File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-0406_evalu.xml
Size: 5,323 bytes
Last Modified: 2025-10-06 13:59:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0406"> <Title>Multiword Expression Filtering for Building Knowledge Maps</Title> <Section position="5" start_page="1" end_page="1" type="evalu"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> We implemented this algorithm using Java, and ran it on more than 20 corpora of documents dealing with technology topics. The size of the corpora ranged from 4000 documents to 500,000 documents. The average size of the corpora was around 5-6 MB. The topics discussed include, among others, computer networking, instructions on how to install and use application software, troubleshooting software problems, and so on.</Paragraph> <Paragraph position="1"> Program inputs include documents, and a stop-words list.</Paragraph> <Paragraph position="2"> Benefits of applying our algorithm to filter expressions include: Term list size reduction - The result of applying our algorithm to filter expressions extracted from documents is a reduction in number of terms by at least 30%-40%.</Paragraph> <Paragraph position="3"> This translated to an order-of-magnitude reduction in time and effort on the part of ontologists and other users. Without the algorithm, ontologists may have had to study the list manually to eliminate meaningless expressions and manipulate other terms to turn them into useful expressions.</Paragraph> <Paragraph position="4"> Examples of such reduction include: Expressions such as &quot;Windows 98 operating system&quot;, &quot;Windows 98 operating system was&quot;, &quot;the Windows 98 operating system&quot;, and &quot;Windows 98 operating system is&quot; are reduced to &quot;Windows 98 operating system&quot;.</Paragraph> <Paragraph position="5"> Expressions such as &quot;the screen flickers&quot;, and &quot;screen flickers and&quot; would be reduced to just &quot;screen flickers&quot;.</Paragraph> <Paragraph position="6"> Expressions such as &quot;and is a&quot; and &quot;is not&quot; and &quot;and etc.&quot; are eliminated from the list. The individual words in these expressions are in the stop words list, but ordinarily a multiword expression such as &quot;is not&quot; would make it past the stop words filter since it contains more than one word in it.</Paragraph> <Paragraph position="7"> The reduction in the number of terms translated to a reduction in the number of person-weeks required to create a knowledge map using the terms. We noticed a savings in the number of person-weeks that ranged from 50% to close to 90%. In one particular instance, using our algorithm reduced the time required to create a knowledge map based on extracted n-grams from 4 person-weeks to about 0.5 person /weeks Higher precision - There is a greater probability that the terms that remain after filtering are useful terms. In other words, the remaining terms are more likely to be considered useful by users. Our experience has shown that the percentage of useful terms prior to filtering ranged from 30% to 50%. Post filtering, the percentage of useful terms ranged from 60% to 80%.</Paragraph> <Paragraph position="8"> In other words, running our algorithm on large corpora of documents has shown that it helps to increase the percentage of useful terms from 40% (+-10%) to 70% (+-10%) - with an eight-fold improvement observed in some cases.</Paragraph> <Paragraph position="9"> Domain independence - Pattern extraction from documents involves extracting both domain specific and domain independent terms. Domain specific terms are those that represent the core knowledge in the domain. For example, terms such as &quot;dynamic host control protocol&quot; and &quot;TCP/IP&quot; can be considered to be domain specific terms in the computer networking domain. On the other hand, terms such as &quot;document author&quot; are not domain specific. The technique described in this paper aids in filtering both domain specific and domain independent terms extracted from documents. This ensures domain portability. The tests conducted have been primarily with documents containing technology topics. However, this algorithm worked well with documents related to electronic commerce as well.</Paragraph> <Paragraph position="10"> The algorithm is, of course, not foolproof, and there are instances where expressions that ought to be modified are not, and expressions are modified more than necessary. For instance, the expression &quot;software was&quot; will be correctly reduced to &quot;software&quot; since &quot;was&quot; is an auxiliary verb. The multiword expression &quot;computer crashed after&quot; will be reduced to &quot;computer crashed&quot; since &quot;after&quot; is a prepositon, but &quot;installed in order to&quot; will be reduced to &quot;installed in order&quot;. &quot;Installed in order&quot; is not a useful expression, but it is one of the expressions that are not processed correctly by our algorithm. On the whole, however, our finding is that applying this algorithm results in a significant savings of time and effort to extract useful multiword expressions from documents.</Paragraph> </Section> class="xml-element"></Paper>