File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0406_metho.xml
Size: 13,581 bytes
Last Modified: 2025-10-06 14:09:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0406"> <Title>Multiword Expression Filtering for Building Knowledge Maps</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Term Extraction and Filtering Algo- </SectionTitle> <Paragraph position="0"> rithms Researchers have extracted keywords by a simple recognition of fixed patterns that occur frequently in the text. [Choueka 1988, Tseng 1998] We adopted Tseng's algorithm which identifies frequently repeated N-grams because of its efficiency. null We begin by describing Tseng's algorithm and then discuss our modifications to extract useful multiword expressions.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 Tseng's Algorithm </SectionTitle> <Paragraph position="0"> This algorithm consists of three major steps. The first step requires only linear-time complexity for converting the input text into a list. The second step repeats merging tests until no elements remained to be merged. The third step involves filtering out noisy terms using a stop list.</Paragraph> <Paragraph position="1"> 1. Convert the input text into a LIST of overlapping 2-grams (or 2-words, see an example below).</Paragraph> <Paragraph position="2"> 2. WHILE LIST is not empty 2.1. Set MergeList to empty.</Paragraph> <Paragraph position="3"> 2.2. Push a special token to the end of LIST as a sentinel.</Paragraph> <Paragraph position="4"> 2.3. FOR each pair of adjacent elements KI and K2 in LIST, IF Kl and K2 are mergeable and both of their occurring frequency are greater than a threshold 2.4. Set LIST to MergeList.</Paragraph> <Paragraph position="5"> 3. Filter out noisy terms in FinalList and sort the result according to some criteria.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Iterative Term Filtering Algorithm </SectionTitle> <Paragraph position="0"> We used Tseng's algorithm to select a set of multiword expressions. Then, we applied an algorithm based on stopwords and part of speech information to filter out the less useful words from the beginning and end of multiword expressions in order to help identify and construct useful multiword expressions. Our algorithm deletes words from the beginning and end of the multiword expression until one of the following conditions are satisfied: (1) the result is a one word term that is not a stopword, or (2) the words at the beginning and end of the multiword expression are deemed acceptable by our algorithm or (3) all words in the expression are deleted. null Our technique uses the stopword list that is used by step 3 of Tseng's algorithm along with rules regarding which of those stopwords are acceptable if found at the beginning and at the end of the extracted n-grams. Stopword lists may be generated by analyzing a corpus of documents, and identifying the most frequently occurring words across the corpus. Many pre-compiled lists are also available, and generally the practice has been to adopt a pre-compiled list and tweak it to meet a specific purpose.</Paragraph> <Paragraph position="1"> The number of entries in stopword lists can range from approximately 50 to 5000 words. [&quot;Onix&quot; online stopword lists, &quot;freeWAIS-sf&quot;, &quot;Seattle City Clerk Information Services Database&quot;, &quot;RDS&quot;] The stopword lists that are used for term suggestion tend to be longer or more aggressive than those used for building standard search engine indexes. The stopword list used by the term suggestion system that we used for our experimentation contained around 600 words. Fast-NPE, a noun phrase extractor, uses a stopword list with more than 3500 words [Bennett, 1999].</Paragraph> <Paragraph position="2"> One would find words such as &quot;can&quot;, &quot;cannot&quot;, should&quot;, &quot;approximately&quot;, and so on a stopword list for term suggestion even though they may not be present in other stopword lists. The reason for this is that such words are not useful by themselves to end users or to ontologists who are trying to understand the content of documents.</Paragraph> <Paragraph position="3"> But, these words may be useful when found at the beginning or end of multiword expressions.</Paragraph> <Paragraph position="4"> Our algorithm assumes the use of a stopword list generally used by implementers building term suggestion systems. Our part of speech based heuristics determine which of those stopwords would be considered acceptable at the beginning or end of multiword expressions.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.3 Stopword Analysis </SectionTitle> <Paragraph position="0"> In order to gain an initial insight into what kinds of words are acceptable at the beginning and end of multiword expressions, we built lists of acceptable multiword expressions based on filtering performed by a team of experienced ontologists on a sample set of multiword expressions extracted from hundreds of thousands of documents. Studying words in the multiword expressions that were discarded or retained gave us clues about what words are acceptable at the beginning and end of multiword expressions.</Paragraph> <Paragraph position="1"> This helped us identify patterns that could then be incorporated into a more general algorithm.</Paragraph> <Paragraph position="2"> Most of the time, stopwords were not useful when they were at the beginning or end of multiword expressions. Some good examples of stopwords that are not useful at the beginning or end or a multiword expression are coordinating conjunctions such as &quot;and&quot;, &quot;or&quot;, and so on. However, there are exceptions to this rule. The exceptions are described in sections 2.3.1 and 2.3.2.</Paragraph> <Paragraph position="3"> 2.3.1. Stopwords Acceptable at the Beginning of Multiword Expressions We studied words retained and discarded by ontologists at the beginning of multiword expressions. As expected, many times, these were words that were in the stopword lists. But, there were cases where some stopwords were not discarded. These helped us identify cases or patterns that went into the creation of our algorithm. Those cases are presented below: Prepositions - Words such as &quot;under&quot;, &quot;over&quot;, &quot;after&quot;, and so on. An example of an expression that has a preposition at the beginning, and is a useful expression is &quot;after installing the operating system&quot;. Using the standard stopwords list to eliminate words from the beginning of multi-word expressions would have resulted in that expression being reduced to &quot;installing the operating system&quot;. The meaning of &quot;after installing the operating system&quot; is quite different from &quot;installing the operating system&quot;. The content of documents containing the expression &quot;after installing the operating system&quot; may be quite different from documents containing just the expression &quot;installing the operating system&quot;. &quot;After installing the operating system&quot; may be indicative of a document about problems users may run into after installation. Just &quot;installing the operating system&quot; may be indicative of documents about how to install an operating system.</Paragraph> <Paragraph position="4"> The goal here is not to determine whether one multiword expression is really different from another, but to provide the ontologist with all possible information to make those judgment calls.</Paragraph> <Paragraph position="5"> Auxiliary verbs - Words such as &quot;can&quot;, &quot;cannot&quot;, &quot;will&quot;, &quot;won't&quot;, &quot;was&quot;, &quot;has&quot;, &quot;been&quot; are examples of auxiliary or helping verbs. For example, the expression &quot;can uninstall the program&quot; is quite different from &quot;cannot uninstall the program&quot;. Since &quot;can&quot; is both a noun as well as an auxiliary verb, it is usually not on most stopword lists. But, &quot;cannot&quot; is sometimes found in some stopword lists.</Paragraph> <Paragraph position="6"> Adverbs - Words such as &quot;slowly&quot;, &quot;insufficiently&quot;, &quot;fast&quot;, &quot;late&quot;, early&quot;, etc. may be found in stopword lists used for term suggestion since these words do not carry much meaning by themselves. But, they are useful when found at the beginning of multiword expressions. Examples of such expressions include &quot;early binding&quot; and &quot;late binding&quot;.</Paragraph> <Paragraph position="7"> Adjectives - Adjectives such as &quot;slow&quot;, &quot;fast&quot;, &quot;empty&quot;, &quot;full&quot;, and intermittent&quot; are useful when found in the beginning of multiword expressions. Examples include &quot;slow CPU&quot;, &quot;intermittent failures&quot;, etc. &quot;All,&quot; &quot;any,&quot; &quot;each,&quot; &quot;few,&quot; &quot;many,&quot; &quot;nobody,&quot; &quot;none,&quot; &quot;one,&quot; &quot;several,&quot; and &quot;some,&quot; are some examples of indefinite adjectives. Multiword expressions such as &quot;several users&quot;, and &quot;all CDROM drives&quot; may convey more meaning than just &quot;users&quot; and &quot;CDROM drives&quot;.</Paragraph> <Paragraph position="8"> Interrogative pronouns - &quot;How&quot;, &quot;why&quot;, &quot;when&quot;, and so on are not useful by themselves, but are very useful when found at the beginning of multiword expressions.</Paragraph> <Paragraph position="9"> Examples of such expressions include &quot;how to install an anti-virus package&quot;, &quot;when to look for updates&quot;, and &quot;how do I fix my computer&quot;.</Paragraph> <Paragraph position="10"> Correlative conjunctions - &quot;Both the computers&quot;, and &quot;either freezing or crashing&quot; are examples of expressions that begin with correlative conjunctions. &quot;Both&quot; and &quot;either&quot; are very likely to be found in stopword lists used for term suggestion, but they add meaning to multiword expressions. null 2.3.2. Stopwords Acceptable at the End of</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Multiword Expressions </SectionTitle> <Paragraph position="0"> Similarly we studied words retained and discarded by ontologists at the end of multiword expressions. As expected, many times, these were words that were in the stopword lists. But, there were cases where some stopwords were not discarded. Those cases are presented below: Numbers - Numbers are generally found on most stopword lists. 0, 1, 2, and so on rarely make sense by themselves, especially in the context of term suggestion.</Paragraph> <Paragraph position="1"> However, when they are found at the end of the multiword expressions in the digit form (0, 1, 2, and so on) rather than in the word form (one, two, three, and so on), they can be useful. Examples of such cases are usually product names with their version numbers - &quot;Microsoft Word version 3.0&quot;, &quot;Windows 3.1&quot;, and so on. Closing parentheses - Closing parentheses usually indicates the presence of opening parentheses within the multiword expression. Therefore, retaining the closing parentheses is a good idea. Examples of such expressions are &quot;Manufacturing (UK division)&quot;, &quot;Transmission Control Protocol (TCP)&quot;, and so on. A nice side effect of this heuristic is the ability for the users to learn about acronyms in the domain.</Paragraph> <Paragraph position="2"> Adverbs - Words such as &quot;slowly&quot;, &quot;quickly&quot;, &quot;immediately&quot;, and so on are useful at the end of multiword expressions. Examples of these include &quot;computer shuts down slowly&quot;, and &quot;uninstall the program immediately&quot;.</Paragraph> <Paragraph position="3"> 2.4. Set LIST to MergeList.</Paragraph> <Paragraph position="4"> 3. Filter out noisy terms in FinalList and sort the result according to some criteria.</Paragraph> <Paragraph position="5"> 4. FOR each expression FL on the FinalList, 4.1 IF the first word in FL is a stopword and is not: a preposition, an auxiliary verb, an adverb, an adjective, an interrogative pronoun, or a correlative not on the stopword list OR FL is a one word term that is a not a stopword or all the words in the expression are deleted by this algorithm)</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 5. Push FL into FilteredList. </SectionTitle> <Paragraph position="0"> This algorithm can be implemented using either a program that does part of speech tagging or a program that looks up a thesaurus. Our implementation used a list of stopwords that are acceptable at the beginning of the expression, and another list of stopwords that are acceptable at the end of the expression.</Paragraph> </Section> class="xml-element"></Paper>