File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1028_metho.xml
Size: 24,521 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1028"> <Title>A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Proceedings, 1992), Roark and Charniak (Roark </SectionTitle> <Paragraph position="0"> and Charniak, 1998) reported that 3 of every 5 terms generated by their semantic lexicon learner were not present in WordNet. These results suggest that automatic semantic lexicon acquisition could be used to enhance existing resources such as WordNet, or to produce semantic lexicons for specialized domains.</Paragraph> <Paragraph position="1"> We have developed a weakly supervised bootstrapping algorithm called Basilisk that automatically generates semantic lexicons. Basilisk hypothesizes the semantic class of a word by gathering collective evidence about semantic associations from extraction pattern contexts. Basilisk also learns multiple semantic classes simultaneously, which helps constrain the bootstrapping process.</Paragraph> <Paragraph position="2"> First, we present Basilisk's bootstrapping algorithm and explain how it di ers from previous work on semantic lexicon induction. Second, we present empirical results showing that Basilisk outperforms a previous algorithm. Third, we explore the idea of learning multiple semantic categories simultaneously by adding this capability to Basilisk as well as another bootstrapping algorithm. Finally, we present results showing that learning multiple semantic categories simultaneously improves performance.</Paragraph> <Paragraph position="3"> weakly supervised bootstrapping algorithm that automatically generates semantic lexicons. Figure 1 shows the high-level view of Basilisk's bootstrapping process. The input to Basilisk is an unannotated text corpus and a few manually de ned seed words for each semantic category. Before bootstrapping begins, we run an extraction pattern learner over the corpus which generates patterns to extract every noun phrase in the corpus.</Paragraph> <Paragraph position="4"> The bootstrapping process begins by selecting a subset of the extraction patterns that tend to extract the seed words. We call this the pattern pool. Association for Computational Linguistics.</Paragraph> <Paragraph position="5"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 214-221. Proceedings of the Conference on Empirical Methods in Natural The nouns extracted by these patterns become candidates for the lexicon and are placed in a candidate word pool. Basilisk scores each candidate word by gathering all patterns that extract it and measuring how strongly those contexts are associated with words that belong to the semantic category. The ve best candidate words are added to the lexicon, and the process starts over again. In this section, we describe Basilisk's bootstrapping algorithm in more detail and discuss related work.</Paragraph> <Paragraph position="6"> extraction patterns and</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Basilisk </SectionTitle> <Paragraph position="0"> The input to Basilisk is a text corpus and a set of seed words. We generated seed words by sorting the words in the corpus by frequency and manually identifying the 10 most frequent nouns that belong to each category. These seed words form the initial semantic lexicon. In this section we describe the learning process for a single semantic category. In Section 3 we will explain how the process is adapted to handle multiple categories simultaneously.</Paragraph> <Paragraph position="1"> To identify new lexicon entries, Basilisk relies on extraction patterns to provide contextual evidence that a word belongs to a semantic class. As our representation for extraction patterns, we used the AutoSlog system (Rilo , 1996). AutoSlog's extraction patterns represent linguistic expressions that extract a noun phrase in one of three syntactic roles: subject, direct object, or prepositional phrase object. For example, three patterns that would extract people are: \<subject> was arrested&quot;, \murdered <direct object>&quot;, and \collaborated with <pp object>&quot;. Extraction patterns represent linguistic contexts that often reveal the meaning of a word by virtue of syntax and lexical semantics. Extraction patterns are typically designed to capture role relationships. For example, consider the verb \robbed&quot; when it occurs in the active voice. The subject of \robbed&quot; identi es the perpetrator, while the direct object of \robbed&quot; identi es the victim or target.</Paragraph> <Paragraph position="2"> Before bootstrapping begins, we run AutoSlog exhaustively over the corpus to generate an extraction Generate all extraction patterns in the corpus and record their extractions.</Paragraph> <Paragraph position="4"/> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> BOOTSTRAPPING </SectionTitle> <Paragraph position="0"> 1. Score all extraction patterns 2. pattern pool = top ranked 20+i patterns 3. candidate word pool =extractions of patterns in pattern pool 4. Score candidate words in candidate word pool 5. Add top 5 candidate words to lexicon 6. i := i +1 7. Go to Step 1.</Paragraph> <Paragraph position="1"> pattern for every noun phrase that appears. The patterns are then applied to the corpus and all of their extracted noun phrases are recorded. Figure 2 shows the bootstrapping process that follows, which we explain in the following sections.</Paragraph> <Paragraph position="2"> The rst step in the bootstrapping process is to score the extraction patterns based on their tendency to extract known category members. All words that are currently de ned in the semantic lexicon are considered to be category members. Basilisk scores each pattern using the RlogF metric that has been used for extraction pattern learning (Rilo , 1996). The score for each pattern is computed as:</Paragraph> <Paragraph position="4"> is the total number of nouns extracted by pattern i . Intuitively, the RlogF metric is a weighted conditional probability; a pattern receives a high score if a high percentage of its extractions are category members, or if a moderate percentage of its extractions are category members anditextractsalotofthem.</Paragraph> <Paragraph position="5"> The top N extraction patterns are put into a pattern pool. Basilisk uses a value of N=20 for the rst iteration, which allows a variety of patterns to be considered, yet is small enough that all of the patterns are strongly associated with the category. \Depleted&quot; patterns are not included in this set. A pattern is depleted if all of its extracted nouns are already de ned in the lexicon, in which case it has no unclassi ed words to contribute.</Paragraph> <Paragraph position="6"> The purpose of the pattern pool is to narrow down the eld of candidates for the lexicon. Basilisk collects all noun phrases (NPs) extracted by patterns in the pattern pool and puts the head noun of each NP into the candidate word pool. Only these nouns are considered for addition to the lexicon.</Paragraph> <Paragraph position="7"> As the bootstrapping progresses, using the same value N=20 causes the candidate pool to become stagnant. For example, let's assume that Basilisk performs perfectly, adding only valid category words to the lexicon. After some number of iterations, all of the valid category members extracted by the top 20 patterns will have been added to the lexicon, leaving only non-category words left to consider. For this reason, the pattern pool needs to be infused with new patterns so that more nouns (extractions) become available for consideration. To achieve this e ect, we increment the value of N by one after each bootstrapping iteration. This ensures that there is always at least one new pattern contributing words to the candidate word pool on each successive iteration.</Paragraph> <Paragraph position="8"> The next step is to score the candidate words. For each word, Basilisk collects every pattern that extracted the word. All extraction patterns are used during this step, not just the patterns in the pattern pool. Initially, we used a scoring function that computes the average number of category members extracted by the patterns. The formula is:</Paragraph> <Paragraph position="10"> is the number of distinct category members extracted by pattern j. A word receives a high score if it is extracted by patterns that also have a tendency to extract known category members.</Paragraph> <Paragraph position="11"> As an example, suppose the word \Peru&quot; is in the candidate word pool as a possible location. Basilisk nds all patterns that extract \Peru&quot; and computes the average number of known locations extracted by those patterns. Let's assume that the three patterns shown below extract \Peru&quot; and that the underlined words are known locations. \Peru&quot; would receive a score of (2+3+2)/3 = 2.3. Intuitively, this means that patterns that extract \Peru&quot; also extract, on average, 2.3 known location words.</Paragraph> <Paragraph position="12"> \was killed in <np>&quot; Extractions: Peru, clashes, a shootout, El Salvador, Colombia \<np> was divided&quot; Extractions: the country, the Medellin cartel, Colombia, Peru, the army, Nicaragua \ambassador to <np>&quot; Extractions: Nicaragua, Peru, the UN, Panama Unfortunately, this scoring function has a problem. The average can be heavily skewed by one pattern that extracts a large number of category members. For example, suppose word w is extracted by 10 patterns, 9 which do not extract any category members but the tenth extracts 50 category members. The average number of category members extracted by these patterns will be 5. This is misleading because the only evidence linking word w with the semantic category is a single, high-frequency extraction pattern (which may extract words that belong to other categories as well).</Paragraph> <Paragraph position="13"> To alleviate this problem, we modi ed the scoring function to compute the average logarithm of the number of category members extracted by each pattern. The logarithm reduces the influence of any single pattern. We will refer to this scoring metric as the AvgLog function, which is de ned below. Since</Paragraph> <Paragraph position="15"> that patterns which extract a single category member contribute a positive value.</Paragraph> <Paragraph position="17"> Using this scoring metric, all words in the candidate word pool are scored and the top ve words are added to the semantic lexicon. The pattern pool and the candidate word pool are then emptied, and the bootstrapping process starts over again.</Paragraph> <Paragraph position="18"> Several weakly supervised learning algorithms have previously been developed to generate semantic lexicons from text corpora. Rilo and Shepherd (Rilo and Shepherd, 1997) developed a bootstrapping algorithm that exploits lexical co-occurrence statistics, and Roark and Charniak (Roark and Charniak, 1998) re ned this algorithm to focus more explicitly on certain syntactic structures. Hale, Ge, and Charniak (Ge et al., 1998) devised a technique to learn the gender of words. Caraballo (Caraballo, 1999) and Hearst (Hearst, 1992) created techniques to learn hypernym/hyponym relationships. None of these previous algorithms used extraction patterns or similar contexts to infer semantic class associations. Several learning algorithms have also been developed for named entity recognition (e.g., (Collins and Singer, 1999; Cucerzan and Yarowsky, 1999)).</Paragraph> <Paragraph position="19"> (Collins and Singer, 1999) used contextual information of a di erent sort than we do. Furthermore, our research aims to learn general nouns (e.g., \artist&quot;) rather than proper nouns, so many of the features commonly used to great advantage for named entity recognition (e.g., capitalization and title words) are not applicable to our task.</Paragraph> <Paragraph position="20"> The algorithm most closely related to Basilisk is meta-bootstrapping (Rilo and Jones, 1999), which also uses extraction pattern contexts for semantic lexicon induction. Meta-bootstrapping identi es a single extraction pattern that is highly correlated with a semantic category and then assumes that all of its extracted noun phrases belong to the same category. However, this assumption is often violated, which allows incorrect terms to enter the lexicon.</Paragraph> <Paragraph position="21"> Rilo and Jones acknowledged this issue and used a second level of bootstrapping (the \Meta&quot; bootstrapping level) to alleviate this problem. While meta-bootstrapping trusts individual extraction patterns to make unilateral decisions, Basilisk gathers collective evidence from a large set of extraction patterns. As we will demonstrate in Section 2.2, Basilisk's approach produces better results than meta-bootstrapping and is also considerably more e cient because it uses only a single bootstrapping loop (meta-bootstrapping uses nested bootstrapping). However, meta-bootstrapping produces category-speci c extraction patterns in addition to a semantic lexicon, while Basilisk focuses exclusively on semantic lexicon induction.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Single Category Results </SectionTitle> <Paragraph position="0"> To evaluate Basilisk's performance, we ran experiments with the MUC-4 corpus (MUC-4 Proceedings, 1992), which contains 1700 texts associated with terrorism. We used Basilisk to learn semantic lexicons for six semantic categories: building, event, human, location, time,andweapon. Before we ran these experiments, one of the authors manually labeled every head noun in the corpus that was found by an extraction pattern. These manual annotations were the gold standard. Table 1 shows the breakdown of semantic categories for the head nouns.</Paragraph> <Paragraph position="1"> These numbers represent a baseline: an algorithm that randomly selects words would be expected to get accuracies consistent with these numbers.</Paragraph> <Paragraph position="2"> Three semantic lexicon learners have previously been evaluated on the MUC-4 corpus (Rilo and Shepherd, 1997; Roark and Charniak, 1998; Rilo and Jones, 1999), and of these meta-bootstrapping achieved the best results. So we implemented the meta-bootstrapping algorithm ourselves to directly compare its performance with that of Basilisk. A di erence between the original implementation and ours is that our version learns individual nouns (as does Basilisk) instead of noun phrases. We believe that learning individual nouns is a more conservative approach because noun phrases often overlap (e.g., \high-power bombs&quot; and \incendiary bombs&quot; would count as two di erent lexicon entries in the original meta-bootstrapping algorithm). Consequently, our meta-bootstrapping results di er from those reported in (Rilo and Jones, 1999).</Paragraph> <Paragraph position="3"> Figure 3 shows the results for Basilisk (ba-1) and meta-bootstrapping (mb-1). We ran both algorithms for 200 iterations, so that 1000 words were added to the lexicon (5 words per iteration). The X axis shows the number of words learned, and the Y axis shows how many were correct. The Y axes have di erent ranges because some categories are more proli c than others. Basilisk outperforms meta-bootstrapping for every category, often substantially. For the human and location categories, Basilisk learned hundreds of words, with accuracies in the 80-89% range through much of the bootstrapping. It is worth noting that Basilisk's performance held up well on the human and location categories even at the end, achieving 79.5% (795/1000) accuracy for humans and 53.2% (532/1000) accuracy for locations.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Learning Multiple Semantic Categories Simultaneously </SectionTitle> <Paragraph position="0"> We also explored the idea of bootstrapping multiple semantic classes simultaneously. Our hypothesis was that errors of confusion between semantic categories can be lessened by using information about multiple categories. This hypothesis makes sense only if a word cannot belong to more than one semantic class. In general, this is not true because words are often polysemous. But within a limited domain, a word usually has a dominant word sense. Therefore we make a \one sense per domain&quot; assumption (similar We use the term confusion to refer to errors where a word is labeled as category X when it really belongs to</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Single Category </SectionTitle> <Paragraph position="0"> to the \one sense per discourse&quot; observation (Gale et al., 1992)) that a word belongs to a single semantic category within a limited domain. All of our experiments involve the MUC-4 terrorism domain and corpus, for which this assumption seems appropriate.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Motivation </SectionTitle> <Paragraph position="0"> Figure 4 shows one way of viewing the task of semantic lexicon induction. The set of all words in the corpus is visualized as a search space. Each category owns a certain territory within the space (demarcated with a dashed line), representing the words that are true members of that category. Not all territories are the same size, since some categories have more members than others.</Paragraph> <Paragraph position="1"> lexicon is generated for a single category. The seed words for the category (in this case, category C) are represented by the solid black area in category C's territory. The hypothesized words in the growing lexicon are represented by a shaded area. The goal of the bootstrapping algorithm is to expand the area of hypothesized words so that it exactly matches the category's true territory. If the shaded area expands beyond the category's true territory, then incorrect words have been added to the lexicon. In Figure 4, category C has claimed a signi cant number of words that belong to categories B and E. When generating a lexicon for one category at a time, these confusion errors are impossible to detect because the learner has no knowledge of the other categories.</Paragraph> <Paragraph position="2"> Figure 5 shows the same search space when lexicons are generated for six categories simultaneously. If the lexicons cannot overlap, then we constrain the ability of a category to overstep its bounds. Category C is stopped when it begins to encroach upon the territories of categories B and E because words in those areas have already been claimed.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Simple Conflict Resolution </SectionTitle> <Paragraph position="0"> The easiest way to take advantage of multiple categories is to add simple conflict resolution that enforces the \one sense per domain&quot; constraint. If more than one category tries to claim a word, then we use conflict resolution to decide which category should win. We incorporated a simple conflict resolution procedure into Basilisk, as well as the meta-bootstrapping algorithm. For both algorithms, the conflict resolution procedure works as follows. (1) If a word is hypothesized for category A but has already been assigned to category B during a previous iteration, then the category A hypothesis is discarded.</Paragraph> <Paragraph position="1"> (2) If a word is hypothesized for both category A and category B during the same iteration, then it highest score. In Section 3.4, we will present empirical results showing how this simple conflict resolution scheme a ects performance.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 A Smarter Scoring Function for Multiple Categories </SectionTitle> <Paragraph position="0"> Simple conflict resolution helps the algorithm recognize when it has encroached on another category's territory, but it does not actively steer the bootstrapping in a more promising direction. A more intelligent way to handle multiple categories is to incorporate knowledge about other categories directly into the scoring function. We modi ed Basilisk's scoring function to prefer words that have strong evidence for one category but little or no evidence for competing categories. Each word w where AvgLog is the candidate scoring function used previously by Basilisk (see Equation 3) and the max function returns the maximum AvgLog value over all competing categories. For example, the score for each candidate location word will be its AvgLog score for the location category minus its maximum AvgLog score for all other categories. A word is ranked highly only if it has a high score for the targeted category and there is little evidence that it belongs to a di erent category. This has the e ect of steering the bootstrapping process away from ambiguous parts of the search space.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.4 Multiple Category Results </SectionTitle> <Paragraph position="0"> We will use the abbreviation 1CAT to indicate that only one semantic category was bootstrapped, and MCAT to indicate that multiple semantic categories were simultaneously bootstrapped. Figure 6 compares the performance of Basilisk-MCAT with conflict resolution (ba-M) against Basilisk-1CAT (ba-1). Most categories show small performance gains, with the building, location,andweapon categories bene tting the most. However, the improvement usually doesn't kick in until many bootstrapping iterations have passed. This phenomenon is consistent with the visualization of the search space in Figure 5.</Paragraph> <Paragraph position="1"> Since the seed words for each category are not generally located near each other in the search space, the bootstrapping process is una ected by conflict resolution until the categories begin to encroach on each other's territories.</Paragraph> <Paragraph position="2"> 1). Learning multiple categories improves the performance of meta-bootstrapping dramatically for most categories. We were surprised that the improvement for meta-bootstrapping was much Basilisk-MCAT+ more pronounced than for Basilisk. It seems that Basilisk was already doing a better job with errors of confusion, so meta-bootstrapping had more room for improvement.</Paragraph> <Paragraph position="3"> Finally, we evaluated Basilisk using the di scoring function to handle multiple categories. Figure 8 compares all three MCAT algorithms, with the smarter di version of Basilisk labeled as ba-M+. Overall, this version of Basilisk performs best, showing a small improvement over the version with simple conflict resolution. Both multiple category versions of Basilisk also consistently outperform the multiple category version of meta-bootstrapping.</Paragraph> <Paragraph position="4"> Table 2 summarizes the improvement of the best version of Basilisk over the original meta-bootstrapping algorithm. The left-hand column represents the number of words learned and each cell indicates how many of those words were correct. These results show that Basilisk produces substantially better accuracy and coverage than meta-bootstrapping. Figure 9 shows examples of words learned by Basilisk. Inspection of the lexicons reveals many unusual words that could be easily overlooked by someone building a dictionary by hand. For example, the words \deserter&quot; and \narcoterrorists&quot; appear in a variety of terrorism articles but they are not commonly used words in general.</Paragraph> <Paragraph position="5"> We also measured the recall of Basilisk's lexicons after 1000 words had been learned, based on the gold standard data shown in Table 1. The recall results range from 40-60%, which indicates that a good percentage of the category words are being found, although there are clearly more category words lurking in the corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>