File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1015_metho.xml
Size: 21,314 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1015"> <Title>Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations</Title> <Section position="5" start_page="113" end_page="116" type="metho"> <SectionTitle> 3 The Espresso Algorithm </SectionTitle> <Paragraph position="0"> Espresso is based on the framework adopted in (Hearst 1992). It is a minimally supervised bootstrapping algorithm that takes as input a few seed instances of a particular relation and iteratively learns surface patterns to extract more instances.</Paragraph> <Paragraph position="1"> The key to Espresso lies in its use of generic patters, i.e., those broad coverage noisy patterns that extract both many correct and incorrect relation instances. For example, for part-of relations, the pattern &quot;X of Y&quot; extracts many correct relation instances like &quot;wheel of the car&quot; but also many incorrect ones like &quot;house of representatives&quot;. The key assumption behind Espresso is that in very large corpora, like the Web, correct instances generated by a generic pattern will be instantiated by some reliable patterns, where reliable patterns are patterns that have high precision but often very low recall (e.g., &quot;X consists of Y&quot; for part-of relations). In this section, we describe the overall architecture of Espresso, propose a principled measure of reliability, and give an algorithm for exploiting generic patterns.</Paragraph> <Section position="1" start_page="114" end_page="114" type="sub_section"> <SectionTitle> 3.1 System Architecture </SectionTitle> <Paragraph position="0"> Espresso iterates between the following three phases: pattern induction, pattern ranking/selection, and instance extraction.</Paragraph> <Paragraph position="1"> The algorithm begins with seed instances of a particular binary relation (e.g., is-a) and then iterates through the phases until it extracts t patterns or the average pattern score decreases by more than t from the previous iteration. In our experiments, we set t = 5 and t = 50%.</Paragraph> <Paragraph position="2"> For our tokenization, in order to harvest multi-word terms as relation instances, we adopt a slightly modified version of the term definition given in (Justeson 1995), as it is one of the most commonly used in the NLP literature:</Paragraph> </Section> <Section position="2" start_page="114" end_page="114" type="sub_section"> <SectionTitle> ((Adj|Noun)+|((Adj|Noun)*(NounPrep)?)(Adj|Noun)*)Noun Pattern Induction </SectionTitle> <Paragraph position="0"> In the pattern induction phase, Espresso infers a set of surface patterns P that connects as many of the seed instances as possible in a given corpus.</Paragraph> <Paragraph position="1"> Any pattern learning algorithm would do. We chose the state of the art algorithm described in (Ravichandran and Hovy 2002) with the following slight modification. For each input instance {x, y}, we first retrieve all sentences containing the two terms x and y. The sentences are then generalized into a set of new sentences S x,y by replacing all terminological expressions by a terminological label, TR. For example: &quot;Because/IN HF/NNP is/VBZ a/DT weak/JJ acid/NN and/CC x is/VBZ a/DT y&quot; is generalized as: &quot;Because/IN TR is/VBZ a/DT TR and/CC x is/VBZ a/DT y&quot; Term generalization is useful for small corpora to ease data sparseness. Generalized patterns are naturally less precise, but this is ameliorated by our filtering step described in Section 3.3. As in the original algorithm, all substrings linking terms x and y are then extracted from S x,y , and overall frequencies are computed to form P. Pattern Ranking/Selection In (Ravichandran and Hovy 2002), a frequency threshold on the patterns in P is set to select the final patterns. However, low frequency patterns may in fact be very good. In this paper, instead of frequency, we propose a novel measure of pattern reliability, r p , which is described in detail in Section 3.2.</Paragraph> <Paragraph position="2"> Espresso ranks all patterns in P according to reliability r p and discards all but the top-k, where k is set to the number of patterns from the previous iteration plus one. In general, we expect that the set of patterns is formed by those of the previous iteration plus a new one. Yet, new statistical evidence can lead the algorithm to discard a pattern that was previously discovered.</Paragraph> </Section> <Section position="3" start_page="114" end_page="115" type="sub_section"> <SectionTitle> Instance Extraction </SectionTitle> <Paragraph position="0"> In this phase, Espresso retrieves from the corpus the set of instances I that match any of the patterns in P. In Section 3.2, we propose a principled measure of instance reliability, r</Paragraph> <Paragraph position="2"> ranking instances. Next, Espresso filters incorrect instances using the algorithm proposed in Section 3.3 and then selects the highest scoring m instances, according to r i , as input for the subsequent iteration. We experimentally set m=200. In small corpora, the number of extracted instances can be too low to guarantee sufficient statistical evidence for the pattern discovery phase of the next iteration. In such cases, the system enters an expansion phase, where instances are expanded as follows: Web expansion: New instances of the patterns in P are retrieved from the Web, using the Google search engine. Specifically, for each instance {x, y}[?] I, the system creates a set of queries, using each pattern in P instantiated with y. For example, given the instance &quot;Italy, country&quot; and the pattern &quot;Y such as X&quot;, the resulting Google query will be &quot;country such as *&quot;. New instances are then created from the retrieved Web results (e.g. &quot;Canada, country&quot;) and added to I. The noise generated from this expansion is attenuated by the filtering algorithm described in Section 3.3.</Paragraph> <Paragraph position="3"> Syntactic expansion: New instances are created from each instance {x, y}[?] I by extracting sub-terminological expressions from x corresponding to the syntactic head of terms. For ex- null ample, the relation &quot;new record of a criminal conviction part-of FBI report&quot; expands to: &quot;new record part-of FBI report&quot;, and &quot;record part-of FBI report&quot;.</Paragraph> </Section> <Section position="4" start_page="115" end_page="115" type="sub_section"> <SectionTitle> 3.2 Pattern and Instance Reliability </SectionTitle> <Paragraph position="0"> Intuitively, a reliable pattern is one that is both highly precise and one that extracts many instances. The recall of a pattern p can be approximated by the fraction of input instances that are extracted by p. Since it is non-trivial to estimate automatically the precision of a pattern, we are wary of keeping patterns that generate many instances (i.e., patterns that generate high recall but potentially disastrous precision). Hence, we desire patterns that are highly associated with the input instances. Pointwise mutual information (Cover and Thomas 1991) is a commonly used metric for measuring this strength of association between two events x and y:</Paragraph> <Paragraph position="2"> We define the reliability of a pattern p, r p (p), as its average strength of association across each input instance i in I, weighted by the reliability of</Paragraph> <Paragraph position="4"> (i) is the reliability of instance i (defined below) and max pmi is the maximum pointwise mutual information between all patterns and all instances. r p (p) ranges from [0,1]. The reliability of the manually supplied seed instances are r</Paragraph> <Paragraph position="6"> = 1. The pointwise mutual information between instance i = {x, y} and pattern p is estimated using the following formula:</Paragraph> <Paragraph position="8"> problem is that pointwise mutual information is biased towards infrequent events. We thus multiply pmi(i, p) with the discounting factor suggested in (Pantel and Ravichandran 2004).</Paragraph> <Paragraph position="9"> Estimating the reliability of an instance is similar to estimating the reliability of a pattern.</Paragraph> <Paragraph position="10"> Intuitively, a reliable instance is one that is highly associated with as many reliable patterns as possible (i.e., we have more confidence in an instance when multiple reliable patterns instantiate it.) Hence, analogous to our pattern reliability measure, we define the reliability of an instance</Paragraph> <Paragraph position="12"> (p) is the reliability of pattern p (defined earlier) and max pmi is as before. Note that r</Paragraph> <Paragraph position="14"> for the manually supplied seed instances.</Paragraph> </Section> <Section position="5" start_page="115" end_page="116" type="sub_section"> <SectionTitle> 3.3 Exploiting Generic Patterns </SectionTitle> <Paragraph position="0"> Generic patterns are high recall / low precision patterns (e.g, the pattern &quot;X of Y&quot; can ambiguously refer to a part-of, is-a and possession relations). Using them blindly increases system recall while dramatically reducing precision.</Paragraph> <Paragraph position="1"> Minimally supervised algorithms have typically ignored them for this reason. Only heavily supervised approaches, like (Girju et al. 2006) have successfully exploited them.</Paragraph> <Paragraph position="2"> Espresso's recall can be significantly increased by automatically separating correct instances extracted by generic patterns from incorrect ones. The challenge is to harness the expressive power of the generic patterns while remaining minimally supervised.</Paragraph> <Paragraph position="3"> The intuition behind our method is that in a very large corpus, like the Web, correct instances of a generic pattern will be instantiated by many of Espresso's reliable patterns accepted in P. Recall that, by definition, Espresso's reliable patterns extract instances with high precision (yet often low recall). In a very large corpus, like the Web, we assume that a correct instance will occur in at least one of Espresso's reliable pattern even though the patterns' recall is low. Intuitively, our confidence in a correct instance increases when, i) the instance is associated with many reliable patterns; and ii) its association with the reliable patterns is high. At a given Espresso iteration, where P R represents the set of previously selected reliable patterns, this intuition is captured by the following measure of confidence in an instance i = {x, y}:</Paragraph> <Paragraph position="5"> where pointwise mutual information between instance i and pattern p is estimated with Google as follows:</Paragraph> <Paragraph position="7"> An instance i is rejected if S(i) is smaller than some threshold t.</Paragraph> <Paragraph position="8"> Although this filtering may also be applied to reliable patterns, we found this to be detrimental in our experiments since most instances generated by reliable patterns are correct. In Espresso, we classify a pattern as generic when it generates more than 10 times the instances of previously accepted reliable patterns.</Paragraph> </Section> </Section> <Section position="6" start_page="116" end_page="117" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> In this section, we present an empirical comparison of Espresso with three state of the art systems on the task of extracting various semantic relations.</Paragraph> <Section position="1" start_page="116" end_page="116" type="sub_section"> <SectionTitle> 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> We perform our experiments using the following two datasets: * TREC: This dataset consists of a sample of articles from the Aquaint (TREC-9) newswire text collection. The sample consists of 5,951,432 words extracted from the following data files: AP890101 - AP890131, AP890201 - AP890228, and AP890310 - AP890319.</Paragraph> <Paragraph position="1"> * CHEM: This small dataset of 313,590 words consists of a college level textbook of introductory chemistry (Brown et al. 2003). Each corpus is pre-processed using the Alembic Workbench POS-tagger (Day et al. 1997). Below we describe the systems used in our empirical evaluation of Espresso.</Paragraph> <Paragraph position="2"> * RH02: The algorithm by Ravichandran and Hovy (2002) described in Section 2.</Paragraph> <Paragraph position="3"> * GI03: The algorithm by Girju et al. (2006) described in Section 2.</Paragraph> <Paragraph position="4"> * PR04: The algorithm by Pantel and Ravichandran (2004) described in Section 2. * ESP-: The Espresso algorithm using the pattern and instance reliability measures, but without using generic patterns.</Paragraph> <Paragraph position="5"> * ESP+: The full Espresso algorithm described in this paper exploiting generic patterns. For ESP+, we experimentally set t from Section 3.3 to t = 0.4 for TREC and t = 0.3 for CHEM by manually inspecting a small set of instances. Espresso is designed to extract various semantic relations exemplified by a given small set of seed instances. We consider the standard is-a and part-of relations as well as the following more specific relations: * succession: This relation indicates that a person succeeds another in a position or title. For example, George Bush succeeded Bill Clinton and Pope Benedict XVI succeeded Pope John Paul II. We evaluate this relation on the TREC-9 corpus.</Paragraph> <Paragraph position="6"> * reaction: This relation occurs between chemical elements/molecules that can be combined in a chemical reaction. For example, hydrogen gas reacts-with oxygen gas and zinc reacts-with hydrochloric acid. We evaluate this relation on the CHEM corpus.</Paragraph> <Paragraph position="7"> * production: This relation occurs when a process or element/object produces a result</Paragraph> <Paragraph position="9"> example, ammonia produces nitric oxide. We evaluate this relation on the CHEM corpus.</Paragraph> <Paragraph position="10"> For each semantic relation, we manually extracted a small set of seed examples. The seeds were used for both Espresso as well as RH02.</Paragraph> <Paragraph position="11"> Table 1 lists a sample of the seeds as well as sample outputs from Espresso.</Paragraph> </Section> <Section position="2" start_page="116" end_page="117" type="sub_section"> <SectionTitle> 4.2 Precision and Recall </SectionTitle> <Paragraph position="0"> We implemented the systems outlined in Section 4.1, except for GI03, and applied them to the Production is an ambiguous relation; it is intended to be a causation relation in the context of chemical reactions. TREC and CHEM datasets. For each output set, per relation, we evaluate the precision of the system by extracting a random sample of instances (50 for the TREC corpus and 20 for the CHEM corpus) and evaluating their quality manually using two human judges (a total of 680 instances were annotated per judge). For each instance, judges may assign a score of 1 for correct, 0 for incorrect, and 1/2 for partially correct. Example instances that were judged partially correct include &quot;analyst is-a manager&quot; and &quot;pilot is-a teacher&quot;. The kappa statistic (Siegel and Castellan Jr. 1988) on this task was K = 0.69 . The precision for a given set of instances is the sum of the judges' scores divided by the total instances. Although knowing the total number of correct instances of a particular relation in any non-trivial corpus is impossible, it is possible to compute the recall of a system relative to another system's recall. Following (Pantel et al. 2004), we define the relative recall of system A given sys- null is the recall of A, C A is the number of correct instances extracted by A, C is the (unknown) total number of correct instances in the corpus, P A is A's precision in our experiments, The kappa statistic jumps to K = 0.79 if we treat partially correct classifications as correct.</Paragraph> <Paragraph position="1"> and |A |is the total number of instances discovered by A.</Paragraph> <Paragraph position="2"> Tables 2 - 8 report the total number of instances, precision, and relative recall of each system on the TREC-9 and CHEM corpora 34. The relative recall is always given in relation to the ESP- system. For example, in Table 2, RH02 has a relative recall of 5.31 with ESP-, which means that the RH02 system outputs 5.31 times more correct relations than ESP- (at a cost of much lower precision). Similarly, PR04 has a relative recall of 0.23 with ESP-, which means that PR04 outputs 4.35 fewer correct relations than ESP(also with a smaller precision). We did not include the results from GI03 in the tables since the system is only applicable to part-of relations and we did not reproduce it. However, the authors evaluated their system on a sample of the TREC-9 dataset and reported 83% precision and 72% recall (this algorithm is heavily supervised.)</Paragraph> </Section> </Section> <Section position="7" start_page="117" end_page="117" type="metho"> <SectionTitle> SYSTEM INSTANCES PRECISION </SectionTitle> <Paragraph position="0"/> </Section> <Section position="8" start_page="117" end_page="117" type="metho"> <SectionTitle> SYSTEM INSTANCES PRECISION </SectionTitle> <Paragraph position="0"/> </Section> <Section position="9" start_page="117" end_page="117" type="metho"> <SectionTitle> SYSTEM INSTANCES PRECISION </SectionTitle> <Paragraph position="0"/> </Section> <Section position="10" start_page="117" end_page="117" type="metho"> <SectionTitle> SYSTEM INSTANCES PRECISION </SectionTitle> <Paragraph position="0"/> </Section> <Section position="11" start_page="117" end_page="119" type="metho"> <SectionTitle> SYSTEM INSTANCES PRECISION </SectionTitle> <Paragraph position="0"> In all tables, RH02 extracts many more relations than ESP-, but with a much lower precision, because it uses generic patterns without filtering.</Paragraph> <Paragraph position="1"> The high precision of ESP- is due to the effective reliability measures presented in Section 3.2.</Paragraph> <Section position="1" start_page="118" end_page="119" type="sub_section"> <SectionTitle> 4.3 Effect of Generic Patterns </SectionTitle> <Paragraph position="0"> Experimental results, for all relations and the two different corpus sizes, show that ESP- greatly outperforms the other methods on precision.</Paragraph> <Paragraph position="1"> However, without the use of generic patterns, the ESP- system shows lower recall in all but the production relation.</Paragraph> <Paragraph position="2"> As hypothesized, exploiting generic patterns using the algorithm from Section 3.3 substantially improves recall without much deterioration in precision. ESP+ shows one to two orders of magnitude improvement on recall while losing on average below 10% precision. The succession relation in Table 6 was the only relation where Espresso found no generic pattern. For other relations, Espresso found from one to five generic patterns. Table 4 shows the power of generic patterns where system recall increases by 577 times with only a 10% drop in precision. In Table 7, we see a case where the combination of filtering with a large increase in retrieved instances resulted in both higher precision and recall.</Paragraph> <Paragraph position="3"> In order to better analyze our use of generic patterns, we performed the following experiment.</Paragraph> <Paragraph position="4"> For each relation, we randomly sampled 100 instances for each generic pattern and built a gold standard for them (by manually tagging each instance as correct or incorrect). We then sorted the 100 instances according to the scoring formula S(i) derived in Section 3.3 and computed the average precision, recall, and F-score of each top-K ranked instances for each pattern . Due to lack of space, we only present the graphs for four of the 22 generic patterns: &quot;X is a Y&quot; for the is-a relation of Table 2, &quot;X in the Y&quot; for the part-of relation of Table 4, &quot;X in Y&quot; for the part-of relation of Table 5, and &quot;X and Y&quot; for the reaction relation of Table 7. Figure 1 illustrates the results. In each figure, notice that recall climbs at a much faster rate than precision decreases. This indicates that the scoring function of Section 3.3 effectively separates correct and incorrect instances. In Figure 1a), there is a big initial drop in precision that accounts for the poor precision reported in Table 1.</Paragraph> <Paragraph position="5"> Recall that the cutoff points on S(i) were set to t = 0.4 for TREC and t = 0.3 for CHEM. The figures show that this cutoff is far from the maximum F-score. An interesting avenue of future work would be to automatically determine the proper threshold for each individual generic pattern instead of setting a uniform threshold.</Paragraph> <Paragraph position="6"> We can directly compute recall here since we built a gold standard for each set of 100 samples.</Paragraph> </Section> </Section> class="xml-element"></Paper>