File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1047_metho.xml
Size: 20,218 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1047"> <Title>A Semantic Approach to IE Pattern Induction</Title> <Section position="3" start_page="379" end_page="380" type="metho"> <SectionTitle> 2 Extraction Pattern Learning </SectionTitle> <Paragraph position="0"> We begin by outlining the general process of learning extraction patterns, similar to one presented by (Yangarber, 2003).</Paragraph> <Paragraph position="1"> 1. For a given IE scenario we assume the existence of a set of documents against which the system can be trained. The documents are unannotated and may be either relevant (contain the description of an event relevant to the scenario) or irrelevant although the algorithm has no access to this information.</Paragraph> <Paragraph position="2"> 2. This corpus is pre-processed to generate the set of all patterns which could be used to represent sentences contained in the corpus, call this set S. The aim of the learning process is to identify the subset of S representing patterns which are relevant to the IE scenario.</Paragraph> <Paragraph position="3"> 3. The user provides a small set of seed patterns, Sseed, which are relevant to the scenario. These patterns are used to form the set of currently accepted patterns, Sacc, so Sacc - Sseed. The remaining patterns are treated as candidates for inclusion in the accepted set, these form the set Scand(= S[?]Sacc).</Paragraph> <Paragraph position="4"> 4. A function, f, is used to assign a score to each pattern in Scand based on those which are currently in Sacc. This function assigns a real number to candidate patterns so [?] c epsilon1 Scand, f(c, Sacc) mapsto- Rfractur. A set of high scoring patterns (based on absolute scores or ranks after the set of patterns has been ordered by scores) are chosen as being suitable for inclusion in the set of accepted patterns. These form the set Slearn.</Paragraph> <Paragraph position="5"> 5. The patterns in Slearn are added to Sacc and removed from Scand, so Sacc -Sacc [?] Slearn and Scand -Sacc [?] Slearn 6. If a suitable set of patterns has been learned then stop, otherwise go to step 4</Paragraph> <Section position="1" start_page="379" end_page="380" type="sub_section"> <SectionTitle> 2.1 Document-centric approach </SectionTitle> <Paragraph position="0"> A key choice in the development of such an algorithm is step 4, the process of ranking the candidate patterns, which effectively determines which of the candidate patterns will be learned. Yangarber et al.</Paragraph> <Paragraph position="1"> (2000) chose an approach motivated by the assumption that documents containing a large number of patterns already identified as relevant to a particular IE scenario are likely to contain further relevant patterns. This approach, which can be viewed as being document-centric, operates by associating confidence scores with patterns and relevance scores with documents. Initially seed patterns are given a maximum confidence score of 1 and all others a 0 score.</Paragraph> <Paragraph position="2"> Each document is given a relevance score based on the patterns which occur within it. Candidate patterns are ranked according to the proportion of relevant and irrelevant documents in which they occur, those found in relevant documents far more than in irrelevant ones are ranked highly. After new patterns have been accepted all patterns' confidence scores are updated, based on the documents in which they occur, and documents' relevance according to the accepted patterns they contain.</Paragraph> <Paragraph position="3"> This approach has been shown to successfully acquire useful extraction patterns which, when added to an IE system, improved its performance (Yangarber et al., 2000). However, it relies on an assumption about the way in which relevant patterns are distributed in a document collection and may learn patterns which tend to occur in the same documents as relevant ones whether or not they are actually relevant. For example, we could imagine an IE scenario in which relevant documents contain a piece of information which is related to, but distinct from, the information we aim to extract. If patterns expressing this information were more likely to occur in relevant documents than irrelevant ones the document-centric approach would also learn the irrelevant patterns. null Rather than focusing on the documents matched by a pattern, an alternative approach is to rank patterns according to how similar their meanings are to those which are known to be relevant. This semantic-similarity approach avoids the problem which may be present in the document-centric approach since patterns which happen to co-occur in the same documents as relevant ones but have different meanings will not be ranked highly. We now go on to describe a new algorithm which implements this approach.</Paragraph> </Section> </Section> <Section position="4" start_page="380" end_page="382" type="metho"> <SectionTitle> 3 Semantic IE Pattern Learning </SectionTitle> <Paragraph position="0"> For these experiments extraction patterns consist of predicate-argument structures, as proposed by Yangarber (2003). Under this scheme patterns consist of triples representing the subject, verb and object (SVO) of a clause. The first element is the &quot;semantic&quot; subject (or agent), for example &quot;John&quot; is a clausal subject in each of these sentences &quot;John hit Bill&quot;, &quot;Bill was hit by John&quot;, &quot;Mary saw John hit Bill&quot;, and &quot;John is a bully&quot;. The second element is the verb in the clause and the third the object (patient) or predicate. &quot;Bill&quot; is a clausal object in the first three example sentences and &quot;bully&quot; in the final one. When a verb is being used intransitively, the pattern for that clause is restricted to only the first pair of elements.</Paragraph> <Paragraph position="1"> The filler of each pattern element can be either a lexical item or semantic category such as per-son name, country, currency values, numerical expressions etc. In this paper lexical items are represented in lower case and semantic categories are capitalised. For example, in the pattern COM-PANY+fired+ceo, fired and ceo are lexical items and COMPANY a semantic category which could match any lexical item belonging to that type.</Paragraph> <Paragraph position="2"> The algorithm described here relies on identifying patterns with similar meanings. The approach we have developed to do this is inspired by the vector space model which is commonly used in Information Retrieval (Salton and McGill, 1983) and language processing in general (Pado and Lapata, 2003). Each pattern can be represented as a set of pattern element-filler pairs. For example, the pattern COMPANY+fired+ceo consists of three pairs: subject COMPANY, verb fired and object ceo. Each pair consists of either a lexical item or semantic category, and pattern element. Once an appropriate set of pairs has been established a pattern can be represented as a binary vector in which an element with value 1 denotes that the pattern contains a particular pair and 0 that it does not.</Paragraph> <Section position="1" start_page="380" end_page="381" type="sub_section"> <SectionTitle> 3.1 Pattern Similarity </SectionTitle> <Paragraph position="0"> The similarity of two pattern vectors can be compared using the measure shown in Equation 1. Here vectora andvectorb are pattern vectors, vectorbT the transpose ofvectorb and a. chairman+resign 1. subject chairman b. ceo+quit 2. subject ceo c. chairman+comment 3. verb resign 4. verb quit 5. verb comment</Paragraph> <Paragraph position="2"> ple vector space formed from three patterns W a matrix that lists the similarity between each of the possible pattern element-filler pairs.</Paragraph> <Paragraph position="4"> The semantic similarity matrix W contains information about the similarity of each pattern element-filler pair stored as non-negative real numbers and is crucial for this measure. Assume that the set of patterns, P, consists of n element-filler pairs denoted by p1, p2, ...pn. Each row and column of W represents one of these pairs and they are consistently labelled. So, for any i such that 1[?]i[?]n, row i and column i are both labelled with pair pi. If wij is the element of W in row i and column j then the value of wij represents the similarity between the pairs pi and pj. Note that we assume the similarity of two element-filler pairs is symmetric, so wij = wji and, consequently, W is a symmetric matrix. Pairs with different pattern elements (i.e. grammatical roles) are automatically given a similarity score of 0. Diagonal elements of W represent the self-similarity between pairs and have the greatest values.</Paragraph> <Paragraph position="5"> Figure 1 shows an example using three patterns, chairman+resign, ceo+quit and chairman+comment. This shows how these patterns are represented as vectors and gives a sample semantic similarity matrix. It can be seen that the first pair of patterns are the most similar using the proposed measure.</Paragraph> <Paragraph position="6"> The measure in Equation 1 is similar to the cosine metric, commonly used to determine the similarity of documents in the vector space model approach to Information Retrieval. However, the cosine metric will not perform well for our application since it does not take into account the similarity between elements of a vector and would assign equal similarity to each pair of patterns in the example shown in Figure 1.1 The semantic similarity matrix in Equation 1 provides a mechanism to capture semantic similarity between lexical items which allows us to identify chairman+resign and ceo+quit as the most similar pair of patterns.</Paragraph> </Section> <Section position="2" start_page="381" end_page="381" type="sub_section"> <SectionTitle> 3.2 Populating the Matrix </SectionTitle> <Paragraph position="0"> It is important to choose appropriate values for the elements of W. We chose to make use of the research that has concentrated on computing similarity between pairs of lexical items using the WordNet hierarchy (Resnik, 1995; Jiang and Conrath, 1997; Patwardhan et al., 2003). We experimented with several of the measures which have been reported in the literature and found that the one proposed by Jiang and Conrath (1997) to be the most effective.</Paragraph> <Paragraph position="1"> The similarity measure proposed by Jiang and Conrath (1997) relies on a technique developed by Resnik (1995) which assigns numerical values to each sense in the WordNet hierarchy based upon the amount of information it represents. These values are derived from corpus counts of the words in the synset, either directly or via the hyponym relation and are used to derive the Information Content (IC) of a synset c thus IC(c) = [?]log(Pr(c)). For two senses, s1 and s2, the lowest common subsumer, lcs(s1, s2), is defined as the sense with the highest information content (most specific) which subsumes both senses in the WordNet hierarchy. Jiang and Conrath used these elements to calculate the semantic distance between a pair or words, w1 and w2, according to this formula (where senses(w) is the set 1The cosine metric for a pair of vectors is given by the calculation a.b|a||b |. Substituting the matrix multiplication in the numerator of Equation 1 for the dot product of vectors vectora and vectorb would give the cosine metric. Note that taking the dot product of a pair of vectors is equivalent to multiplying by the identity matrix, i.e. vectora.vectorb = vectoraI vectorbT . Under our interpretation of the similarity matrix, W , this equates to each pattern element-filler pair being identical to itself but not similar to anything else.</Paragraph> <Paragraph position="2"> of all possible WordNet senses for word w):</Paragraph> <Paragraph position="4"> (2) Patwardhan et al. (2003) convert this distance metric into a similarity measure by taking its multiplicative inverse. Their implementation was used in the experiments described later.</Paragraph> <Paragraph position="5"> As mentioned above, the second part of a pattern element-filler pair can be either a lexical item or a semantic category, such as company. The identifiers used to denote these categories, i.e. COMPANY, do not appear in WordNet and so it is not possible to directly compare their similarity with other lexical items. To avoid this problem these tokens are manually mapped onto the most appropriate node in the WordNet hierarchy which is then used for similarity calculations. This mapping process is not particularly time-consuming since the number of named entity types with which a corpus is annotated is usually quite small. For example, in the experiments described in this paper just seven semantic classes were sufficient to annotate the corpus.</Paragraph> </Section> <Section position="3" start_page="381" end_page="382" type="sub_section"> <SectionTitle> 3.3 Learning Algorithm </SectionTitle> <Paragraph position="0"> This pattern similarity measure can be used to create a weakly supervised approach to pattern acquisition following the general outline provided in Section 2.</Paragraph> <Paragraph position="1"> Each candidate pattern is compared against the set of currently accepted patterns using the measure described in Section 3.1. We experimented with several techniques for ranking candidate patterns based on these scores, including using the best and average score, and found that the best results were obtained when each candidate pattern was ranked according to its score when compared against the centroid vector of the set of currently accepted patterns. We also experimented with several schemes for deciding which of the scored patterns to accept (a full description would be too long for this paper) resulting in a scheme where the four highest scoring patterns whose score is within 0.95 of the best pattern are accepted.</Paragraph> <Paragraph position="2"> Our algorithm disregards any patterns whose corpus occurrences are below a set threshold, a, since these may be due to noise. In addition, a second threshold, b, is used to determine the maximum number of documents in which a pattern can occur since these very frequent patterns are often too general to be useful for IE. Patterns which occur in more than bxC, where C is the number of documents in the collection, are not learned. For the experiments in this paper we set a to 2 and b to 0.3.</Paragraph> </Section> </Section> <Section position="5" start_page="382" end_page="382" type="metho"> <SectionTitle> 4 Implementation </SectionTitle> <Paragraph position="0"> A number of pre-processing stages have to be applied to documents in order for the set of patterns to be extracted before learning can take place. Firstly, items belonging to semantic categories are identified by running the text through the named entity identifier in the GATE system (Cunningham et al., 2002). The corpus is then parsed, using a version of MINIPAR (Lin, 1999) adapted to process text marked with named entities, to produce dependency trees from which SVO-patterns are extracted.</Paragraph> <Paragraph position="1"> Active and passive voice is taken into account in MINIPAR's output so the sentences &quot;COMPANY fired their C.E.O.&quot; and &quot;The C.E.O. was fired by COMPANY&quot; would yield the same triple, COM-PANY+fire+ceo. The indirect object of ditransitive verbs is not extracted; these verbs are treated like transitive verbs for the purposes of this analysis.</Paragraph> <Paragraph position="2"> An implementation of the algorithm described in Section 3 was completed in addition to an implementation of the document-centric algorithm described in Section 2.1. It is important to mention that this implementation is not identical to the one described by Yangarber et al. (2000). Their system makes some generalisations across pattern elements by grouping certain elements together. However, there is no difference between the expressiveness of the patterns learned by either approach and we do not believe this difference has any effect on the results of our experiments.</Paragraph> </Section> <Section position="6" start_page="382" end_page="383" type="metho"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> Various approaches have been suggested for the evaluation of automatic IE pattern acquisition.</Paragraph> <Paragraph position="1"> Riloff (1996) judged the precision of patterns learned by reviewing them manually. Yangarber et al. (2000) developed an indirect method which allowed automatic evaluation. In addition to learning a set of patterns, their system also notes the relevance of documents based on the current set of accepted patterns. Assuming the subset of documents relevant to a particular IE scenario is known, it is possible to use these relevance judgements to determine how accurately a given set of patterns can discriminate the relevant documents from the irrelevant. This evaluation is similar to the &quot;text-filtering&quot; sub-task used in the sixth Message Understanding Conference (MUC-6) (1995) in which systems were evaluated according to their ability to identify the documents relevant to the extraction task. The document filtering evaluation technique was used to allow comparison with previous studies.</Paragraph> <Paragraph position="2"> Identifying the document containing relevant information can be considered as a preliminary stage of an IE task. A further step is to identify the sentences within those documents which are relevant.</Paragraph> <Paragraph position="3"> This &quot;sentence filtering&quot; task is a more fine-grained evaluation and is likely to provide more information about how well a given set of patterns is likely to perform as part of an IE system. Soderland (1999) developed a version of the MUC-6 corpus in which events are marked at the sentence level. The set of patterns learned by the algorithm after each iteration can be compared against this corpus to determine how accurately they identify the relevant sentences for this extraction task.</Paragraph> <Section position="1" start_page="382" end_page="383" type="sub_section"> <SectionTitle> 5.1 Evaluation Corpus </SectionTitle> <Paragraph position="0"> The evaluation corpus used for the experiments was compiled from the training and testing corpus used in MUC-6, where the task was to extract information about the movements of executives from newswire texts. A document is relevant if it has a filled template associated with it. 590 documents from a version of the MUC-6 evaluation corpus described by Soderland (1999) were used.</Paragraph> <Paragraph position="1"> After the pre-processing stages described in Section 4, the MUC-6 corpus produced 15,407 pattern tokens from 11,294 different types. 10,512 patterns appeared just once and these were effectively discarded since our learning algorithm only considers patterns which occur at least twice (see Section 3.3).</Paragraph> <Paragraph position="2"> The document-centric approach benefits from a large corpus containing a mixture of relevant and irrelevant documents. We provided this using a subset of the Reuters Corpus Volume I (Rose et al., 2002) which, like the MUC-6 corpus, consists of newswire texts. 3000 documents relevant to the management succession task (identified using document metadata) and 3000 irrelevant documents were used to produce the supplementary corpus. This supplementary corpus yielded 126,942 pattern tokens and 79,473 types with 14,576 of these appearing more than once. Adding the supplementary corpus to the data set used by the document-centric approach led to an improvement of around 15% on the document filtering task and over 70% for sentence filtering. It was not used for the semantic similarity algorithm since there was no benefit.</Paragraph> <Paragraph position="3"> The set of seed patterns listed in Table 1 are indicative of the management succession extraction task and were used for these experiments.</Paragraph> </Section> </Section> class="xml-element"></Paper>