File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2207_metho.xml
Size: 19,191 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2207"> <Title>A Hybrid Approach for the Acquisition of Information Extraction Patterns</Title> <Section position="3" start_page="48" end_page="50" type="metho"> <SectionTitle> 2 The Pattern Acquisition Framework </SectionTitle> <Paragraph position="0"> In this section we introduce a modular pattern acquisition framework that co-trains two different views of the document collection: the first view uses the collection words to train a text categorization algorithm, while the second view bootstraps a decision list learner that uses all syntactico-semantic patterns as features. The rules acquired by the latter algorithm, of the form p - y, where p is a pattern and y is a domain label, are the output of the overall system. The system can be customized with several pattern selection strategies that dramatically influence the quality and order of the acquired rules.</Paragraph> <Section position="1" start_page="48" end_page="48" type="sub_section"> <SectionTitle> 2.1 Co-training Text Categorization and Pattern Acquisition </SectionTitle> <Paragraph position="0"> Given two views of a classification task, co-training (Blum and Mitchell, 1998) bootstraps a separate classifier for each view as follows: (1) it initializes both classifiers with the same small amount of labeled data (i.e. seed documents in our case); (2) it repeatedly trains both classifiers using the currently labeled data; and (3) after each learning iteration, the two classifiers share all or a subset of the newly labeled examples (documents in our particular case).</Paragraph> <Paragraph position="1"> The intuition is that each classifier provides new, informative labeled data to the other classifier. If the two views are conditional independent and the two classifiers generally agree on unlabeled data they will have low generalization error. In this paper we focus on a &quot;naive&quot; co-training approach, which trains a different classifier in each iteration and feeds its newly labeled examples to the other classifier. This approach was shown to perform well on real-world natural language problems (Collins and Singer, 1999).</Paragraph> <Paragraph position="2"> Figure 1 illustrates the co-training framework used in this paper. The feature space of the first view contains only lexical information, i.e.</Paragraph> <Paragraph position="3"> the collection words, and uses as classifier Expectation Maximization (EM) (Dempster et al., 1977). EM is actually a class of iterative algorithms that find maximum likelihood estimates of parametersusingprobabilisticmodelsoverincomplete data (e.g. both labeled and unlabeled documents) (Dempster et al., 1977). EM was theoretically proven to converge to a local maximum of the parameters' log likelihood. Furthermore, empirical experiments showed that EM has excellent performance for lightly-supervised text classification (Nigam et al., 2000). The EM algorithm used in this paper estimates its model parameters using the Naive Bayes (NB) assumptions, similarly to (Nigam et al., 2000). From this point further, we refer to this instance of the EM algorithm as NB-EM.</Paragraph> <Paragraph position="4"> The feature space of the second view contains the syntactico-semantic patterns, generated using the procedure detailed in Section 3.2. The second learner is the actual pattern acquisition algorithm implemented as a bootstrapped decision list classifier. null The co-training algorithm introduced in this paper interleaves one iteration of the NB-EM algorithm with one iteration of the pattern acquisition algorithm. If one classifier converges faster (e.g.</Paragraph> <Paragraph position="5"> NB-EM typically converges in under 20 iterations, whereas the acquisition algorithms learns new patterns for hundreds of iterations) we continue bootstrapping the other classifier alone.</Paragraph> </Section> <Section position="2" start_page="48" end_page="49" type="sub_section"> <SectionTitle> 2.2 The Text Categorization Algorithm </SectionTitle> <Paragraph position="0"> The parameters of the generative NB model, ^th, include the probability of seeing a given category, 1. Initialization: * Initialize the set of labeled examples with n labeled seed documents of the form (di, yi). yi is the label of the ith document di. Each document di contains a set of patterns {pi1,pi2,...,pim}. * Initialize the list of learned rules R = {}. 2. Loop: * For each label y, select a small set of pattern rules r = p - y, r /[?] R.</Paragraph> <Paragraph position="1"> * Append all selected rules r to R.</Paragraph> <Paragraph position="2"> * For all non-seed documents d that contain a pattern in R, set label(d) = argmaxp,y strength(p,y). 3. Termination condition: * Stop if no rules selected or maximum number of iterations reached.</Paragraph> <Paragraph position="3"> P(c|^th), and the probability of seeing a word given a category, P(w|c; ^th). We calculate both similarly to Nigam (2000). Using these parameters, the word independence assumption typical to the Naive Bayes model, and the Bayes rule, the probability that a document d has a given category c is calculated as:</Paragraph> <Paragraph position="5"/> </Section> <Section position="3" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 2.3 The Pattern Acquisition Algorithm </SectionTitle> <Paragraph position="0"> The lightly-supervised pattern acquisition algorithm iteratively learns domain-specific IE patterns from a small set of labeled documents and a much larger set of unlabeled documents. During each learning iteration, the algorithm acquires a new set of patterns and labels more documents based on the new evidence. The algorithm output is a list R of rules p - y, where p is a pattern in the set of patterns P, and y a category label in Y = {1...k}, k being the number of categories in the document collection. The list of acquired rules R is sorted in descending order of rule importance to guarantee that the most relevant rules are accessed first. This generic bootstrapping algorithm is formalized in Figure 2.</Paragraph> <Paragraph position="1"> Previous studies called the class of algorithms illustrated in Figure 2 &quot;cautious&quot; or &quot;sequential&quot; because in each iteration they acquire 1 or a small set of rules (Abney, 2004; Collins and Singer, 1999). This strategy stops the algorithm from being over-confident, an important restriction for an algorithm that learns from large amounts of unlabeled data. This approach was empirically shown to perform better than a method that in each iterationacquiresall rulesthatmatchacertaincriterion (e.g. the corresponding rule has a strength over a certain threshold).</Paragraph> <Paragraph position="2"> The key element where most instances of this algorithm vary is the select procedure, which decides which rules are acquired in each iteration.</Paragraph> <Paragraph position="3"> Although several selection strategies have been previously proposed for various NLP problems, to our knowledge no existing study performs an empirical analysis of such strategies in the context of acquisition of IE patterns. For this reason, we implement several selection methods in our system (described in Section 2.4) and evaluate their performance in Section 5.</Paragraph> <Paragraph position="4"> The label of each collection document is given bythestrengthofitspatterns. Similarlyto(Collins and Singer, 1999; Yarowsky, 1995), we define the strength of a pattern p in a category y as the precision of p in the set of documents labeled with category y, estimated using Laplace smoothing:</Paragraph> <Paragraph position="6"> where count(p,y) is the number of documents labeled y containing pattern p, count(p) is the over-all number of labeled documents containing p, and k is the number of domains. For all experiments presented here we used epsilon1 = 1.</Paragraph> <Paragraph position="7"> Another point where acquisition algorithms differistheinitializationprocedure: somestartwitha small number of hand-labeled documents (Riloff, 1996), as illustrated in Figure 2, while others start with a set of seed rules (Yangarber et al., 2000; Yangarber, 2003). However, these approaches are conceptually similar: the seed rules are simply used to generate the seed documents.</Paragraph> <Paragraph position="8"> Thispaperfocusesontheframeworkintroduced in Figure 2 for two reasons: (a) &quot;cautious&quot; al- null gorithms were shown to perform best for several NLP problems (including acquisition of IE patterns), and (b) it has nice theoretical properties: Abney (2004) showed that, regardless of the selection procedure, &quot;sequential&quot; bootstrapping algorithms converge to a local minimum of K, where K is an upper bound of the negative log likelihood of the data. Obviously, the quality of the local minimum discovered is highly dependent of the selection procedure, which is why we believe an evaluation of several pattern selection strategies is important.</Paragraph> </Section> <Section position="4" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 2.4 Selection Criteria </SectionTitle> <Paragraph position="0"> The pattern selection component, i.e. the select procedure of the algorithm in Figure 2, consists of the following: (a) for each category y all patterns p are sorted in descending order of their scores in the current category, score(p,y), and (b) for each category the top k patterns are selected. For all experiments in this paper we have used k = 3.</Paragraph> <Paragraph position="1"> We provide four different implementations for the pattern scoring function score(p,y) according to four different selection criteria.</Paragraph> <Paragraph position="2"> This selection criterion was developed specifically for the pattern acquisition task (Riloff, 1996) and has been used in several other pattern acquisition systems (Yangarber et al., 2000; Yangarber, 2003; Stevenson and Greenwood, 2005). The intuition behind it is that a qualitative pattern is yielded by a compromise between pattern precision (which is a good indicator of relevance) and pattern frequency (which is a good indicator of coverage). Furthermore, the criterion considers only patterns that are positively correlated with the corresponding category, i.e. their precision is higher than 50%. The Riloff score of a pattern p in a category y is formalized as:</Paragraph> <Paragraph position="4"> where prec(p,y) is the raw precision of pattern p in the set of documents labeled with category y.</Paragraph> <Paragraph position="5"> This criterion was used in a lightly-supervised NE recognizer (Collins and Singer, 1999). Unlike the previous criterion, which combines relevance and frequency in the same scoring function, Collins considers only patterns whose raw precision is over a hard threshold T and ranks them by their global coverage:</Paragraph> <Paragraph position="7"> The kh2 score measures the lack of independence between a pattern p and a category y. It is computed using a two-way contingency table of p and y,whereaisthenumberoftimespandy co-occur, b is the number of times p occurs without y, c is the number of times y occurs without p, and d is the number of times neither p nor y occur. The number of documents in the collection is n. Similarly to the first criterion, we consider only patterns positively correlated with the corresponding category:</Paragraph> <Paragraph position="9"> The kh2 statistic was previously reported to be the best feature selection strategy for text categorization (Yang and Pedersen, 1997).</Paragraph> <Paragraph position="10"> Mutual information is a well known information theory criterion that measures the independence of two variables, in our case a pattern p and a categoryy (YangandPedersen, 1997). Usingthesame contingency table introduced above, the MI criterion is estimated as:</Paragraph> <Paragraph position="12"/> </Section> </Section> <Section position="4" start_page="50" end_page="51" type="metho"> <SectionTitle> 3 The Data </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 3.1 The Document Collections </SectionTitle> <Paragraph position="0"> For all experiments reported in this paper we used the following three document collections: (a) the AP collection is the Associated Press (year 1999) subset of the AQUAINT collection (LDC catalog number LDC2002T31); (b) the LATIMES collection is the Los Angeles Times subset of the TREC- null Similarly to previous work, for the REUTERS collection we used the ModApte split and selected the ten most frequent categories (Nigam et al., 2000). Due to memory limitations on our test machines, we reduced the size of the AP and LATIMES collections to their first 5,000 documents (the complete collections contain over 100,000 documents).</Paragraph> <Paragraph position="1"> The collection words were pre-processed as follows: (i) stop words and numbers were discarded; (ii) all words were converted to lower case; and (iii) terms that appear in a single document were removed. Table 1 lists the collection characteristics after pre-processing.</Paragraph> </Section> <Section position="2" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 3.2 Pattern Generation </SectionTitle> <Paragraph position="0"> In order to extract the set of patterns available in a document, each collection document undergoes the following processing steps: (a) we recognize and classify named entities3, and (b) we generate full parse trees of all document sentences using a probabilistic context-free parser.</Paragraph> <Paragraph position="1"> Following the above processing steps, we extractSubject-Verb-Object(SVO)tuplesusingase- null ries of heuristics, e.g.: (a) nouns preceding active verbs are subjects, (b) nouns directly attached to a verb phrase are objects, (c) nouns attached to the verb phrase through a prepositional attachment are indirect objects. Each tuple element is replaced with either its head word, if its head word is not included in a NE, or with the NE category otherwise. For indirect objects we additionally store the accompanying preposition. Lastly, each tuple containing more than two elements is generalized by maintaining only subsets of two and three of its elements and replacing the others with a wildcard.</Paragraph> <Paragraph position="2"> Table 2 lists the patterns extracted from one sample sentence. As Table 2 hints, the system generates a large number of candidate patterns. It is the task of the pattern acquisition algorithm to extract only the relevant ones from this complex search space.</Paragraph> </Section> </Section> <Section position="5" start_page="51" end_page="52" type="metho"> <SectionTitle> 4 The Evaluation Procedures </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="51" end_page="52" type="sub_section"> <SectionTitle> 4.1 The Indirect Evaluation Procedure </SectionTitle> <Paragraph position="0"> The goal of our evaluation procedure is to measure the quality of the acquired patterns. Intuitively, 3We identify six categories: persons, locations, organizations, other names, temporal and numerical expressions. Text The Minnesota Vikings beat the Arizona Cardinals in yesterday's game.</Paragraph> <Paragraph position="2"> stands for subject, v for verb, o for object, and io for indirect object.</Paragraph> <Paragraph position="3"> thelearnedpatternsshouldhavehighcoverageand low ambiguity. We indirectly measure the quality of the acquired patterns using a text categorization strategy: we feed the acquired rules to a decision-list classifier, which is then used to classify a new set of documents. The classifier assigns to each document the category label given by the first rule whose pattern matches. Since we expect higher-quality patterns to appear higher in the rule list, the decision-list classifier never changes the category of an already-labeled document.</Paragraph> <Paragraph position="4"> The quality of the generated classification is measured using micro-averaged precision and recall: null</Paragraph> <Paragraph position="6"> where q is the number of categories in the document collection.</Paragraph> <Paragraph position="7"> For all experiments and all collections with the exception of REUTERS, which has a standard document split for training and testing, we used 5fold cross validation: we randomly partitioned the collections into 5 sets of equal sizes, and reserved a different one for testing in each fold.</Paragraph> <Paragraph position="8"> Wehavechosenthisevaluationstrategybecause this indirect approach was shown to correlate well withadirectevaluation,wherethelearnedpatterns were used to customize an IE system (Yangarber et al., 2000). For this reason, much of the following work on pattern acquisition has used this approach as a de facto evaluation standard (Yangarber, 2003; Stevenson and Greenwood, 2005).</Paragraph> <Paragraph position="9"> Furthermore, given the high number of domains and patterns (we evaluate on 25 domains), an evaluation by human experts is extremely costly. Nevertheless, to show that the proposed indirect evaluation correlates well with a direct evaluation, two human experts have evaluated the patterns in several domains. The direct evaluation procedure is described next.</Paragraph> </Section> <Section position="2" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 4.2 The Direct Evaluation Procedure </SectionTitle> <Paragraph position="0"> The task of manually deciding whether an acquired pattern is relevant or not for a given domain is not trivial, mainly due to the ambiguity of the patterns. Thus, this process should be carried out by more than one expert, so that the relevance of the ambiguous patterns can be agreed upon. For example, the patterns s(ORG) v(score) o(goal) and s(PER) v(lead) io(with point) are clearly relevant only for the sports domain, whereas the patterns v(sign) io(as agent) and o(title) io(in DATE) might be regarded as relevant for other domains as well.</Paragraph> <Paragraph position="1"> The specific procedure to manually evaluate the patterns is the following: (1) two experts separately evaluate the acquired patterns for the considered domains and collections; and (2) the results of both evaluations are compared. For any disagreement, we have opted for a strict evaluation: all the occurrences of the corresponding pattern are looked up in the collection and, whenever at least one pattern occurrence belongs to a document assigned to a different domain than the domain in question, the pattern will be considered as not relevant.</Paragraph> <Paragraph position="2"> Both the ambiguity and the high number of the extracted patterns have prevented us from performing an exhaustive direct evaluation. For this reason, only the top (most relevant) 100 patterns havebeenevaluatedforonedomainpercollection.</Paragraph> <Paragraph position="3"> The results are detailed in Section 5.2.</Paragraph> </Section> </Section> class="xml-element"></Paper>