File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2075_metho.xml
Size: 16,387 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2075"> <Title>Integrating Pattern-based and Distributional Similarity Methods for Lexical Entailment Acquisition</Title> <Section position="5" start_page="91904" end_page="91904" type="metho"> <SectionTitle> 3 An Integrated Approach for Lexi- </SectionTitle> <Paragraph position="0"> cal Entailment Acquisition This section describes our integrated approach for acquiring lexical entailment relationships, applied to common nouns. The algorithm receives as input a target term and aims to acquire a set of terms that either entail or are entailed by it. We denote a pair consisting of the input target term and an acquired entailing/entailed term as entailment pair. Entailment pairs are directional, as in bank barb2right company.</Paragraph> <Paragraph position="1"> Our approach applies a supervised learning scheme, using SVM, to classify candidate entailment pairs as correct or incorrect. The SVM training phase is applied to a small constant number of training pairs, yielding a classification model that is then used to classify new test entailment pairs. The designated training set is also used to tune some additional parameters of the method. Overall, the method consists of the following main components: 1: Acquiring candidate entailment pairs for the input term by pattern-based and distributional similarity methods (Section 3.2); 2: Constructing a feature set for all candidates based on pattern-based and distributional information (Section 3.3); 3: Applying SVM training and classification to the candidate pairs (Section 3.4).</Paragraph> <Paragraph position="2"> The first two components, of acquiring candidate pairs and collecting features for them, utilize a generic module for pattern-based extraction from the web, which is described first in Section 3.1.</Paragraph> <Section position="1" start_page="91904" end_page="91904" type="sub_section"> <SectionTitle> 3.1 Pattern-based Extraction Mod- </SectionTitle> <Paragraph position="0"> ule The general pattern-based extraction module receives as input a set of lexical-syntactic patterns (as in Table 1) and either a target term or a candidate pair of terms. It then searches the web for occurrences of the patterns with the input term(s). A small set of effective queries is created for each pattern-terms combination, aiming to retrieve as much relevant data with as few queries as possible.</Paragraph> <Paragraph position="1"> Each pattern has two variable slots to be instantiated by candidate terms for the sought relation. Accordingly, the extraction module can be quisition based on (Hearst, 1992) and (Pantel et al., 2004). Capitalized terms indicate variables. pl and sg stand for plural and singular forms.</Paragraph> <Paragraph position="2"> used in two modes: (a) receiving a single target term as input and searching for instantiations of the other variable to identify candidate related terms (as in Section 3.2); (b) receiving a candidate pair of terms for the relation and searching pattern instances with both terms, in order to validate and collect information about the relationship between the terms (as in Section 3.3). Google proximity search1 provides a useful tool for these purposes, as it allows using a wildcard which might match either an un-instantiated term or optional words such as modifiers. For example, the query &quot;such ** as *** (war OR wars)&quot; is one of the queries created for the input pattern such NP1 as NP2 and the input target term war, allowing new terms to match the first pattern variable. For the candidate entailment pair war - struggle, the first variable is instantiated as well. The corresponding query would be: &quot;such * (struggle OR struggles) as *** (war OR wars)&quot;. This technique allows matching terms that are sub-parts of more complex noun phrases as well as multi-word terms.</Paragraph> <Paragraph position="3"> The automatically constructed queries, covering the possible combinations of multiple wildcards, are submitted to Google2 and a specified number of snippets is downloaded, while avoiding duplicates. The snippets are passed through a word splitter and a sentence segmenter3, while filtering individual sentences that do not contain all search terms. Next, the sentences are processed with the OpenNLP4 POS tagger and NP chunker. Finally, pattern-specific regular expressions over the chunked sentences are applied to verify that the instantiated pattern indeed occurs in the sentence, and to identify variable instantiations. null On average, this method extracted more than 3300 relationship instances for every 1MB of downloaded text, almost third of them contained multi-word terms.</Paragraph> </Section> <Section position="2" start_page="91904" end_page="91904" type="sub_section"> <SectionTitle> 3.2 Candidate Acquisition </SectionTitle> <Paragraph position="0"> Given an input target term we first employ pattern-based extraction to acquire entailment pair candidates and then augment the candidate set with pairs obtained through distributional similarity. null At the candidate acquisition phase pattern instances are searched with one input target term, looking for instantiations of the other pattern variable to become the candidate related term (the first querying mode described in Section 3.1). We construct two types of queries, in which the target term is either the first or second variable in the pattern, which corresponds to finding either entailing or entailed terms that instantiate the other variable.</Paragraph> <Paragraph position="1"> In the candidate acquisition phase we utilized patterns 1-8 in Table 1, which we empirically found as most suitable for identifying directional lexical entailment pairs. Patterns 9-11 are not used at this stage as they produce too much noise when searched with only one instantiated variable. About 35 queries are created for each target term in each entailment direction for each of the 8 patterns. For every query, the first n snippets are downloaded (we used n=50). The downloaded snippets are processed as described in Section 3.1, and candidate related terms are extracted, yielding candidate entailment pairs with the input target term.</Paragraph> <Paragraph position="2"> Quite often the entailment relation holds between multi-word noun-phrases rather than merely between their heads. For example, trade center lexically entails shopping complex, while center does not necessarily entail complex. On the other hand, many complex multi-word noun phrases are too rare to make a statistically based decision about their relation with other terms.</Paragraph> <Paragraph position="3"> Hence, we apply the following two criteria to balance these constraints: 1. For the entailing term we extract only the complete noun-chunk which instantiate the pattern. For example: we extract housing project - complex, but do not extract project as entailing complex since the head noun alone is often too general to entail the other term.</Paragraph> <Paragraph position="4"> 2. For the entailed term we extract both the complete noun-phrase and its head in order to create two separate candidate entailment pairs with the entailing term, which will be judged eventually according to their overall statistics.</Paragraph> <Paragraph position="5"> As it turns out, a large portion of the extracted pairs constitute trivial hyponymy relations, where one term is a modified version of the other, like low interest loan - loan. These pairs were removed, along with numerous pairs including proper nouns, following the goal of learning en- null tailment relationships for distinct common nouns.</Paragraph> <Paragraph position="6"> Finally, we filter out the candidate pairs whose frequency in the extracted patterns is less than a threshold, which was set empirically to 3. Using a lower threshold yielded poor precision, while a threshold of 4 decreased recall substantially with just little effect on precision.</Paragraph> </Section> <Section position="3" start_page="91904" end_page="91904" type="sub_section"> <SectionTitle> 3.2.2 Distributional Similarity Candidates </SectionTitle> <Paragraph position="0"> As mentioned in Section 2, we employ the distributional similarity measure of (Geffet and Dagan, 2004) (denoted here GD04 for brevity), which was found effective for extracting non-directional lexical entailment pairs. Using local corpus statistics, this algorithm produces for each target noun a scored list of up to a few hundred words with positive distributional similarity scores.</Paragraph> <Paragraph position="1"> Next we need to determine an optimal threshold for the similarity score, considering words above it as likely entailment candidates. To tune such a threshold we followed the original methodology used to evaluate GD04. First, the top-k (k=40) similarities of each training term are manually annotated by the lexical entailment criterion (see Section 4.1). Then, the similarity value which yields the maximal micro-averaged F1 score is selected as threshold, suggesting an optimal recall-precision tradeoff. The selected threshold is then used to filter the candidate similarity lists of the test words.</Paragraph> <Paragraph position="2"> Finally, we remove all entailment pairs that already appear in the candidate set of the pattern-based approach, in either direction (recall that the distributional candidates are non-directional).</Paragraph> <Paragraph position="3"> Each of the remaining candidates generates two directional pairs which are added to the unified candidate set of the two approaches.</Paragraph> </Section> <Section position="4" start_page="91904" end_page="91904" type="sub_section"> <SectionTitle> 3.3 Feature Construction </SectionTitle> <Paragraph position="0"> Next, each candidate is represented by a set of features, suitable for supervised classification. To this end we developed a novel feature set based on both pattern-based and distributional data.</Paragraph> <Paragraph position="1"> To obtain pattern statistics for each pair, the second mode of the pattern-based extraction module is applied (see Section 3.1). As in this case, both variables in the pattern are instantiated by the terms of the pair, we could use all eleven patterns in Table 1, creating a total of about 55 queries per pair and downloading m=20 snippets for each query. The downloaded snippets are processed as described in Section 3.1 to identify pattern matches and obtain relevant statistics for feature scores.</Paragraph> <Paragraph position="2"> Following is the list of feature types computed for each candidate pair. The feature set was designed specifically for the task of extracting the complementary information of the two methods.</Paragraph> <Paragraph position="3"> Conditional Pattern Probability: This type of feature is created for each of the 11 individual patterns. The feature value is the estimated conditional probability of having the pattern matched in a sentence given that the pair of terms does appear in the sentence (calculated as the fraction of pattern matches for the pair amongst all unique sentences that contain the pair). This feature yields normalized scores for pattern matches regardless of the number of snippets retrieved for the given pair. This normalization is important in order to bring to equal grounds candidate pairs identified through either the pattern-based or distributional approaches, since the latter tend to occur less frequently in patterns.</Paragraph> <Paragraph position="4"> Aggregated Conditional Pattern Probability: This single feature is the conditional probability that any of the patterns match in a retrieved sentence, given that the two terms appear in it. It is calculated like the previous feature, with counts aggregated over all patterns, and aims to capture overall appearance of the pair in patterns, regardless of the specific pattern.</Paragraph> <Paragraph position="5"> Conditional List-Pattern Probability: This feature was designed to eliminate the typical non-entailing cases of co-hyponyms (words sharing the same hypernym), which nevertheless tend to co-occur in entailment patterns. We therefore also check for pairs' occurrences in lists, using appropriate list patterns, expecting that correct entailment pairs would not co-occur in lists. The probability estimate, calculated like the previous one, is expected to be a negative feature for the learning model.</Paragraph> <Paragraph position="6"> Relation Direction Ratio: The value of this feature is the ratio between the overall number of pattern matches for the pair and the number of pattern matches for the reversed pair (a pair created with the same terms in the opposite entailment direction). We found that this feature strongly correlates with entailment likelihood.</Paragraph> <Paragraph position="7"> Interestingly, it does not deteriorate performance for synonymous pairs.</Paragraph> <Paragraph position="8"> Distributional Similarity Score: The GD04 similarity score of the pair was used as a feature. We also attempted adding Lin's (1998) similarity scores but they appeared to be redundant.</Paragraph> <Paragraph position="9"> Intersection Feature: A binary feature indicating candidate pairs acquired by both methods, which was found to indicate higher entailment likelihood. null In summary, the above feature types utilize mutually complementary pattern-based and distributional information. Using cross validation over the training set we verified that each feature makes marginal contribution to performance when added on top of the remaining features.</Paragraph> </Section> <Section position="5" start_page="91904" end_page="91904" type="sub_section"> <SectionTitle> 3.4 Training and Classification </SectionTitle> <Paragraph position="0"> In order to systematically integrate different feature types we used the state-of-the-art supervised classifier SVMlight (Joachims, 1999) for entailment pair classification. Using 10-fold cross-validation over the training set we obtained the SVM configuration that yields an optimal micro-averaged F1 score. Through this optimization we chose the RBF kernel function and obtained optimal values for the J, C and the RBF's Gamma parameters. The candidate test pairs classified as correct entailments constitute the output of our integrated method.</Paragraph> </Section> </Section> <Section position="6" start_page="91904" end_page="91904" type="metho"> <SectionTitle> 4 Empirical Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="91904" end_page="91904" type="sub_section"> <SectionTitle> 4.1 Data Set and Annotation </SectionTitle> <Paragraph position="0"> We utilized the experimental data set from Geffet and Dagan (2004). The dataset includes the similarity lists calculated by GD04 for a sample of 30 target (common) nouns, computed from an 18 million word subset of the Reuters corpus5. We randomly picked a small set of 10 terms for training, leaving the remaining 20 terms for testing.</Paragraph> <Paragraph position="1"> Then, the set of entailment pair candidates for all nouns was created by applying the filtering method of Section 3.2.2 to the distributional similarity lists, and by extracting pattern-based</Paragraph> </Section> </Section> <Section position="7" start_page="91904" end_page="91904" type="metho"> <SectionTitle> 5 Reuters Corpus, Volume 1, English Language, 1996-08-20 to 1997-08-19. </SectionTitle> <Paragraph position="0"> candidates from the web as described in Section 3.2.1.</Paragraph> <Paragraph position="1"> Gold standard annotations for entailment pairs were created by three judges. The judges were guided to annotate as &quot;Correct&quot; the pairs conforming to the lexical entailment definition, which was reflected in two operational tests: i) Word meaning entailment: whether the meaning of the first (entailing) term implies the meaning of the second (entailed) term under some common sense of the two terms; and ii) Substitutability: whether the first term can substitute the second term in some natural contexts, such that the meaning of the modified context entails the meaning of the original one. The obtained Kappa values (varying between 0.7 and 0.8) correspond to substantial agreement on the task.</Paragraph> </Section> class="xml-element"></Paper>