File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1006_metho.xml
Size: 23,481 bytes
Last Modified: 2025-10-06 14:10:04
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1006"> <Title>Learning to recognize features of valid textual entailments</Title> <Section position="4" start_page="0" end_page="41" type="metho"> <SectionTitle> 2 Approaching a robust semantics </SectionTitle> <Paragraph position="0"> In this section we try to give a unifying overview to current work on robust textual inference, to present fundamental limitations of current methods, and then to outline our approach to resolving them. Nearly all current textual inference systems use a single-stage matching/proof process, and differ</Paragraph> </Section> <Section position="5" start_page="41" end_page="43" type="metho"> <SectionTitle> ID Text Hypothesis Entailed </SectionTitle> <Paragraph position="0"> 59 Two Turkish engineers and an Afghan translator kidnapped in December were freed Friday.</Paragraph> <Paragraph position="1"> translator kidnapped in Iraq no 98 Sharon warns Arafat could be targeted for assassination. prime minister targeted for assassination no 152 Twenty-five of the dead were members of the law enforcement agencies and the rest of the 67 were civilians. 25 of the dead were civilians. no 231 The memorandum noted the United Nations estimated that 2.5 million to 3.5 million people died of AIDS last year. Over 2 million people died of AIDS last year.</Paragraph> <Paragraph position="2"> yes 971 Mitsubishi Motors Corp.'s new vehicle sales in the US fell 46 percent in June.</Paragraph> <Paragraph position="3"> Mitsubishi sales rose 46 percent. no 1806 Vanunu, 49, was abducted by Israeli agents and convicted of treason in 1986 after discussing his work as a mid-level Dimona technician with Britain's Sunday Times newspaper. Vanunu's disclosures in 1968 led experts to conclude that Israel has a stockpile of nuclear warheads.</Paragraph> <Paragraph position="4"> no 2081 The main race track in Qatar is located in Shahaniya, on the Dukhan Road.</Paragraph> <Paragraph position="5"> Qatar is located in Shahaniya. no Though most problems shown have answer no, the data set is actually balanced between yes and no. mainly in the sophistication of the matching stage. The simplest approach is to base the entailment prediction on the degree of semantic overlap between the text and hypothesis using models based on bags of words, bags of n-grams, TF-IDF scores, or something similar (Jijkoun and de Rijke, 2005). Such models have serious limitations: semantic overlap is typically a symmetric relation, whereas entailment is clearly not, and, because overlap models do not account for syntactic or semantic structure, they are easily fooled by examples like ID 2081.</Paragraph> <Paragraph position="6"> A more structured approach is to formulate the entailment prediction as a graph matching problem (Haghighi et al., 2005; de Salvo Braz et al., 2005). In this formulation, sentences are represented as normalized syntactic dependency graphs (like the one shown in figure 1) and entailment is approximated with an alignment between the graph representing the hypothesis and a portion of the corresponding graph(s) representing the text. Each possible alignment of the graphs has an associated score, and the score of the best alignment is used as an approximation to the strength of the entailment: a betteraligned hypothesis is assumed to be more likely to be entailed. To enable incremental search, alignment scores are usually factored as a combination of local terms, corresponding to the nodes and edges of the two graphs. Unfortunately, even with factored scores the problem of finding the best alignment of two graphs is NP-complete, so exact computation is intractable. Authors have proposed a variety of approximate search techniques. Haghighi et al. (2005) divide the search into two steps: in the first step they consider node scores only, which relaxes the problem to a weighted bipartite graph matching that can be solved in polynomial time, and in the second step they add the edges scores and hillclimb the alignment via an approximate local search.</Paragraph> <Paragraph position="7"> A third approach, exemplified by Moldovan et al.</Paragraph> <Paragraph position="8"> (2003) and Raina et al. (2005), is to translate dependency parses into neo-Davidsonian-style quasi-logical forms, and to perform weighted abductive theorem proving in the tradition of (Hobbs et al., 1988). Unless supplemented with a knowledge base, this approach is actually isomorphic to the graph matching approach. For example, the graph in figure 1 might generate the quasi-LF rose(e1), nsubj(e1, x1), sales(x1), nn(x1, x2), Mitsubishi(x2), dobj(e1, x3), percent(x3), num(x3, x4), 46(x4).</Paragraph> <Paragraph position="9"> There is a term corresponding to each node and arc, and the resolution steps at the core of weighted abduction theorem proving consider matching an individual node of the hypothesis (e.g. rose(e1)) with something from the text (e.g. fell(e1)), just as in the graph-matching approach. The two models become distinct when there is a good supply of additional linguistic and world knowledge axioms--as in Moldovan et al. (2003) but not Raina et al. (2005).</Paragraph> <Paragraph position="10"> Then the theorem prover may generate intermediate forms in the proof, but, nevertheless, individual terms are resolved locally without reference to global context.</Paragraph> <Paragraph position="11"> Finally, a few efforts (Akhmatova, 2005; Fowler et al., 2005; Bos and Markert, 2005) have tried to translate sentences into formulas of first-order logic, in order to test logical entailment with a theorem prover. While in principle this approach does not suffer from the limitations we describe below, in practice it has not borne much fruit. Because few problem sentences can be accurately translated to logical form, and because logical entailment is a strict standard, recall tends to be poor.</Paragraph> <Paragraph position="12"> The simple graph matching formulation of the problem belies three important issues. First, the above systems assume a form of upward monotonicity: if a good match is found with a part of the text, other material in the text is assumed not to affect the validity of the match. But many situations lack this upward monotone character. Consider variants on ID 98. Suppose the hypothesis were Arafat targeted for assassination. This would allow a perfect graph match or zero-cost weighted abductive proof, because the hypothesis is a subgraph of the text.</Paragraph> <Paragraph position="13"> However, this would be incorrect because it ignores the modal operator could. Information that changes the validity of a proof can also exist outside a matching clause. Consider the alternate text Sharon denies Arafat is targeted for assassination.1 The second issue is the assumption of locality.</Paragraph> <Paragraph position="14"> Locality is needed to allow practical search, but many entailment decisions rely on global features of the alignment, and thus do not naturally factor by nodes and edges. To take just one example, dropping a restrictive modifier preserves entailment in a positive context, but not in a negative one. For example, Dogs barked loudly entails Dogs barked, but No dogs barked loudly does not entail No dogs barked.</Paragraph> <Paragraph position="15"> These more global phenomena cannot be modeled with a factored alignment score.</Paragraph> <Paragraph position="16"> The last issue arising in the graph matching approaches is the inherent confounding of alignment and entailment determination. The way to show that one graph element does not follow from another is to make the cost of aligning them high. However, since we are embedded in a search for the lowest cost alignment, this will just cause the system to choose an alternate alignment rather than recognizing a non-entailment. In ID 152, we would like the hypothesis to align with the first part of the text, to 1This is the same problem labeled and addressed as context in Tatu and Moldovan (2005).</Paragraph> <Paragraph position="17"> be able to prove that civilians are not members of law enforcement agencies and conclude that the hypothesis does not follow from the text. But a graph-matching system will to try to get non-entailment by making the matching cost between civilians and members of law enforcement agencies be very high.</Paragraph> <Paragraph position="18"> However, the likely result of that is that the final part of the hypothesis will align with were civilians at the end of the text, assuming that we allow an alignment with &quot;loose&quot; arc correspondence.2 Under this candidate alignment, the lexical alignments are perfect, and the only imperfect alignment is the subject arc of were is mismatched in the two. A robust inference guesser will still likely conclude that there is entailment.</Paragraph> <Paragraph position="19"> We propose that all three problems can be resolved in a two-stage architecture, where the alignment phase is followed by a separate phase of entailment determination. Although developed independently, the same division between alignment and classification has also been proposed by Marsi and Krahmer (2005), whose textual system is developed and evaluated on parallel translations into Dutch.</Paragraph> <Paragraph position="20"> Their classification phase features an output space of five semantic relations, and performs well at distinguishing entailing sentence pairs.</Paragraph> <Paragraph position="21"> Finding aligned content can be done by any search procedure. Compared to previous work, we emphasize structural alignment, and seek to ignore issues like polarity and quantity, which can be left to a subsequent entailment decision. For example, the scoring function is designed to encourage antonym matches, and ignore the negation of verb predicates.</Paragraph> <Paragraph position="22"> The ideas clearly generalize to evaluating several alignments, but we have so far worked with just the one-best alignment. Given a good alignment, the determination of entailment reduces to a simple classification decision. The classifier is built over features designed to recognize patterns of valid and invalid inference. Weights for the features can be hand-set or chosen to minimize a relevant loss function on training data using standard techniques from machine learning. Because we already have a complete alignment, the classifier's decision can be con2Robust systems need to allow matches with imperfect arc correspondence. For instance, given Bill went to Lyons to study French farming practices, we would like to be able to conclude that Bill studied French farming despite the structural mismatch. ditioned on arbitrary global features of the aligned graphs, and it can detect failures of monotonicity.</Paragraph> </Section> <Section position="6" start_page="43" end_page="44" type="metho"> <SectionTitle> 3 System </SectionTitle> <Paragraph position="0"> Our system has three stages: linguistic analysis, alignment, and entailment determination.</Paragraph> <Section position="1" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 3.1 Linguistic analysis </SectionTitle> <Paragraph position="0"> Our goal in this stage is to compute linguistic representations of the text and hypothesis that contain as much information as possible about their semantic content. We use typed dependency graphs, which contain a node for each word and labeled edges representing the grammatical relations between words.</Paragraph> <Paragraph position="1"> Figure 1 gives the typed dependency graph for ID 971. This representation contains much of the information about words and relations between them, and is relatively easy to compute from a syntactic parse.</Paragraph> <Paragraph position="2"> However many semantic phenomena are not represented properly; particularly egregious is the inability to represent quantification and modality.</Paragraph> <Paragraph position="3"> We parse input sentences to phrase structure trees using the Stanford parser (Klein and Manning, 2003), a statistical syntactic parser trained on the Penn TreeBank. To ensure correct parsing, we pre-process the sentences to collapse named entities into new dedicated tokens. Named entities are identified by a CRF-based NER system, similar to that described in (McCallum and Li, 2003). After parsing, contiguous collocations which appear in Word-Net (Fellbaum, 1998) are identified and grouped.</Paragraph> <Paragraph position="4"> We convert the phrase structure trees to typed dependency graphs using a set of deterministic hand-coded rules (de Marneffe et al., 2006). In these rules, heads of constituents are first identified using a modified version of the Collins head rules that favor semantic heads (such as lexical verbs rather than auxiliaries), and dependents of heads are typed using tregex patterns (Levy and Andrew, 2006), an extension of the tgrep pattern language. The nodes in the final graph are then annotated with their associated word, part-of-speech (given by the parser), lemma (given by a finite-state transducer described by Minnen et al. (2001)) and named-entity tag.</Paragraph> </Section> <Section position="2" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 3.2 Alignment </SectionTitle> <Paragraph position="0"> The purpose of the second phase is to find a good partial alignment between the typed dependency graphs representing the hypothesis and the text. An alignment consists of a mapping from each node (word) in the hypothesis graph to a single node in the text graph, or to null.3 Figure 1 gives the alignment for ID 971.</Paragraph> <Paragraph position="1"> The space of alignments is large: there are O((m + 1)n) possible alignments for a hypothesis graph with n nodes and a text graph with m nodes.</Paragraph> <Paragraph position="2"> We define a measure of alignment quality, and a procedure for identifying high scoring alignments.</Paragraph> <Paragraph position="3"> We choose a locally decomposable scoring function, such that the score of an alignment is the sum of the local node and edge alignment scores. Unfortunately, there is no polynomial time algorithm for finding the exact best alignment. Instead we use an incremental beam search, combined with a node ordering heuristic, to do approximate global search in the space of possible alignments. We have experimented with several alternative search techniques, and found that the solution quality is not very sensitive to the specific search procedure used.</Paragraph> <Paragraph position="4"> Our scoring measure is designed to favor alignments which align semantically similar subgraphs, irrespective of polarity. For this reason, nodes receive high alignment scores when the words they represent are semantically similar. Synonyms and antonyms receive the highest score, and unrelated words receive the lowest. Our hand-crafted scoring metric takes into account the word, the lemma, and the part of speech, and searches for word relatedness using a range of external resources, including WordNet, precomputed latent semantic analysis matrices, and special-purpose gazettes. Alignment scores also incorporate local edge scores, which are based on the shape of the paths between nodes in the text graph which correspond to adjacent nodes in the hypothesis graph. Preserved edges receive the highest score, and longer paths receive lower scores.</Paragraph> </Section> <Section position="3" start_page="43" end_page="44" type="sub_section"> <SectionTitle> 3.3 Entailment determination </SectionTitle> <Paragraph position="0"> In the final stage of processing, we make a decision about whether or not the hypothesis is entailed by the text, conditioned on the typed dependency graphs, as well as the best alignment between them.</Paragraph> <Paragraph position="1"> gated by the fact that many multiword expressions (e.g. named entities, noun compounds, multiword prepositions) have been collapsed into single nodes during linguistic analysis. Because we have a data set of examples that are labeled for entailment, we can use techniques from supervised machine learning to learn a classifier. We adopt the standard approach of defining a featural representation of the problem and then learning a linear decision boundary in the feature space. We focus here on the learning methodology; the next section covers the definition of the set of features. Defined in this way, one can apply any statistical learning algorithm to this classification task, such as support vector machines, logistic regression, or naive Bayes. We used a logistic regression classifier with a Gaussian prior parameter for regularization.</Paragraph> <Paragraph position="2"> We also compare our learning results with those achieved by hand-setting the weight parameters for the classifier, effectively incorporating strong prior (human) knowledge into the choice of weights.</Paragraph> <Paragraph position="3"> An advantage to the use of statistical classifiers is that they can be configured to output a probability distribution over possible answers rather than just the most likely answer. This allows us to get confidence estimates for computing a confidence weighted score (see section 5). A major concern in applying machine learning techniques to this classification problem is the relatively small size of the training set, which can lead to overfitting problems.</Paragraph> <Paragraph position="4"> We address this by keeping the feature dimensionality small, and using high regularization penalties in training.</Paragraph> </Section> </Section> <Section position="7" start_page="44" end_page="45" type="metho"> <SectionTitle> 4 Feature representation </SectionTitle> <Paragraph position="0"> In the entailment determination phase, the entailment problem is reduced to a representation as a vector of 28 features, over which the statistical classifier described above operates. These features try to capture salient patterns of entailment and non-entailment, with particular attention to contexts which reverse or block monotonicity, such as negations and quantifiers. This section describes the most important groups of features.</Paragraph> <Paragraph position="1"> Polarity features. These features capture the presence (or absence) of linguistic markers of negative polarity contexts in both the text and the hypothesis, such as simple negation (not), downward-monotone quantifiers (no, few), restricting prepositions (without, except) and superlatives (tallest).</Paragraph> <Paragraph position="2"> Adjunct features. These indicate the dropping or adding of syntactic adjuncts when moving from the text to the hypothesis. For the common case of restrictive adjuncts, dropping an adjunct preserves truth (Dogs barked loudly |= Dogs barked), while adding an adjunct does not (Dogs barked negationslash|= Dogs barked today). However, in negative-polarity contexts (such as No dogs barked), this heuristic is reversed: adjuncts can safely be added, but not dropped. For example, in ID 59, the hypothesis aligns well with the text, but the addition of in Iraq indicates non-entailment.</Paragraph> <Paragraph position="3"> We identify the &quot;root nodes&quot; of the problem: the root node of the hypothesis graph and the corresponding aligned node in the text graph. Using dependency information, we identify whether adjuncts have been added or dropped. We then determine the polarity (negative context, positive context or restrictor of a universal quantifier) of the two root nodes to generate features accordingly.</Paragraph> <Paragraph position="4"> Antonymy features. Entailment problems might involve antonymy, as in ID 971. We check whether an aligned pairs of text/hypothesis words appear to be antonymous by consulting a pre-computed list of about 40,000 antonymous and other contrasting pairs derived from WordNet. For each antonymous pair, we generate one of three boolean features, indicating whether (i) the words appear in contexts of matching polarity, (ii) only the text word appears in a negative-polarity context, or (iii) only the hypothesis word does.</Paragraph> <Paragraph position="5"> Modality features. Modality features capture simple patterns of modal reasoning, as in ID 98, which illustrates the heuristic that possibility does not entail actuality. According to the occurrence (or not) of predefined modality markers, such as must or maybe, we map the text and the hypothesis to one of six modalities: possible, not possible, actual, not actual, necessary, and not necessary. The text/hypothesis modality pair is then mapped into one of the following entailment judgments: yes, weak yes, don't know, weak no, or no. For example: (not possible |= not actual)? = yes (possible |= necessary)? = weak no Factivity features. The context in which a verb phrase is embedded may carry semantic presuppositions giving rise to (non-)entailments such as The gangster tried to escape negationslash|= The gangster escaped. This pattern of entailment, like others, can be reversed by negative polarity markers (The gangster managed to escape |= The gangster escaped while The gangster didn't manage to escape negationslash|= The gangster escaped). To capture these phenomena, we compiled small lists of &quot;factive&quot; and non-factive verbs, clustered according to the kinds of entailments they create. We then determine to which class the parent of the text aligned with the hypothesis root belongs to. If the parent is not in the list, we only check whether the embedding text is an affirmative context or a negative one.</Paragraph> <Paragraph position="6"> Quantifier features. These features are designed to capture entailment relations among simple sentences involving quantification, such as Every company must report |= A company must report (or The company, or IBM). No attempt is made to handle multiple quantifiers or scope ambiguities. Each quantifier found in an aligned pair of text/hypothesis words is mapped into one of five quantifier categories: no, some, many, most, and all. The no category is set apart, while an ordering over the other four categories is defined. The some category also includes definite and indefinite determiners and small cardinal numbers. A crude attempt is made to handle negation by interchanging no and all in the presence of negation. Features are generated given the categories of both hypothesis and text.</Paragraph> <Paragraph position="7"> Number, date, and time features. These are designed to recognize (mis-)matches between numbers, dates, and times, as in IDs 1806 and 231. We do some normalization (e.g. of date representations) and have a limited ability to do fuzzy matching. In ID 1806, the mismatched years are correctly identified. Unfortunately, in ID 231 the significance of over is not grasped and a mismatch is reported.</Paragraph> <Paragraph position="8"> Alignment features. Our feature representation includes three real-valued features intended to represent the quality of the alignment: score is the raw score returned from the alignment phase, while goodscore and badscore try to capture whether the alignment score is &quot;good&quot; or &quot;bad&quot; by computing the sigmoid function of the distance between the alignment score and hard-coded &quot;good&quot; and &quot;bad&quot; reference values.</Paragraph> </Section> class="xml-element"></Paper>