File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1205_metho.xml

Size: 23,072 bytes

Last Modified: 2025-10-06 14:09:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1205">
  <Title>Recognizing Paraphrases and Textual Entailment using Inversion Transduction Grammars</Title>
  <Section position="3" start_page="25" end_page="26" type="metho">
    <SectionTitle>
2 Inversion Transduction Grammars
</SectionTitle>
    <Paragraph position="0"> Formally, ITGs can be defined as the restricted subset of syntax-directed transduction grammars or SDTGs Lewis and Stearns (1968) where all of the rules are either of straight or inverted orientation. Ordinary SDTGs allow any permutation of the symbols on the right-hand side to be specified when translating from the input language to the output language. In contrast, ITGs only allow two out of the possible permutations. If a rule is straight, the order of its right-hand symbols must be the same for both language. On the other hand, if a rule is inverted, then the order is left-to-right for the input language and right-to-left for the output language. Since inversion is permitted at any level of rule expansion, a derivation may intermix productions of either orientation within the parse tree.</Paragraph>
    <Paragraph position="1"> The ability to compose multiple levels of straight and inverted constituents gives ITGs much greater expressiveness than might seem at first blush.</Paragraph>
    <Paragraph position="2"> A simple example may be useful to fix ideas. Consider the following pair of parse trees for sentence translations: [[[The Authority]NP [will [[be accountable]VV [to [the [[Financial Secretary]NN ]NNN ]NP ]PP ]VP</Paragraph>
    <Paragraph position="4"> Even though the order of constituents under the inner VP is inverted between the languages, an ITG can capture the common structure of the two sentences. This is compactly shown by writing the parse tree together for both sentences with the aid of an &lt;&gt; angle bracket notation marking parse tree nodes that instantiate rules of</Paragraph>
    <Paragraph position="6"> In a weighted or stochastic ITG (SITG), a weight or a probability is associated with each rewrite rule. Following the standard convention, we use a and b to denote probabilities for syntactic and lexical rules, respectively.</Paragraph>
    <Paragraph position="7"> For example, the probability of the rule NN 0.4- [A N] is aNN-[A N] = 0.4. The probability of a lexical rule A 0.001x/y is bA(x,y) = 0.001. Let W1,W2 be the vocabulary sizes of the two languages, and N = {A1,...,AN} be the set of nonterminals with indices 1,...,N.</Paragraph>
    <Paragraph position="8"> Wu (1997) also showed that ITGs can be equivalently be defined in two other ways. First, ITGs can be defined as the restricted subset of SDTGs where all rules are of rank 2. Second, ITGs can also be defined as the restricted subset of SDTGs where all rules are of rank 3.</Paragraph>
    <Paragraph position="9"> Polynomial-time algorithms are possible for various tasks including translation using ITGs, as well as bilingual parsing or biparsing, where the task is to build the highest-scored parse tree given an input bi-sentence.</Paragraph>
    <Paragraph position="10"> For present purposes we can employ the special case of Bracketing ITGs, where the grammar employs only one single, undistinguished &amp;quot;dummy&amp;quot; nonterminal category for any non-lexical rule. Designating this category A, a Bracketing ITG has the following form (where, as usual, lexical transductions of the form A - e/f may possibly be singletons of the form A - e/epsilon1 or A - epsilon1/f).</Paragraph>
    <Paragraph position="11">  The simplest class of ITGs, Bracketing ITGs, are particularly interesting in applications like paraphrasing, because they impose ITG constraints in language-independent fashion, and in the simplest case do not require any language-specific linguistic grammar or training. In Bracketing ITGs, the grammar uses only a single, undifferentiated non-terminal (Wu, 1995). The key modeling property of Bracketing ITGs that is most relevant to paraphrase recognition is that they assign strong preference to candidate paraphrase pairs in which nested constituent subtrees can be recursively aligned with a minimum of constituent boundary violations. Unlike language-specific linguistic approaches, however, the shape of the trees are driven in unsupervised fashion by the data. One way to view this is that the trees are hidden explanatory variables. This not only provides significantly higher robustness than more highly constrained manually constructed grammars, but also makes the model widely applicable across languages in economical fashion without a large investment in manually constructed resources.</Paragraph>
    <Paragraph position="12"> Moreover, for reasons discussed by Wu (1997), ITGs possess an interesting intrinsic combinatorial property of permitting roughly up to four arguments of any frame to be transposed freely, but not more. This matches suprisingly closely the preponderance of linguistic verb frame theories from diverse linguistic traditions that all allow up to four arguments per frame. Again, this property emerges naturally from ITGs in language-independent fashion, without any hardcoded language-specific knowledge. This further suggests that ITGs should do well at picking out paraphrase pairs where the order of up to four arguments per frame may vary freely between the two strings. Conversely, ITGs should do well at rejecting pairs where (1) too many words in one sentence  find no correspondence in the other, (2) frames do not nest in similar ways in the candidate sentence pair, or (3) too many arguments must be transposed to achieve an alignment--all of which would suggest that the sentences probably express different ideas.</Paragraph>
    <Paragraph position="13"> As an illustrative example, in common similarity models, the following pair of sentences (found in actual data arising in our experiments below) would receive an inappropriately high score, because of the high lexical similarity between the two sentences: Chinese president Jiang Zemin arrived in Japan today for a landmark state visit .</Paragraph>
    <Paragraph position="15"> (Jiang Zemin will be the first Chinese national president to pay a state vist to Japan.) However, the ITG based model is sensitive enough to the differences in the constituent structure (reflecting underlying differences in the predicate argument structure) so that our experiments show that it assigns a low score. On the other hand, the experiments also show that it successfully assigns a high score to other candidate bisentences representing a true Chinese translation of the same English sentence, as well as a true English translation of the same Chinese sentence.</Paragraph>
    <Paragraph position="16"> We investigate a model for the paraphrase recognition problem that employ simple generic Bracketing ITGs.</Paragraph>
    <Paragraph position="17"> The experimental results show that, even in the absence of any thesaurus to accommodate lexical variation between the two strings, the Bracketing ITG's structure matching bias alone produces a significant improvement in average precision.</Paragraph>
  </Section>
  <Section position="4" start_page="26" end_page="26" type="metho">
    <SectionTitle>
3 Scoring Method
</SectionTitle>
    <Paragraph position="0"> All words of the vocabulary are included among the lexical transductions, allowing exact word matches between the two strings of any candidate paraphrase pair.</Paragraph>
    <Paragraph position="1"> Each candidate pair of the test set was scored via the ITG biparsing algorithm, which employs a dynamic programming approach as follows.Let the input English sentence be e1,..., eT and the corresponding input Chinese sentence be c1,..., cV . As an abbreviation we write es..t for the sequence of words es+1, es+2,..., et, and similarly for cu..v; also, es..s = epsilon1 is the empty string. It is convenient to use a 4-tuple of the form q = (s,t,u,v) to identify each node of the parse tree, where the sub-strings es..t and cu..v both derive from the node q. Denote the nonterminal label on q by lscript(q). Then for any</Paragraph>
    <Paragraph position="3"> as the maximum probability of any derivation from i that successfully parses both es..t and cu..v. Then the best parse of the sentence pair has probability d0,T,0,V (S).</Paragraph>
    <Paragraph position="4"> The algorithm computes d0,T,0,V (S) using the following recurrences. Note that we generalize argmax to the case where maximization ranges over multiple indices, by making it vector-valued. Also note that [] and &lt;&gt; are simply constants, written mnemonically. The condition (S[?]s)(t[?]S)+(U [?]u)(v[?]U) negationslash= 0 is a way to specify that the substring in one but not both languages may be split into an empty string epsilon1 and the substring itself; this ensures that the recursion terminates, but permits words that have no match in the other language to map to an epsilon1 instead.</Paragraph>
  </Section>
  <Section position="5" start_page="26" end_page="27" type="metho">
    <SectionTitle>
1. Initialization
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> 3. Reconstruction Initialize by setting the root of the  parse tree to q1 = (0,T,0,V ) and its nonterminal label to lscript(q1) = S. The remaining descendants in the optimal parse tree are then given recursively for any</Paragraph>
    <Paragraph position="4"> As mentioned earlier, biparsing for ITGs can be accomplished efficiently in polynomial time, rather than the exponential time required for classical SDTGs. The result in Wu (1997) implies that for the special case of Bracketing ITGs, the time complexity of the algorithm is ThparenleftbigT3V 3parenrightbig where T and V are the lengths of the two sentences. This is a factor of V 3 more than monolingual chart parsing, but has turned out to remain quite practical for corpus analysis, where parsing need not be real-time.</Paragraph>
    <Paragraph position="5"> The ITG scoring model can also be seen as a variant of the approach described by Leusch et al. (2003), which allows us to forego training to estimate true probabilities; instead, rules are simply given unit weights. The ITG scores can be interpreted as a generalization of classical Levenshtein string edit distance, where inverted block transpositions are also allowed. Even without probability estimation, Leusch et al. found excellent correlation with human judgment of similarity between translated paraphrases. null</Paragraph>
  </Section>
  <Section position="6" start_page="27" end_page="27" type="metho">
    <SectionTitle>
4 Experimental Results--Paraphrase
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
Recognition
</SectionTitle>
      <Paragraph position="0"> Our objective here was to isolate the effect of the ITG constraint bias. No training was performed with the available development sets. Rather, the aim was to establish foundational baseline results, to see in this first round of paraphrase recognition experiments what results could be obtained with the simplest versions of the ITG models.</Paragraph>
      <Paragraph position="1"> The MSR Paraphrase Corpus test set consists of 1725 candidate paraphrase string pairs, each annotated for semantic equivalence by two or three human collectors.</Paragraph>
      <Paragraph position="2"> Within the test set, 66.5% of the examples were annotated as being semantically equivalent. The corpus was originally generated via a combination of automatic filtering methods, making it difficult to make specific claims about distributional neutrality, due to the arbitrary nature of the example selection process.</Paragraph>
      <Paragraph position="3"> The ITG scoring model produced an uninterpolated average precision (also known as confidence weighted score) of 76.1%. This represents an improvement of roughly 10% over the random baseline. Note that this improvement can be achieved with no thesaurus or lexical similarity model, and no parameter training.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="27" end_page="29" type="metho">
    <SectionTitle>
5 Experimental Results--Textual
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
Entailment Recognition
</SectionTitle>
      <Paragraph position="0"> The experimental procedure for the monolingual textual entailment recognition task is the same as for paraphrase recognition, except that one string serves as the Text and the other serves as the Hypothesis.</Paragraph>
      <Paragraph position="1"> Results on the textual entailment recognition task are consistent with the above paraphrase recognition results.</Paragraph>
      <Paragraph position="2"> For the PASCAL RTE challenge datasets, across all sub-sets overall, the model produced a confidence-weighted score of 54.97% (better than chance at the 0.05 level). All examples were labeled, so precision, recall, and f-score are equivalent; the accuracy was 51.25%.</Paragraph>
      <Paragraph position="3"> For the RTE task we also investigated a second variant of the model, in which a list of 172 words from a stoplist was excluded from the lexical transductions. The motivation for this model was to discount the effect of words such as &amp;quot;the&amp;quot; or &amp;quot;of&amp;quot; since, more often than not, they could be irrelevant to the RTE task.</Paragraph>
      <Paragraph position="4"> Surprisingly, the stoplisted model produced worse results. The overall confidence-weighted score was 53.61%, and the accuracy was 50.50%. We discuss the reasons below in the context of specific subsets.</Paragraph>
      <Paragraph position="5"> As one might expect, the Bracketing ITG models performed better on the subsets more closely approximating the tasks for which Bracketing ITGs were designed: comparable documents (CD), paraphrasing (PP), and information extraction (IE). We will discuss some important caveats on the machine translation (MT) and reading comprehension (RC) subsets. The subsets least close to the Bracketing ITG models are information retrieval (IR) and question answering (QA).</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
5.1 Comparable Documents (CD)
</SectionTitle>
      <Paragraph position="0"> The CD task definition can essentially be characterized as recognition of noisy word-aligned sentence pairs. Among all subsets, CD is perhaps closest to the noisy word alignment task for which Bracketing ITGs were originally developed, and indeed produced the best results for both of the Bracketing ITG models. The basic model produced a confidence-weighted score of 79.88% (accuracy 71.33%), while the stoplisted model produced an essentially unchanged confidence-weighted score of 79.83%  (accuracy 70.00%).</Paragraph>
      <Paragraph position="1"> The results on the RTE Challenge datasets closely reflect the larger-scale findings of Wu and Fung (2005), who demonstrate that an ITG based model yields far more accurate extraction of parallel sentences from quasicomparable non-parallel corpora than previous state-of-the-art methods. Wu and Fung's results also use the evaluation metric of uninterpolated average precision (i.e., confidence-weighted score).</Paragraph>
      <Paragraph position="2"> Note also that we believe the results here are artificially lowered by the absence of any thesaurus, and that significantly further improvements would be seen with the addition of a suitable thesaurus, for reasons discussed below under the MT subsection.</Paragraph>
    </Section>
    <Section position="3" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
5.2 Paraphrase Acquisition (PP)
</SectionTitle>
      <Paragraph position="0"> The PP task is also close to the task for which Bracketing ITGs were originally developed. For the PP task, the basic model produced a confidence-weighted score of 57.26% (accuracy 56.00%), while the stoplisted model produced a lower confidence-weighted score of 51.65% (accuracy 52.00%). Unlike the CD task, the greater importance of function words in determining equivalent meaning between paraphrases appears to cause the degradation in the stoplisted model.</Paragraph>
      <Paragraph position="1"> The effect of the absence of a thesaurus is much stronger for the PP task as opposed to the CD task. Inspection of the datasets reveals much more lexical variation between paraphrases, and shows that cases where lexis does not vary are generally handled accurately by the Bracketing ITG models. The MT subsection below discusses why a thesaurus should produce significant improvement. null</Paragraph>
    </Section>
    <Section position="4" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
5.3 Information Extraction (IE)
</SectionTitle>
      <Paragraph position="0"> The IE task presents a slight issue of misfit for the Bracketing ITG models, but yielded good results anyhow. The basic Bracketing ITG model attempts to align all words/collocations between the two strings. However, for the IE task in general, only a substring of the Text should be aligned to the Hypothesis, and the rest should be disregarded as &amp;quot;noise&amp;quot;. We approximated this by allowing words to be discarded from the Text at little cost, by using parameters that impose only a small penalty on null-aligned words from the Text. (As a reasonable first approximation, this characterization of the IE task ignores the possibility of modals, negation, quotation, and the like in the Text.) Despite the slight modeling misfit, the Bracketing ITG models produced good results for the IE subset. The basic model produced a confidence-weighted score of 59.92% (accuracy 55.00%), while the stoplisted model produced a lower confidence-weighted score of 53.63% (accuracy 51.67%). Again, the lower score of the stoplisted model appears to arise from the greater importance of function words in ensuring correct information extraction, as compared with the CD task.</Paragraph>
    </Section>
    <Section position="5" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
5.4 Machine Translation (MT)
</SectionTitle>
      <Paragraph position="0"> One exception to expectations is the machine translation subset, a task for which Bracketing ITGs were developed. The basic model produced a confidence-weighted score of 34.30% (accuracy 40.00%), while the stoplisted model produced a comparable confidence-weighted score of 35.96% (accuracy 39.17%).</Paragraph>
      <Paragraph position="1"> However, the performance here on the machine translation subset cannot be directly interpreted, for two reasons. null First, the task as defined in the RTE Challenge datasets is not actually crosslingual machine translation, but rather evaluation of monolingual comparability between an automatic translation and a gold standard human translation. This is in fact closer to the problem of defining a good MT evaluation metric, rather than MT itself. Leusch et al. (2003 and personal communication) found that Bracketing ITGs as an MT evaluation metric show excellent correlation with human judgments.</Paragraph>
      <Paragraph position="2"> Second, no translation lexicon or equivalent was used in our model. Normally in translation models, including ITG models, the translation lexicon accommodates lexical ambiguity, by providing multiple possible lexical choices for each word or collocation being translated. Here, there is no second language, so some substitute mechanism to accommodate lexical ambiguity would be needed.</Paragraph>
      <Paragraph position="3"> The most obvious substitute for a translation lexicon would be a monolingual thesaurus. This would allow matching synonomous words or collocations between the Text and the Hypothesis. Our original thought was to incorporate such a thesaurus in collaboration with teams focusing on creating suitable thesauri, but time limitations prevented completion of these experiments. Based on our own prior experiments and also on Leusch et al.'s experiences, we believe this would bring performance on the MT subset to excellent levels as well.</Paragraph>
    </Section>
    <Section position="6" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
5.5 Reading Comprehension (RC)
</SectionTitle>
      <Paragraph position="0"> The reading comprehension task is similar to the information extraction task. As such, the Bracketing ITG model could be expected to perform well for the RC subset. However, the basic model produced a confidence-weighted score of just 49.37% (accuracy 47.14%), and the stoplisted model produced a comparable confidence-weighted score of 47.11% (accuracy 45.00%).</Paragraph>
      <Paragraph position="1"> The primary reason for the performance gap between the RC and IE domains appears to be that RC is less news-oriented, so there is less emphasis on exact lexical choices such as named entities. This puts more weight on  the importance of a good thesaurus to recognize lexical variation. For this reason, we believe the addition of a thesaurus would bring performance improvements similar to the case of MT.</Paragraph>
    </Section>
    <Section position="7" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
5.6 Information Retrieval (IR)
</SectionTitle>
      <Paragraph position="0"> The IR task diverges significantly from the tasks for which Bracketing ITGs were developed. The basic model produced a confidence-weighted score of 43.14% (accuracy 46.67%), while the stoplisted model produced a comparable confidence-weighted score of 44.81% (accuracy 47.78%).</Paragraph>
      <Paragraph position="1"> Bracketing ITGs seek structurally parallelizable substrings, where there is reason to expect some degree of generalization between the frames (heads and arguments) of the two substrings from a lexical semantics standpoint. In contrast, the IR task relies on unordered keywords, so the effect of argument-head binding cannot be expected to be strong.</Paragraph>
    </Section>
    <Section position="8" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
5.7 Question Answering (QA)
</SectionTitle>
      <Paragraph position="0"> The QA task is extremely free in the sense that questions can differ significantly from the answers in both syntactic structure and lexis, and can also require a significant degree of indirect complex inference using real-world knowledge. The basic model produced a confidence-weighted score of 33.20% (accuracy 40.77%), while the stoplisted model produced a significantly better confidence-weighted score of 38.26% (accuracy 44.62%).</Paragraph>
      <Paragraph position="1"> Aside from adding a thesaurus, to properly model the QA task, at the very least the Bracketing ITG models would need to be augmented with somewhat more linguistic rules that include a proper model for wh- words in the Hypothesis, which otherwise cannot be aligned to the Text. In the Bracketing ITG models, the stoplist appears to help by normalizing out the effect of the wh- words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML