File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1113_metho.xml
Size: 25,169 bytes
Last Modified: 2025-10-06 14:10:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1113"> <Title>Variants of tree similarity in a Question Answering task</Title> <Section position="4" start_page="0" end_page="102" type="metho"> <SectionTitle> 2 Distance Measures </SectionTitle> <Paragraph position="0"> In pursuing such a similarity-based approach to question-answering, the key decisions to be made are the representations of the questions and answers, and relatedly, distance measures between them.</Paragraph> <Paragraph position="1"> We will primarily be concerned with measures which refer to a linguistic structure assigned to a word sequence - variants of tree-distance, but we will also consider string-distance.</Paragraph> <Section position="1" start_page="0" end_page="100" type="sub_section"> <SectionTitle> 2.1 Tree Measures </SectionTitle> <Paragraph position="0"> Following (Zhang and Shasha, 1989), one can arrive at tree-distance in the following way. Given source and target ordered, labelled trees, S and T, consider the set H(S,T) of all 1-to-1 partial maps, s, from S into T, which are homomorphisms preserving left-to-right order and ancestry1. Let the alignment, s', be the enlargment of the map s with pairs (Si,l) for nodes Si negationslash[?] dom(s) and (l,Tj) for nodes Tj negationslash[?] ran(s). Let D define deletion costs for the (Si,l), I insertion costs for the (l,Tj), and R replacement costs for the (Si,Tj) which represent nodes with non-identical labels. Then a total cost for the alignment, C(s') can be defined as the sum of these components costs, and the tree distance can then be defined as the cost of the least-cost map:</Paragraph> <Paragraph position="2"> For any 3 trees, T1, T2, T3, the triangle inequality holds [?](T1,T3) [?] [?](T1,T2) + [?](T2,T3).</Paragraph> <Paragraph position="3"> Briefly the argument is as follows. Given mappings s [?] H(T1,T2), and t [?] H(T2,T3), s*t [?] H(T1,T3)2, so (s * t)' is an alignment between T1 and T3, and [?](T1,T3) [?] C((s * t)'). The cost of the composition is less than the sum of the costs of the composed maps: s's insertions and replacements contribute only if they fall in dom(t), t's deletions and replacements contribute only if they act on ran(s).</Paragraph> <Paragraph position="4"> From this basic definition, one can depart in a number of directions. First of all, there is a part-vs-whole dimension of variation. Where [?](S,T) gives the cost of aligning the whole source tree S with the target T, one can consider variants where one minimises over a set of sub-parts of S. This is equivalent to letting all but the nodes belonging to the chosen sub-part to delete at zero cost3. Let d(S,T) be the sub-tree distance. Let vectord(S,T), be the sub-traversal distance, in which sub-traversals of the left-to-right, post-order traversal of S are considered. As for [?], the triangle inequality holds for d and vectord - one needs to extend the notion of alignment with a set of free deletions. Unlike [?], d and vectord are not symmetric. All of [?], d and vectord are implicitly parametrised by the cost functions, D, I and R. In the work below 4 other parameters are explored Node weighting W: this is a function which assigns a real-number weight to each each node. The cost function then refers to the weights. In experiments reported below,</Paragraph> <Paragraph position="6"> Si and Tj have unequal labels. The experiments reported below use 2 weighting function STR, and LEX. STR assign weights according to the syntactic structure, via a classification of nodes as heads vs. complements vs. adjuncts vs. the rest, with essentially adjuncts given 1/5th the weights of heads and complements, and other daughters 1/2, via essentially the following top-down algorithm: Str(node,rank) : assign weight 1/rank to node</Paragraph> <Paragraph position="8"> 3Note that if one minimises also over sub-parts of the target, you do not get an interesting notion, as the minimum will inevitably involve at most one node of source and target.</Paragraph> <Paragraph position="9"> if (d is head or complement){ Target wild cards T([?]): this is a function which classifies certain target sub-trees as wildcard. If source Si is mapped to target Tj, and Tj is the root of a wild-card tree, all nodes within the Si sub-tree can be deleted for 0 cost, and all those within the Tj sub-tree can be inserted for 0 cost. A wild card np tree might can be put in the position of the gap in wh-questions, allowing for example what is memory allocation, to closely match any sentences with memory allocation as their object, no matter what their subject - see Figure 3.</Paragraph> <Paragraph position="10"> Source self-effacers S/l: this is a function which classifies source sub-trees as selfeffacers. Such trees can be deleted in their entirety for zero cost. If S/l classifies all source sub-trees as self-effacing, then [?](S/l) will coincide with notion of 'tree-distance with Cut' given in (Zhang and Shasha, 1989).</Paragraph> <Paragraph position="11"> Target self-inserters l/T: this is a function which classifies certain target sub-trees as self-inserters. Such trees can be inserted in their entirety for zero cost. A candidate might be optional adjuncts.4</Paragraph> </Section> <Section position="2" start_page="100" end_page="102" type="sub_section"> <SectionTitle> 2.2 Sequence Measures </SectionTitle> <Paragraph position="0"> The tree-distance measures work with an elaboration of the original questions and answers. (Levenshtein, 1966) defined the 1 dimensional precursor of tree distance, which works directly on the 2 word sequences for the answer and question.</Paragraph> <Paragraph position="1"> For two sequences, s, t, and vertical (or horizontal) tree encodings l tree(s) and l tree(t), if 4Thus a target wild-card is somewhat like a target selfeffacer, but one which also licenses the classification of a matched source sub-tree as a being self-effacer.</Paragraph> <Paragraph position="2"> we define P(s,t), as [?](l tree(s),l tree(t)), and p(s,t), as vectord(l tree(s),l tree(t)), then P and p coincide with the standard sequence edit distance and sub-sequence edit distance. As special cases of [?] and d, P and p inherit the triangle inequality property.</Paragraph> <Paragraph position="3"> To illustrate some of the tree-distance definitions, in the following example, a [?] distance of 3 between 2 trees is obtained, assuming unit costs for deletions (shown in red and double outline), insertions (shown in green and double outline), and substitutions (shown in blue and linked with an ar- null whole tree matching dist=3.0 Note also in this picture that nodes that are mapped without a relabelling are shown at the same horizontal level, with no linking arrow.</Paragraph> <Paragraph position="4"> Figure 1 shows a sub-tree example - d. The source tree nodes which do not belong to the chosen sub-tree are shown in grey. The lowest vp sub-tree in the source is selected, and mapped to the vp in the target. The remaining target nodes must be inserted, but this costs less than a match which starts higher and necessitates some deletions and substitutions.</Paragraph> <Paragraph position="5"> Figure 2 shows a sub-tree example where the structural weighting STR has been used: size of a node reflects the weight. 4 of the nodes in the source represent the use of an auxiliary verb, and receive low weight, changing the optimum match to one covering the whole source tree. There is some price paid in matching the dissimilar subject nps.</Paragraph> <Paragraph position="6"> in the subject position there is a sub-tree which is classified as a wild-card np tree, and it matches at The basis of the algorithm used to calculate [?] is the ZhangShasha algorithm (Zhang and Shasha, 1989): the Appendix summarises it. The im- null plementation is based on code implementing [?] (Fontana et al., 2004), adapting it to allowing for the d and vectord variants and T([?]), S/l, and l/T parameters, and to generate the human-readable displays of the alignments (such as seen in figures 1,2 and 3).</Paragraph> </Section> <Section position="3" start_page="102" end_page="102" type="sub_section"> <SectionTitle> 2.3 Order invariant measures </SectionTitle> <Paragraph position="0"> Assessing answer/question similarity by variants of tree distance or sequence edit-distance, means that distance will not be word-order invariant.</Paragraph> <Paragraph position="1"> There are also measures which are word-order invariant, sometimes called token-based measures.</Paragraph> <Paragraph position="2"> These measures are usually couched in a vector representation of questions and answers, where vector dimensions are words from (some chosen enumeration) of words (see (Salton and Lesk, 1968)). In the simplest case the values on each dimensions are in {0,1}, denoting presence or absence of a word. If * is vector product and aw is the set of words in a sequence a, then vectora *vectorb = |aw [?] bw|, for the binary vectors representing aw, bw. Three well known measures based on this are given below, both in terms vectors, and for binary vectors, the equivalent formulation with sets: These measure similarity, not difference, ranging for 1 for identical aw,bw, to 0 for disjoint. In the binary case, Dice/Jaccard similarity can be related to the alignment-based, difference counting perspective of the edit-distances. If we define Pw(a,b) as |aw [?] bw|[?]|aw [?] bw |- the size of the symmetric difference between aw and bw this can be seen as a set-based version of edit distance5, which (i) considers mappings on the sets of words, aw, bw, not the sequences a, b, and (ii) sets replacement cost to infinity. A difference measure (ranging from 0 for identical aw,bw to 1 for disjoint) results if Pw(a,b) is divided by |aw |+|bw| (resp. |aw [?]bw|) and this difference measures will give the reverse of a ranking by Dice (resp. Jaccard) similarity.</Paragraph> <Paragraph position="3"> The Cosine is a measure of the angle between the vectors vectora,vectorb, and is not relatable in the 5Pw(a,b) could be equivalently defined as |(vectora [?]vectorb)|2 binary-case to the alignment-based, differencecounting perspective of the edit-distances: dividing Pw(a,b), the symmetric difference, by |aw|.5|bw|.5 does not give a measure with maximum value 1 for the disjoint case, and does not give the reverse of a ranking by Cosine similarity.6 Below we shall use th to denote the Cosine distance. null</Paragraph> </Section> </Section> <Section position="5" start_page="102" end_page="103" type="metho"> <SectionTitle> 3 The Question Answering Tasks </SectionTitle> <Paragraph position="0"> For a given representation r (parse trees, word sequences etc.), and distance measure d, we shall generically take a Question Answering by Distance (QAD) task to be given by a set of queries, Q, and for each query q, a corpus of potential answer sentences, CORq. For each a [?] CORq, the system determines d(r(a),r(q)), the distance between the representations of a and q, then uses this to sort CORq into Aq. This sorting is then evaluated in the following way. If ac [?] Aq is the correct answer, then the correct-answer-rank is the rank of ac in Aq: |{a [?] Aq : d(r(a),r(q)) [?] d(r(ac),r(q))} | whilst the correct-answer-cutoff is the proportion of Aq cut off by the correct answer ac: |{a [?] Aq : d(r(a),r(q)) [?] d(r(ac),r(q))} |/ |Aq | Lower values for this connote better performance.</Paragraph> <Paragraph position="1"> Another figure of merit is the reciprocal correctanswer-rank. Higher values of this connote better performance.</Paragraph> <Paragraph position="2"> Note the notion of answerhood is not one requiring answers to be the sub-sentential phrases associated with wh-phrases in the question. Also not all the questions are wh-questions.</Paragraph> <Paragraph position="3"> Note also that the set of candidate answers CORq is sorted by the answer-to-query distance, d(r(a),r(q)), not the query-to-answer distance, d(r(q),r(a)). The intuition is that the queries are short and the answers longer, with sub-part that really contains the answer.</Paragraph> <Paragraph position="4"> The performance of some of the above mentioned distance measures on 2 examples of QAD tasks has been measured: GNU Library Manual QAD Task: in this case Q is a set of 88 hand-created queries, and CORq, shared by all the queries, is the sentences of the manual of the GNU C Library7 ( |CORq |[?] 31,000).</Paragraph> <Paragraph position="5"> The TREC 11 QAD task: In this case Q was the 500 questions of the TREC11 QA track (Voorhees and Buckland, 2002), whose answers are drawn from a large corpus of newspaper articles. CORq was taken to be the sentences of the top 50 from the top-1000 ranking of articles provided by TREC11 for each question ( |CORq |[?] 1000).</Paragraph> <Paragraph position="6"> Answer correctness was determined using the TREC11 answer regular expressions. null For the tree-distance measures, 2 parsing systems have been used. For convenience of reference, we will call the first parser, the trinity parser. This is a home-grown parser combining a disambiguating part-of-speech tagger with a bottom-up chartparser, refering to CFG-like syntax rules and a subcategorisation system somewhat in a categorial grammar style. Right-branching analyses are prefered and a final selection of edges from all available is made using a leftmost/longest selection strategy - there is always an output regardless of whether there is a single input-encompassing edge. Preterminal node labels are a combination of a main functor with other feature terms, but the replacement cost function R is set to ignore the feature terms. Terminal node labels are base forms of words, not inflected forms. For the structural weighting algorithm, STR, the necessary node distinctions are furnished directly by the parser for vp, and by a small set of structure matching rules for other structures (nps, pps etc). The structures output for wh-questions are essentially deep structures, re-ordering an auxiliary inversion, and placing a tree in the position of a gap.</Paragraph> <Paragraph position="7"> The Collins parser (Collins, 1999) (Model 3 variant) is a probabilistic parser, using a model of trees as built top-down with a repertoire of moves, learnt from the Penn Treebank. The preterminal node labels are a combination of a Penn Tree-bank label with other information pertaining to the head/complement/adjunct distinction, but the replacement cost function R is set to ignore all but nal node labels are inflected forms of words, not base forms. For the structural weighting algorithm, STR, the necessary node distinctions are furnished directly by the parser. For the question parses, a set of transformations is applied to the parses directly given by the parser, which comparable to the trinity parser, re-order auxiliary inversion, and place a tree in the position of a gap.</Paragraph> </Section> <Section position="6" start_page="103" end_page="104" type="metho"> <SectionTitle> 4 Relating Parse Quality to Retrieval </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="103" end_page="104" type="sub_section"> <SectionTitle> Performance </SectionTitle> <Paragraph position="0"> As a kind of sanity-check on the idea of the using syntactic structures in retrieving answers, we performed some experiments in which we varied the sophistication of the parse trees that the parsers could produce, the expectation being that the less sophisticated the parse, the less successful would be question-answering performance. The left-hand data in Table 1 refers to various reductions of the linguistic knowledge bases of the trinity parser(thin50 = random removal of 50% subset, manual = manual removal of a subset, flat = entirely flat parses, gold = hand-correction of query parses and their correct answers). The right-hand data in Table 1 refers to experiments in which the repertoire of moves available to the Collins parser, as defined by its grammar file, was reduced to different sized random subsets of itself.</Paragraph> <Paragraph position="1"> Figure 4 shows the empirical cumulative density function (ecdf) of the correct-answer-cutoff obtained with the weighted sub-tree with wild cards measure. For each possible value c of correct-answer-cutoff, it plots the percentage of x = correct-answer-cutoff, y = proportion of queries whose correct-answer-cutoff [?] x (ranking by weighted sub-tree with wild cards) (Library task) What these experiments show is that the ques- null tion answering performance is a function of the sophistication of the parses that the parsers are able to produce.</Paragraph> </Section> </Section> <Section position="7" start_page="104" end_page="104" type="metho"> <SectionTitle> 5 Comparing Distance Measures </SectionTitle> <Paragraph position="0"> Table 2 gives results on the Library task, using the trinity parser, for some variations of the distance measure.</Paragraph> <Paragraph position="1"> Considering the results in 2, the best performing measure (mrr = 0.27) was the sub-traversal distance, vectord, assigning weights structurally using STR, with lexical emphasis LEX, and treating a gap position as an np wild card. This slightly out performs the sub-tree measure, d (mrr = 0.25).</Paragraph> <Paragraph position="2"> An alternative approach to discounting parts of the answer tree, allowing any sub-tree of the answer the option to delete for free ([?](W = Str * Lex,T([?]) = np gap,S/l = [?])) performs considerably worse (mrr = 0.16). Presumably this is because it is too enthusiastic to assemble the query tree from disparate parts of the answer tree. By comparison, vectord and d can only assembly the query tree from parts of the answer tree that are more closely connected.</Paragraph> <Paragraph position="3"> The tree-distance measures (vectord, d) using structural weights, lexical emphasis and wild cards (mrr = 0.27) out-perform the sub-sequence measure, p (mrr = 0.197). It also out-performs the cosine measure, th (mrr = 0.190). But p and th either out-perform or perform at about the same level as the tree-distance measure if the lexical emphasis is removed (see d(W = Str,T([?]) = np gap), mrr = 0.160).</Paragraph> <Paragraph position="4"> The tree-distance measure d works better if structural weighting is used (mrr = 0.09) than if it is not (mrr = 0.04).</Paragraph> <Paragraph position="5"> The tree-distance measure d works better with wild-cards (see d(W = Str,T([?]) = np gap), mrr = 0.160, than without (see d(W = Str), mrr = 0.090).</Paragraph> <Paragraph position="6"> Table 3 gives some results on the TREC11 task, using the Collins parser. Fewer comparisons have been made here.</Paragraph> <Paragraph position="7"> The sub-traversal measure, using structural weighting, lexical emphasis, and wild-cards performs better (mrr = 0.150) than the sub-sequence measure (mrr = 0.09), which in turn performs better than the basic sub-traversal measure, withouth structural weighting, lexical emphasis or wild-cards (mrr = 0.076). The cosine distance, th, performed best.</Paragraph> </Section> <Section position="8" start_page="104" end_page="106" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> For the parsers used, you could easily have 2 sentences with completely different words, and very different meanings, but which would have the same pre-terminal syntactic structure: the preterminal syntactic structure is not a function of the meaning. Given this, it is perhaps not surprising that there will be cases that the sequence distance easily spots as dissimilar, but which the tree distance measure, without any lexical emphasis, will regard as quite similar, and this perhaps explains why, without any lexical emphasis, the tree-distance measure performs at similar level to, or worse than, the sub-sequence distance measure.</Paragraph> <Paragraph position="1"> With some kind of lexical emphasis in place, the tree-distance measures out-perform the sub-sequence measures. We can speculate as to the reason for this. There are two kinds of case where the tree-distance measures could be expected to spot a similarity which the sequencedistance measures will fail to spot. One is when the question and answer are more or less similar on their head words, but differ in determiners, auxiliaries and adjuncts. The sequence distance measure will pay more of a price for these differences than the structurally weighted tree-distance. Another kind of case is when the answer supplies words which match a wild-card in the middle of the query tree, as might happen for example in: Q: what do child processes inherit from their parent processes A: a child process inherits the owner and permissions from the ancestor process cutoff, mean reciprocal rank mrr cutoff distance type 1st Qu. Median Mean mrr</Paragraph> <Paragraph position="3"> The tree-distance measures will see these as similar, but the sub-sequence measure will pay a large price for words in the answer that match the gap position in the query. Thus one can argue that the use of structural weighting, and wild-card trees in the query analysis will tend to equate things which the sequence distance sees as dissimilar.</Paragraph> <Paragraph position="4"> Another possible reason that the tree-distance measure out-performs the sub-sequence measure is that it may be able to distinguish things which the sequence distance will tend to treat as equivalent. A question might make the thematic role of some entity very clear, but use very few significant words as in: what does malloc do ? Using tree distance will favour answer sentences with malloc as the subject, such as malloc returns a null pointer. The basic problem for the sequence distance here is that it does not have much to work with and will only be able to partition the answer set into a small set of equivalence classes.</Paragraph> <Paragraph position="5"> These are speculations as to why tree-distance would out-perform sequence distance. Whether these equating and discriminating advantages which theoretically should accrue to d, vectord actually will do so, will depend on the accuracy of the parsing: if there is too much bad parsing, then we will be equating that which we should keep apart, and discriminating that which we should equate.</Paragraph> <Paragraph position="6"> In the two tasks, the relationship between the tree-distance measures and the order-invariant cosine measure worked out differently. The reasons for this are not clear at the moment. One possibility is that our use of the Collins parser has not yet resulted in good enough parses, especially question parses - recall that the indication from 4 was that improved parse quality will give better retrieval performance. Also it is possible that relative to the queries in the Library task, the amount of word-order permutation between question and answer is greater in the TREC task. This is also indicated by the fact that on the TREC task, the sub-sequence measure, p, falls considerably behind the cosine measure, th, whereas for the Library task they perform at similar levels.</Paragraph> <Paragraph position="7"> Some other researchers have also looked at the use of tree-distance measures in semantically-oriented tasks. Punyakonok(2004) report work using tree-distance to do question-answering on the TREC11 data. Their work differs from that presented here in several ways. They take the parse trees which are output by Collins parser and convert them into dependency trees between the leaves. They compute the distance from query to the answer, rather than from answer to query, using essentially the variant of tree-distance that allows arbitrary sub-trees of the target to insert for zero-cost. Presumably this directionality difference is not a significant one, and with distances calculated from answers to queries, this would correspond to the variant that allows arbitrary source sub-trees to delete with zero cost. The cost functions are parameterised to refer in the case of wild-card replacements to (i) information derived from Named Entity recognisers so different kinds of wh wild-cards can be given low-cost replacment with vocabulary categorised as belong to the right kind by NE recognition and (ii) base-form information.</Paragraph> <Paragraph position="8"> There is no way to make a numerical comparison because they took a different answer corpus CORq - the articles containing the answers suggested by TREC11 participants - and a different criterion of correctness - an answer was correct if it belonged to an article which the TREC11 adjudicators judges to contain a correct answer.</Paragraph> <Paragraph position="9"> Their adaptation of cost functions to refer to essentially semantic annotations of tree nodes is an avenue we intend to explore in future work. What this paper has sought to do is to investigate intrinsic syntactic parameters that might influence performance. The hope is that these parameters still play a role in an enriched system.</Paragraph> </Section> class="xml-element"></Paper>