Proceedings of the Workshop on Linguistic Distances, pages 100–108,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Variants of tree similarity in a Question Answering task
Martin Emms
Dept of Computer Science
Trinity College
Ireland
Abstract
The results of experiments on the appli-
cation of a variety of distance measures
to a question-answering task are reported.
Variants of tree-distance are considered,
including whole-vs-sub tree, node weight-
ing, wild cards and lexical emphasis. We
derive string-distance as a special case
of tree-distance and show that a particu-
lar parameterisation of tree-distance out-
performs the string-distance measure.
1 Introduction
This paper studies the deployment in a ques-
tion answering task of methods which assess the
similarity of question and answer representations.
Given questions such as
Q1 what does malloc return ?
Q2 What year did poet Emily Dickinson die?
and a collection of sentences (eg. a computer man-
ual, a corpus of newspaper articles), the task is to
retrieve the sentences that answer the question, eg.
A1 the malloc function returns a null pointer
A2 In 1886 , poet Emily Dickinson died in Amherst , Mass
One philosophy for finding answers to ques-
tions would be to convert questions and candidate
answers into logical forms and to compute answer-
hood by apply theorem-proving methods. Another
philosophy is to assume that the answers are simi-
lar to the questions, where similarity might be de-
fined in many different ways. While not all an-
swers to all questions will be similar, there’s an
intuition that most questions can be answered in
a way which shares quite a bit with the question,
and that accordingly with a large enough corpus, a
similarity-based approach could be fruitful.
2 Distance Measures
In pursuing such a similarity-based approach to
question-answering, the key decisions to be made
are the representations of the questions and an-
swers, and relatedly, distance measures between
them.
We will primarily be concerned with measures
which refer to a linguistic structure assigned to a
word sequence – variants of tree-distance, but we
will also consider string-distance.
2.1 Tree Measures
Following (Zhang and Shasha, 1989), one can ar-
rive at tree-distance in the following way. Given
source and target ordered, labelled trees, S and
T, consider the set H(S,T) of all 1-to-1 par-
tial maps, σ, from S into T, which are homo-
morphisms preserving left-to-right order and an-
cestry1. Let the alignment, σ′, be the enlarg-
ment of the map σ with pairs (Si,λ) for nodes
Si negationslash∈ dom(σ) and (λ,Tj) for nodes Tj negationslash∈ ran(σ).
Let D define deletion costs for the (Si,λ), I inser-
tion costs for the (λ,Tj), and R replacement costs
for the (Si,Tj) which represent nodes with non-
identical labels. Then a total cost for the align-
ment, C(σ′) can be defined as the sum of these
components costs, and the tree distance can then
be defined as the cost of the least-cost map:
∆(S,T) = min({C(σ′) | σ ∈ H(S,T)})
For any 3 trees, T1, T2, T3, the triangle inequal-
ity holds ∆(T1,T3) ≤ ∆(T1,T2) + ∆(T2,T3).
1If Tj
1 = σ(Si1) and Tj2 = σ(Si2) then (i) Si1 is to theleft of S
i2 iff Tj1 is to the left of Tj2 and (ii) Si1 is a descen-
dant of Si2 iff Tj1 is a descendant of Tj2, with descendency
understood as the transitive closure of the daugher-mother re-
lation.
100
Briefly the argument is as follows. Given map-
pings σ ∈ H(T1,T2), and τ ∈ H(T2,T3), σ◦τ ∈
H(T1,T3)2, so (σ ◦ τ)′ is an alignment between
T1 and T3, and ∆(T1,T3) ≤ C((σ ◦ τ)′). The
cost of the composition is less than the sum of the
costs of the composed maps: σ’s insertions and re-
placements contribute only if they fall in dom(τ),
τ’s deletions and replacements contribute only if
they act on ran(σ).
From this basic definition, one can depart in
a number of directions. First of all, there is a
part-vs-whole dimension of variation. Where
∆(S,T) gives the cost of aligning the whole
source tree S with the target T, one can consider
variants where one minimises over a set of sub-
parts of S. This is equivalent to letting all but the
nodes belonging to the chosen sub-part to delete
at zero cost3. Let δ(S,T) be the sub-tree dis-
tance. Let vectorδ(S,T), be the sub-traversal distance,
in which sub-traversals of the left-to-right, post-
order traversal of S are considered. As for ∆, the
triangle inequality holds for δ and vectorδ – one needs
to extend the notion of alignment with a set of free
deletions. Unlike ∆, δ and vectorδ are not symmetric.
All of ∆, δ and vectorδ are implicitly parametrised by
the cost functions, D, I and R. In the work below
4 other parameters are explored
Node weighting W: this is a function which
assigns a real-number weight to each each
node. The cost function then refers to the
weights. In experiments reported below,
Dw((Si,w),λ) = w, Iw(λ,(Tj,w)) = w,
Rw((Si,ws),(Tj,wt)) = max(ws,wt), if
Si and Tj have unequal labels. The experi-
ments reported below use 2 weighting func-
tion STR, and LEX. STR assign weights
according to the syntactic structure, via a
classification of nodes as heads vs. comple-
ments vs. adjuncts vs. the rest, with es-
sentially adjuncts given 1/5th the weights of
heads and complements, and other daughters
1/2, via essentially the following top-down
algorithm:
Str(node,rank) :
assign weight 1/rank to node
for each daughter d
2∀x ∈ T1∀z ∈ T3((x,z) ∈ σ ◦ τ iff ∃y ∈ T2((x,y) ∈
σ,(y,z) ∈ τ)
3Note that if one minimises also over sub-parts of the tar-
get, you do not get an interesting notion, as the minimum will
inevitably involve at most one node of source and target.
if (d is head or complement){
assign weight = 1/rank,
Str(rank,d)}
else if (d is adjunct) {
assign weight = 1/(5 × rank),
Str(5∗ rank,d)}
else {
assign weight = 1/(2 × rank)
Str(2∗ rank,d)}
LEX is a function which can be composed
with STR, and scales up the weights of leaf
nodes by a factor of 3.
Target wild cards T(∗): this is a function which
classifies certain target sub-trees as wild-
card. If source Si is mapped to target Tj, and
Tj is the root of a wild-card tree, all nodes
within the Si sub-tree can be deleted for 0
cost, and all those within the Tj sub-tree can
be inserted for 0 cost. A wild card np tree
might can be put in the position of the gap in
wh-questions, allowing for example what is
memory allocation, to closely match any sen-
tences with memory allocation as their ob-
ject, no matter what their subject – see Fig-
ure 3.
Source self-effacers S/λ: this is a function
which classifies source sub-trees as self-
effacers. Such trees can be deleted in
their entirety for zero cost. If S/λ clas-
sifies all source sub-trees as self-effacing,
then ∆(S/λ) will coincide with notion of
’tree-distance with Cut’ given in (Zhang and
Shasha, 1989).
Target self-inserters λ/T: this is a function
which classifies certain target sub-trees as
self-inserters. Such trees can be inserted in
their entirety for zero cost. A candidate might
be optional adjuncts.4
2.2 Sequence Measures
The tree-distance measures work with an elabora-
tion of the original questions and answers. (Lev-
enshtein, 1966) defined the 1 dimensional precur-
sor of tree distance, which works directly on the
2 word sequences for the answer and question.
For two sequences, s, t, and vertical (or hori-
zontal) tree encodings l tree(s) and l tree(t), if
4Thus a target wild-card is somewhat like a target self-
effacer, but one which also licenses the classification of a
matched source sub-tree as a being self-effacer.
101
process
n
np
s
vp
rhs
lhs be rhs
lhs
vp
call be
memory
n
n
n
allocation
np something
pro
np
s
vp
lhs rhs
memory
n
n
n
allocation
np
sub tree matching dist=5.0
Figure 1: Sub tree example
we define Π(s,t), as ∆(l tree(s),l tree(t)), and
π(s,t), as vectorδ(l tree(s),l tree(t)), then Π and π
coincide with the standard sequence edit distance
and sub-sequence edit distance. As special cases
of ∆ and δ, Π and π inherit the triangle inequality
property.
To illustrate some of the tree-distance defini-
tions, in the following example, a ∆ distance of
3 between 2 trees is obtained, assuming unit costs
for deletions (shown in red and double outline), in-
sertions (shown in green and double outline), and
substitutions (shown in blue and linked with an ar-
row):
a
b
b
a
a
c
b
a
b b a
a
whole tree matching dist=3.0
Note also in this picture that nodes that are mapped
without a relabelling are shown at the same hori-
zontal level, with no linking arrow.
Figure 1 shows a sub-tree example – δ. The
source tree nodes which do not belong to the cho-
sen sub-tree are shown in grey. The lowest vp sub-
tree in the source is selected, and mapped to the
vp in the target. The remaining target nodes must
be inserted, but this costs less than a match which
starts higher and necessitates some deletions and
substitutions.
Figure 2 shows a sub-tree example where the
structural weighting STR has been used: size of
a node reflects the weight. 4 of the nodes in the
source represent the use of an auxiliary verb, and
receive low weight, changing the optimum match
to one covering the whole source tree. There is
some price paid in matching the dissimilar subject
nps.
process something
n pro
np
s
vp
rhs
lhs be rhs
lhs
vp
call be
memory
n
n
n
allocation
np
np
s
vp
lhs rhs
memory
n
n
n
allocation
np
sub tree matching dist=3.6
Figure 2: Structurally weighted example
Figure 3 continues the example, but this time
in the subject position there is a sub-tree which is
classified as a wild-card np tree, and it matches at
0 cost with the subject np in the source tree.
process
n
np np_wild
s
vp
rhs
lhs be rhs
lhs
vp
call be
memory
n
n
n
allocation
np
something
pro
s
vp
lhsrhs
memory
n
n
n
allocation
np
sub tree matching dist=1.6
Figure 3: Wild-card example
The basis of the algorithm used to calculate ∆
is the ZhangShasha algorithm (Zhang and Shasha,
1989): the Appendix summarises it. The im-
102
plementation is based on code implementing ∆
(Fontana et al., 2004), adapting it to allowing for
the δ and vectorδ variants and T(∗), S/λ, and λ/T pa-
rameters, and to generate the human-readable dis-
plays of the alignments (such as seen in figures 1,2
and 3).
2.3 Order invariant measures
Assessing answer/question similarity by variants
of tree distance or sequence edit-distance, means
that distance will not be word-order invariant.
There are also measures which are word-order in-
variant, sometimes called token-based measures.
These measures are usually couched in a vector
representation of questions and answers, where
vector dimensions are words from (some cho-
sen enumeration) of words (see (Salton and Lesk,
1968)). In the simplest case the values on each
dimensions are in {0,1}, denoting presence or ab-
sence of a word. If • is vector product and aw
is the set of words in a sequence a, then vectora •vectorb =
|aw ∩ bw|, for the binary vectors representing aw,
bw. Three well known measures based on this are
given below, both in terms vectors, and for binary
vectors, the equivalent formulation with sets:
Dice 2(vectora•vectorb)/(vectora•vectora) + (vectorb•vectorb))
= 2(|aw ∩bw|)/(|aw|+|bw|)
Jaccard (vectora•vectorb)/(vectora •vectora) +vectorb•vectorb−vectora•vectorb)
= (|aw ∩bw|)/(|aw ∪bw)
Cosine (vectora•vectorb)/(vectora •vectora).5(vectorb•vectorb).5
= (|aw ∩bw|)/((|aw|)0.5(|bw|)0.5)
These measure similarity, not difference, ranging
for 1 for identical aw,bw, to 0 for disjoint. In
the binary case, Dice/Jaccard similarity can be
related to the alignment-based, difference count-
ing perspective of the edit-distances. If we de-
fine Πw(a,b) as |aw ∪ bw|−|aw ∩ bw| – the size
of the symmetric difference between aw and bw –
this can be seen as a set-based version of edit dis-
tance5, which (i) considers mappings on the sets of
words, aw, bw, not the sequences a, b, and (ii) sets
replacement cost to infinity. A difference measure
(ranging from 0 for identical aw,bw to 1 for dis-
joint) results if Πw(a,b) is divided by |aw| +|bw|
(resp. |aw ∪bw|) and this difference measures will
give the reverse of a ranking by Dice (resp. Jac-
card) similarity.
The Cosine is a measure of the angle be-
tween the vectors vectora,vectorb, and is not relatable in the
5Πw(a,b) could be equivalently defined as |(vectora −vectorb)|2
binary-case to the alignment-based, difference-
counting perspective of the edit-distances: di-
viding Πw(a,b), the symmetric difference, by
|aw|.5|bw|.5 does not give a measure with maxi-
mum value 1 for the disjoint case, and does not
give the reverse of a ranking by Cosine similarity.6
Below we shall use θ to denote the Cosine dis-
tance.
3 The Question Answering Tasks
For a given representation r (parse trees, word se-
quences etc.), and distance measure d, we shall
generically take a Question Answering by Dis-
tance (QAD) task to be given by a set of queries,
Q, and for each query q, a corpus of potential an-
swer sentences, CORq. For each a ∈ CORq, the
system determines d(r(a),r(q)), the distance be-
tween the representations of a and q, then uses this
to sort CORq into Aq. This sorting is then evalu-
ated in the following way. If ac ∈ Aq is the correct
answer, then the correct-answer-rank is the rank
of ac in Aq:
| {a ∈ Aq : d(r(a),r(q)) ≤ d(r(ac),r(q))} |
whilst the correct-answer-cutoff is the proportion
of Aq cut off by the correct answer ac:
| {a ∈ Aq : d(r(a),r(q)) ≤ d(r(ac),r(q))} | / | Aq |
Lower values for this connote better performance.
Another figure of merit is the reciprocal correct-
answer-rank. Higher values of this connote better
performance.
Note the notion of answerhood is not one requir-
ing answers to be the sub-sentential phrases asso-
ciated with wh-phrases in the question. Also not
all the questions are wh-questions.
Note also that the set of candidate answers
CORq is sorted by the answer-to-query distance,
d(r(a),r(q)), not the query-to-answer distance,
d(r(q),r(a)). The intuition is that the queries are
short and the answers longer, with sub-part that re-
ally contains the answer.
The performance of some of the above men-
tioned distance measures on 2 examples of QAD
tasks has been measured:
GNU Library Manual QAD Task: in
this case Q is a set of 88 hand-created
6if the vectors are normalised by their length, then you
can show |(vectora/|vectora|−vectorb/|vectorb|)|2 reverses the Cosine ranking
103
queries, and CORq, shared by all the
queries, is the sentences of the manual
of the GNU C Library7 (| CORq |≈
31,000).
The TREC 11 QAD task: In this
case Q was the 500 questions of the
TREC11 QA track (Voorhees and Buck-
land, 2002), whose answers are drawn
from a large corpus of newspaper arti-
cles. CORq was taken to be the sen-
tences of the top 50 from the top-1000
ranking of articles provided by TREC11
for each question (| CORq |≈ 1000).
Answer correctness was determined us-
ing the TREC11 answer regular expres-
sions.
For the tree-distance measures, 2 parsing sys-
tems have been used. For convenience of refer-
ence, we will call the first parser, the trinity parser.
This is a home-grown parser combining a disam-
biguating part-of-speech tagger with a bottom-up
chartparser, refering to CFG-like syntax rules and
a subcategorisation system somewhat in a catego-
rial grammar style. Right-branching analyses are
prefered and a final selection of edges from all
available is made using a leftmost/longest selec-
tion strategy – there is always an output regardless
of whether there is a single input-encompassing
edge. Preterminal node labels are a combination
of a main functor with other feature terms, but the
replacement cost function R is set to ignore the
feature terms. Terminal node labels are base forms
of words, not inflected forms. For the structural
weighting algorithm, STR, the necessary node
distinctions are furnished directly by the parser for
vp, and by a small set of structure matching rules
for other structures (nps, pps etc). The structures
output for wh-questions are essentially deep struc-
tures, re-ordering an auxiliary inversion, and plac-
ing a tree in the position of a gap.
The Collins parser (Collins, 1999) (Model 3
variant) is a probabilistic parser, using a model of
trees as built top-down with a repertoire of moves,
learnt from the Penn Treebank. The preterminal
node labels are a combination of a Penn Tree-
bank label with other information pertaining to the
head/complement/adjunct distinction, but the re-
placement cost function R is set to ignore all but
the Penn Treebank part of the label. The termi-
7http://www.gnu.org
nal node labels are inflected forms of words, not
base forms. For the structural weighting algo-
rithm, STR, the necessary node distinctions are
furnished directly by the parser. For the question
parses, a set of transformations is applied to the
parses directly given by the parser, which compa-
rable to the trinity parser, re-order auxiliary inver-
sion, and place a tree in the position of a gap.
4 Relating Parse Quality to Retrieval
Performance
As a kind of sanity-check on the idea of the us-
ing syntactic structures in retrieving answers, we
performed some experiments in which we var-
ied the sophistication of the parse trees that the
parsers could produce, the expectation being that
the less sophisticated the parse, the less successful
would be question-answering performance. The
left-hand data in Table 1 refers to various reduc-
tions of the linguistic knowledge bases of the trin-
ity parser(thin50 = random removal of 50% subset,
manual = manual removal of a subset, flat = en-
tirely flat parses, gold = hand-correction of query
parses and their correct answers). The right-hand
data in Table 1 refers to experiments in which the
repertoire of moves available to the Collins parser,
as defined by its grammar file, was reduced to dif-
ferent sized random subsets of itself.
Figure 4 shows the empirical cumulative den-
sity function (ecdf) of the correct-answer-cutoff
obtained with the weighted sub-tree with wild
cards measure. For each possible value c of
correct-answer-cutoff, it plots the percentage of
queries with a correct-answer-cutoff ≤ c.
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.4
0.8
gold
full
thin50
manual
flat
Figure 4: Success vs Cut-off for different parse settings:
x = correct-answer-cutoff, y = proportion of queries whose
correct-answer-cutoff ≤ x (ranking by weighted sub-tree
with wild cards) (Library task)
What these experiments show is that the ques-
104
Table 1: Distribution of Correct Cutoff across query set Q in different parse settings. Left-hand data =
GNU task, trinity parser, right-hand data = TREC11 task, Collins parser
Parsing 1st Qu. Median Mean 3rd Qu.
flat 0.1559 0.2459 0.2612 0.3920
manual 0.0215 0.2103 0.2203 0.3926
thin50 0.01418 0.02627 0.157 0.2930
full 0.00389 0.04216 0.1308 0.2198
gold 0.00067 0.0278 0.1087 0.1669
Parsing 1st Qu. Median Mean 3rd Qu.
55 0.3157 0.6123 0.5345 0.766400
75 0.02946 0.1634 0.2701 0.4495
85 0.0266 0.1227 0.2501 0.4380
100 0.01256 0.08306 0.2097 0.2901
tion answering performance is a function of the so-
phistication of the parses that the parsers are able
to produce.
5 Comparing Distance Measures
Table 2 gives results on the Library task, using the
trinity parser, for some variations of the distance
measure.
Considering the results in 2, the best perform-
ing measure (mrr = 0.27) was the sub-traversal
distance, vectorδ, assigning weights structurally using
STR, with lexical emphasis LEX, and treating a
gap position as an np wild card. This slightly out
performs the sub-tree measure, δ (mrr = 0.25).
An alternative approach to discounting parts of
the answer tree, allowing any sub-tree of the an-
swer the option to delete for free (∆(W = Str ◦
Lex,T(∗) = np gap,S/λ = ∀)) performs con-
siderably worse (mrr = 0.16). Presumably this is
because it is too enthusiastic to assemble the query
tree from disparate parts of the answer tree. By
comparison, vectorδ and δ can only assembly the query
tree from parts of the answer tree that are more
closely connected.
The tree-distance measures (vectorδ, δ) using struc-
tural weights, lexical emphasis and wild cards
(mrr = 0.27) out-perform the sub-sequence mea-
sure, π (mrr = 0.197). It also out-performs the
cosine measure, θ (mrr = 0.190). But π and θ
either out-perform or perform at about the same
level as the tree-distance measure if the lexical
emphasis is removed (see δ(W = Str,T(∗) =
np gap), mrr = 0.160).
The tree-distance measure δ works better if
structural weighting is used (mrr = 0.09) than
if it is not (mrr = 0.04).
The tree-distance measure δ works better with
wild-cards (see δ(W = Str,T(∗) = np gap),
mrr = 0.160, than without (see δ(W = Str),
mrr = 0.090).
Table 3 gives some results on the TREC11 task,
using the Collins parser. Fewer comparisons have
been made here.
The sub-traversal measure, using structural
weighting, lexical emphasis, and wild-cards per-
forms better (mrr = 0.150) than the sub-sequence
measure (mrr = 0.09), which in turn performs
better than the basic sub-traversal measure, with-
outh structural weighting, lexical emphasis or
wild-cards (mrr = 0.076). The cosine distance,
θ, performed best.
6 Discussion
For the parsers used, you could easily have 2
sentences with completely different words, and
very different meanings, but which would have the
same pre-terminal syntactic structure: the preter-
minal syntactic structure is not a function of the
meaning. Given this, it is perhaps not surpris-
ing that there will be cases that the sequence dis-
tance easily spots as dissimilar, but which the tree
distance measure, without any lexical emphasis,
will regard as quite similar, and this perhaps ex-
plains why, without any lexical emphasis, the tree-
distance measure performs at similar level to, or
worse than, the sub-sequence distance measure.
With some kind of lexical emphasis in place,
the tree-distance measures out-perform the sub-
sequence measures. We can speculate as to the
reason for this. There are two kinds of case
where the tree-distance measures could be ex-
pected to spot a similarity which the sequence-
distance measures will fail to spot. One is when
the question and answer are more or less simi-
lar on their head words, but differ in determiners,
auxiliaries and adjuncts. The sequence distance
measure will pay more of a price for these differ-
ences than the structurally weighted tree-distance.
Another kind of case is when the answer supplies
words which match a wild-card in the middle of
the query tree, as might happen for example in:
Q: what do child processes inherit from their par-
ent processes
A: a child process inherits the owner and permis-
sions from the ancestor process
105
Table 2: For different distance measures (Library task, trinity parser), distrution of correct-answer-
cutoff, mean reciprocal rank mrr
cutoff
distance type 1st Qu. Median Mean mrr
vectorδ(W = Str ◦Lex,T(∗) = np gap) 8.630-05 8.944-04 2.460-02 0.270
δ(W = Str ◦Lex,T(∗) = np gap) 9.414e-05 1.428e-03 7.133e-02 0.255
π bases 1.569e-04 2.087e-03 5.181e-02 0.197
θ bases 1.569e-04 8.630e-04 1.123e-02 0.190
∆(W = Str ◦Lex,T(∗) = np gap,S/λ = ∀) 4.080e-04 9.352-03 5.853-02 0.160
δ(W = Str,T(∗) = np gap) 3.923e-04 1.964e-02 1.162e-01 0.160
δ(W = Str) 5.060e-03 3.865e-02 1.303e-01 0.090
δ 1.324e-03 1.046e-01 1.852e-01 0.040
∆ 8.398e-02 2.633e-01 3.531e-01 0.003
Table 3: For different distance measures (TREC task, collins parser) the distribution of correct-answer-
cutoff and mean reciprocal rank (mrr)
cutoff
distance type 1st Qu. Median Mean mrr
θ forms 7.847e-03 2.631e-02 1.068e-01 0.167
vectorδ(W = Str ◦Lex,T(∗) = np gap) 8.452e-03 4.898e-02 1.558e-01 0.150
π forms 2.113e-02 7.309-02 2.051e-01 0.092
vectorδ 1.815e-02 1.030e-01 3.269e-01 0.076
The tree-distance measures will see these as
similar, but the sub-sequence measure will pay a
large price for words in the answer that match the
gap position in the query. Thus one can argue that
the use of structural weighting, and wild-card trees
in the query analysis will tend to equate things
which the sequence distance sees as dissimilar.
Another possible reason that the tree-distance
measure out-performs the sub-sequence measure
is that it may be able to distinguish things which
the sequence distance will tend to treat as equiva-
lent. A question might make the thematic role of
some entity very clear, but use very few significant
words as in:
what does malloc do ?
Using tree distance will favour answer sen-
tences with malloc as the subject, such as mal-
loc returns a null pointer. The basic problem for
the sequence distance here is that it does not have
much to work with and will only be able to parti-
tion the answer set into a small set of equivalence
classes.
These are speculations as to why tree-distance
would out-perform sequence distance. Whether
these equating and discriminating advantages
which theoretically should accrue to δ, vectorδ actually
will do so, will depend on the accuracy of the pars-
ing: if there is too much bad parsing, then we will
be equating that which we should keep apart, and
discriminating that which we should equate.
In the two tasks, the relationship between the
tree-distance measures and the order-invariant co-
sine measure worked out differently. The reasons
for this are not clear at the moment. One pos-
sibility is that our use of the Collins parser has
not yet resulted in good enough parses, especially
question parses – recall that the indication from
4 was that improved parse quality will give better
retrieval performance. Also it is possible that rel-
ative to the queries in the Library task, the amount
of word-order permutation between question and
answer is greater in the TREC task. This is also
indicated by the fact that on the TREC task, the
sub-sequence measure, π, falls considerably be-
hind the cosine measure, θ, whereas for the Li-
brary task they perform at similar levels.
Some other researchers have also looked at
the use of tree-distance measures in semantically-
oriented tasks. Punyakonok(2004) report work
106
using tree-distance to do question-answering on
the TREC11 data. Their work differs from that
presented here in several ways. They take the
parse trees which are output by Collins parser and
convert them into dependency trees between the
leaves. They compute the distance from query to
the answer, rather than from answer to query, us-
ing essentially the variant of tree-distance that al-
lows arbitrary sub-trees of the target to insert for
zero-cost. Presumably this directionality differ-
ence is not a significant one, and with distances
calculated from answers to queries, this would cor-
respond to the variant that allows arbitrary source
sub-trees to delete with zero cost. The cost func-
tions are parameterised to refer in the case of wild-
card replacements to (i) information derived from
Named Entity recognisers so different kinds of wh
wild-cards can be given low-cost replacment with
vocabulary categorised as belong to the right kind
by NE recognition and (ii) base-form information.
There is no way to make a numerical compar-
ison because they took a different answer corpus
CORq – the articles containing the answers sug-
gested by TREC11 participants – and a different
criterion of correctness – an answer was correct if
it belonged to an article which the TREC11 adju-
dicators judges to contain a correct answer.
Their adaptation of cost functions to refer to es-
sentially semantic annotations of tree nodes is an
avenue we intend to explore in future work. What
this paper has sought to do is to investigate intrin-
sic syntactic parameters that might influence per-
formance. The hope is that these parameters still
play a role in an enriched system.
7 Conclusion and Future Work
For two different parsers, and two different
question-answering tasks, we have shown that im-
proved parse quality leads to better performance,
and that a tree-distance measure out-performs a
sequence distance measure. We have focussed on
intrinsic, syntactic properties of parse-trees. It is
not realistic to expect that exclusively using tree-
distance measures in this rather pure way will give
state-of-the-art question-answering performance,
but the contribution of this paper is the (start of
an) exporation of the syntactic parameters which
effect the use of tree-distance in question answer-
ing. More work needs to be done in systematically
varying the parsers, question-answering tasks, and
parametrisations of tree-distance over all the pos-
sibilities.
There are many possibilities to be explored in-
volving adapting cost functions to enriched node
descriptions. Already mentioned above, is the pos-
sibility to involve semantic information in the cost
functions. Another avenue is introducing weight-
ings based on corpus-derived statistics, essentially
making the distance comparision refer to extrin-
sic factors. One open question is whether anal-
ogously to idf, cost functions for (non-lexical)
nodes should depend on tree-bank frequencies.
Another question needing further exploration is
the dependency-vs-constituency contrast. Interest-
ingly Punyakonok(2004) themselves speculate:
each node in a tree represents only a
word in the sentence; we believe that ap-
propriately combining nodes into mean-
ingful phrases may allow our approach
to perform better.
We found working with constituency trees that
it was the sub-traversal distance measure that per-
formed best, and it needs to be seen whether this
holds also for dependency trees. Also to be ex-
plored is the role of structural weighting in a sys-
tem using dependency trees.
A final speculation that it would be interesting
to explore is whether one can use feed-back from
performance on a QATD task as a driver in the
machine-learning of probabilities for a parser, in
an approach analogous to the use of the language-
model in parser training.

References
Michael Collins. 1999. Head-driven statistical models
for natural language parsing. Ph.D. thesis, Univer-
sity of Pennsylvania.
Walter Fontana, Ivo L. Hofacker, and Pe-
ter F. Stadler. 2004. Vienna rna package.
www.tbi.univie.ac.at/˜ivo/RNA.
V. I. Levenshtein. 1966. Binary codes capable of cor-
recting insertions and reversals. Sov. Phys. Dokl,
10:707–710.
Vasin Punyakanok, Dan Roth, and Wen tau Yih. 2004.
Natural language inference via dependency tree
mapping: An application to question answering.
Computational Linguistics.
G. Salton and M. E. Lesk. 1968. Computer evala-
tion of indexing and text processing. Journal of the
ACM, 15:8–36, January.
Ellen Voorhees and Lori Buckland, editors. 2002. The
Eleventh Text REtrieval Conference (TREC 2002).
Department of Commerce, National Institute of
Standards and Technology.
K. Zhang and D. Shasha. 1989. Simple fast algorithms
for the editing distance between trees and related
problems. SIAM Journal of Computing, 18:1245–
1262.
