Error-tolerant Tree Matching 
Kemal Oflazer 
Department of Computer Engineering and Information Science, 
Bilkent University, Ankara, TR-06533, ~1 urkey 
ko@cs, b ilkent, edu. t r 
Abstract 
This paper presents an efficient algo- 
rithm for retrieving from a database of 
trees, all trees that match a given query 
tree appro,imately, that is, within a cer- 
tain error tolerance. It has natural lan- 
guage processing applications in search- 
ing for matches in example-based trans- 
lation systems, and retrieval from lexi- 
cal databases containing entries of com- 
plex feature structures. The algorithm 
has been implemented on SparcStations, 
and for large randomly generated syn- 
thetic tree databases (some having tens 
of thousands of trees) it can associatively 
search \[or trees with a small error, in a 
matter of tenths of a second to few sec- 
onds. 
1 Introduction 
Recent approaches in machine translation known 
as example-based translation rely on searching a 
database of previous translations of sentences or 
fragments of sentences, and composing a trans- 
lation from the translations of any matching ex- 
amples (Sato and Nagao, 1!)90; Nirenburg, Beale 
and l)omasnhev, 1994). The example database 
may consist, of paired text fragments, or trees as 
in Sat() and Nagao (1990). Most often, exact 
matches for new sentences or fragments will not 
be in the database, and one has to consider exam- 
pies that are "similar" to the sentence or fragment 
in question. This involves associatively searching 
through the database, tbr trees that are "close" to 
the query tree. This paper addresses the compu- 
tational problem o\[ retrieving trees that are close 
to a given query tree in terms of a certain distance 
metric. 
The paper first presents the approximate tree 
matching problem in an abstract setting and 
presents an algorithm for approximate associative 
tree matching. The Mgorithm relies on lineariz- 
ing the trees and then representing the complete 
database of trees as atrie structure which can 
be efficiently searched. The problem then reduces 
to sequence correction problem akin to standard 
spelling correction problem. The trie is then used 
with an approximate finite state recognition al- 
gorithm close to a query tree. Following some ex- 
perimental results from a number of synthetic tree 
databases, the paper ends with conclusions. 
2 Approximate Tree Matching 
In this paper we consider the problem of searching 
in a database of trees, all trees that are "close" 
to a given query tree, where closeness is defined 
in terms of an error metric. The trees that we 
consider have labeled terminal and non-terminal 
nodes. We assume that all immediate children of 
a given node have unique labels, and that a total 
ordering on these labels is defined. We consider 
two trees close if we can 
• add/delete a small number of leaves to/from 
one of the trees, and/or 
• change the label of a small number of leaves 
in one of the trees 
to get the second tree. A pair of such "close" trees 
is depicted in Fignre 1. 
2.1 Linearization of trees 
Before proceeding any fllrther we would like to 
define the terminology we will be using in the fob 
lowing sections: We identify each leaf node in a 
tree with an ordered vertex list (re, vl, v2, ..., vd) 
where each vi is the label of a vertex from the root 
v0 to the leaf Vd at depth d, and :{'or i > 0, vi is the 
parent of vi+ L. A tree with n leaves is represented 
by a vertex list sequence. VLS =.Vi,V'e,...,¼, 
where each V~. = v3o, v{, v~, v~, . •., va,;, corresponds 
to a vertex list for a leaf at level dj. This se- 
quence is constructed by taking into account the 
total order on the labels at every level, that is, 
17i is lexico.qraphically less than Vi+l, based on the 
total ordering of the vertex labels. For instance, 
the first tree in Fignre 1 would be represented by 
the vertex list sequence: 
860 
S 
NP 
I)ct N P V 
I ~ I A A(I.j N chased 
I I black cat 
VP 
NP 
Det N P 
the A<I.j N 
I I little mouse 
j~-.. 
N \]) V P 
\])ct NI ) I \] V NP 
A IN I J~ 
l ;,t;<: / l)ct N P 
cat I //~- 
the Adj -N 
I I 
brown \]IIO/ISC 
I,'igure 1: Trees tha.t ;I.Fe cl >S<; to eac.h other. 
((S,NP,Det,a), 
(S,NP,NP,Adj,black), 
(S,NP,NP,N,cat), 
(S,VP,NP,Det,the), 
(S,VP,NP,NP,Adj,little), 
(S,VP,NP,NP,N, mouse), 
(S,VP,V,chased) 
assuming the normal lexicogra.phie ordering on 
i,o(lc na~lles. 
2.2 Distance between two trees 
We deline the distan<'e 1)etween two trees aeeor(1- 
ing to the struchrral diJl~,,,'cnces or differences in 
leaf labels. We consider an extra or a missing leaf 
a.s a structural change. If, however, both trees 
Itave a leaw~s whose vertex lists match in all but 
the last (leaf vertex) lat>e\], we. <:onsi<ler this as a 
dil\[erence in leaf lal>cls. For instance, in I,'igm'e 2, 
(ihere is extra, leaf in tree (I)) in <,Oml)a.rison to the 
tree in (a), while tree (c) has a leaf label diffc,,- 
ence. We a.sso<:iate the f'ollowing costs associated 
with these <lifl'erences: 
• If I>oth trees have a. lea\[' whose verl;ex list 
matches in all but the last (leaf w.'l:tex) ta-- 
bel, we assign a label <lill~rence error of C. 
• \[\[' a certa,in leaf is missing in one of" the trees 
but exists in the other one, then we assign a 
<:ost S for this a structural dilI'erence. 
We <'urrently treat all structural or leaf label <\]if:- 
fere,<:es as incurring a. cost that is indel>endent of 
the tree level at whi<'h \];he difference takes i)lacc. 
;t 
b e 
;-~ c k 
I 
X 
O0 
h (1 e 
• ~ C k 
I 
X 
(h) 
& 
\]) C jq--.... 
a f k 
I 
X 
0) 
Figure 2: Structural and lea\[ label <lifl'erences he= 
tween trees. 
If, however, ditl~rences that ar0. closer to the root 
of the tree are considered to b(' more serious than 
differences further away \[\]:on~ the root, it is \]>os-- 
sible to mo<lify the formulation to take this into 
~tCCOtl nt. 
2,3 Conw'xting a set of trees into a the. 
A h'ee database l) <:onsists of a set o\[' trees 
'/~, "1~, •.., 5/~., ea.ch "1) being a vertex list sequ<mce 
for a tree. Once we convert all the trees to a linear 
form, we haw: a set; o\[" vertex list sequences. We 
can convert this set into a trie data structure. This 
trie will compress ;-'~l\]y l>ossible redundancies in the 
prefixes of the vertex list; sequences to achieve a. 
certain ('ompa<'tion which hell>s during searching. \] 
For insta.nce, the three trees in F\[gttre 2 can 
I>e re4>resente<l as a trie as shown in Figm'e, 3. 
The edge labels along the t>ath to a h'af when 
concat<'.nate<l in order gives the vertex list se- 
quence for a tree, e.g., ((a,b,a,x), (a,b,c), 
(a,b,k), (a,e)) repr<;sents the tree (a) il) Fig- 
ure ~. 
t Note that i~ is possible to obtain more spa<:c re- 
duction by aJso sharing any common postflxes of Lhe 
vertex labe\] sequences using a directed acy<:lic graph 
representation and not a. trie, but this does not ira- 
prow:' the execution time. 
861 
l a,I),&,x) 
b 2 4.b, 
'1'1 eea I Tree c 
I 
'l'ree b 
Figure 3: 'l'rie representation of the 3 trees in Fig- 
ure. 2 
2.4 Error-tolerant, matching in the trie 
Our concern in this work is not the exact match 
of trees but rather approximate match. Given the 
vertex list sequence for a query tree, exact match 
over the trie can be performed using the stan- 
dard t;ech niques by fbllowing the edge labeled with 
next vertex list until a loft in the trie is reached, 
~-md the query vertex label sequence is exhausted. 
For approximate tree matching, we use the error- 
tolerant approximate tinite-state recognition al- 
gorithm (Oflazer, 1996), which tinds all strings 
within a giwm error threshold of some string in 
the regular set accepted by the underlying finite- 
state acceptor. An adaptation of this algorithm 
will be briefly summarized here. 
hh:ror-tolerant matching of vertex list sequences 
requires an errol: inetric for measuring how rnuch 
two such sequences deviate from each other. The 
distance between two sequences measures the min- 
imum number of insertions, deletions and leaf la- 
bel changes that are necessary to convert one tree 
into another. It should be noted that this is dif- 
ferent fl:om the error metric defined by (Wang el 
M., 1994). 
Let Z = Z1, Z.~,..., Zp, denote a generic vertex 
list sequence of p vertex lists. Z\[j\] denotes the ini- 
tim subsequence of Z up to and including the ju~ 
leaf label. We will use X (of length rn) to denote 
the query vertex list sequence, and Y (of length n) 
to denote the sequence that is a (possibly pattie.I) 
candidate vertex list sequence (from (;he database 
of trees). 
Given two vertex list sequences X and Y, 
the distance, disffX\[m\], Y\[n\]), computed accord- 
ing to the recurrence below, gives the minimum 
number of leaf insertions, deletions or lea\[' label 
(:hai~ges necessary to change one tree to the other. 
dist(X\[m\], Y\[n\]) = dist(X\[m- 1\], Y\[n- 1\]) 
if x,~, = y,,. 
(last vertex lists a.re sa.me) 
: ,ti.<x\[.~ - l\], z\[,~ - ,\]) + c' 
if x., a, nd y,~ 
differ only ~tt the 
lea.f l~tbel 
= dist(X\[rn - 11, Y\[n\]) + ,'-,' 
if y,, < x,,(lexicographica.lly) 
X is missing leaf #,,. 
= ,ti,~t(X\[,,4, Z b - I\]) +,S' 
if xm < y~(lexicogra.phica.lly) 
X has ~n extra lc~ff a: .... 
Boundary Conditions 
dist(X\[O\],Z\[n\]) = ,~. S 
dist(X\[m\],Y\[O\]) : m.,5' 
For a tree database D and at distance threshold 
t > O, we consider a query tree represented by a 
wertex list sequence X\[m\] (not in the database) to 
match the database with an error of t, if the set 
C : {r\[",\]l Y\[",\] < 10 and distX\[,,~\], Yb\]) -< t} 
is not empty. 
2.5 An algorithm for approximate tree 
mat ehing 
Standard searching with a trie corresponds to 
traversing a path starting t}om the start node (o\[' 
the trie), to one of tlle lea\[' nodes (of the trie), so 
that the concatenation of the labels on the arcs 
along this path matches the input vertex list se- 
quence. For error-tolerant matching, one needs to 
lind all paths from the start node lo one of the 
final nodes, so lhat wh.en lhe labels on the edges 
along a path are concatenated, lhc resulting "verlea; 
list sequence is within a given dislance lh, rcshold 
t, of the query vertex list sequence. 
This search has to be very fast if apl)roximate 
matching is to be of any practical use. This means 
that paths in the trie that can lead to no solutions 
have to be pruned so that the search can be lim- 
ited to a very small percentage of the search space. 
We need to make sure that any candidate (1)re- 
fix) vertex list sequence that is generated as the 
search is being p'erfbrmed, does not deviate from 
certain initial subsequences of" the query sequence 
by more than the allowed threshold. To detect 
such cases, we use the notion ol'a cnl-off distance. 
The cut-off distance measures the minimum dis- 
lance between an initial subsequence of the query 
862 
sequence sequel:it(% a.nd the (possibly partial) can- 
(lidate soqtlOll(-(L I,et Y he ;~ l)a.rtial candi(lato se- 
<,llleric(~ whose lmagth is n, and le, t X be tl/c query 
soqll('\[lC(~ O\[ hmgth m.. I,c't l= lnin(l,n,- LZ/M\]) 
a,,(i ,, = ,~,~×(,,,, ,+ + \[Z/iV/i) wl, e,:o a4 is ti,(, ((,so 
of ittsol:tions nnd deloi;ions. 'l'h(~ cut+ol+f distance 
c,,:d/.s+t(X\[r,,\], +7\[,,\]) is defino(l a.s 
..:d/.~(X\[,,.\], r\[,,\]) : mh, d.i.v,(x \[:\], <\[,,\]). /<i<.u 
Note l;hat; ex('ept; at the boulldarios, the iuitial 
subscquonces of the query soquence X considorexl 
,,.,; or ,(;,,gt,, EWe41 i,o ,o.gt, ,, + \[:/A4\]. A,,y 
initial sul)scquonce of X shortor tha.tl I ,loods .,oro 
IJmn LI/M\] l,~af nodo i(isertions, ;rod nny itiitial 
stll)string of X loilger tha.n "u ro(ltfires nlore t\[ta.n 
I-\[I/M\] h',a.f no(h: (\[cletions, to a.t bast equal Y in 
Iougth, violating the dist;mee constrMnt. 
Givcu a. vcrl.ex list se, qlw, n(:o X (corresponding 
to a, (tuery l;reo), a lm.rtial ca.ndidate seqllenco Y 
is geuorat(xl I)y su(:c(~ssively (:ollcaten;~ting labels 
a.loug tire ~u'cs as tt:+msitions a, ro tn~t(le, sta.rting 
with l;ho start state. Wltolmvor wo extcn(t Y go- 
ing a, long the trio, we chock if the cut-off distmwo 
of X and the i~artial Y is within the botu,I slu'c- 
ifiod by the threshold /. If the c,t-olf distnllco 
goes l)oyoud I, ho throshol(l, the lasl; edgo is Imcl(o<l 
off' to tim source nodo (in p+u'a,lM with the short- 
(',hint 0\[' Y) ~,i(l some other o(Ige is t;ried, l\]a.cl¢- 
tr:-tcMng is t:e(:ursively apl)lie(l when tit('\] semr(:h can 
.c.1; l)e contimled from tlmt nodo. If, during tho 
('c.l)~sl, ruetion of Y, a, tormin;q node (which ma.y 
or llmy .ot l)o a, leaf of the trie) is reached with- 
out viola.tittg l, hc (:utoff (listan(:e co.stmail,t, a,n(I 
d',:.~t(X\[,,4, Y\[,\]) < t at t,i~t poll,,, tti,;,i V is ++ 
tr(+c in t.h(" (l+~ta.l>asc t.hat tt,aJ.chos th(' iNl)Ut (It,ory 
S(XlUOnCC. 2 
I)(!noLhig tile nodes of the trio }>y subs('rilfl;e(l (l'S 
(qo being the inil;ial uode (e.g., top node in Figure 
3)) a,n(| the la.bols of tl:lo edges l)y V, a, nd denoting 
by 8(qi, 17) the taodo in IJto t, rie that oiie Ca, ll reach 
\[\[rOlll llo(l(' qi with edgo la, bol V ((|ellotillg :,t vortex 
list,), wo l)rcsettt, in I,'igurc /I, the a, lgorithut \[Lr 
gonera, thlg a.ll Y's by a (slightly tnodifiod) dopllh- 
first probing or l, he trio. '\['he cru(:ia.\] point ill l;his 
a, lgoril, hln is tha, t tile cut-eli (listauco conil)ut;t,l;ion 
c;i,o be per\['ortncd very ofticiontly by ui~tintainhig 
;1 ilia, trix II whhth is ;i, al ill, \])y It illa.i, rix with el- 
,.,,,o,,t//(i,j) = d/,~:.(x\[,:\], Y\[./\]) (,),t +.t<i (:i,at.g, 
1992). We (:~ttt t-tote that the (',OlUt)Ul;~tion or l, he 
olettic;nt II (i + 1, j q- 1 ) recursJvely de.ponds on only 
//(i, j), II (i, .7-i-i), 11 (i+ 1, j) f,.o,u the earlier dof 
initiou of tlic edit disl,anco (see l,'iguro 5.) \])uring 
the dopl, h first, so~u'c,}i ()t' the t('i% ont;rios in cohnnn 
'n, o1' the lna, trix 11 \]ia,vo I;o I)o (ro)contl:)ut;ed , ottiy 
when the ('an(li(hd;e sl;rilig is o\[ Ioligth n. \])urhlg 
Imx'kt, rax:king, I, ho entries for the last coititlill are 
2Nol, e tl,;Lt wc ha, vc to do this chock since we may 
coinc to other irreleva.nt, tcrminat nodes during I.he 
,q(}aYch. 
/*push empty candidate, and start 
node to start search */ 
P,t.~h.(( ' , q0 )) 
while stack not empgy 
begin 
l,op((Y',qi)) /* pop partial sequence Y' 
and the node */ 
for all qj and V such that 6(qi, 'l/) : qa 
begin /* extend the candidate sequence */ 
Y = conc:~tt(Y', V) 
/* u is the current length o:17 Y */ 
/* check if g has deviated too much, 
if not push */ 
i* .',a.,Zi.~l(X\['.,.\], Yb\]) -< t then p',t.~h((< q,)) 
/* also see if we are at a filial state */ 
i* ,Z/s~.(X\[,,,,\], Y\[,,.\]) < : and 
q,i is a terminal node then output V 
end 
end 
Figure 4: Algoridmi for error-tolerant recognition 
o\[' vertex lis~ sequences 
: 
ii "i},:i) ,('/,.i + l) .. .(/+J,j) ,:(/+~,/+:l) 
Figure 5: (k)uqmt;M;iou of the elo.nionts of t,ho II 
maJ, rix. 
disca.rdod, but the entries in prior c ohumls m:o still 
valid. Thus all enl;ries required by It (i + 1, ,7" -I- 1), 
except I\[(i, j + \]), axe nlre~dy awLihtble in the ma- 
trix in cohlmns i - \] a.nd i. The conaputation of 
c'uldisl,(X\[,t,\], Y\[n\])invo|vcs ~ loop in whioh the 
minhntun is colul)uted. 'l'his loop (hldexing along 
coh,m,, .7' + 1) co,np,,tos ll(i,j + 1) bcfo,'e it is 
needed for the computaiAon ol7 ll(i + l,j + 1). 
3 Experiniental Results 
W(; hamo cxperinl(;nLed with 3 synthcticly goner 
a.tod sots of trees with the propeJ'tios given in 'l'+> 
I)lc 1. lit this l;Mqe, {.tie third cohllilll (label ALl') 
gives the ~tvera.gc rat, to of the vertices at each level 
which a.ro ra.ndomly soie(;ted as lea\[ vortices in ;t 
tree. '\['hc' \['ourth column gives the trl~xinmirl nltni- 
bet of children that a uon-lea£ node lna.y h~tvo. 
Tile \[a.st column gives the maxinnnn depth of the 
trees in rite, t, d~ttal)~LSO. 
From I, heso synthetic ,.I,nb'-dm, ses, we ra.ndo\]nly 
oxtra.ctod 100 trees arid the, perturbed thcnl 
with ramlom leaf deletions, insertions and la.bol 
changes so that l;\]ioy were o\[' some (listmlce l'ron~ ~t 
863 
Database 
Number |ALP Max I Max 
of ~ Children Depth 
Trees 
l,ooo ! 1/3 8 I 5 
10,000 \[ 1/2 16 I 5 a0,000/1/2 s__L__A 
Table l: Properties of the synthetic databases of 
rees 
tree in the originaJ tree. We used thresholds t = 2 
and t = 4, allowing an error of C = 1 for each le~ff 
label change and an error of S = 2 for each inser- 
tion or deletion (see Section 2.2). We then ran ore' 
algoridnn on these data sets and obtained perfof 
mance information. All runs were performed on a 
Sun SpareStation 20/61 with 128M real memory. 
The results are presented in '\].'able 2. It can be 
Data-\] Thres- 
base hold 
1 2 
4 
2 2 
4 
3 2 
4 
Avg. \[ Avg. Avg. 
I,eaves/ \]Search 'I~'ees 
Query Time Found/ 
'\['ree (Msec) Query 
12.00 65 1.96 
12.42 81 16.65 
24.65 990 3.32 
25.62 1,659 31.59 
10.45 2,550 13.63 
10/15 3,492 68.62 
'Fable 2: Performance results for the approximate 
tree matching algorithm. 
seen that the approximate search algorithm is very 
fast for the set; of synthetic tree d;~tabases that we 
have experimented with. It certainly is also possi- 
ble that additional space savings can be achieved 
if directed acyclic graphs can be used to represent 
the tree database taking into account both com- 
lnon prefixes and common suffixes of vertex list; 
sequences. 
5 Acknowledgments 
This research was in part funded by a NATO Sci- 
ence for Stability Phase III Project Grant - TU- 
LANGUAGE. 
References 
M .W. Du and S. C. Chang. \]992. A model and 
a fast Mgorithm for multiple errors spelling cor- 
rection. Acta lnformatica, 29:281- 302. 
Sergei Nirenbm:g, Stephen Beale, and Constantine 
Domashnev. 1994. A Full-text l,\]xperiment in 
Example-based 'l}anslation. In Proceedings of 
lhe International Conference on New Methods 
in Language Processing, Manchester, UK, Pages 
78 87. 
Kemal Oflazer. 1996. Error-tolerant Finite-state 
Recognition with Applications to Morphologi- 
cal Analysis and Spelling Correction, Compu- 
tational Linguistics, Vo1:22, No:l. 
Satoshi Sat() and Makoto Nagao. 1990. rIbwards 
Memory-based Translation. In Proceedings of 
COLING'90 Vol.3, Pages 247 252. 
Jason Tsong4,i Wang, Kaizhong Zhang, Karpjoo 
Jeong, and Dennis Shasha. 1994. A System \['or 
Approximate 'Dee Matching. In IEEI'; 7}'ansac- 
tions of Knowledge and Data Engineering Vol. 
6, No. 4, August, Pages 559 570. 
4 Conclusions 
'l'his paper has presented an algorithm ibr ap- 
proximate associative tree matching that can be 
used in example-based machine translation appli- 
cations. The algorithm et\[iciently searches in a 
database of trees, all trees that are "(:lose" to a 
given query tree. The algorithm has been ilnple- 
mented on Sun Sparcstations, and experiments on 
rather large synthetic tree database indicate that 
i t ('an perform N)proximate nmt ehes within tenths 
of a second to few seconds depending on the size 
ot' the. database and the error that the search is 
allowed to consider. 
864 
