Deriving Transfer Rules from Dominance-Preserving Alignments 
Adam Meyers, Roman Yangarber, Ralph Grishman, 
Catherine Macleod, Antonio Moreno-Sandoval t 
New York University 
715 Broadway, 7th Floor, NY, NY 10003, USA 
tUniversidad Aut6noma de Madrid 
Cantoblanco, 28049-Madrid, SPAIN 
meyers/roman/grishman/macleod©cs, nyu. edu 
sandoval©lola, lllf. uam. es 
1 Introduction 
Automatic acquisition of translation rules from 
parallel sentence-aligned text takes a variety of 
forms. Some machine translation (MT) systems 
treat aligned sentences as unstructured word se- 
quences. Other systems, including our own ((Gr- 
ishman, 1994) and (Meyers et al., 1996)), syn- 
tactically analyze sentences (parse) before ac- 
quiring transfer rules (cf. (Kaji et hi., 1992), 
(Matsumoto et hi., 1993), and (Kitamura and 
Matsumoto, 1995)). This has the advantage of 
acquiring structural as well as lexical correspon- 
dences. A syntactically analyzed, aligned cor- 
pus may serve as an example base for a form of 
example-based NIT (cf. (Sato and Nagao, 1990), 
(l(aji et al., 1992), and (Furuse and Iida. 1994)). 
This paper 1 describes: (1) an efficient algo- 
rithm for aligning a pair of source/target lan- 
guage parse trees; and (9) a procedure for de- 
riving transfer rules from this alignment. Each 
transfer rule consists of a pair of tree fragments 
derived by "cutting up" the source and target 
trees. A set of transfer rules whose left-hand 
sides match a source language parse tree is used 
to generate a target language parse tree from 
their set of right-hand sides, which is a transla- 
tion of the source tree. This technique resembles 
work on NIT using synchronous Tree-Adjoining 
Grammars (cf. (Abeille et al.. 1990)). 
The Proteus translation system learns transfer 
rules from pairs of aligned source and target reg- 
ularized parses, Proteus's representation of pred- 
icate argument structure (cf. Figure 1). 2 Then 
it uses these transfer rules to map source tan- 
l We thank Cristina Olmeda Moreno for work on pars- 
ing our Spanish text. This research was supported by 
National Science Fotmdation Grant IRI-9303013. 
2Regularized parses (henceforth, "parse trees") are 
like F-structures of Lexical Ftmction Grammar (LFG), 
except, that a dependency structure is used." 
guage regularized parses generated by our source 
language parser into target language regularized 
parses. Finally a generator converts target reg- 
ularized parses into target language sentences. 
An alignment f is a 1-to-1 partial mapping 
from source nodes to target nodes. We con- 
sider only alignments which preserve the dom- 
inance relationship: If node a dominates node 
b in the source tree, then f(a) dominates f(b) 
in the target tree. In Figure 1. source nodes .4. 
B, C and D map to the corresponding target 
nodes, marked with a prime, e.g., f(A) = A'. 
The alignment may be represented by the set 
{(d, A'), (B, B'), (C, C'), (D, D')}. We can as- 
sign a score to each alignment f, based on the 
(weighted) number of pairs in f; finding the best 
alignment translates into finding the alignment 
with the highest score. Our algorithms are based 
on (Farach et al., 1995) and related work. 
We needed efficient alignment algorithms be- 
cause: (1) Corpus-based training requires pro- 
cessing a lot of text; and (2) An exhaustive 
search of all alignments is too computationally 
expensive for realistically sized parse trees. 
Eliminating dominance violations greatly re- 
duced our search space. Similar work (e.g., 
(Matsumoto et hi., 1993)) considers all possible 
matches. Although. our system cannot account 
for actual dominance violations in a given bi- 
text, there are no such violations in our corpus 
and many hypothetical cases can be avoided by 
adopting the appropriate grammar. Cases of ad- 
juncts aligning with heads and vice versa are not 
dominance violations if we replace our depen- 
dency analysis with one in which internal nodes 
have category labels and the head constituents 
are marked by HEAD arcs and we assume the 
following Categorial Grammar (CG) style anal- 
yses. Suppose that verb (Vi) maps to adverb 
(A'I) and adverb (A2) maps to verb (V'2), where 
843 
SourceTree Target Tree 
("'D= voiver ~ ...................... "~ 
.....   iiiiiiii: ...................... 
Excel vuelve a calcular valores en libro de trabajo Excel recalculates values in workbook 
Figure l: A Pair of Aligned Trees 
A2 modifies V1 and A'l modifies V'2. We as- 
sume the following structures: \[VP \[VP1 V1 ...\] 
A2\] and \[VP \[VP2 V'2...\] A'I\]. No dominance 
violation exists because no dominance relation 
holds between VI and A2 or V'2 and A'L Y. 
Matsumoto (p.c.) notes that the subordinate 
clause of a source sentence may align with the 
main clause of a target language and vice versa, 
e.g., X after Y aligns with Y' before X'. where 
X, X', Y and Y' are all clauses. Assuming a CG 
style analysis, \[S X \[after Y\]\] aligns with \[S Y" 
\[before X'\]\] with no dominance violations. 
2 The Least-Common-Ancestor 
Constraint 
Our earlier tree alignment algorithms (cf. (Mey- 
ers et al., 1996)) were designed to produce align- 
ments which preserve the least common ancestor 
relationship: If nodes a and b map into nodes 
a' = f(a) and b' = f(b), then f(LCA(a,b)) = 
LCA(f(a), f(b)) = LCA(a', b'). The least com- 
mon ancestor (LCA) of a and b is the lowest node 
in the tree dominating both a and b. The LCA- 
preserving approach imposes limitations on the 
quality of the resulting alignments. \[n Figure 1, 
the LCA-preserving algorithm will match node 
E with node D' and report that as the best match 
overall. The score S(D; D'I would take into ac- 
count only the match (E, D~), which in turn in- 
cludes (B, B') and (C, C'). (S(D, D') would be 
penalized for collapsing the arc from D to E.) 
We seek a better alignment scheme, in which 
the score S(D, D') could benefit from S(A, A'). 
We are willing to pay a small penalty to collapse 
the path from D to E, and align the resulting 
structure. This leads to new algorithms where 
the LeA-preserving restriction is replaced by the 
weaker, dominance-preserving constraint. The 
rationale behind allowing an edge, say (v, u) to 
be collapsed when matching two nodes v and v ~, 
is that we may find some children of u which cor- 
respond well to some children of v', while other 
children of v correspond well to other children of 
v'. (This is not possible if LCA's are preserved.) 
The algorithm relies on the assumption that two 
different children of v will not match well with 
the same child of v'. 
3 The Dominance-Preserving 
Algorithm 
Let T and T' be the source and the target trees. 
We use a dynamic programming algorithm to 
compute, in a bottom-up fashion, the scores for 
matching each node in T against each node in T'. 
There are O(n 2) such scores, n = max(IT\[, IT'\]) 
is number of nodes in the trees. Let the d(v) be 
the degree of a node v. We denote children of u 
by vi, i = 1,..., d(v), and arc (v, v{) by if{. 
For all pairs of nodes v E T and v' E T', the 
algorithm computes the score function S(v, v'). 
S(v, v ~) corresponds to the best match found be- 
tween the subtrees rooted at v in T and at v ~ in 
T'. The values of S are stored in a. \[T\[ x IT' I ma- 
trix, also denoted by S. \[nitially, we fill the ma- 
trix S with undefined values, and invoke the pro- 
cedure SCOREdom, described below, to com- 
pute S(root(T), root(T')), the score for matching 
the root nodes of the trees. During the compu- 
tation of the score for the roots, the procedure 
recursively finds the best-scoring matches for all 
the nodes in the trees. This yields the best align- 
ment of the entire trees. 
Table l(a) shows the values of S for the trees 
in Figure 1. Whenever we compute a score fox" 
internal nodes, we also record the best way of 
pairing up their children in Table l(b). 3 The 
3 Children pairings include child/child pairs and par- 
ent/child pairs: (D.D')'s pairing is {(A, A'), (E, D')}. 
844 
alignment, implicit in these children pairings, is 
used in a later phase (Section 4) to recover the 
alignment for the entire trees. 
Procedure SCOREdorn: For a pair of nodes, 
(v, v~), recursively compute the score S(v, v'): 
Construct an intermediate child-scoring ma- 
trix M = M(v, v'), for the children of v and v~; 
the dimensions of M are (d(v) + i) x (d(v') + t). 
That is, the number of rows in M is one more 
than the number of children of v, and the number 
of columns is one more than number of children 
of v ¢. V~re label row d(v) + 1 and column d(v ~) + 1 
with a "*". Fill the matrix M: 
1. Vi, j, where 1 <_ i <_ d(v),t < j <_ d(¢) 
compute the corresponding entry in Mij: 
The function Lex,~od~.(v,v ~) >_ 0 (used be- 
low) is the quality of translation, i.e. the 
measure of how closely the label (word) at 
source node v corresponds to the label at 
target node v ~ in the bilingual dictionary, 
and Lex~c( ff, ff~) >__ 0 is the corresponding 
measure for arc labels. 
2. Fill the last column as follows: Vi, where 
t <_ i < d(v) compute the entries: 
Mi. = S(vi, v') - Pen(ffi) 
Pen(ffi) >_ 0 is the penalty for collapsing the 
edge ffi, which depends on the value of the 
label of that edge. 
3. Symmetrically, Vj s.t. t _< j <_ 
d(v ~) fill the last row with the entries: 
M.j = S(v, v;) - Pen(~;) 
4. The entry M.. is disfavored: ~,'l~. = -~c 
For example, during the calculation of the 
scores S(D, D') and S(E, D') from Table t; the 
corresponding matrices M(D, D ~) and M(E, D t) 
are filled in as in Table 2. The proper values for 
the parameter functions used above, such as the 
penalty function Pen and the translation inea- 
sures, are chosen empirically, and constitute the 
tunable parameters of the procedure. Normally, 
we will expect that the values of Lexr, ode will be 
much larger than the values of Lex~rc and Pen. 
In the example we used the following settings: 
1. Lexnode = 100 for an exact translation, as for 
(,4, .4'), (B, B t) and (C, C'), and 0 otherwise. 
2. all values of Lex~c are set to zero 
3. all penalties Pen are set to 1 
Now, using the values in M, compute the score 
for matching v and ¢: 
S(v, v') = Lex,~od~(V, v') + max ~ iYI~j (1) PEEP 
(i,j)EP 
Here P is a legitimate pairing of v and its chil- 
dren against v' and its children. A legitimate 
pairing P is a set of elements of the matrix M. 
that conform to the following conditions: 
1. each row and each column of M may con- 
tribute at most one element to P, except 
that the row and the column labeled * may 
contribute more than one element to P 
2. if P contains an element Mij correspond- 
ing to the node pair (w. w'), and some child 
node u appears in the Children-Pairing for 
(w, w'), then the row or column of u may 
not contribute any elements to P. 
We use/.7 ) = £7)(v. v') to denote the set of all 
legitimate pairings. There are O(d!) such pair- 
ings, where d is the greater of the degrees of u 
and v'. The summation in (l) ranges over all 
the pairs (i, j) that appear in a legitimate pair- 
ing P E /.7)(v, v'). We evaluate this summation 
for all O(d!) legitimate pairings in/.7), and then 
select the pairing Pbe~t with the maximum score. 
Pbest is then stored in the Children-Pairing ma- 
trix entry for (v, v'). 
Table 2 shows how scores are calculated. The 
best score for S(E, D ~) is 200, the sum of the 
scores for (B,B') and (C,C'). S(D.D') = 
299 = S(A, A') + S(E, D') - t, a penalty of t 
for collapsing the edge from D to E. 
We can reduce the computation time of the 
max term in (1), if we do not consider all O(d!) 
pairings of the children of v and v'. Instead 
of exhaustively computing the maximal-scoring 
pairing Pbest in (t), we can build it in a greedy 
fashion: successively choos the d highest-scoring, 
mutually disjoint pairs from the O(d 2) possible 
pairs of children of v and v'. 
1. Initialize the set of highest scoring pairs 
Pb,=~t e- 0 
2. Phi.st e- Pbestu{ (i,j) } where Mij is the next 
largest entry in the matrix, which that sat- 
isfies both conditions 1 and 2 of legitimate 
pairings 
845 
Source 
Nodes 
Target Nodes 
A' B' C" D' A' 
A 100 0 0 0 A 
B 0 100 0 0 B 
C 0 0 t00 0 Source C 
D 0 0 0 299 Nodes D 
E 0 0 0 2OO E 
F 0 0 0 0 F 
Target Nodes 
B' C' D' 
- (d, A')(E, D') 
- (B, B')(C, C') 
Table 1: (a) A Final Score Matrix; (b) Children-Pairing Matrix 
Source 
Chil- 
dre n 
Target Children 
t: A' 2: B' 3: C' *: D' 
t: B 0 100 0 99 
2: C 0 0 100 99 
*: E 0 99 99 -~ 
The Score S( t = tO0+ tO0 -- 200 
Source 
Chil- 
dren 
Target Children 
1: A' 2: B' 3: C' *: D' 
1: A 100 0 0 99 
2: E 0 99 99 199 
*: D 99 98 98 -oc 
The Score S( = t99+ 100 = 299 
Table 2: Computing Child-Scoring Matrices 
3. Repeat the above step until no more pairs 
can be added to Pbest, at most d times. 
where d = min(d(v), d(vl)). 
4. Compute the result: 
S(V, Y') -- LeZnode(V. V') -4:- ~(i,j)ePb,.~, :tiiJ 
The greedy algorithm aligns trees with n 
nodes and maximal degree d in O(n2d 2) time. 
4 Acquiring Transfer Rules 
This section describes the procedure for deriving 
transfer rules from aligned parse trees. 
First, the best-scoring alignment is recovered 
from the Children-Pairing matrix, (Table t(b)). 4 
Start by including the root node-pair in the 
alignment, (here (D, DI)). Then, for each pair 
(v, v ~) already in the alignment, repeat the fol- 
lowing steps, until no more pairs can be added to 
the alignment: (t) look up the Children:Pairing 
for (v.v'); (2) for each pair in the children- 
pairing, if it does not include either v or v ~, add 
the pair to the alignment, (e.g. (A, At), etc.). 
4When sentences in the bitext have multiple parses, 
we align structure sharing forests of trees. If one pair 
of trees has the highest scoring alignment, we acquire 
transfer rules from that alignment. When more than one 
pair of trees tie for the highest score, we acquire transfer 
rules from the set of pairs of aligned subtrees which are 
shared by each of these high scoring alignments. 
In the running example, the final align- 
ment (FA)is {(D, D'), (A, A'), (B, S'), (C, C')}. 
Based on this alignment we can "chop up" the 
trees into fragments, or substructures ((Mat- 
sumoto et hi., 1993)), where each substructure 
of a tree is a connected group of nodes in the 
tree, together with their joining arcs. In Fig- 
ure i, dashed arrows connect aligned pairs of 
source and target substructures. These corre- 
spondences become our transfer rules. 
For each pair of aligned nodes (v, v') in FA, 
there is a pair of substructures in Figure t such 
that v and v ~ are the roots of the source and tar- 
get substructures. These substructures include 
all unaligned source and target nodes v~ and 
' below v and v', which have no intervening V u 
aligned nodes y or y' dominating v, or v~u. 
The transfer rules derived from Figure t may 
be written as follows: 
1. < root : Excel > --+ < root : Excel > 
2. < root : valores > ~ < root : values > 
3. < root : libro, de : trabajo > -+ < root : 
workbook > 
4. < root : volver, subj : xl,a :< root : 
calcular, obj : x2, en : x3 > > 
< root : recalculate;subj : Tr(xl),obj : 
Tr(x2), in : Tr(x3) > 
Each substructure is represented as a list con- 
846 
taining a root lexical item, and a set of arc- 
value pairs. An arc (role) al with head (value) 
h is written as al : h, where h is a fixed la- 
bel (word), a substructure or a variable. If the 
source substructure has n of the leaves labeled 
with variables xl, • •., x~, the target will have 
n of the leaves labeled with Tr(xl),..., Tr(x~), 
where Tr(x) is the texical translation function. 
This general structure allows us to capture re- 
lations between multi-word expressions in the 
source and target languages. 
5 Translation 
The described procedure for acquisition of trans- 
fer rules from corpora is the basis for our trans- 
lation system. A large collection of transfer rules 
are collected from a training corpus. When new 
text is to be translated, it is first parsed. The 
source tree is matched against the left hand sides 
of the transfer rules which have been collected. 
If a set of transfer rules whose left-hand sides 
match the parse tree is found, the corresponding 
target structure is generated from the right hand 
sides of these transfer rules. Typically, several 
sets of transfer rules meet this criterion. They 
are ranked by their frequency in the training cor- 
pus. Once a target tree has been produced, it is 
converted to a word sequence by a target lan- 
guage generator. We have applied this approach 
to the translation of Microsoft Help files in En- 
glish and Spanish. The sentences are moderately 
simple and quite parallel in structure, which has 
made the corpus suitable for our initial system 
development. To date, we have been using a 
training corpus of about 1,000 sentences, and a 
test corpus of about 100 sentences. 
6 Evaluation 
Real evaluation of performance of MT systems 
is time consuming and subjective. Neverthe- 
less, some evaluation system is needed to insure 
that incremental changes are for the better, or 
at least, are not detrimental. We measured the 
success of our translation by how closely we re- 
produced Microsoft's English (target language) 
text. Our evaluation procedure computes the 
ratio between (a) the complement of the inter- 
section set of words in our translation and the 
actual Microsoft sentence; and (b) the combined 
lengths of these two sentences. An exact trans- 
lation gives a score of 0. If the system generates 
the sentence "A B C D E" and the actual sen- 
tence is "A B C F", the score is 3/9 (the length 
of D E F divided by the combined lengths of 
A B C D E and A B C F.) The dominance- 
preserving version of the program produced out- 
put for 88 out of 91 test sentences. The average 
score for these 88 sentences was 0.29:0.21 due 
to incorrect word matches and 0.08 due to failure 
to translate because insufficient confidence levels 
were reached. The LCA-preserving version pro- 
duced output for only 83 sentences with an aver- 
age score of over 0.30: about 0.23 due to incor- 
rect word matches and about 0.08 due to insuffi- 
cient confidence levels. This crude scoring tech- 
nique suggests that the dominance-preserving al- 
gorithm improved our results: more sentences 
were translated with higher quality. One limita- 
tion of this scoring technique is that paraphrases 
are penalized. An imperfect score (even .20) 
may signify an adequate translation. 

References 
A. Abeille, Y. Schabes. and A. K. Joshi. 1990. 
Using Lexicalized Tags for Machine Transla- 
tion. In COLING90. 
M. Farach, T. M. Przytycka, and M. Thorup. 
1995. On the agreement of many trees. Infor- 
mation Processing Letters, 55:297-301. 
O. Furuse and H. lida. 1994. Constituent 
Boundary Parsing for Example-Based Ma- 
chine Translation. In COLING94. 
R. Grishman. 1994. lterative Alignment of Syn- 
tactic Structures for a Bilingual Corpus: In 
Proceedings of the Second Annual Workshop 
for Very Large Corpora, Tokyo. 
H. Kaji, Y. Kida, and Y. Morimoto. 1992. 
Learning Translation Templates fi'om Bilin- 
gual Text. In COLING92. 
M. Kitamura and Y. Matsumoto. 1995. A Ma- 
chine Translation System based on Transla- 
tion Rules Acquired from Parallel Corpora. In 
RANLP95. 
Y. Matsumoto, H. Ishimoto. T. Utsuro, and 
M. Nagao. 1993. Structural Matching of Par- 
allel Texts. In ACL93. 
A. Meyers, R. Yangarber, and R. Grishman. 
1996. Alignment of Shared Forests for Bilin- 
gual Corpora. In COLING96, pages 460-465. 
S. Sato and M. Nagao. 1990. Toward Memory- 
based Translation. In COLING90, volume 3, 
pages 247-252. 
