INVITED TALK 
Head Automata and Bilingual Tiling: 
Translation with Minimal Representations 
Hiyan Alshawi 
AT&T Research 
600 Mountain Avenue, Murray Hill, NJ 07974, USA 
hiyan@research.att.com 
Abstract 
We present a language model consisting of 
a collection of costed bidirectional finite 
state automata associated with the head 
words of phrases. The model is suitable 
for incremental application of lexical asso- 
ciations in a dynamic programming search 
for optimal dependency tree derivations. 
We also present a model and algorithm 
for machine translation involving optimal 
"tiling" of a dependency tree with entries 
of a costed bilingual lexicon. Experimen- 
tal results are reported comparing methods 
for assigning cost functions to these mod- 
els. We conclude with a discussion of the 
adequacy of annotated linguistic strings as 
representations for machine translation. 
1 Introduction 
Until the advent of statistical methods in the main- 
stream of natural language processing, syntactic 
and semantic representations were becoming pro- 
gressively more complex. This trend is now revers- 
ing itself, in part because statistical methods re- 
duce the burden of detailed modeling required by 
constraint-based grammars, and in part because sta- 
tistical models for converting natural language into 
complex syntactic or semantic representations is not 
well understood at present. At the same time, lex- 
ically centered views of language have continued to 
increase in popularity. We can see this in lexical- 
ized grammatical theories, head-driven parsing and 
generation, and statistical disambiguation based on 
lexical associations. 
These themes -- simple representations, statisti- 
cal modeling, and lexicalism -- form the basis for 
the models and algorithms described in the bulk of 
this paper. The primary purpose is to build effec- 
tive mechanisms for machine translation, the oldest 
and still the most commonplace application of non- 
superficial natural language processing. A secondary 
motivation is to test the extent to which a non-trivial 
language processing task can be carried out without 
complex semantic representations. 
In Section 2 we present reversible mono-lingual 
models consisting of collections of simple automata 
associated with the heads of phrases. These head 
automata are applied by an algorithm with admissi- 
ble incremental pruning based on semantic associa- 
tion costs, providing a practical solution to the prob- 
lem of combinatoric disambiguation (Church and 
Patil 1982). The model is intended to combine the 
lexical sensitivity of N-gram models (Jelinek et al. 
1992) and the structural properties of statistical con- 
text free grammars (Booth 1969) without the com- 
putational overhead of statistical lexicalized tree- 
adjoining grammars (Schabes 1992, Resnik 1992). 
For translation, we use a model for mapping de- 
pendency graphs written by the source language 
head automata. This model is coded entirely as 
a bilingual lexicon, with associated cost parame- 
ters. The transfer algorithm described in Section 4 
searches for the lowest cost 'tiling' of the target 
dependency graph with entries from the bilingual 
lexicon. Dynamic programming is again used to 
make exhaustive search tractable, avoiding the com- 
binatoric explosion of shake-and-bake translation 
(Whitelock 1992, Brew 1992). 
In Section 5 we present a general framework for as- 
sociating costs with the solutions of search processes, 
pointing out some benefits of cost functions other 
than log likelihood, including an error-minimization 
cost function for unsupervised training of the pa- 
rameters in our translation application. Section 6 
briefly describes an English-Chinese translator em- 
ploying the models and algorithms. We also present 
experimental results comparing the performance of 
different cost assignment methods. 
Finally, we return to the more general discussion 
of representations for machine translation and other 
natural language processing tasks, arguing the case 
for simple representations close to natural language 
itself. 
2 Head Automata Language Models 
2.1 Lexieal and Dependency Parameters 
Head automata mono-lingual language models con- 
sist of a lexicon, in which each entry is a pair (w, m) 
of a word w from a vocabulary V and a head au- 
tomaton m (defined below), and a parameter table 
giving an assignment of costs to events in a genera- 
tive process involving the automata. 
167 
We first describe the model in terms of the familiar 
paradigm of a generative statistical model, present- 
ing the parameters as conditional probabilities. This 
gives us a stochastic version of dependency grammar 
(Hudson 1984). 
Each derivation in the generative statistical model 
produces an ordered dependency tree, that is, a tree 
in which nodes dominate ordered sequences of left 
and right subtrees and in which the nodes have la- 
bels taken from the vocabulary V and the arcs have 
labels taken from a set R of relation symbols. When 
a node with label w immediately dominates a node 
with label w' via an arc with label r, we say that 
w' is an r-dependent of the head w. The interpre- 
tation of this directed arc is that relation r holds 
between particular instances of w and w'. (A word 
may have several or no r-dependents for a particular 
relation r.) A recursive left-parent-right traversal of 
the nodes of an ordered dependency tree for a deriva- 
tion yields the word string for the derivation. 
A head automaton m of a lexical entry (w, m) de- 
fines possible ordered local trees immediately dom- 
inated by w in derivations. Model parameters for 
head automata, together with dependency parame- 
ters and lexical parameters, give a probability dis- 
tribution for derivations. 
A dependency parameter 
P( L w'lw, r') 
is the probability, given a head w with a dependent 
arc with label r', that w' is the r'-dependent for this 
arc. 
A lexical parameter 
P(m, qlr, t, w) 
is the probability that a local tree immediately dom- 
inated by an r-dependent w is derived by starting 
in state q of some automaton m in a lexieal entry 
(w, m). The model also includes lexieal parameters 
P(w,m, qlt>) 
for the probability that w is the head word for an 
entire derivation initiated from state q of automaton 
m. 
2.2 Head Automata 
A head automaton is a weighted finite state machine 
that writes (or accepts) a pair of sequences of rela- 
tion symbols from R: 
((rl... r,)). 
These correspond to the relations between a head 
word and the sequences of dependent phrases to its 
left and right (see Figure 1). The machine consists 
of a finite set q0, • • ", qs of states and an action ta- 
ble specifying the finite cost (non-zero probability) 
actions the automaton can undergo. 
There are three types of action for an automaton 
m: left transitions, right transitions, and stop ac- 
tions. These actions, together with associated prob- 
abilistic model parameters, are as follows. 
W 
Wl " " " Wk Wk+l " " " Wn 
Figure h Head automaton m scans left and right 
sequences of relations ri for dependents wi of w. 
• Left transition: if in state qi-1, m can write 
a symbol r onto the right end of the current 
left sequence and enter state qi with probability 
P(~, qi, rlqi-1, m). 
• Right transition: if in state qi-1, m can write 
a symbol r onto the left end of the current 
right sequence and enter state qi with proba- 
bility P(--* , qi, rlqi-1, m). 
• Stop: if in state q, m can stop with probabil- 
ity P(t31q , m), at which point the sequences are 
considered complete. 
For a consistent probabilistic model, the probabili- 
ties of all transitions and stop actions from a state q 
must sum to unity. Any state of a head automaton 
can be an initial state, the probability of a partic- 
ular initial state in a derivation being specified by 
lexical parameters. A derivation of a pair of sym- 
bol sequence thus corresponds to the selection of an 
initial state, a sequence of zero or more transitions 
(writing the symbols) and a stop action. The prob- 
ability, given an initial state q, that automaton m 
will a generate a pair of sequences, i.e. 
P((rl'.. rk), (rk+l"'' rn)Ira, q) 
is the product of the probabilities of the actions 
taken to generate the sequences. The case of zero 
transitions will yield empty sequences, correspond- 
ing to a leaf node of the dependency tree. 
From a linguistic perspective, head automata al- 
low for a compact, graded, notion of lexical subcate- 
gorization (Gazdar et al. 1985) and the linear order 
of a head and its dependent phrases. Lexical param- 
eters can control the saturation of a lexical item (for 
example a verb that is both transitive and intran- 
sitive) by starting the same automaton in different 
states. Head automata can also be used to code a 
grammar in which states of an automaton for word 
w corresponds to X-bar levels (Jaekendoff 1977) for 
phrases headed by w. 
Head automata are formally more powerful than 
finite state automata that accept regular languages 
in the following sense. Each head automaton defines 
a formal language with alphabet R whose strings are 
the concatenation of the left and right sequence pairs 
168 
written by the automaton. The class of languages 
defined in this way clearly includes all regular lan- 
guages, since strings of a regular language can be 
generated, for example, by a head automaton that 
only writes a left sequence. Head automata can also 
accept some non-regular languages requiring coordi- 
nation of the left and right sequences, for example 
the language anb ~ (requiring two states), and the 
language of palindromes over a finite alphabet. 
2.3 Derivation Probability 
Let the probability of generating an ordered depen- 
dency subtree D headed by an r-dependent word w 
be P(D\]w, r). The recursive process of generating 
this subtree proceeds as follows: 
1. Select an initial state q of an automaton m for 
w with lexical probability P(m, q\[r, ~, w). 
2. Run the automaton m0 with initial state q to 
generate a pair of relation sequences with prob- 
ability P((rl... rk), (rk+l-"" r,,)lm, q). 
3. For each relation ri in these sequences, select a 
dependent word wi with dependency probabil- 
ity P(l, wi\[w, ri). 
4. For each dependent wi, recursively generate a 
subtree with probability P(D~ Iwi, ri). 
We can now express the probability P(Do) for an 
entire ordered dependency tree derivation Do headed 
by a word w0 as 
P(Do) = 
P(wo, too, q0\[ 1>) 
P( (rl . . . rl,), (rk+l " . . rnl Imo, qo) 
YIl <i<n P(l, wilwo, ri)P( Di Iwi, ri). 
In the translation application we search for the high- 
est probability derivation (or more generally, the N- 
highest probability derivations). For other purposes, 
the probability of strings may be of more interest. 
The probability of a string according to the model is 
the sum of the probabilities of derivations of ordered 
dependency trees yielding the string. 
In practice, the number of parameters in a head 
automaton language model is dominated by the de- 
pendency parameters, that is, O(\]V\]2\]RI) parame- 
ters. This puts the size of the model somewhere in 
between 2-gram and 3-gram model. The similarly 
motivated link grammar model (Lafferty, Sleator 
and Temperley 1992) has O(\[VI 3) parameters. Un- 
like simple N-gram models, head automata models 
yield an interesting distribution of sentence lengths. 
For example, the average sentence length for Monte- 
Carlo generation with our probabilistic head au- 
tomata model for ATIS was 10.6 words (the average 
was 9.7 words for the corpus it was trained on). 
3 Analysis and Generation 
3.1 Analysis 
Head automaton models admit efficient lexically 
driven analysis (parsing) algorithms in which par- 
tial analyses are costed incrementally as they are 
constructed. Put in terms of the traditional parsing 
issues in natural language understanding, "seman- 
tic" associations coded as dependency parameters 
are applied at each parsing step allowing semanti- 
cally suboptimal analyses to be eliminated, so the 
analysis with the best semantic score can be identi- 
fied without scoring an exponential number of syn- 
tactic parses. Since the model is lexical, linguistic 
constructions headed by lexical items not present in 
the input are not involved in the search the way 
they are with typical top-down or predictive parsing 
strategies. 
We will sketch an algorithm for finding the lowest 
cost ordered dependency tree derivation for an input 
string in polynomial time in the length of the string. 
In our experimental system we use a more general 
version of the algorithm to allow input in the form 
of word lattices. 
The algorithm is a bottom-up tabular parser 
(Younger 1967, Early 1970) in which constituents 
are constructed "head-outwards" (Kay 1989, Sata 
and Stock 1989). Since we are analyzing bottom- 
up with generative model automata, the algorithm 
'runs' the automata backwards. Edges in the parsing 
lattice (or "chart") are tuples representing partial or 
complete phrases headed by a word w from position 
i to position j in the string: 
(w,t,i,j,m,q,c). 
Here m is the head automaton for w in this deriva- 
tion; the automaton is in state q; t is the dependency 
tree constructed so far, and c is the cost of the par- 
tial derivation. We will use the notation C(zly ) for 
the cost of a model event with probability P(zIy); 
the assignment of costs to events is discussed in Sec- 
tion 5. 
Initialization: For each word w in the input be- 
tween positions i and j, the lattice is initialized with 
phrases 
{w,{},i,j,m,q$,c$) 
for any lexical entry (w, m) and any final state q! of 
the automaton m in the entry. A final state is one 
for which the stop action cost c! = C(DJq!, m) is 
finite. 
Transitions: Phrases are combined bottom-up to 
form progressively larger phrases. There are two 
types of combination corresponding to left and right 
transitions of the automaton for the word acting as 
the head in the combination. We will specify left 
combination; right combination is the mirror im- 
age of left combination. If the lattice contains two 
phrases abutting at position k in the string: 
169 
(Wl, tl, i, k, ml, ql, Cl) 
(W2, t2, k, j, ra2, q2, c2), 
and the parameter table contains the following finite 
costs parameters (a left v-transition of m2, a lexical 
parameter for wl, and an r-dependency parameter): 
c3 = C(~---, q2, rlq~, m2) 
c4 = C(ml, qiir, ~, Wx) 
c5 = C(l, wllw2, r), 
then build a new phrase headed by w2 with a tree t~ 
formed by adding tl to t~ as an r-dependent of w2: 
(w2, t~, i, j, m2, q~, cl + c2 + c3 + c4 -4- cs). 
When no more combinations are possible, for each 
phrase spanning the entire input we add the appro- 
priate start of derivation cost to these phrases and 
select the one with the lowest total cost. 
Pruning: The dynamic programming condition for 
pruning suboptimal partial analyses is as follows. 
Whenever there are two phrases 
p: (w,t,i,j,m,q,c) 
p' = (w, t', i, j, m, q, c'), 
and c ~ is greater than c, then we can remove p~ be- 
cause for any derivation involving p~ that spans the 
entire string, there will be a lower cost derivation 
involving p. This pruning condition is effective at 
curbing a combinatorial explosion arising from, for 
example, prepositional phrase attachment ambigui- 
ties (coded in the alternative trees t and t'). 
The worst case asymptotic time complexity of the 
analysis algorithm is O(min(n 2, IY12)n3), where n is 
the length of an input string and IVI is the size of 
the vocabulary. This limit can be derived in a simi- 
lar way to cubic time tabular recognition algorithms 
for context free grammars (Younger 1967) with the 
grammar related term being replaced by the term 
min(n 2, IVI 2) since the words of the input sentence 
also act as categories in the head automata model. 
In this context "recognition" refers to checking that 
the input string can be generated from the grammar. 
Note that our algorithm is for analysis (in the sense 
of finding the best derivation) which, in general, is 
a higher time complexity problem than recognition. 
3.2 Generation 
By generation here we mean determining the low- 
est cost linear surface ordering for the dependents of 
each word in an unordered dependency structure re- 
sulting from the transfer mapping described in Sec- 
tion 4. In general, the output of transfer is a de- 
pendency graph and the task of the generator in- 
volves a search for a backbone dependency tree for 
the graph, if necessary by adding dependency edges 
to join up unconnected components of the graph. 
For each graph component, the main steps of the 
search process, described non-deterministically, are 
1. Select a node with word label w having a finite 
start of derivation cost C(w, m, ql t>). 
2. Execute a path through the head automaton m 
starting at state q and ending at state q' with a 
finite stop action cost C(Olq' , m). When mak- 
ing a transition with relation ri in the path, se- 
lect a graph edge with label ri from w to some 
previously unvisited node wi with finite depen- 
dency cost C(~,wilw, ri). Include the cost of 
the transition (e.g. C(---% ql, rilqi-1, m)) in the 
running total for this derivation. 
3. For each dependent node wi, select a lexical en- 
try with cost C(mi, qilri, J., wi), and recursively 
apply the machine rni from state ql as in step 
2. 
4. Perform a left-parent-right traversal of the 
nodes of the resulting dependency tree, yield- 
ing a target string. 
The target string resulting from the lowest cost tree 
that includes all nodes in the graph is selected as the 
translation target string. The independence assump- 
tions implicit in head automata models mean that 
we can select lowest cost orderings of local depen- 
dency trees, below a given relation r, independently 
in the search for the lowest cost derivation. 
When the generator is used as part of the trans- 
lation system, the dependency parameter costs are 
not, in fact, applied by the generator. Instead, be- 
cause these parameters are independent of surface 
order, they are applied earlier by the transfer com- 
ponent, influencing the choice of structure passed to 
the generator. 
4 Transfer Maps 
4.1 Transfer Model Bilingual Lexicon 
The transfer model defines possible mappings, with 
associated costs, of dependency trees with source- 
language word node labels into ones with target- 
language word labels. Unlike the head automata 
monolingual models, the transfer model operates 
with unordered dependency trees, that is, it treats 
the dependents of a word as an unordered bag. The 
model is general enough to cover the common trans- 
lation problems discussed in the literature (e.g. Lin- 
dop and Tsujii 1991 and Dorr 1994) including many- 
to-many word mapping, argument switching, and 
head switching. 
A transfer model consists of a bilingual lexicon 
and a transfer parameter table. The model uses de- 
pendency tree fragments, which are the same as un- 
ordered dependency trees except that some nodes 
may not have word labels. In the bilingual lexicon, 
an entry for a source word wi (see top portion of 
Figure 2) has the form 
(wi, Hi, hi, Gi, fi) 
where Hi is a source language tree fragment, ni (the 
primary node) is a distinguished node of Hi with 
label wi, Gi is a target tree fragment, and fi is a 
170 
mapping function, i.e. a (possibly partial) function 
from the nodes of Hi to the nodes of Gi. 
The transfer parameter table specifies costs for 
the application of transfer entries. In a context- 
independent model, each entry has a single cost pa- 
rameter. In context-dependent transfer models, the 
cost function takes into account the identities of the 
labels of the arcs and nodes dominating wi in the 
source graph. (Context dependence is discussed fur- 
ther in Section 5.) The set of transfer parameters 
may also include costs for the null transfer entries 
for wi, for use in derivations in which wi is trans- 
lated by the entry for another word v. For example, 
the entry for v might be for translating an idiom 
involving wi as a modifier. 
Each entry in the bilingual lexicon specifies a 
way of mapping part of a dependency tree, specifi- 
cally that part "matching" (as explained below) the 
source fragment of the entry, into part of a target 
graph, as indicated by the target fragment. Entry 
mapping functions specify how the set of target frag- 
ments for deriving a translation are to be combined: 
whenever an entry is applied, a global node-mapping 
function is extended to include the entry mapping 
function. 
4.2 Matching, Tiling, and Derivation 
Transfer mapping takes a source dependency tree S 
from analysis and produces a minimum cost deriva- 
tion of a target graph T and a (possibly partial) 
function f from source nodes to target nodes. In 
fact, the transfer model is applicable to certain types 
of source dependency graphs that are more general 
than trees, although the version of the head au- 
tomata model described here only produces trees. 
We will say that a tree fragment H matches an 
unordered dependency tree S if there is a function 
g (a matching function) from the nodes of H to the 
nodes of S such that 
• g is a total one-one function; 
• if a node n of H has a label, and that label is 
word w, then the word label for g(n) is also w; 
• for every arc in H with label r from node nl to 
node n2, there is an arc with label r from g(nz) 
to g(n2). 
Unlike first order unification, this definition of 
matching is not commutative and is not determinis- 
tic in that there may be multiple matching functions 
for applying a bilingual entry to an input source tree. 
A particular match of an entry against a dependency 
tree can be represented by the matching function g, 
a set of arcs A in S, and the (possibly context de- 
pendent) cost c of applying the entry. 
A tiling of a source graph with respect to a transfer 
model is a set of entry matches 
{(El, gz, A1, cl), • • ", (E~, gk, At, ck)} 
which is such that 
gi 
Figure 2: Transfer matching and mapping functions 
• k is the number of nodes in the source tree S. 
• Each Ei, 1 < i ~ k, is a bilingual entry 
(wi, Hi, hi, Gi, fil matching S with function gi 
(see Figure 2) and arcs Ai. 
• For primary nodes nl and nj of two distinct 
entries Ei and Ej, gi(ni) and gi(nj) are distinct. 
• The sets of edges Ai form a partition of the 
edges of S. 
• The images gi(Li) form a partition of the nodes 
of S, where Li is the set of labeled source nodes 
in the source fragment Hi of Ei. 
• ci is the cost of the match specified by the pa- 
rameter table. 
A tiling of S yields a costed derivation of a target 
dependency graph T as follows: 
• The cost of the derivation is the sum of the costs 
ci for each match in the tiling. 
• The nodes and arcs of T are composed of the 
nodes and arcs of the target fragments Gi for 
the entries Ei. 
• Let fi and fj be the mapping functions for en- 
tries Ei and Ej. For any node n of S for which 
target nodes fi(g\[l(n)) and fj(g~l(n)) are de- 
fined, these two nodes are identified as a single 
node f(n) in T. 
The merging of target fragment nodes in the last 
condition has the effect of joining the target frag- 
ments in a consistent fashion. The node mapping 
function f for the entire tree thus has a different 
role from the alignment function in the IBM statis- 
tical translation model (Brown et al. 1990, 1993); 
the role of the latter includes the linear ordering of 
words in the target string. In our approach, tar- 
get word order is handled exclusively by the target 
monolingual model. 
4.3 Transfer Algorithm 
The main transfer search is preceded by a bilingual 
lexicon matching phase. This leads to greater ef- 
ficiency as it avoids repeating matching operations 
171 
during the search phase, and it allows a static analy- 
sis of the matching entries and source tree to identify 
subtrees for which the search phase can safely prune 
out suboptimal partial translations. 
Transfer Configurations In order to apply tar- 
get language model relation costs incrementally, we 
need to distinguish between complete and incom- 
plete arcs: an arc is complete if both its nodes have 
labels, otherwise it is incomplete. The output of the 
lexicon matching phrase, and the partial derivations 
manipulated by the search phase are both in the 
form of transfer configurations 
(S,R,T,P,f,c,I) 
where S is the set of source nodes and arcs con- 
sumed so far in the derivation, R the remaining 
source nodes and arcs, f the mapping function built 
so far, T the set of nodes and complete arcs of the 
target graph, P the set of incomplete target arcs, 
c the partial derivation cost, and I a set of source 
nodes for which entries have yet to be applied. 
Lexical matching phase The algorithm for lexi- 
cal matching has a similar control structure to stan- 
dard unification algorithms, except that it can result 
in multiple matches. We omit the details. The lex- 
icon matching phase returns, for each source node 
i, a set of runtime entries. There is one runtime 
entry for each successful match and possibly a null 
entry for the node if the word label for i is included 
in successful matches for other entries. Runtime en- 
tries are transfer configurations of the form 
(Hi, ¢, Gi, Pi, fi, ci, {i}) 
in which Hi is the source fragment for the entry with 
each node replaced by its image under the applica- 
ble matching function; Gi the target fragment for 
the entry, except for the incomplete arcs Pi of this 
fragment; fi the composition of mapping function 
for the entry with the inverse of the matching func- 
tion; ci the cost of applying the entry in the context 
of its match with the source graph plus the cost in 
the target model of the arcs in Gi. 
Transfer Search Before the transfer search 
proper, the resulting runtime entries together with 
the source graph are analyzed to determine decom- 
position nodes. A decomposition node n is a source 
tree node for which it is safe to prune suboptimal 
translations of the subtree dominated by n. Specifi- 
cally, it is checked that n is the root node of all source 
fragments Hn of runtime entries in which both n and 
its node label are included, and that fn(n) is not 
dominated by (i.e. not reachable via directed arcs 
from) another node in the target graph Gn of such 
entries. 
Transfer search maintains a set M of active run- 
time entries. InitiMly, this is the set of runtime 
entries resulting from the lexicon matching phase. 
Overall search control is as follows: 
1. Determine the set of decomposition nodes. 
2. Sort the decomposition nodes into a list D such 
that if nl dominates n2 in S then n2 precedes 
nl in D. 
3. If D is empty, apply the subtree transfer search 
(given below) to S, return the lowest cost solu- 
tion, and stop. 
4. Remove the first decomposition node n from D 
and apply the subtree transfer search to the sub- 
tree S ~ dominated by n, to yield solutions 
(s', ¢, T', ¢, f', c', ¢). 
5. Partition these solutions into subsets with the 
same word label for the node fl(n), and select 
the solution with lowest cost c' from each sub- 
set. 
6. Remove from M the set of runtime entries for 
nodes in S ~. 
7. For each selected subtree solution, add to M a 
new runtime entry (S', ¢, T', f', c', {n}). 
8. Repeat from step 3. 
The subtree transfer search maintains a queue 
Q of configurations corresponding to partial deriva- 
tions for translating the subtree. Control follows a 
standard non-deterministic search paradigm: 
1. Initialize Q to contain a single configuration 
(¢, R0, ¢, ¢, ¢, 0, I0) with the input subtree R0 
and the set of nodes I0 in R0. 
2. If Q is empty, return the lowest cost solution 
found and stop. 
3. Remove a configuration iS, R, T, P, f, c, I) from 
the queue. 
4. If R is empty, add the configuration to the set 
of subtree solutions. 
5. Select a node i from I. 
6. For each runtime entry (Hi, ¢, Gi, Pi, fi, cl, {i}) 
for i, if Hi is a subgraph of R, add to Q a con- 
figuration iS 0 Hi, R - Hi, T O Gi 0 G', P U Pi - 
G', fO fi, c +ci +cv, , I--{ i} ), where G' is the set 
of newly completed arcs (those in P t3 Pi with 
both node labels in T U Gi O P 0 Pi) and cg, 
is the cost of the arcs G' in the target language 
model. 
7. For any source node n for which f(n) and fi(n) 
are both defined, merge these two target nodes. 
8. Repeat from step 2. 
Keeping the arcs P separate in the configuration al- 
lows efficient incremental application of target de- 
pendency costs cv, during the search, so these costs 
are taken into account in the pruning step of the 
overall search control. This way we can keep the 
benefits of monolingual/bilingual modularity (Is- 
abelle and Macklovitch 1986) without the compu- 
tationM overhead of transfer-and-filter (Alshawi et 
al. 1992). 
172 
It is possible to apply the subtree search directly 
to the whole graph starting with the initial runtime 
entries from lexical matching. However, this would 
result in an exponential search, specifically a search 
tree with a branching factor of the order of the num- 
ber of matching entries per input word. Fortunately, 
long sentences typically have several decomposition 
nodes, such as the heads of noun phrases, so the 
search as described is factored into manageable com- 
ponents. 
5 Cost Functions 
5.1 Costed Search Processes 
The head automata model and transfer model were 
originally conceived as probabilistic models. In order 
to take advantage of more of the information avail- 
able in our training data, we experimented with cost 
functions that make use of incorrect translations as 
negative examples and also to treat the correctness 
of a translation hypothesis as a matter of degree. 
To experiment with different models, we imple- 
mented a general mechanism for associating costs to 
solutions of a search process. Here, a search process 
is conceptualized as a non-deterministic computa- 
tion that takes a single input string, undergoes a 
sequence of state transitions in a non-deterministic 
fashion, then outputs a solution string. Process 
states are distinct from, but may include, head au- 
tomaton states. 
A cost function for a search process is a real val- 
ued function defined on a pair of equivalence classes 
of process states. The first element of the pair, a 
context c, is an equivalence class of states before 
transitions. The second element, an event e, is an 
equivalence class of states after transitions. (The 
equivalence relations for contexts and events may 
be different.) We refer to an event-context pair as a 
choice, for which we use the notation 
(efc) 
borrowed from the special case of conditional prob- 
abilities. The cost of a derivation of a solution by 
the process is taken to be the sum of costs of choices 
involved in the derivation. 
We represent events and contexts by finite se- 
quences of symbols (typically words or relation sym- 
bols in the translation application). We write 
C(al'"anlbl'"bk) 
for the cost of the event represented by (al ..-a,~) in 
the context represented by(b1 ..-bk). 
"Backed off" costs can be computed by averag- 
ing over larger equivalence classes (represented by 
shorter sequences in which positions are eliminated 
systematically). A similar smoothing technique has 
been applied to the specific case of prepositional 
phrase attachment by Collins and Brooks (1995). 
We have used backed off costs in the translation ap- 
plication for the various cost functions described be- 
low. Although this resulted in some improvement in 
testing, so far the improvement has not been statis- 
tically significant. 
5.2 Model Cost Functions 
Taken together, the events, contexts, and cost func- 
tion constitute a process cost model, or simply a 
model. The cost function specifies the model param- 
eters; the other components are the model structure. 
We have experimented with a number of model 
types, including the following. 
Probabilistic model: In this model we assume a 
probability distribution on the possible events for a 
context, that is, 
E~ P(elc) = 1. 
The cost parameters of the model are defined as: 
C(elc) = -ln(P(elc)). 
Given a set of solutions from executions of a process, 
let n+(e\]e) be the number of times choice (e\[c) was 
taken leading to acceptable solutions (e.g. correct 
translations) and n+(c) be the number of times con- 
text c was encountered for these solutions. We can 
then estimate the probabilistic model costs with 
C(elc ) ~ ln(n+(c)) -ln(n+(elc)). 
Discriminative model: The costs in this model are 
likelihood ratios comparing positive and negative 
solutions, for example correct and incorrect trans- 
lations. (See Dunning 1993 on the application of 
likelihood ratios in computational linguistics.) Let 
n-(elc ) be the count for choice (e\]c) leading to neg- 
ative solutions. The cost function for the discrimi- 
native model is estimated as 
C(elc) ~ In(n- (elc)) -ln(n+(ele)). 
Mean distance model: In the mean distance model, 
we make use of some measure of goodness of a solu- 
tion ts for some input s by comparing it against an 
ideal solution is for s with a distance metric h: 
h(t,,i,) ~ d 
in which d is a non-negative real number. A param- 
eter for choice (e\]c) in the distance model 
C(elc) = Eh(elc) 
is the mean value of h(t~,t~) for solutions t, pro- 
duced by derivations including the choice (eIc). 
Normalized distance model: The mean distance 
model does not use the constraint that a particular 
choice faced by a process is always a choice between 
events with the same context. It is also somewhat 
sensitive to peculiarities of the distance function h. 
With the same assumptions we made for the mean 
distance model, let 
Eh(c) 
be the average of h(t~, ts) for solutions derived from 
sequences of choices including the context c. The 
cost parameter for (elc) in the normalized distance 
model is 
173 
C(elc) = Bh(c) ' 
that is, the ratio of the expected distance for deriva- 
tions involving the choice and the expected distance 
for all derivations involving the context for that 
choice. 
Reflexive Training If we have a manually trans- 
lated corpus, we can apply the mean and normal- 
ized distance models to translation by taking the 
ideal solution t~ for translating a source string s to 
be the manual translation for s. In the absence of 
good metrics for comparing translations, we employ 
a heuristic string distance metric to compare word 
selection and word order in t~ and ~s. 
In order to train the model parameters without 
a manually translated corpus, we use a "reflexive" 
training method (similar in spirit to the "wake- 
sleep" algorithm, Hinton et al. 1995). In this 
method, our search process translates a source sen- 
tence s to ts in the target language and then trans- 
lates t~ back to a source language sentence #. The 
original sentence s can then act as the ideal solu- 
tion of the overall process. For this training method 
to be effective, we need a reasonably good initial 
model, i.e. one for which the distance h(s, #) is in- 
versely correlated with the probability that t~ is a 
good translation of s. 
6 Experimental System 
We have built an experimental translation system 
using the monolingual and translation models de- 
scribed in this paper. The system translates sen- 
tences in the ATIS domain (Hirschman et al. 1993) 
between English and Mandarin Chinese. The trans- 
lator is in fact a subsystem of a speech translation 
prototype, though the experiments we describe here 
are for transcribed spoken utterances. (We infor- 
mally refer to the transcribed utterances as sen- 
tences.) The average time taken for translation of 
sentences (of unrestricted length) from the ATIS cor- 
pus was around 1.7 seconds with approximately 0.4 
seconds being taken by the analysis algorithm and 
0.7 seconds by the transfer algorithm. 
English and Chinese lexicons of around 1200 and 
1000 words respectively were constructed. Alto- 
gether, the entries in these lexicons made reference 
to around 200 structurally distinct head automata. 
The transfer lexicon contained around 3500 paired 
graph fragments, most of which were used in both 
transfer directions. With this model structure, we 
tried a number of methods for assigning cost func- 
tions. The nature of the training methods and their 
corresponding cost functions meant that different 
amounts of training data could be used, as discussed 
further below. 
The methods make use of a supervised training 
set and an unsupervised training set, both sets be- 
ing chosen at random from the 20,000 or so ATIS 
sentences available to us. The supervised training 
set comprised around 1950 sentences. A subcollec- 
tion of 1150 of these sentences were translated by the 
system, and the resulting translations manually clas- 
sified as 'good' (800 translations) or 'bad' (350 trans- 
lations). The remaining 800 supervised training set 
sentences were hand-tagged for prepositional attach- 
ment points. (Prepositional phrase attachment is a 
major cause of ambiguity in the ATIS corpus, and 
moreover can affect English-Chinese translation, see 
Chen and Chen 1992.) The attachment informa- 
tion was used to generate additional negative and 
positive counts for dependency choices. The un- 
supervised training set consisted of approximately 
13,000 sentences; it was used for automatic training 
(as described under 'Reflexive Training' above) by 
translating the sentences into Chinese and back to 
English. 
A. Qualitative Baseline: In this model, all choices 
were assigned the same cost except for irregular 
events (such as unknown words or partial analy- 
ses) which were all assigned a high penalty cost. 
This model gives an indication of performance based 
solely on model structure. 
B. Probabilistic: Counts for choices leading to good 
translations for sentences of the supervised train- 
ing corpus, together with counts from the manually 
assigned attachment points, were used to compute 
negated log probability costs. 
C. Discriminative: The positive counts as in the 
probabilistic method, together with corresponding 
negative counts from bad translations or incorrect 
attachment choices, were used to compute log likeli- 
hood ratio costs. 
D. Normalized Distance: In this fully automatic 
method, normalized distance costs were computed 
from reflexive translation of the sentences in the un- 
supervised training corpus. The translation runs 
were carried out with parameters from method A. 
E. Bootstrapped Normalized Distance: The same as 
method D except that the system used to carry out 
the reflexive translation was running with parame- 
ters from method C. 
Table 1 shows the results of evaluating the per- 
formance of these models for translating 200 unre- 
stricted length ATIS sentences into Chinese. This 
was a previously unseen test set not included in 
any of the training sets. Two measures of transla- 
tion acceptability are shown, as judged by a Chinese 
speaker. (In separate experiments, we verified that 
the judgments of this speaker were near the average 
of five Chinese speakers). The first measure, "mean- 
ing and grammar", gives the percentage of sentence 
translations judged to preserve meaning without the 
introduction of grammatical errors. For the second 
measure, "meaning preservation", grammatical er- 
rors were allowed if they did not interfere with mean- 
ing (in the sense of misleading the hearer). In the ta- 
ble, we have grouped together methods A and D for 
174 
Table 1: Translation performance of different cost 
assignment methods 
Method Meaning and 
Grammar (%) 
A' 29 71 
D 37 71 
B 46 82 
C 52 83 
E 54 83 
Meaning 
Preservation (%) 
which the parameters were derived without human 
supervision effort, and methods B, C, and E which 
depended on the same amount of human supervision 
effort. This means that side by side comparison of 
these methods has practical relevance, even though 
the methods exploited different amounts of data. In 
the case of E, the supervision effort was used only 
as an oracle during training, not directly in the cost 
computations. 
We can see from Table 1 that the choice of method 
affected translation quality (meaning and grammar) 
more than it affected preservation of meaning. A 
possible explanation is that the model structure was 
adequate for most lexical choice decisions because of 
the relatively low degree of polysemy in the ATIS 
corpus. For the stricter measure, the differences 
were statistically significant, according to the sign 
test at the 5% significance level, for the following 
comparisons: C and E each outperformed B and D, 
and B and D each outperformed A. 
7 Language Processing and 
Semantic Representations 
The translation system we have described employs 
only simple representations of sentences and phrases. 
Apart from the words themselves, the only sym- 
bols used are the dependency relations R. In our 
experimental system, these relation symbols are 
themselves natural language words, although this 
is not a necessary property of our models. Infor- 
mation coded explicitly in sentence representations 
by word senses and feature constraints in our pre- 
vious work (Alshawi 1992) is implicit in the mod- 
els used to derive the dependency trees and trans- 
lations. In particular, dependency parameters and 
context-dependent transfer parameters give rise to 
an implicit, graded notion of word sense. 
For language-centered applications like transla- 
tion or summarization, for which we have a large 
body of examples of the desired behavior, we can 
think of the task in terms of the formal problem of 
modeling a relation between strings based on exam- 
pies of that relation. By taking this viewpoint, we 
seem to be ignoring the intuition that most interest- 
ing natural language processing tasks (translation, 
summarization, interfaces) are semantic in nature. 
It is therefore tempting to conclude that an adequate 
treatment of these tasks requires the manipulation 
of artificial semantic representation languages with 
well-understood formal denotations. While the in- 
tuition seems reasonable, the conclusion might be 
too strong in that it rules out the possibility that 
natural language itself is adequate for manipulating 
semantic denotations. After all, this is the primary 
function of natural language. 
The main justification for artificial semantic rep- 
resentation languages is that they are unambiguous 
by design. This may not be as critical, or useful, 
as it might first appear. While it is true that nat- 
ural language is ambiguous and under-specified out 
of context, this uncertainty is greatly reduced by 
context to the point where further resolution (e.g. 
full scoping) is irrelevant to the task, or even the 
intended meaning. The fact that translation is in- 
sensitive to many ambiguities motivated the use of 
unresolved quasi-logical form for transfer (Alshawi 
et al. 1992). 
To the extent that contextual resolution is neces- 
sary, context may be provided by the state of the lan- 
guage processor rather than complex semantic rep- 
resentations. Local context may include the state of 
local processing components (such as our head au- 
tomata) for capturing grammatical constraints, or 
the identity of other words in a phrase for capturing 
sense distinctions. For larger scale context, I have 
argued elsewhere (Alshawi 1987) that memory ac- 
tivation patterns resulting from the process of car- 
rying out an understanding task can act as global 
context without explicit representations of discourse. 
Under this view, the challenge is how to exploit con- 
text in performing a task rather than how to map 
natural language phrases to expressions of a formal- 
ism for coding meaning independently of context or 
intended use. 
There is now greater understanding of the formal 
semantics of under-specified and ambiguous repre- 
sentations. In Alshawi 1996, I provide a denota- 
tional semantics for a simple under-specified lan- 
guage and argue for extending this treatment to a 
formal semantics of natural language strings as ex- 
pressions of an under-specified representation. In 
this paradigm, ordered dependency trees can be 
viewed as natural language strings annotated so that 
some of the implicit relations are more explicit. A 
milder form of this kind of annotation is a bracketed 
natural language string. We are not advocating an 
approach in which linguistic structure is ignored (as 
it is in the IBM translator described by Brown et 
al. 1990), but rather one in which the syntactic and 
semantic structure of a string is implicit in the way 
it is processed by an interpreter. 
One important advantage of using representations 
that are close to natural language itself is that it re- 
duces the degrees of freedom in specifying language 
and task models, making these models easier to ac- 
175 
quire automatically. With these considerations in 
mind, we have started to experiment with a version 
of the translator described here with even simpler 
representations and for which the model structure, 
not just the parameters, can be acquired automati- 
cally. 
Acknowledgments 
The work on cost functions and training methods 
was carried out jointly with Adam Buchsbaum who 
also customized the English model to ATIS and in- 
tegrated the translator into our speech translation 
prototype. Jishen He constructed the Chinese ATIS 
language model and bilingual lexicon and identified 
many problems with early versions of the transfer 
component. I am also grateful for advice and help 
from Don Hindle, Fernando Pereira, Chi-Lin Shih, 
Richard Sproat, and Bin Wu. 

References 
Alshawi, H. 1987. Memory and Context for Language 
Interpretation. Cambridge University Press, Cambridge, 
England. 
Alshawi, H. 1996. "Underspecified First Order Log- 
ics". In Semantic Ambiguity and Underspecification, 
edited by K. van Deemter and S. Peters, CSLI Publi- 
cations, Stanford, California. 
Alshawi, H. 1992. The Core Language Engine. MIT 
Press, Cambridge, Massachusetts. 
Alshawi, H., D. Carter, B. Gamback and M. Rayner. 
1992. "Swedish-English QLF Translation". In H. A1- 
shawi (ed.) The Core Language Engine. MIT Press, 
Cambridge, Massachusetts. 
Booth, T. 1969. "Probabilistic Representation of For- 
real Languages". Tenth Annual IEEE Symposium on 
Switching and Automata Theory. 
Brew, C. 1992. "Letting the Cat out of the Bag: Gen- 
eration for Shake-and-Bake MT'. Proceedings of COL- 
ING92, the International Conference on Computational 
Linguistics, Nantes, France. 
Brown, P., J. Cocks, S. Della Pietra, V. Della Pietra, 
F. Jelinek, J. Lafferty, R. Mercer and P. Rossin. 1990. 
"A Statistical Approach to Machine Translation". Com- 
putational Linguistics 16:79-85. 
Brown, P.F., S.A. Della Pietra, V.J. Della Pietra, and 
R.L. Mercer. 1993. "The Mathematics of Statistical 
Machine Translation: Parameter Estimation". Compu- 
tational Linguistics 19:263-312. 
Chen, K.H. and H. H. Chen. 1992. "Attachment and 
Transfer of Prepositional Phrases with Constraint Prop- 
agation". Computer Processing of Chinese and Oriental 
Languages, Vol. 6, No. 2, 123-142. 
Church K. and R. PatH. 1982. "Coping with Syntactic 
Ambiguity or How to Put the Block in the Box on the 
Table". Computational Linguistics 8:139-149. 
Collins, M. and J. Brooks. 1995. "Prepositional 
Phrase Attachment through a Backed-Off Model." Pro- 
ceedings of the Third Workshop on Very Large Corpora, 
Cambridge, Massachusetts, ACL, 27-38. 
Dorr, B.J. 1994. "Machine Translation Divergences: 
A Formal Description and Proposed Solution". Compu- 
tational Linguistics 20:597-634. 
Dunning, T. 1993. "Accurate Methods for Statistics of 
Surprise and Coincidence." Computational Linguistics. 
19:61-74. 
Early, J. 1970. "An Efficient Context-Free Parsing 
Algorithm". Communications of the ACM 14: 453-60. 
Gazdar, G., E. Klein, G.K. Pullum, and I.A.Sag. 
1985. Generalised Phrase Structure Grammar. Black- 
well, Oxford. 
Hinton, G.E., P. Dayan, B.J. Frey and R.M. Neal. 
1995. "The 'Wake-Sleep' Algorithm for Unsupervised 
Neural Networks". Science 268:1158-1161. 
Hudson, R.A. 1984. Word Grammar. Blackwell, Ox- 
ford. 
Hirschman, L., M. Bates, D. Dahl, W. Fisher, J. Garo- 
folo, D. Pallett, K. Hunicke-Smith, P. Price, A. Rud- 
nicky, and E. Tzoukermann. 1993. "Multi-Site Data 
Collection and Evaluation in Spoken Language Under- 
standing". In Proceedings of the Human Language Tech- 
nology Workshop, Morgan Kaufmann, San Francisco, 
19-24. 
Isabelle, P. and E. Macklovitch. 1986. "Transfer and 
MT Modularity", Eleventh International Conference on 
Computational Linguistics, Bonn, Germany, 115-117. 
Jackendoff, R.S. 1977. X-bar Syntax: A Study 
of Phrase Structure. MIT Press, Cambridge, Mas- 
sachusetts. 
Jelinek, F., R.L. Mercer and S. Roukos. 1992. "Prin- 
ciples of Lexical Language Modeling for Speech Recog- 
nition". In S. Furui and M.M. Sondhi (eds.), Advances 
in Speech Signal Processing, Marcel Dekker, New York. 
Lafferty, J., D. Sleator and D. Temperley. 1992. 
"Grammatical Trigrams: A Probabilistic Model of Link 
Grammar". In Proceedings of the 199P AAAI Fall Sym- 
posium on Probabilistic Approaches to Natural Language, 
89-97. 
Kay, M. 1989. "Head Driven Parsing". In Proceed- 
ings of the Workshop on Parsing Technologies, Pitts- 
burg, 1989. 
Lindop, J. and 3. Tsujii. 1991. "Complex Transfer 
in MT: A Survey of Examples". Technical Report 91/5, 
Centre for Computational Linguistics, UMIST, Manch- 
ester, UK. 
Resnik, P. 1992. "Probabilistic Tree-Adjoining Gram- 
mar as a Framework for Statistical Natural Language 
Processing". In Proceedings of COLING-9P, Nantes, 
France, 418-424. 
Sata, G. and O. Stock. 1989. "Head-Driven Bidi- 
rectional Parsing". In Proceedings of the Workshop on 
Parsing Technologies, Pittsburg, 1989. 
Schabes, Y. 1992. "Stochastic Lexicalized Tree- 
Adjoining Grammars". In Proceedings of COLING-9P, 
Nantes, France, 426-432. 
Whitelock, P.J. 1992. "Shake-and-Bake Translation". 
Proceedings of COLING92, the International Conference 
on Computational Linguistics, Nantes, France. 
Younger, D. 1967. Recognition and Parsing of 
Context-Free Languages in Time n 3. Information and 
Control, 10, 189-208. 
