Qualitative and Quantitative Models 
of Speech Translation 
Hiyan Alshawi 
AT,~T Bell Laboratories 
600 Mountain Avenue 
Murray Hill, NJ 07974, USA 
hiyan@research, at t.com 
Abstract 
This paper compares a qualitative reasoning model of 
translation with a quantitative statistical model. We 
consider these models within the context of two hy- 
pothetical speech translation systems, starting with a 
logic-based design and pointing out which of its char- 
acteristics are best preserved or eliminated in moving 
to the second, quantitative design. The quantitative 
language and translation models are based on relations 
between lexical heads of phrases. Statistical parame- 
ters for structural dependency, lexical transfer, and lin- 
ear order are used to select a set of implicit relations 
between words in a source utterance, a corresponding 
set of relations between target language words, and the 
most likely translation of the original utterance. 
1. Introduction 
In recent years there has been a resurgence of interest in 
statistical approaches to natural language processing. 
Such approaches are not new, witness the statistical 
approach to machine translation suggested by Weaver 
(1955), but the current level of interest is largely due 
to the success of applying hidden Markov models and 
N-gram language models in speech recognition. This 
success was directly measurable in terms of word recog- 
nition error rates, prompting language processing re- 
searchers to seek corresponding improvements in per- 
formance and robustness. A speech translation system, 
which by necessity combines speech and language tech- 
nology, is a natural place to consider combining the sta- 
tistical and conventional approaches and much of this 
paper describes probabilistic models of structural lan- 
guage analysis and translation. Our aim will be to pro- 
vide an overall model for translation with the best of 
both worlds. Various factors will lead us to conclude 
that a lexicalist statistical model with dependency re- 
lations is well suited to this goal. 
As well as this quantitative approach, we will consider 
a constraint/logic based approach and try to distinguish 
characteristics that we wish to preserve from those that 
are best replaced by statistical models. Although per- 
haps implicit in many conventional approaches to trans- 
lation, a characterization in logical terms of what is be- 
ing done is rarely given, so we will attempt to make 
that explicit here, more or less from first principh's. 
Before proceeding, I will first examine some fashiou- 
able distinctions in section 2 in order to clarify the is- 
sues involved in comparing these approaches. I will 
attempt to argue that the important distinction is not 
so much a rational-empirical or symbolic-statistical dis- 
tinction but rather a qualitative-quantitative one. This 
is followed by discussion of the logic-based model in 
section 3, the overall quantitative model in section 4, 
monolingual models in section 5, translation models 
in section 6, and some conclusions in section 7. We 
concentrate throughout on what information about lan- 
guage and translation is coded and how it is express('d 
as logical constraints or statistical parameters. Al- 
though important, we will say little about search al- 
gorithms, rule acquisition, or parameter estimation. 
2. Qualitative and Quantitative Models 
One contrast often taken for granted is the identifica- 
tion of a 'statistical-symbolic' distinction in language 
processing as an instance of the empirical vs. rational 
debate. I believe this contrast has been exaggerated 
though historically it has had some validity ill terms 
of accepted practice. Rule based approaches have be- 
come more empirical in a number of ways: First, a more 
empirical approach is being adopted to grammar devel- 
opment whereby the rule set is modified according to 
its performance against corpora of natural text (e.g. 
Taylor, Grovel and Briscoe 1989). Second, there is a 
class of techniques for learning rules from text, a recent 
example being Brill 1993. Conversely, it is possible to 
imagine building a language model in which all prob- 
abilities are estimated according to intuition without 
reference to any real data, giving a probabilistic mod~,l 
that is not empirical. 
Most language processing labeled as statistical in- 
volves associating real-number valued parameters to 
configurations of symbols. This is not surprising given 
that natural language, at least in written form, is explic- 
itly symbolic. Presumably, classifying a system as sym- 
bolic must refer to a different set of (internal) symbols, 
but even this does not rule out many statistical sys- 
trrrls modeling events involving nonterminal categories 
and word senses. Given that the notion of a symbol, 
let. alone an 'internal symbol', is itself a slippery one, it 
may he unwise to build our theories of language, or even 
tl,. way we classify different theories, on this notion. 
Instead, it would seem that the real contrast driving 
the shift towards statistics in language processing is a 
contrast between qualitative systems dealing exclusively 
with combinatoric constraints, and quantitative systems 
that involve computing numerical functions. This bears 
dir~.ctly on the problems of brittleness and complexity 
that discrete approaches to language processing share 
wll,ll, for example, reasoning systems based on tradi- 
tional logical inference. It relates to the inadequacy of 
the dominant theories in linguistics to capture 'shades' 
of meaning or degrees of acceptability which are often 
recognized by people outside the field as important in- 
herent properties of natural language. The qualitative- 
quantitative distinction can also be seen as underlying 
the difference between classification systems based on 
I'cature specifications, as used in unification formalisms 
(Shicber 1986), and clustering based on a variable de- 
gr~,e of granularity (e.g. Pereira, Tishby and Lee 1993). 
It seems unlikely that these continuously variable as- 
pcct:s of fluent natural language can be captured by a 
purely combinatoric model. This naturally leads to the 
qtwstion of how best to introduce quantitative model- 
i,g into language processing. It is not, of course, nec- 
,,ssary for the quantities of a quantitative model to be 
probabilities. For example, we may wish to define real- 
valued functions on parse trees that reflect the extent 
to which the trees conform to, .say, minimal attachment 
and parallelism between conjuncts. Such functions have 
been used in tandem with statistical functions in ex- 
periments on disambiguation (for instance Alshawi and 
(',a.rter 1994). Another example is connection strengths 
i, m~ural network approaches to language processing, 
th,mgh it. has been shown that certain networks are 
~,tfectively computing probabilities (Richard and Lipp- 
mann 1991). 
Nevertheless, probability theory does offer a coher- 
ent and relatively well understood framework for select- 
ing between uncertain alternatives, making it a natural 
choice for quantitative language processing. The case 
f.r probability theory is strengthened by a well devel- 
,,p-d empirical methodology in the form of statistical 
I,:~ramet.ccr estimation. There is also the strong connec- 
l i,,n between probability theory and the formal theory 
.1" i.formation and communication, a connection that 
has been exploited in speech recognition, for example 
I~qing tim concept of entropy to provide a motivated way 
,.f measuring the complexity of a recognition problem 
(.h'lim'k et ai. 1992). 
I",v,'n if probability t|wory remains, as it currently 
is, th,, m~.l.llod of clloicc in making language processing 
qu.ntitative, this still h~aw:s the fieht wide open in terms 
.,f carving up languag~ processing into an appropriate 
set ,,f ,,wmts tbr probability theory to work with. For 
translation, a very direct apprgach using parameters 
based on surface positions of words in source and target 
sentences was adopted in the Candide system (Brown 
et at. 1990). However, this does not capture important 
structural properties of natural language. Nor does it 
take into account generalizations about translation that 
are independent of the exact word order in source and 
target sentences. Such generalizations are, of course, 
central to qualitative structural approaches to transla- 
tion (e.g. Isabelle and Macklovitch 1986, Alshawi et at. 
1992). 
The aim of the quantitative language and translation 
models presented in sections 5 and 6 is to employ proba~ 
bilistic parameters that reflect linguistic structure with- 
out discarding rich lexical information or making the 
models too complex to train automatically. In terms of 
a traditional classification, this would be seen as a 'hy- 
brid symbolic-statistical' system because it deals with 
linguistic structure. From our perspective, it can be 
seen as a quantitative version of the logic-based model 
because both models attempt to capture similar infor- 
mation (about the organization of words into phrases 
and relations holding between these phrases or their ref- 
erents), though the tools of modeling are substantially 
different. 
3. Dissecting a Logic-Based System 
We now consider a hypothetical speech translation sys- 
tem in which the language processing components fol- 
low a conventional qualitative transfer design. Al- 
though hypothetical, this design and its components are 
similar to those used in existing database query (Rayner 
and Alshawi 1992) and translation systems (Alshawi et 
al 1992). More recent versions of these systems have 
been gradually taking on a more quantitative flavor, 
particularly with respect to choosing between alterna- 
tive analyses, but our hypothetical system will be more 
purist in its qualitative approach. 
The overall design is as follows. We assume that 
a speech recognition subsystem delivers a list of text 
strings corresponding to transcriptions of an input ut- 
terance. These recognition hypotheses are passed to a 
parser which applies a logic-based grammar and lexicon 
to produce a set of logical forms, specifically formulas 
in first order logic corresponding to possible interpreta- 
tions of the utterance. The logical forms are filtered by 
contextual and word-sense constraints, and one of them 
is passed to the translation component. The translation 
relation is expressed by a set of first order axioms which 
are used by a theorem prover to derive a target language 
logical form that is equivalent (in some context) to the 
source logical form. A grammar for tile target language 
is then applied to the target form, generating a syntax 
tree whose fringe is passed to a speech synthesizer. 
"Faking the various components in turn, we make a 
note of undesirable properties that might be improved 
by quantitative modeling. 
2 
Analysis and Generation 
A grammar, expressed as a set of syntactic rules (ax- 
ioms) Gsv, and a set of semantic rules (axioms)Gsem is 
used to support a relation form holding between strings 
s and logical forms ¢ expressed in first order logic: 
a.y. u a,.m f o m( s, ¢). " 
The relation form is many-to-many, associating a 
string with linguistically possible logical form interpre- 
tations. In the analysis direction, we are given s and 
search for logical forms ¢, while in generation we search 
for strings s given ¢. 
For analysis and generation, we are treating strings 
s and logical forms ¢ as object level entities. In inter- 
pretation and translation, we will move down from this 
meta-level reasoning to reasoning with the logical forms 
as propositions. 
The list of text strings handed by the recognize/to 
the parser can be assumed to be ordered in accordance 
with some acoustic scoring scheme internal to the rec- 
ognizer. The magnitude of the scores is ignored by our 
qualitative language processor; it simply processes the 
hypotheses one at a time until it finds one for which 
it can produce a complete logical form interpretation 
that passes grammatical and interpretation constraints, 
at which point it discards the remaining hypotheses. 
Clearly, discarding the acoustic score and taking the 
first hypothesis that satisfies the constraints may lead 
to an interpretation that is less plausible than one deriv- 
able from a hypothesis further down in the recognition 
list. But there is no point in processing these later 
hypotheses since we will be forced, to select one inter- 
pretation essentially at random, 
Syntax The syntactic rules in Gsv. relate 'category' 
predicates co, ct, c2 holding of a string and two spanning 
substrings (we limit the rules here to two daughters for 
simplicity): 
c0(s0) A daughters(so, sl, s2) 
el(st) A cz(s2) A (so = concat(st, s2)) 
(Here, and subsequently, variables like so and st are 
implicitly universally quantified.) G~v,~ also includes 
lexical axioms for particular strings w consisting of sin- 
gle words: 
el(w), ... 
For a feature-based grammar, these rules can in- 
clude conjuncts constraining the values, al,a~,..., of 
discrete-valued functions f on the strings: 
f(w) = al, f(so) = f(St). 
The main problem here is that such grammars have 
no notion of a degree of grammatical acceptability - a 
sentence is either grammatical or ungrammatical. For 
small grammars this means that perfectly acceptable 
strings are often rejected; for large grammars we got a 
vast number of alternative trees so the chance of seh'ct- 
ing the correct tree for simple Nell{.CllCes C;tll gel. worso 
~Lg the gralnmar cow'rago increas,,s. '\['hcre is also tl,. 
problem of requiring increasingly comph,x feature sets 
to describe idiosyncrasies in the lexicon. 
Semantics Semantic grammar axioms belonging to 
Gsem specify a 'composition' function g for deriving a 
logical form for a phrase from those for its subphrasos: 
form(so, g(¢t, ¢2)) 
daughters(so, st, s2)Acj (st)Ac2(s2)Acl~(s0) 
A form(sl, el) A form(s2, ¢2) 
The interpretation rules for strings l)ottom out ill a set 
of lexical semantic rules associating words with pred- 
icates (pl,P2,...) corresponding to 'word senses'. For 
a particular word and syntactic category, there will bo 
a (small, possibly empty) finite set of such word sense 
predicates: 
el(w) ~ form(w,p~) 
cdiw) ~ form(w,pim). 
First order logic was assunmd as the semantic repre- 
sentation language because it comes with well under- 
stood, if not very practieM, inferential machinery for 
constraint solving. However, applying this machinory 
requires making logical forms fine grained to a degroe 
often not warranted by the information the speaker of 
an utterance intended to convey. An example of this is 
explicit scoping which leads (again) to large numlmrs of 
alternatives which the qualitative model has difliculty 
choosing between. Also, many natural language sen- 
tences cannot be expressed in first order logic without 
resort to elaborate formulas requiring complex seman- 
tic composition rules. These rules can be simplilied by 
using a higher order logic but at the expense of cw.n 
less practical inferential machinery. 
In applying the grammar in generation we are 
faced with the problem of balancing over and under- 
generation by tweaking grammatical constraints, there 
being no way to prefer fully grammatical target sen- 
tences over more marginal ones. Qualitative approaches 
to grammar tend to emphasize the ability to capl, uro 
generalizations as the main measure of success in lin- 
guistic modeling. This might explain why producing 
appropriate lexical collocations is rarely addressed seri- 
ously in these models, even though lexical collocations 
are important for fluent generation. '/'he study of col- 
locations for generation fits in more naturally with sl.a- 
tistical techniques, as illustrated by Smajda and McK- 
eown (1990). 
Interpretation 
In the logic-based model, interpretation is the process 
of identifying from the possible interpretations ~ of s for 
3 
which form(s, qt) hold, ones that are consistent with the 
,',,m~,xt of interpretation. We can state this as follows: 
/f U.~'U A ~ O. 
Ih.r,., we haw~ separated the context into a contingent 
s,,I ,ff contextual propositions S and a set R of (mono- 
l i ngual) 'meaning postulates', or selectional restrictions, 
that constrain the word sense predicates in all contexts. 
.1 is a set of assumptions sufficient to support the in- 
I,'rl)n'lation ¢ given S and R. In other words, this is 
h,~crl)rctal, ion as abduction' (Itobbs et al. 1988), since 
~!)(i,('lion, not deduction, is needed to arrive at the 
:~>.'d H II I~tiOIIS ,4. 
'l'h(" ,host common types of meaning postulates in R 
art, t h,,s~" for restriction, hyponymy, and disjointness, 
, \l,l'<:.~sed a.'~ follows: 
HI (.l'l, X2) ~ p2(x! ) restriction; 
t,:¢(x) --* p3(x) hyponymy; 
-~(pa(x) A p4(x)) disjointness. 
Although there are compilation techniques (e.g. Mel- 
lish 19~) which allow sclectional constraints stated in 
this fashion to be implemented efficiently, the scheme 
i~ I,rol)lematic iu other respects. To start with, the as- 
s~t~ttl~l ion of a small set of senses for a word is at best 
;~wkward because it is difficult to arrive at an optimal 
gra,ularity for sense distinctions. Disambiguation with 
s,qcctionai restrictions expressed as meaning postulates 
is also prol)lematic because it is virtually impossible to 
,levis, a set of postulates that will always filter all but 
,,t,, alt.crnative. We are thus forced to under-filter and 
make an arbitrary choice between remaining alterna- 
tives. 
Logic based translation 
In hoth the quantitative and qualitative models we take 
a t ransfi~r approach to translation. We do not depend 
.!~ im.('rlingual symbols, but instead map a representa- 
I i,:)n with constants associated with the source language 
inlx) a corresponding expression with constants from the 
l ar~ct language. For the qualitative model, the opera- 
hh, notion of correspondence is based on logical equiva- 
hql('e and the constants are source word sense predicates 
I'1, t"-' .... and target sense predicates ql, q2, .... 
More specifically, we will say the translation relation 
hH we~,n a source logical form Cs and a target logical 
i;,r~t 6t holds if we have 
/~ u .'~' u A' ~ (q~., ~ ~,) 
wh,.n, I~ is a s~.t of monolingual and bilingual mean- 
I J;:. i,t).~l.ulal.es, and ,S' is ;t set of formulas characterizing 
I.h*' ~'lli'l','llt COllt~xt. .'l I is a s,,t of assumptions that 
in,h=,h's I.h,' assunlptions A which SUl)ported ~bs. ilere 
I,ili,,~ual me;ruing i~osl.ulal.~.s a.re first order axioms re- 
hll.ing source and target sense predicates. A typical 
I,ilin~ual posl.ulate Ibr translal.ing between Pl an(I ql 
ii~it~;lil h,, of th,. for,n: 
p5(~1) ~ (p1(~1, z2) ~ ql(zl, z2)). 
The need for the assumptions A' arises when a source 
language word is vaguer that its possible translations 
in the target language, so different choices of target 
words will correspond to translations under different 
assumptions. For example, the condition ps(xl) above 
might be proved from the input logical form, or it might 
need to be assumed. 
In the general case, finding solutions (i.e. A', ~bt pairs) 
for the abductive schema is an undecidable theorem 
proving problem. This can be alleviated by placing re- 
strictions on the form of meaning postulates and input 
formulas and using heuristic search methods. Although 
such an approach was applied with some success in 
a limited-domain system translating logical forms into 
database queries (Rayner and Alshawi 1992), it is likely 
to be impractical for language translation with tens of 
thousands of sense predicates and related axioms. 
Setting aside the intractability issue, this approach 
does not offer a principled way of choosing between al- 
ternative solutions proposed by the prover. One would 
like to prefer solutions with 'minimal' sets of assump- 
tions, but it is difficult to find motivated definitions for 
this minimization in a purely qualitative framework. 
4. Quantitative Model Components 
Moving to a Quantitative Model 
In moving to a quantitative architecture, we propose to 
retain many of the basic characteristics of the qualita- 
tive model: 
• A transfer organization with analysis, transfer, and 
generation components. 
• Monolingual models that can be used for both anal- 
ysis and generation. 
• Translation models that exclusively code contrastive 
(cross-linguistic) information. 
• Hierarchical phrases capturing recursive linguistic 
structure. 
Instead of feature based syntax trees and first-order 
logical forms we will adopt a simpler, monostratal rep- 
resentation that is more closely related to those found 
in dependency grammars (e.g. Hudson 1984). Depen- 
dency representations have been used in large scale 
qualitative machine translation systems, notably by 
McCord (1988). The notion of a lexical 'head' of a 
phrase is central to these representations because they 
concentrate on relations between such lexical heads. In 
our case, the dependency representation is monostratal 
in that the relations may include ones normally classi- 
fied as belonging to syntax, semantics or l)ragmatics. 
One salient property of our language model is that it 
is strongly lexical: it consists of statistical parameters 
associated with relations between lexical items and the 
number and ordering of dependents of lexical heads. 
This lexical anchoring facilitates statistical training and 
4 
sensitivity to lexical variation and collocations. In order 
to gain the benefits of probabilistic modeling, we replace 
the task of developing large rule sets with the task of 
estimating large numbers of statistical parameters for 
the monolingual and translation models. This gives rise 
to a new cost trade-off in human annotation/judgement 
versus barely tractable fully automatic training. It also 
necessitates further research on lexical similarity and 
clustering (e.g. Pereira, Tishby and Lee 1993, Dagan, 
Marcus and Markovitch 1993) to improve parameter 
estimation from sparse data. 
Translation via Lexical Relation Graphs 
The model associates phrases with relation graphs. A 
relation graph is a directed labeled graph consisting of 
a set of relation edges. Each edge has the form of an 
atomic proposition 
~(wi, w~) 
where r is a relation symbol, wi is the lexical head of 
a phrase and wj is the lexical head of another phrase 
(typically a subphrase of the phrase headed by w~). The 
nodes wi and wj are word occurrences representable by 
a word and an index, the indices uniquely identifying 
particular occurrences of the words in a discourse or 
corpus. The set of relation symbols is open ended, but 
the first argument of the relation is always interpreted 
as the head and the second as the dependent with re- 
spect to this relation. The relations in the models for 
the sour~:e and target languages need not be the same, 
or even overlap. To keep the language models simple, 
we will mainly restrict ourselves here to dependency 
graphs that are trees with unordered siblings. In partic- 
ular, phrases will always be contiguous strings of words 
and dependents will always be heads of subphrases. 
Ignoring algorithmic issues relating to compactly rep- 
resenting and efficiently searching the space of alterna- 
tive hypotheses, the overall design of the quantitative 
system is as follows. The speech recognizer produces 
a set of word-position hypotheses (perhaps in the form 
of a word lattice) corresponding to a set of string hy- 
potheses for the input. The source language model is 
used to compute a set of possible relation graphs, with 
associated probabilities, for each string hypothesis. A 
probabilistic graph translation model then provides, for 
each source relation graph, the probabilities of deriving 
corresponding graphs with word occurrences from the 
target language. These target graphs include all the 
words of possible translations of the utterance hypothe- 
ses but do not specify the surface order of these words. 
Probabilities for different possible word orderings are 
computed according to ordering parameters which form 
part of the target language model. 
In the following section we explain how the probabil- 
ities for these various processing stages are combined to 
select the most likely target word sequence. This word 
sequence can then be handed to the speech synthesizer. 
For tighter integration between getmraliovt aml sy,,tl,~', 
sis, information about the derivation of I.Iw l,arg,'l uI 
I,erance can also I)c passed to the syuthesizcr. 
Integrated Statistical Model 
The probabilities associated with phrases in the abov,, 
description are computed according to the statistical 
models for analysis, translation, and generation. In this 
section we show the relationship between these mod- 
els to arrive at an overall statistical model of sp,,,.," h 
translation. We are not considering training ismws in 
this paper, though a number of now familiar techniques 
ranging from methods for maximum likelihood estima- 
tion to direct estimation using fully annotated data are 
applicable. 
The objects involved in the overall model are as Jbl- 
lows (we omit target speech synthesis under the, as- 
sumption that it proceeds deterministically from a tar- 
get language word string): 
• A0: (acoustic evidence for) source language spe~'ch 
• Wo: source language word string 
• Wz: target language word string 
• C0: source language relation graph 
• Ct: target language relation graph 
Given a spoken input in the source language, we 
wish to find a target language string that is the most 
likely translation of the input. We are thus interestc.d 
in the conditional probability of We given A,. This 
conditional probability can be expressed as follows (of. 
Chang and Su 1993): 
P(WdA,) = 
~W,,C,,Ct P(WolAo) P(C, IW,, A,) 
P(CdCo, W,, A°) PCWd(:,, C,, W.,, ,4, ). 
We now apply some simplifying independence .s- 
sumptions concerning relation graphs. Specifically. that 
their derivation from word strings is independent of 
acoustic information; that their translation is indepen- 
dent of the original words and acoustics involved; and 
that target word string generation from target relation 
edges is independent of the source language represent, a- 
tions. The extent to which these (Markovian) assump- 
tions hold depend on the extent to which relation edges 
represent all the relevant information for translation. 
In particular it means they should express aspects of 
surface relevant to meaning, such as topicalization, as 
well as predicate argument structure. In any case, the 
simplifying assumptions give the following: 
P(W~IA, ) _~ 
~w.,c.,c, P( W, IA, ) P(C01W,) P( Ct lCo ) P( Wt I£ :, ). 
This can be rewritten with two applications of Bay,,~ 
5 
I'llh': 
v" L.,W.. ,C~,('t P( A, IW,) ( I / P(A.,)) P(WolC,) 
P(C,) P(C~IC, ) P(W, ICt). 
Since A, is given, lIP(A,) is a constant which can be 
ignored in finding the maximum of P(Wt\]As). Deter- 
mining Wt that maximizes P(WdA, ) therefore involves 
the following factors: 
* I'(A, I W, ): source language acoustics 
• /'(\[.V, IC,): source language generation 
. I'(C.,): source content relations 
• /'(('tiCs): source to target transfer 
• I'(IVtlC't ): target language generation 
Wc a.,~ume that the speech recognizer provides acous- 
tic scores proportional to P(A, IW, ) (or logs thereof). 
Sud~ scores are normally computed by speech recogni- 
l i,,n systems, although they are usually also multiplied 
by w,,rd-based language model probabilities P(W,) 
which we do not require in this application context. 
()ur approach to language modeling, which covers the 
corn.cat analysis and language generation factors, is pre- 
:~,,uted in section 5 and the transfer probabilities fall 
umh,r the translation model of section 6. 
Finally note thai. by another application of Bayes 
,-,d,, w,, can replace the two factors P(C,)P(CdC,) by 
I'(Ct)l'(C, lCt} without changing other parts of the 
model. Tiffs latter fornmlation allows us to apply con- 
straints imposed by the target language model to ill- 
t,'r inappropriate possibilities suggested by analysis and 
tra.sfi~r. In some respects this is similar to Dagan and 
Itai's (I 994) approach to word sense disambiguation us- 
ing statistical associations in a second language. 
5. Language Models 
Language Production Model 
~).r bmguage model can be viewed in terms of a proba- 
bihstic generative process based on the choice of lexical 
"heads" of phrases and the recursive generation of sub- 
;,bra~es and their ordering. For this purpose, we can de- 
(ira, tho head word of a phrase to be the word that most 
strongly influences the way the phrase may be com- 
biucd with other phrases. This notion has been central 
to a number of approaches to grammar for some time, 
including theories like dependency grammar (Hudson 
I!~7(;, 1990) and HPSG (Pollard and Sag 1987). More 
;,'~,.t,l.ly, the statistical properties of associations be- 
Iw,.,'n words, and more particularly heads of phrases, 
JL:t.~ J~,~'~,l|lql, all a.el.iw; area of research (e.g. Chang, l,uo, 
aml Su 1992; Ilindlc and R.ooth 1993). 
'l'h,' language model factors the statistical derivation 
,,f a .~'ul.ence with word string W as follows: 
I'(ll) = ~,: P(C) P(WIC) 
where C ranges over relation graphs. The content 
model, P(C), and generation model, P(WIC), are com- 
ponents of the overall statistical model for spoken lan- 
guage translation given earlier. This decomposition of P(W) 
can be viewed as first deciding on the content of 
a sentence, formulated as a set of relation edges accord- 
ing to a statistical model for P(C), and then deciding 
on word order according to P(WIC ). 
Of course, this decomposition simplifies the realities 
of language production in that real language is always 
generated in the context of some situation S (real or 
imaginary), so a more comprehensive model would be 
concerned with P(CIS), i.e. language production in 
context. This is less important, however, in the trans- 
lation setting since we produce Ct in the context of a 
source relation graph C, and we assume the availability 
of a model for P(CtlC,). 
Content Derivation Model 
The model for deriving the relation graph of a phrase 
is taken to consist of choosing a lexical head h0 for the 
phrase (what the phrase is 'about') followed by a series 
of 'node expansion' steps. An expansion step takes a 
node and chooses a possibly empty set of edges (relation 
labels and ending nodes) starting from that node. Here 
we consider only the case of relation graphs that are 
trees with unordered siblings. 
To start with, let us take the simplified case where a 
head word h has no optional or duplicated dependents 
(i.e. exactly one for each relation). There will be a set 
of edges 
E(h) = {rl(h, wl), r~(h, w2) ... r~(h, wk)} 
corresponding to the local tree rooted at h with depen- 
dent nodes Wl...wk. The set of relation edges for the 
entire derivation is the union of these local edge sets. 
To determine the probability of deriving a relation 
graph C for a phrase headed by h0 we make use of 
parameters ('dependency parameters') 
P(r(h,w)lh, r) 
for the probability, given a node h and a relation r, 
that w is an r-dependent of h. Under the assumption 
that the dependents of a head are chosen independently 
from each other, the probability of deriving C is: 
P(C) = P(Top(ho)) I~Ir(h.~)¢c P(r(h, w)lh, r) 
where P(Top(ho)) is the probability of choosing h0 to 
start the derivation. 
If we now remove the assumption made earlier that 
there is exactly one r-dependent of a head, we need to 
elaborate the derivation model to include choosing the 
number of such dependents. We model this by param- 
eters 
P(N(r,n)lh) 
6 
that is, the I)rol)aifility that head h h+~ n r-dep(m(lents. 
We will r,ffer t,o t,|lis I)robability ;m a '(let, all parameter'. 
Our previous assmnption amounted to stating that this 
was always 1 for n = 1 or for n = 0. Detail parameters 
allow us to model, for example, the number of adjectival 
modifiers of a noun or the 'degree' to which a particular 
argument of a verb is optional. The probability of an 
expansion of h giving rise to local edges E(h) is now: 
P(E(h)lh) = 
Fir P(N(r, nr)lh) k(nr) I\]l<i<r~ P(r(h, w\[)lh , r). 
where r ranges over the set of relation labels and h has 
nr r-dependents w~... w nP . k(nr) is a combinatorie con- 
stant for taking account of the fact that we are not dis- 
tinguishing permutations of the dependents (e.g. there 
are n,.! permutations of the r-dependents of h if these 
dependents are all distinct). 
So if h0 is the root of a tree C, we have 
P(C) = P(Top(ho)) rIheh~aa,(c) P(Ec(h)lh) 
where heads(C) is thc set of nodes in C and Ec(h) is 
the set of edges headed by h in C. 
The above formulation is only an approximation for 
relation graphs that are not trees because the indepen- 
dence assumptions which allow the dependency param- 
eters to be simply multiplied together no longer hold 
for the general case. Dependency graphs with cycles do 
arise as the most natural analyses of certain linguistic 
constructions, but calculating their probabilities on a 
node by node basis as above may still provide proba- 
bility estimates that are accurate enough for practical 
purposes. 
Generation Model 
We now return to the generation model P(WIC). As 
mentioned earlier, since C includes the words in W and 
a set of relations between them, the generation model 
is concerned only with surface order. One possibility is 
to use 'bi-relation' parameters for the probability that 
an ri-dependent immediately follows an u-dependent. 
This approach is problematic for oui: overall statisti- 
cal model because such parameters are not independent 
from the 'detail' parameters specifying the number of 
r-dependents of a head. 
We therefore adopt the use of 'sequencing' parame- 
ters, these being probabilities of particular orderings of 
dependents given that the multiset of dependency rela- 
tions is known. We let the identity relation e stand for 
the head itself. Specifically, we have parameters 
P(slM(s)) 
where s is a sequence of relation labels including an oc- 
currence of e and M(s) is the multiset for this sequence. 
For a head h in a relation graph C, let swch be the se- 
quence of dependent relations induced by a particular 
word string W generated from C. We now have 
s>(WlC) = I-Ih~w(Il. ~-~--~ ) l'(.sw < "h I M ( ~'w < "h )) 
where It ranges over all the heads in (;, aud m. is I.h<' 
number of occurrences of r in sW(:h, assuming that all 
orderings of nr-dependents are equally likely. We can 
thus use these sequencing parameters directly in our 
overall model. 
To summarize, our monolingual models are specifi,'d 
by: 
* topmost head parameters P(Top(h)) 
* dependency parameters P(r(h, w)lh, r) 
+ detail parameters P(N(r, n)lh ) 
* sequencing parameters P(s\[M(s)) 
The overall model splits the contributions of ('ollt~mt P(C) 
and ordering P(WIC ). However, we may also 
want a model for P(W), for example for pruning spec(:h 
recognition hypotheses. Combining our content ;rod or- 
dering models we get: 
P(W) = Z P(C) P(WIC) 
c 
= ~C P(Top(hc)) H P(swc'hlh) 
hEW 
H P(r(h, w)lh, ,') 
r(h,w)eE¢(h) 
The parameters P(slh ) can be derived by combining 
sequencing parameters with the detail parameters for 
h. 
6. Translation Model 
Mapping Relation Graphs 
As already mentioned, the translation model delines 
mappings between relation graphs C., for the source 
language and Ct for the target language. A direct 
(though incomplete)justification of translation via n.- 
lation graphs may be based on a simple referential view 
of natural language semantics. Thus nominals and 
their modifiers pick out entities in a (real or imagi- 
nary) world, verbs and their modifiers refer to actions 
or events in which the entities participate in roles in- 
dicated by the edge relations. Under this view, the 
purpose of the translation mapping is to determhm a 
target language relation graph that provides the best 
approximation to the referential function induced by 
the source relation graph. We call this approximating 
referential equivalence. 
This referential view of semantics is not adequate for 
taking account of much of the complexity of natural 
language including many aspects of quantification, dis- 
tributivity and modality. This means it cannot capture 
some of the subtleties that a theory based on logical 
equivalence might be expected to. On the other hand, 
when we proposed a logic based approach as our quali- 
tative model, we had to restrict it to a simple first order 
7 
logic anyway for computational reasons, and even then 
it did not appear to be practical. Thus using the more 
impow~rished lexical relations representation may not 
tw costing us much in practice. 
One aspect of the representation that is particularly 
useful in the translation application is its convenience 
for partial and/or incremental representation of content 
we can refine the representation by the addition of fur- 
thor edges. A fully specified denotation of the meaning 
of a s,mtence is rarely required for translation, and as 
w,~ pointed out when discussing logic representations, a 
c~mq~lete specification may not have been intended by 
th,, slwaker. Although we have not provided a denota- 
tio.al semantics for sets of relation edges, we anticipate 
that this will be possible along the lines developed in 
m(motonic semantics (Alshawi and Crouch 1992). 
Translation Parameters 
'1'o bc practical, a model for P(CtIC,) needs to decom- 
pose the source and target graphs C~ and Ct into sub- 
graphs small enough that subgraph translation parame- 
ters can be estimated. We do this with the help of 'node 
a.lignment relations' between the nodes of these graphs. 
'l'lmse alignment relations are similar in some respects 
to the alignments used by Brown et al. (1990) in their 
surface translation model. The translation probability 
is then the sum of probabilities over different alignments 
.t: 
I'(C, ICo) = ~s P(C. flC,). 
There are different ways to model P(Ct,.tIC,) corre- 
sp(mding to different kinds of alignment relations and 
different independence assumptions about the transla- 
tion mapping. 
l"or our quantitative design, we adopt a simple model 
in which lexical and relation (structural) probabilities 
are assumed to be independent. In this model the align- 
nlent relations are functions from the word occurrence 
~lodes of Ct to the word occurrences of C~. The idea 
is that .t(,j) = wi means that the source word occur- 
r('ncc wi 'gave rise' to the target word occurrence vj. 
'l'lw inverse relation .t-1 need not be a function, allow- 
ing different numbers of words in the source and target 
sentences. 
We decompose P(C~,.tIC,) into 'lexical' and 'struc- 
tural' probabilities as follows: 
I'(Ct, fie,) = P(N,, IIN,)P(EtINt, .t, C,) 
where Nt and N, are the node sets for Ct and C0 respec- 
tiw.ly, and Et is the set of edges for the target graph. 
The lirst factor P(Nt, .fiN,) is the lexical component 
it~ ~.hat it does not take into account any of the relations 
in I.he source graph C.,. This lexical component is the 
pro,luct of alignment probabilities for each node of N,: 
PCN,, fiN, ) = 
H wiEN. "?}lwd. 
That is, the probability that .I' maps exactly the (possi- 
bly empty) subset {vi*... v~} of Nt to wi. These sets are 
assumed to be disjoint for different source graph nodes, 
so we can replace the factors in the above product with 
parameters: 
P(MIw) 
where w is a source language word and M is a multiset 
of target language words. 
We will derive a target set of edges Et of Ct by k 
derivation steps which partition the set of source edges 
E, into subgraphs St ... Sk. These subgraphs give rise 
to disjoint sets of relation edges T1 ... Tk which together 
form Et. The structural component of our translation 
model will be the sum of derivation probabilities for 
such an edge set Et. 
For simplicity, we assume here that the source graph 
C, is a tree. This is consistent with our earlier assump- 
tions about the source language model. We take our 
partitions of the source graph to be the edge sets for 
local trees. This ensures that the the partitioning is 
deterministic so the probability of a derivation is the 
product of the probabilities of derivation steps. More 
complex models with larger partitions rooted at a node 
are possible but these require additional parameters for 
partitioning. For the simple model it remains to specify 
derivation step probabilities. 
The probability of a derivation step is given by pa- 
rameters of the form: 
P(T, qS,', .td 
where S~ and T\[ are unlabeled graphs and ffi is a node 
alignment function from T\[ to S~. Unlabeled graphs 
are just like our relation edge graphs except that the 
nodes are not labeled with words (the edges still have 
relation labels). To apply a derivation step we need a 
notion of graph matching that respects edge labels: g is 
an isomorphism (modulo node labels) from a graph G 
to a graph H if g is a one-one and onto function from 
the nodes of G to the nodes of H such that 
r(a, b) e V iff r(g(a), g(b)) • H. 
The derivation step with parameter P(T\[IS~,f~ ) is 
applicable to the source edges St, under the alignment 
f, giving rise to the target edges Ti if (i) there is an iso- 
morphism hi from S~ to Si (ii) there is an isomorphism 
gi from ~ to T~' (iii) for any node v of Ti it must be the 
case that 
hi(fi(gi(v))) -- f(v). 
This last condition ensures that the target graph parti- 
tions join up in a way that is compatible with the node 
alignrn,:nt f, 
Tile factoring of the translation model into these 
lexical and structural components means that it will 
overgenerate because these aspects are not indepen- 
dent in translation between real natural languages. It 
8 
is therefore appropriate to filter translation hypotheses 
by re.scoring according to the version of the overall sta- 
tistical model that included the factors P(Ct)P(ColCt) 
so that the target language model constrains the out- 
put of the translation model. Of course, in this case we 
need to model the translation relation in the 'reverse' 
direction. This can be done in a parallel fashion to the 
forward direction described above. 
7. Conclusions 
Our qualitative and quantitative models have a similar 
overall structure and there are clear parallels between 
the factoring of logical constraints and statistical pa- 
rameters, for example monolingual postulates and de- 
pendency parameters, bilingual postulates and trans- 
lation parameters. The parallelism would have been 
closer if we had adopted ID/LP style rules (Gazdar et 
al. 1985) in the qualitative model. However, we argued 
in section 3 that our qualitative model suffered from 
lack of robustness, from having only the crudest means 
for choosing between competing hypotheses, and from 
being computationally intractable for large vocabular- 
ies. 
The quantitative model is in a much better position 
to cope with these problems. It is less brittle because 
statistical associations have replaced constraints (feat- 
ural, selectional, etc.) that must be satisfied exactly. 
The probabilistic models give us a systematic and well 
motivated way of ranking alternative hypotheses. Com- 
putationally, the quantitative model lets us escape from 
the undecidability of logic-based reasoning. Because 
this model is highly lexical, we can hope that the in- 
put words will allow effective pruning by limiting the 
number of search paths having significantly high prob- 
abilities. 
We retained some of the basic assumptions about the 
structure of language when moving to the quantitative 
model. In particular, we preserved the notion of hierar- 
chical phrase structure. Relations motivated by depen- 
dency grammar made it possible to do this without giv- 
ing up sensitivity to lexical collocations which underpin 
simple statistical models like N-grams. The quantita- 
tive model also reduced overall complexity in terms of 
the sets of symbols used. In addition to words, it only 
required symbols for dependency relations, whereas the 
qualitative model required symbol sets for linguistic 
categories and features, and a set of word sense sym- 
bols. Despite their apparent importance to translation, 
the quantitative system can avoid the use of word sense 
symbols (and the problems of granularity they give rise 
to) by exploiting statistical associations between words 
in the target language to filter implicit sense choices. 
Finally, here is a summary of our reasons for com- 
bining statistical methods with dependency representa- 
tions in our language and translation models: 
• inherent lexical sensitivity of dependency representa- 
tions, facilitating parameter estimation; 
* quantitative preference based on probabilistic deriva- 
tion and translation; 
• incremental and/or partial speeilication of tlw ~',~tl- 
tent of utterances, particularly useful in I, ranslatiou; 
• decomposition of complex utterances through rccur- 
sive linguistic structure. 
These factors suggest that dependency grammar will 
play an increasingly important role as language pro- 
cessing systems seek to combine both structural and 
colloeational information. 
Acknowledgements 
I am grateful to Fernando Pereira, Mike Riley, and hlo 
Dagan for valuable discussions on the issues addressed 
in this paper. Fernando Pereira and !do Dagan also 
provided helpful comments on a draft of the paper. 

References 
Alshawi, H., D. Carter, B. Gamback and M. Rayner. 
1992. "Swedish-English QLF Translation". In H. AI- 
shawl (ed.) The Core Language Engine, Cambridge, 
Mass.: MIT Press. 
Alshawi, H. and R. Crouch. 1992. "Monotonic Seman- 
tic Interpretation". Proceedings of the 30th Annual 
Meeting of the Association for Computational Lin- 
guistics, Newark, Delaware. 
Alshawi, H. and D. Carter. 1994. "Training and Seal- 
ing Preference Functions for Disambiguation". To 
appear in Computational Linguistics. 
Brill, E. 1993. "Automatic Grammar Induction and 
Parsing Free Text: A Transformation-Based Ap- 
proach".Proceedings of the 31st Annual Meeting of 
the Association for Computational Linguistics, 259 
265. 
Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. 
Jelinek, J. Lafferty, R. Mercer and P. Rossin. 1990. 
"A Statistical Approach to Machine TranslatioJl". 
Computational Linguistics 16:79-85. 
Chang, J., Y. Lua, and K. Su. 1992. "GPSM: A Gen- 
eralized Probabilistic Semantic Model for Ambiguity 
Resolution". Proceedings of the 30th Annual Meet- 
ing of the Association for Computational Linguistics, 
177-192. 
Chang, J., K. Su. 1993. "A Corpus-Based Statistics- 
Oriented Transfer and Generation Model for Machine 
Translation". Proceedings of the 5th International 
Conference on Theoretical and Methodological Issues 
in Machine Translation. 
Dagan I. and A. Itai. 1994. "Word Sense Disambigua- 
tion Using a Second Language Monolingual Corpus". 
To appear in Computational Linguistics. 
l)agan, 1., S. Marcus and S. Markoviteh. 1993. "Con- 
textual Word Similarity and Estimation from Sparse 
Data". Proceedings of the 31st meeting of the Associ- 
atio~ for Computational Linouistics, ACL, 164-171. 
(;azdar, G., E. Klein, G.K. Pullum, and I.A.Sag. 1985. 
Generalised Phrase Structure Grammar. Oxford: 
Blackwell. 
Ilindle, D. and M. Rooth. 1993. "Structural Ambiguity 
and Lexical Relations". Computational Linguistics 
19:103-120. 
Ilobbs, J.R., M. Stickel, P. Martin and D. Edwards. 
1988. "Interpretation as Abduction", Proceedings of 
the 26th Annual Meeting of the Association for Com- 
putational Linguistics, Buffalo, New York, 95-103. 
Itudson, R.A. 1984. Word Grammar. Oxford: Black- 
woll. 
Isah~,llo, P. and E. Macklovitch. 1986. "Transfer and 
MT Modularity", Eleventh International Conference 
on Computational Linguistics, Bonn, 115-117. 
.Iclinek, F., R.L. Mercer and S. Roukos. 1992. "Princi- 
ples of Lexieal Language Modeling for Speech Recog- 
nition". In S. Furui and M.M. Sondhi (eds.), Ad- 
vances in Speech Signal Processing, New York: Mar- 
col Dekker Inc. 
M,:llish, C.S. 1988. "Implementing Systemic Classifi- 
cation by Unification". Computational Linguistics 
14:40-51. 
McCord, M. 1988. "A Multi-Target Machine Trans- 
lation System". Proceedings of the International 
Conference on Fifth Generation Computer Systems, 
'\[bkyo, Japan, 1141-1149. 
I'ereira, F., N. Tishby and L. Lee. 1993. "Distri- 
butional Clustering of English Words". Proceedings 
of the 31st meeting of the Association for Computa- 
tzonal Linguistics, ACL, 183-190. 
I'oll~trd, C.J. and I.A. Sag. 1987. Information Based 
,';yntax and Semantics: Volume I ~ Fundamentals. 
CSI,I Lecture Notes, Number 13. Center for the 
Study of Language and Information, Stanford, Cali- 
fornia. 
llayner, M. and H. Alshawi. 1992. "Deriving Database 
Queries from Logical Forms by Abductive Definition 
Expansion". Proceedings of the Third Conference on 
Applied Natural Language Processing, Trento, Italy. 
i{ichard, M.D. and R.P. Lippmann. 1991. "Neural Net- 
work Classiliers Estimate Bayesian a posteriori Prob- 
;d,ilili~.s". Neural Computation 3:461-483. 
Shi,.l~,,r, S.M. 1986. An Introduction to Unification- 
Ilascd Approaches to Grammar. CSLI Lecture Notes, 
Number 4. Center for the Study of Language and 
I i~ form ation, Stanford, California. 
Smajda, F. and K. McKeown. 1990. "Automatically 
Extracting and Representing Collocations for Lan- 
guage Generation". In Proceedings of the $Sth An- 
nual Meeting of the Association for Computational 
Linguistics, Pittsburgh. 
Taylor, L., C. Grover, sad E.J. Briscoe. 1989. "The 
Syntactic Regularity of English Noun Phrases". Pro- 
ceeedings of the 4th European ACL Conference, 256- 
263. 
Weaver, W. 1955. "Translation". In W. Locke and 
A. Booth (eds.), Machine Translation of Languages, 
Cambridge, Mass.: MIT Press. 
