A Statistical Theory of Dependency Syntax 
Christer Salnuelsson 
Xerox Resem:ch Centre Europe 
6, chemin de Maupertuis 
38240 Meylan, FRANCE 
Chris'cer. Samue:l.sson©xrce. xerox, com 
Abstract 
A generative statistical model of dependency syntax 
is proposed based on 'l'csniSre's classical theory. It 
provides a stochastic formalization of the original 
model of syntactic structure and augments it with 
a model of the string realization process, the latter 
which is lacking in TesniSre's original work. The 
resulting theory models crossing dependency links, 
discontinuous nuclei and string merging, and it has 
been given an etIicient computational rendering. 
1 Introduction 
The theory of dependency grammar culminated in 
the seminal book by Lncien TesniSre, (Tesnihre, 
1959), to which also today's leading scholars pay 
homage, see, e.g., (Mel'enk, 1987). Unfortunately, 
Tesnibre's book is only available in French, with a 
partial translation into German, and subsequent de- 
scriptions of his work reported in English, (Hays, 
196/1), (Gaifinan, 1965), (Robinson, 1970), ,etc., 
stray increasingly t'urther fi:om the original, see (En- 
gel, 1996) or (,15~rvinen, 1998) for an account of this. 
The first step when assigning a dependency de- 
scription to an input string is to segment the input 
string into nuclei. A nucleus can he a word, a part 
of a word, or a sequence of words and subwords, 
and these need not appear eontiguonsly in the input 
string. The best way to visualize this is perhaps the 
following: the string is tokenized into a sequence of 
tokens and each lmcleus consists of a subsequence of 
these tokens. Alternative readings may imply differ- 
ent ways of dividing the token sequence into nuclei, 
and segmenting the input string into nuclei is there- 
fore in general a nondeterministic process. 
The next step is to relate the nuclei to each other 
through dependency links, which are directed and 
typed. If there is a dependency link froln one nu- 
cleus to another, the former is called a dependent 
of the latter, and the latter a regent of the former. 
Theories of dependency syntax typically require that 
each nucleus, save a single root nucleus, is assigned 
a unique regent, and that there is no chain of de- 
pendency links that constitutes a cycle. This means 
that the dependency links establish a tree structure, 
main main 
ate ate 
~bj~d~j ~bj~su~i 
John beans beans John 
Fignre 1: Dependency trees for John ate beans. 
where each node is labeled by a nucleus. Thus, the 
label assigned to a node is a dependent of the label 
assigned to its parent node, and conversely, the label 
assigned to a node is the regent of the labels assigned 
to its child nodes. Figure 1 shows two dependency 
trees tbr the sentence John ate beans. 
In Tesni~re's dependency syntax, only the depen- 
dency strncture, not the order of the dependents, is 
represented by a dependency tree. This means that 
dependency trees are unordered, a.nd thus that the 
two trees of Figure 1 are equivalent. This also mea.ns 
that specitying the sm'face-string realization of a de- 
pendency description becomes a separate issue. 
We will model dependency desc,:iptions as two 
separate stochastic processes: one top-down process 
generating the tree structure T and one bottom-up 
process generating the surt3.ce string(s) S given the 
tree structure: 
1)(7, s) = \]'(7). p(s 1 7) 
This can be viewed as a variant of Shannon's noisy 
channel model, consisting of a  model of 
tree structures and a signal model converting trees 
to surface strings. In Section 2 we describe the top- 
down process generating tree structures and in Sec- 
tion 3 we propose a series of increasingly more so- 
phisticated bottom-up proccsses generating surl~ce 
strings, which resnlt in grammars with increasingly 
greater expressive power. Section el describes how 
the proposed stochastic model of dependency syn- 
tax was realized as a probabilistic chart parser. 
2 Generating Dependency Trees 
To describe a tree structure T, we will use a string 
notation, introduced in (Gorn, 1962), for the nodes 
684 
•/e 
mainl 
ate/I 
~bv d0bl 
John/ll beans/12 
l!'igure 2: Corn's tree notation tbr John ate beans. 
H £ S S 
e \[main\] s(1). 
\] ate \[subj,dobj\] s(ll) ate s(12) 
1:1 John ~ John 
12 beans 0 beans 
I,'igure 3: l)ependency encoding of John ale beans. 
of Clio tree, where the node name sl)ecifi0s the path 
fi'om the root no(le ~ to the node in question, 
If (Jj is a node of the tree T, 
with j C N+ and (/J E N~, 
then q5 is also a node of the trc'e T 
and ()j is a child of 4. 
ltere, N+ denotes the set of positive integers 
{1,2,...} and N~_ is the set of strings over N+. 'l'his 
lncans that the label assigned to node ()j is a de- 
pendent of the label assigned to node (J. The first 
dependency tree of Figure 1 is shown in l!'igure 2 
using this notation. 
We introduce three basic random variables, which 
incrementally generate the tree strucl;ure: 
• £(4)) = l assigns the \]al)el l to node 4), where 
l is a iitlc|etis, i.e., it; is drawn frol,-I the set of 
strings over t, he so.t of i, okens. 
• "D(Oj) = d indicates t.h(~ dep(:ndency type d link- 
ing the label of node OJ to its regent, the label 
Of node 4'" 
• V(¢,) = v indica.tes that node 7, has exactly v 
child nodes. 
Note the use of ~(~/J) = 0, rather than a partitioning 
of the labels into terlninal and nonterminal syml)ols, 
to indicate that ~ is a leaf node. 
l,et D be the (finite) set of l)ossible dependency 
types. We next introduce the composite variables 
.T(()) ranging over the power bag* N D, indicating 
the bag of dependency types of dJ's children: 
m(4,) = f = \[d,,...,dl,\] *> 
e~O P(c/,) = v A V:ic{i ..... v} D(6j) = dj 
Figure 3 encodes the dependency ti;ee ()1' Figure 2 
accordingly. We will ignore the last cohunn \['or now. 
1 A bag (mull'set) can contain several tokens of the Smlm 
type. We denote sets {...}, \]Jags \[...\] and ordered tuples {...), 
\]Jill, over\]o+'id O~ (~> etc, 
We introduce the probabilities 
s'~(~) - 
: P(~(0 : l) 
P~ if, j) = 
= l'(10j) = (~ I ~(4,) = t,~D(4,J) = 4d 
PT(0 = 
= P(f(0 = f l~(0 = l) 
Pm(~) = { tbr4J~-e } 
= s,(m(4,) = S I £(4>) -- l, 9(40 -- d) 
These l~robabilities are tyl)ically model parameters, 
o1' further decomposed into such. lJ~(@j) is the prob- 
ability of the label £(4~J) of a node given the label 
£(4') of its regent and the dependency type "D(0j) 
linking them. l{eh~ting Eft, j) and £(0) yiekls lexical 
collocation statistics and including D((~j) makes the 
collocation statistics lexical-fimetional. Pm(~0 is the 
probability of the bag of' dependency types Y(0) of 
a ,,ode Rive,, its label £(4J) and its relation D(#)) to 
its regent. This retleets the probability of the label's 
vM oncy, or lexieal-fimctional eoml)lement , and of op- 
1.ional adjuncts. Including D(q)) makes this proba- 
bility situa.ted in taking its current role into accounl.. 
These allow us to define the tree probal)ility 
~,(m) = II ~'~0/,)-s>(4,) 
,',,EAr 
wiiere the 1)roduct is taken over the set. of nodes .A/" 
of the tree. 
\¥e generate the random variables £ and S using 
a top-down stochastic process, where £(()) is goner- 
ated I)efore Y(O). The probal)ility of the condition- 
ing material of l~(Oj) is then known from Pc(O) and 
19((,), and that of Sg(4,j) is known froln \]'£(OJ) 
and lJ:n(O). F'igure 3 shows the process generating 
the dependency tree' of Figure 2 by reading the £ 
and .7:- colunms downwards in parallel, £ before Y: 
~,(~) =., m(O = r~,~\], 
£(1) = ate, Y(1)= \[subj,dobj\], 
~;(~) = Job,,, roll:i)= ~, 
~(~) = /,~,,,,~, m(12) = 0 
Consider calculating the l)robabilities at node 1: 
IJ~ (1) = 
= s'(/_;(l) = <,t~ I£(0 = .,'D(1) = main) 
P~(~) = 
= PlY(l)= \[subj,dobj\] \[ 
I C(I) = .t~,'D(~) = m~n) 
3 String Realization 
'\]'he string realization cannot be uniquely deter- 
mined from the tree structure. 'lb model the string- 
realization process, we introduce another fundamen- 
tal random w~riable $(()), which denotes the string 
685 
associated with node 0 and which should not be con- 
fused with the node label £(()). We will introduce 
yet another fundamental randoln variable \]v4(~)) in 
Section 3.2, when we accommodate crossing depen- 
dency links. In Section 3.1, we present a projectivc 
stochastic dependency gralnlnar with an expressive 
power not exceeding that of stochastic context-free 
grammars. 
3.1 Projective Dependency Grammars 
We let the stochastic process generating the £ and .7- 
vtu'iM)les be as described above. We then define tile 
stochastic string-realization process by letting tile 
8(~5) variables, given ¢'s label 1(40 and the bag of 
strings s(()j) of ~5's child nodes, randomly permute 
and concatenate them according to the probability 
distributions of the modeh 
Ps(0 = 
= s,(s(<.) = 401 c(O, 7(0, c(O) 
~'~(4,) = { for¢¢~ } 
= s'(s(¢,) = 4¢') I >frO, eft,), 7(¢), c(¢,)) 
where 
c(4,) = 0 \[~(<bJ)\] 
j=l 
8(~$) = adjoin(C(g,),l(¢)) 
adjoin(A,/3) = eoncat(permute(A U \[/3\])) 
The latter equations should be interpreted as defin- 
ing the randorn variable 8, rather than specifying its 
probability distribution or some possible outcome. 
This means that each dependent is realized adjacent 
to its regent, where wc allow intervening siblings, 
and that we thus stay within the expressive power 
of stochastic context-free grammars. 
We define the string-realization probability 
~beAr 
and the tree-string probability as 
P(7, s) = I"(7-)./,(s I 7) 
The stochastic process generating the tree struc- 
ture is as described above. We then generate the 
string variables S using a bottom-up stochastic pro- 
cess. Figure 3 also shows the process realizing the 
surface string John ate beans fl-om the dependency 
tree of Figure 2 by reading the S column upwards: 
,9(12) = bca,,s, S(11) = Job,,, 
,9(1) = s(11)ate s(12), S(e) = s(l). 
Consider cMeulating tile striug probability at node 
1. Ps is the probability of the particular permut~t- 
tion observed of the strings of the children and the 
?le 
did say/1 
~bj/'%sc0nj 
Mao,/l 1 that ate/l 2 
subj/"~ol0bj 
Johnll21 Whatbeans/122 
Figure 4: Del)endency tree for What beans did Mary 
say lhat John ate? 
1M)el of the node. To overcome the sparse-data prob- 
lem, we will generalize over the actual strings of tile 
children to their dependency types. For example, 
s(subj) denotes the string of the subject child, re- 
gardless of what it actually might be. 
J'~(1) = P(S(1) = s(subj) ate s(dobj) I 
I "D(1) = m~n, s:(1) = <,re, 
C(I) = \[s(subj), s(dobj)\]) 
This is the probability of the permutation 
(s(subj), ate, s(dobj)) 
of the bag 
\[s(subj), aic, s(dobj)\] 
given this bag and the fact that we wish to tbrm a 
main, declarative clause. This example highlights 
the relationship between the node strings and both 
Sallssure's notion of constituency and tile l)ositiolml 
schemata of, amongst others, l)idrichsen. 
3.2 Crossing Dependency Links 
To accommodate long-distance dependencies, we al- 
low a dependent to be realized adjacent to the la- 
bel of rely node that dominates it, immediately or 
not. For example, consider the dependency tree of 
Figure 4 tbr the sentence l/Vhat beans did Ma'Jw say 
that John ate? as encoded in Figure 5. Ilere, What 
beans is a dependent of that arc, which in turn is a 
dependent of did say, and What beans is realized be- 
tween did and sag. This phenomenon is called move- 
ment in conjunction with phrase-structure gram- 
m~rs. It makes the dependency grammar non- 
projective, since it creates crossing dependency links 
if the dependency trees also depict the word order. 
We introduce variables A//(~) that randomly se- 
lect from C(4)) a, subbag CM(4,) of strings passed up 
to ()'s regent: 
C(¢) = O(\[s(4)J)\] UCM(¢j)) 
j=l 
Gd</,) c_ c(4)) 
s'~(4,) = P(M(4)) = c'~,,(e) I 
I ~(<t,), s:(¢), fifO, c(6)) 
686 
N" £ be 
c v \[whq\] 
1 did say \[subj, sconj\] 
11 Mary (~ 
:12 thai ate \[subj ,dobj\] 
1.21 John 
122 What beans 0 
Figure 5: l)ependency encoding of What beans did 
Mary say that John ate? 
A/ M S 
di(/411) 
11 ~ Mary 
12 \[.s(122)\] that s(121) ate 
121 ~J John 
122 0 W/tat bem~s 
Fi.?;ure 6: Process generating What beans did Mary 
.sag that dohn ate? 
q'he rest of the strings, Cs(¢), are realized here: 
c:,;(¢) = c(¢) \ 
= = I be(C), c;,(¢)) 
s'(¢) = 
3,3 Discontinuous Nuch;i 
We generalize the scheme to discontinuous nuclei by 
allowing 8(¢) to i,mert the strings of C~.(~5) any- 
where in 1(¢): e 
adjoin(A, fl) = 
= V @ 
m 
j=l 
fl = b~ ...b,,, 
Tllis means that strings can only l)e inserted into an- 
cestor labels, ,lot into other strings, which enforces 
a. type of reverse islaml constraint. Note how in Fig- 
ure 6 John is inserted between that and ate to form 
the subordina, te clause that John atc. 
We define tile string-realization probability 
,b6 ar 
and again define the tree-string prol)ability 
~'(r,S) = l'(T).l'(a I T) 
2x -~ y indicates that x precedes y in the resulting per- 
mutation, q~snihre's original implicit definition of a nucleus 
actually does not require that the order be preserved when 
realizing it; if has catch is a nucleus, so is eaten h.as. This is 
obviously a useflfl feature for nlodeling verb chains in G erln&n 
subordinate clauses. 
'lb avoid derivational ambiguity when generating a 
tree-string pair, i.e., have more than one derivation 
generate tile same tree-string pair, we require that 
no string be realized adjacent to the string of any 
node it was passed u 1) through. This introduces the 
l)raetica.l problem of ensuring that zero probability 
mass is assigned to all derivations violating this con- 
straint. Otherwise, the result will be approxima.ting 
the parse probabi\]ity with a derivation probability, 
as described in detail in (Samuelsson, 2000) based on 
the seminal work of (Sima'an, 1996). Schemes like 
(Alshawi, 1996) tacitly make this approximation. 
The tree-structure variables £ and be are gener- 
ated just as before. ~¥e then generate the string vari- 
ables 8 and Ad using a bottom-up stochastic process, 
where M(¢)is generated before 8(¢). 'l.'he proba- 
bility of the eonditkming material of \]o~ (¢) is then 
known either from the top-down process or from 
I'M(¢j) and Pa(¢j), and that of INTO)is known 
either from the top-down process, or from 15v4(¢), \[)dgq(4)j) 
and 1~(¢j). The coherence of S(~) a.nd 
f14(~/)) is enforced by explicit conditioning. 
Figure 5 shows a top-down process generating the 
dependency tree of Figure <1; the columns £ and 
be should be read downwards in parallel, L; before 
b e. Figure 6 shows a bottom-up process generating 
the string l/Vhat beans did Mary say that dohn at(:? 
from the dependency description of Figure 5. The 
colltlll,lS d~v4 and S should be read upwards in paral- 
lel, 2t4 before $. 
3.4 String Merging 
We have increased the expressive power of our de- 
pendency gramma.rs by nlodifying tile S variables, 
i.e., by extending the adjoin opera.lion. In tile first 
version, the adjoin operation randomly permutes the 
node label and the strings of the child nodes, and 
concatenates the result. In the second version, it 
randondy inserts the strings of the child nodes, and 
any moved strings to be rea.lized at tile current node, 
into the node label. 
The adjoin operation can be fln:ther refined to al- 
low handling an even wider range of phenomena, 
such as negation in French. Here, the dependent 
string is merged with the label of the regent, as ne ... 
pas is wrapped around portions of the verb phrase, 
e.g., Ne me quitte pas!, see (Brel, 195.(t). Figure 7 
shows a dependency tree h)r this. In addition to this, 
the node labels may be linguistic abstractions, e.g. 
"negation", calling on the S variables also for their 
surface-string realization. 
Note that the expressive power of the grammar 
depends on the possible distributions of the string 
probabilities IN. Since each node label can be moved 
and realized at the root node, any  can be 
recognized to which the string probabilities allow as- 
signing the entire probablity mass, and the gralnmar 
will possess at least this expressive power. 
687 
!le imnl 
quitte/l ~g>~do~ 
Ne pas/l l me~12 
Figure 7: Dependency tree for Ne me quitte pas t 
4 A Computational Rendering 
A close approximation of the described stochastic 
model of dependency syntax has been realized as a 
type of prohabilistic bottom-up chart parser. 
4.1 Model Specialization 
The following modifications, which are really just 
specializations, were made to the proposed model 
for efficiency reasons and to cope with sparse data. 
According to Tesni6re, a nucleus is a unit that 
contains both tile syntactic and semantic head and 
that does not exhihit any internal syntactic struc- 
ture. We take the view that a nucleus consists of a 
content word, i.e., an open-class word, and all flmc- 
tion words adding information to it that could just as 
well have been realized morphologically. For exam- 
ple, the definite article associates definiteness with a. 
word, which conld just has well have been manifested 
in the word form, as it is done in North-Germanic 
s; a preposition could be realized as a loca.- 
tional or temporal inflection, as is done in Finnish. 
The longest nuclei we currently allow are verb chains 
of the form that have been eoten, as in John knows 
that lhe beans have been eaten. 
The 5 r variables were decomposed into generating 
the set of obligatory arguments, i.e., the valency or 
lexical complement, at once, as in the original model. 
Optional modifiers (adjuncts) are attached through 
one memory-less process tbr each modifier type, re- 
suiting in geometric distributions for these. This is 
the same separation of arguments and adjuncts as 
that employed by (Collins, 1997). However, the £ 
variables remained as described above, thus leaving 
the lexieal collocation statistics intact. 
The movement probability was divided into three 
parts: the probability of moving the string of a par- 
ticular argument dependent from its regent, that of 
a moved dependency type passing through a par- 
ticular other dependency type, and that of a de- 
pendency type landing beneath a particular other 
dependency type. The one type of movement that 
is not yet properly handled is assigning arguments 
and adjuncts to dislocated heads, as in What book 
did John read by Chomsky? 
The string-realization probability is a straight- 
forward generalization of that given at the end of 
Section 3.1, and they m:e defined through regu- 
lar expressions. Basically, each unmoved depen- 
dent string, each moved string landed at. the cur- 
?/e ?/e ynql ynql 
Did xxx/I Did eat/1 
s/uu bj/~'do.~ s/ubj/'~dobi 
John/l I beaus~12 Johull I xxx/12 
Figure 8: Dependency trees for Did John xxm beans? 
and Did John eat xxm? 
rent node, and each token of the nucleus labeling the 
current node are treated as units that are randomly 
permuted. Whenever possible, strings are general- 
ized to their dependency types, but accurately mod- 
elling dependent order in French requires inspecting 
tile actual strings of dependent clitics. Open-class 
words are typically generalized to their word class. 
String merging only applies to a small class of nuclei, 
where we treat tile individual tokens of the depen- 
dent string, which is typically its label, as separate 
units when perfornfing tile permutation. 
4.2 The Chart Parser 
The parsing algorithm, which draws on the Co&e- 
t(asanli-Younger (CI(Y) algorithm, see (Younger, 
1967), is formulated as a prohabilistic deduction 
scheme, which in turn is realized as an agenda-driven 
chart-pa.rser. The top-level control is similar to that 
of (Pereira and Shieher, 1987), pp. 196-210. The 
parser is implemented in Prolog, and it relies heav- 
ily on using set and bag operations as primitives, 
utilizing and extending existing SICStus libraries. 
The parser first nondeterministically segments the 
input string into nuclei, using a lexicon, and each 
possible lmcleus spawns edges tbr the initial chart. 
Due to discontinuous nuclei, each edge spans not a 
single pair of string positions, indicating its start 
and end position, \])tit a set of such string-position 
pairs, and we call this set an index. If the index 
is a singleton set, then it is continuous. We extend 
the notion of adjacent indices to be any two non- 
overlapping indices where one has a start position 
that equals an end position of the other. 
The lexicon contains intbrmation about the roles 
(dependency types linking it to its regent) and va- 
lencies (sets s of types of argument dependents) that 
are possible for each nucleus. These are hard con- 
straints. Unknown words are included in nuclei in a 
judicious way and the resulting nuclei are assigned 
all reasonable role/valency pairs in the lexicon. For 
example, the parser "correctly" analyzes tile sen- 
tences Did John xxx beans? and Did John eat xxx? 
as shown in Figure 8, where xxx' is not in the lexicon. 
For each edge added to the initial chart, the lexi- 
con predicts a single valency, but a set of alternative 
roles. Edges arc added to cover all possible valen- 
al)ue to the uniqueness principle of arguments, these are 
sets, rather than bags. 
688 
ties for each nucleus. The roles correspond to tim 
"goal" of dotted items used ill traditional cha.rt pars- 
ing, and the unfilled valency slots play the part of 
the "l)ody", i.e., the i)ortion of the \]{IlS \['ol\]owing 
the dot that renlailis to I)e found. If an argunl_ent is 
attached to the edge, the corresponding valency slot 
is filled in the resulting new odg(;; no arg~llnlont ea.ll 
be atta.ched to a.n edge llnless tllere is a (;orrespon(l- 
ing unfilled witency slot for it, or it is licensed by a 
lnoved arguln0nt, l,'or obvions reasons, the lexicon 
ca.nnot predict all possible combinations of adjuncts 
for each nuehms, and in fact predicts none at all. 
There will in general be nmltiple derivations of any 
edge with more than ()no del)endent , but the parser 
avoids adding dul)licate edges to tlt(? chart in the 
same way as a. traditional chart l)arser does. 
The l>arser enll)loys a. l)a(:ked l)arse foresl. (PI)I! ') 
to represent the set of all possible analyses and the 
i)robalfility of each analysis is recoverable I\]:om the 
PPI!' entries. Since optional inodifiers are not 1)re - 
dieted by the lexicon, the chart does not (:onl, a.ii~ 
any edges that ('orrespon(t directly to passive edges 
ill traditional chart parsing; at any point, an ad.lun('t 
C~ll always be added to an existing edge to form a 
new edge. In sonic sense, though, tile 1)1)1 '' nodes 
play tlie role all' passive edges, since the l)arser never 
attempts to combine two edges, only Olle ('xlgc and 
one I)l)l! ' lio(le, and the la.tter will a.lways 1)e a. de- 
pendent el'the fornier, directly, or indirectly tlirough 
the lists of n:iovcd dependents. 'l'he edge and l)l)l i' 
node to be Colnl)ined ai'e required to \]lave adjacent 
indices, and their lnlion is the index of tile now edge. 
The lnain point in using a l)acked parse forest, is to 
po'rI'orni local ~tiiil)iguity packing, which lneans a.b 
stracting over difl);ren(-es ill intc'rnal stlFlletlllye that 
do not lllalL, t(;r for fllrth(~,r \])arsilig. \¥hen attching a 
I)PF no(-l(~' to SOlllo edgc ;_is a direct or indirect de- 
pendent, the only relewuit teatnres are its index, its 
nucleus, its role a.nd its moved dependents. Oilier 
features necessary for recovering the comph;tc" anal- 
ysis are recorded in the P1)F entries of the node, bnt 
arc not used for parsing. 
To indicate the alternative that no more depen- 
dents are added to an edge, it is converted into a 
set of PPF updates, where each alternative role of 
the edge adds or updates one PPF entry. When 
doing this, any unfilled valency slots are added to 
the edge's set of moved arguments, which in turn is 
inherited by the resulting PPF update. '.\['lie edges 
are actually not assigned probabilities, since they 
contain enough information to derive the appropri- 
ate l)robabilities once they are converted into I)I)F 
entries. '1'o avoid the combinatorial explosion el' un- 
restricted string merging, we only allow edges with 
continuous indices to be converted into PI)I! ' 0ntries, 
with the exception of a very limited class of lexically 
signMed nnelei, snch as the nc pas, nc jamais, etc., 
scheme of French negation. 
4.3 Printing 
As Ot)l)osed to traditional chart parsing, meaningful 
upper and lower 1)ounds of the supply and demand 
for the dependency types C" the "goal" (roles) and 
"body" (wdency) of each edge can 1)e determined 
From the initial chart, which allows performing so- 
phis(,icated pruning. The basic idea is that if some 
edge is proposed with a role that is not sought out- 
side its index, this role can safely be removed. For 
example, the word me could potentially be an in- 
direct object, but if there is no other word in the 
inl)ut string that can have a.n indirect object as an 
argument, this alternative can be discarded. 
'Phis idea is generalized to a varia.nt of pigeonhole 
reasoning, in the v(;in of 
If wc select this role or edge, then ~here are 
by necessity too few or too many of some 
del)endcncy tyl)e sought or Cl'ered in the 
chart. 
or alternatively 
If wc select this nucleus or edge, then we 
cannot span the entire input string. 
Pruning is currently only al)plied to the initial chart 
to remove logically inq>ossible alternatives and used 
to filter out impossible edges produced in the predic- 
tion step. Nonetheless, it reduces parsing times by 
an order of magnitude or more tbr many of the test 
examples. \]t would however be possible to apply 
similar ideas to interniittently reinove alternatives 
that are known 1:o be suboptimal, or to \]leuristically 
prtllie unlik(;ly searcll branches. 
Discussion 
We have proposed a generative, statistical t.iieory 
of dependency syntax, based on TesniSrc's classical 
theory, that models crossing dependency links, dis- 
continuous nuclei and string merging. The key in- 
sight was to separate the tree-generation and string- 
realization processes. The model has been real- 
ized as a type of probabilistie chart parser. The 
only other high-fidelity computational rendering of 
Tesnitre's dependency syntax that we are aware of 
is that of (rl.'apanainen and J fi.rvinen, 1997), which is 
neither generative nor statistical. 
The stochastic model generating dependency trees 
is very similar to other statistical dependency mod- 
els, e.g., to that of (Alshawi, 1996). Formulating it 
using Gorn's notation and the L; and 2" variables, 
though, is concise, elegant; and novel. Nothing pre- 
vents conditioning the random variables on arbitrary 
portions of Clio 1)artial tree generated this far, using, 
e.g., maximum-entrol)y or decision-tree models to 
extract relevant t~atnres of it; there is no difference 
689 
in principle between our model and history-based 
parsing, see (Black el; al., 1993; Magerman, 1995). 
The proposed treatment of string realization 
through the use of the ,5 and A4 variables is also both 
truly novel and important. While phrase-structure 
grammars overemphasize word order by making the 
processes generating the S variables deterministic, 
Tesni6re treats string realization as a secondary is- 
sue. We tind a middle ground by nsing stochastic 
processes to generate the S and Ad variables, thus 
reinstating word order as a parameter of equal im- 
portance as, say, lexical collocation statistics. It is 
however not elevated to the hard-constraint status 
it enjoys in phrase-structure grammars. 
Due to the subordinate role of string realization in 
classical dependency grammar, the technical prob- 
lems related to incorporating movement into the 
string-realization process have not been investigated 
in great detail. Our use of the 54 variables is moti- 
vated partly by practical considerations, and partly 
by linguistic ones. The former in the sense that 
this allows designing efficient parsing algorithms for 
handling also crossing dependency links. The lat- 
ter as this gives us a quantitative handle on the 
empirically observed resistance against crossing de- 
pendency links. As TesniSre points out, there is 
locality in string realization in the sense that de- 
pendents tend to be realized adjacent to their re- 
gents. This fact is reflected by the model parame- 
ters, which also model, probabilistically, barrier con- 
straints, constraints on landing sites, etc. It is note- 
worthy that treating movelnent as in GPSG, with 
the use of the "slash" l~ature, see (Gazdar et al., 
1985), pp. 137-168, or as is done in (Collins, \]997), 
is the converse of that proposed here for dependency 
grammars: the tbrmer pass constituents down the 
tree, the 54 variables pass strings up the tree. 
The relationship between the proposed stochastic 
model of dependency syntax and a number of other 
prominent stochastic grammars is explored in detail 
in (Samuelsson, 2000). 

References 

Itiyan Alshawi. 1996. IIead automata and bilingual 
tiling: Translation with minimal representations. 
Procs. 3~th Annual Meeting of the Association for 
Computational Linguistics, pages 167-176. 

Ezra Black, Fred Jelinek, John Lafferty, David 
Magerman, l~obert Mercer, and Salim Roukos. 
1993. Towards history-based grammars: Using 
richer models tbr probabilistic parsing. Proes. 
28th Annual Meeting of the Association for Com- 
putational Linguistics, pages 31 37. 

Jacques Brel. 1959. Ne mc quitte pas. La Valse h 
Mille Temps (PIII 6325.205). 

Micha.el Collins. 1997. Three generative, lexical- 
ized models for statistical parsing. Procs. 35th 
Annual Meeting of the Association fog" Computa- 
tional Linguistics, pages 16-23. 

Ulrich Engel, 1996. Tcsni~rc Miflvcrstanden: Lu- 
cicn Tcsni~re - Syntaxc Structuralc et Opera- 
tion Mcntalcs. Aktcn des deulscl,-franzSsischen 
Kolloquiums anliifllich dcr 100 Wiedcrkchr seines 
Gebursttagcs, 5'trasbourg 1993, volume 348 of Lin- 
guistische Arbeiten, pages 53-61. Niedermcyer, 
Tiibingen. 

ltaim Gaiflnan. 1965. l)ependency systems and 
phrase-structure systems. Information and Con- 
trol, 8:304-337. 

Gerald Gazdar, Ewan Klein, Geoffrey K. Pullnm, 
and Ivan A. Sag. 1985. Generalized Phrase Struc- 
turc Grammar. Basil Blackwell l?ul)lishing, Ox- 
ford, England. Also published by Harvard Uni- 
versity Press, Cambridge, MA. 

Saul Gorn. 1962. Processors for infinite codes of 
shannon-fa.no type. Syrup. Math. Theory of Au- 
tomata. 

David Ilays. 1964. l)ependency theory: A formal- 
ism and some observations. Languagc, 40(4):511- 
525. 

Timo J'a.rvinen. 1998. Tcsnidre's 5'tg'uctural Syntax 
Rcworl¢ed. University of llelsinki, Itelsinki. 

David Ma.german. 1995. Statistical decision-tree 
models for parsing. Ib'ocs. 33rd Annual Meeting 
of the Association fog-" Computational Linguistics, 
pages 276 283. 

Igor Mel'Snk. 1987. Dependency 5'ynta.r:. State Uni- 
versity of New York Press, All)any. 

l?crnando Pereira and Stuart Shieber. 1987. Pro- 
log and Natural-Language Analysis. CSLI Lecture 
Note 10. 

Jane Robinson. 1970. l)ependency structures and 
transfornm.tional rules. Lcmguage, 46:259 285. 

Christer Samuelsson. 2000. A theory of stochastic 
grammars. In Proceedings oJNLI)-200(\], l)ages 92 
105. Springer Verlag. 

Khalil Sima'an. 1996. Computational complexity of 
probabilistic disambigua.tions by means of tree- 
grammars. Procs. 16lh International Confcrencc 
on Computational Linguistics, at the very end. 

Pasi Tapanainen and Timo JSrvinen. 1997. A non- 
projective dependency parser. Pgvcs. 5th Con- 
fercnce on Applied Natural Language Processing, 
pages 6zl 71. 

Lucien Tesnihre. 1959. lOldmcnts de Syntaxc Sh'uc- 
turalc. Libraire C. Klincksieck, Paris. 

l)avid 1I. Younger. 1967. R.ecognition and parsing 
of context-fi'ee s in time n a. Information 
and Control, 10(2):189 208. 
