Three New Probabilistic Models 
for Dependency Parsing: An Exploration* 
Jason M. Eisner 
CIS Department, University of Pelmsylvaifia. 
200 S. 33rd St., Philadelphia, PA 19104-6"{89, USA 
j eisner@linc, cis. upenn, edu 
Abstract 
Alter presenting a novel O(n a) parsing al- 
gorithm for dependency grammar, we de- 
velop three contrasting ways to stochasticize 
it. We propose (a) a lexical atfinity mode\] 
where words struggle to modify each other, 
(b) a sense tagging model where words tluc- 
tuate randomly in their selectional prefer- 
ences, and (e) a. generative model where 
the speaker fleshes ()tit each word's syntactic 
and concep{.ual structure without regard to 
the implications :for the hearer. W(! also give 
preliminary empirical results from evaluat- 
ing the three models' p;Lrsing performance 
on annotated Wall Street Journal trMning 
text (derived fi'om the Penn Treebank). in 
these results, the generative model performs 
significantly better than the others, and 
does about equally well at assigning pa.rt- 
of-speech tags. 
1 Introduction 
In recent years, the statistical parsing community 
has begun to reach out; for syntactic formalisms 
that recognize the individuality of words, l,ink 
grammars (Sleator and 'Pemperley, 1991) and lex- 
icalized tree-adjoining granunars (Schabes, 1992) 
have now received stochastic treatments. Other 
researchers, not wishing to abandon context-flee 
grammar (CI"G) but disillusioned with its lexica\] 
blind spot, have tried to re-parameterize stochas- 
tic CI"G in context-sensitive ways (Black et al., 
1992) or have augmented the formalism with lex- 
ical headwords (Magerman, 1995; Collins, 11996). 
In this paper, we 1)resent a \[lexible l)robat)ilistic 
parser that simultaneously assigns both part-of- 
sl)eech tags and a bare-bones dependency struc- 
ture (illustrate.d in l!'igure 1). The choice ot'a 
simple syntactic structure is deliberate: we would 
like to ask some basic questions about where h'x- 
ical relationships al)pear and how best, to exploit 
*This materia.l is based upon work supported un- 
der a National Science I%undation Graduate Fellow- 
ship, and has benefited greatly from discussions with 
Mike Collins, Dan M(:lame(l, Mitch Marcus and Ad- 
wait Ratnaparkhi. 
(a) Tile man in the coiner taught his dachsht,ld IO play golf I';OS 
DT NN IN DT NN VBD PP.P$ NN TO VH NN 
/¢ man N ~.. phty~ j J-y, .% 
(b) The ill __ ~ /.~dachshund It) golf 
.)f COfllel his 
file 
Figure 1: (a) A bare-l>ones dependen(-y parse. \]'\]a<:h 
word points to a single t)arent, the word it modities; 
the head of the sentence points to the EOS (end-of: 
sentence) ma.rk. Crossing links and cycles arc not al- 
lowed. (b) Constituent structure and sub(:ategoriza- 
tion may be highlighted by displaying the same de- 
pendencies as a lexical tree. 
them. It is uscflfl to look into thes0 basic ques- 
tions before trying to tine-tmm the performance of 
systems whose behavior is harder to understand. 1 
The main contribution of' the work is to I)ro- 
pose three distin('t, lexiealist hyl)otheses abou(. (,he 
probability space underlying seHl\]ence structure. 
We il\]ustrate how each hypothesis is (:xl)ressed in 
a depemteney framework, and how each can be 
used to guide our parser toward its favored so- 
lution. Finally, we point to experimental resul(;s 
that compare the three hypotheses' parsing per- 
formance on sentences fi:om the Wall ,b'treel dour- 
hal. \]'he parser is trained on an annol,ated corpus; 
no hand-written grammar is required. 
2 Probabilistic Dependencies 
It cannot be emphasized too strongly that a gram- 
marital rcprcsentalion (de4)endency parses, tag se- 
quen(-es, phrase-structure trees) does not entail 
any particular probability model. In principle, one 
couht model the distribution of dependency l)arses 
l()ur novel parsing algorithm a/so rescues depen 
dency from certain criticisins: "l)ependency granl- 
mars ...are not lexicM, and (as far ~ as we know) lacl( 
a parsing algorithm of efficiency compara.ble to link 
grammars." (LMferty et ;LI., 1992, p. 3) 
340 
in any uuml)er of sensible or perverse ways. 'l'h(~ 
choice of l;he right model is not a priori (A)vious. 
One way to huild a l)robabilistie grammar is to 
specify what sequences of moves (such as shift an(/ 
reduce) a parser is likely to make. It is reasonable 
to expect a given move to be correct about as 
often on test data. as on training data. This is 
tire philosophy behind stochastic CF(I (aelinek et 
a1.1992), "history-based" phrase-structure parsing 
(I-~lack et al., 1992), +m(I others. 
IIowever, i)rol)ability models derived from 
parsers sotnetimes focus on i,lci(lental prope.rties 
of the data. This utW be the case for (l,alli'.rty et 
M., 1992)'s model for link grammar, l\[' we were to 
adapt their top-(h)wn stochastic parsing str~tegy 
to the rather similar case of depen(lency gram- 
mar, we would find their elementary probabilities 
tabulating only non-intuitive aspects of the parse 
structure: 
Pr(word j is the rightmost pre-k chihl of word i 
\] i is a right-sl)ine st, rid, descendant of one of the 
left children of a token of word k, or else i is the 
parent of k, and i l)re(;edes j t)recerles k). :e 
While it is dearly necessary to decide whether j 
is a child of i, conditioning that (Iccision as alrove 
may not reduce its test entropy as mneh as a tnore 
linguistically perspienous condition woul(/. 
We believe it is ffttil,\['u\[ to de.sign prol>al)ility 
models indel)en(letrtly of tit(' pa.rser. In this see- 
lion, we will outline the three+ lexicalist, linguis- 
tically perspicuous, qualitatiw~ly different models 
that we have (leveloped a, nd tested. 
2.1 Model A: Bigram lexieal affinities 
N-gram tatters like (Church, 1988; .lelinek 1985; 
Kupiec 1992; Merialdo 1990) take the following 
view of \]row ~/, tagged sentctrce enters the worhl. 
I"irst, a se.(tuenee of tags is g('nexate.d aecordittg to 
a Markov l)rocess, with t.h(' random choice of e~ch 
tag conditioned ou the previous two tags. Second, 
a word is choseu conditional on each tag. 
Since our sentences have links as well as tags 
and words, suppose that afl;er the words are in- 
serte(l, each senl;ence passes through a third step 
that looks at each pair of words and ran(lotnly de- 
cides whether to link them. For the resulting sen- 
tences to resemble real tort)era, the. probability 
that word j gets linked to word i should b(' le:~:i- 
(:ally scnsilivc: it should depend on the (tag,word) 
pairs at both i and j. 
'Fhe probability of drawing a given parsed sen- 
(once froln the+ population may then be expressed 
2This correspouds to l,Mi'erty el, al.'s central st~ttis- 
tk: (p. 4), l'r(m +-I L, le, l,r), in the case where i's 
pa.rent is to the left el i. i,j, k correspond to L, W, R 
respectively. Owing to the particular re(:ursiw~ strat- 
egy the p~trscr uses to bre+tk up the s(!\[tl,(?n(:e, the 
statistic would be measured ~ttld utilized only under 
the condition (lescribed above. 
(a) Ihe \[nice of Ihc sRu:k 1%11 
I)T NN IN I)1' NN VIII) 
(b) tile price uf. the stock R'II 
\]YI" NN IN I)T NN Viii) 
t,'igure 3: (++)Th(, ,:orrect parse. (b) A cotnmon cr,or 
if the model ignores arity. 
as (1) in \[,'igure 2, where the random wMable 
Lij G {0, 1} is t iff word i is the parent of word j. 
Expression (1) assigns a probability to e.very 
possible tag-a.nd-link-annotated string, and these 
l)robabilities sunl to one. Many or the annotated 
strings exhibit violations such as crossing links 
and multiple parents which, iftheywcreallowed, 
wouhl let all the words express their lexical prefe.r- 
ences independently and situttlta.ne:ously. We SiAl)- 
ulate that the model discards fl'om the popula+tion 
tiny illegal structures that it generates; they do not 
appear in either training or test data. Therefore, 
the parser described below \[inds the likeliest le- 
gal structure: it maximizes the lexical preferences 
of (l) within the few hard liuguistic coush'ainls 
itnlrosed by the del)endency formalism. 
In practice, solrre generalization or "coarsen- 
lug" of the conditionM probabilities in (1) heaps 
to avoid tile e.ll~ets of undertrMning. For exalH- 
ph'., we folk)w standard prn(-tice (Church, 1988) in 
n-gram tagging hy using (3) to al)proxitllate the 
lit'st term in (2). I)ecisions al)out how much coars- 
enittg t,o lie are+ o1' great pra(-t, ieal interest, b ut t hey 
(lel)etM on the training corpus an(l tnay l)e olnit- 
ted from a eonc<'.t)tuM discussion of' the model. 
'Fhe model in (I) can be improved; it does not 
(:aptrlr(" the fact that words have arities. For ex- 
+Unl)h.' , lh.e price of lh.c sleek fell (l"igure 3a) will 
tyl>ically 1)e nlisanalyzed under this model. Since 
stocks often fall, .sleek has a greater affinity f<>r fl:ll 
than lbr @ llen<:e stock (as w<'.ll as price) will en<l 
tt\[) t>ointittg to the verl> ./'(ell (lqgure 31>), result, hit 
in a double subject for JNI and \[eavitlg of childless. 
'l'o Cal)i.nre word aril, ies an(l othe+r stil>cal,<,gr)riza- 
lion I'aets, we must recognize that the. chihh:ert of 
a word like J~ll are not in(le4)ende.nt of each other. 
'File sohttion is to nlodi/'y (t) slightly, further 
conditioning l,lj on the number and/or type of 
children of i that already sit between i and j. This 
means that in I, he parse of Figure 3b, the link price 
-+ \]?~11 will be sensitive to the fact that fell already 
has a ok)set chihl tagged as a noun (NN). Specif- 
ically, tire price --+ fell link will now be strongly 
disfavored in Figure '3b, since verbs rarely Lalw~ 
two N N del)endents to the left. By COllt;rast, price 
--> fell is unobjectionable in l!'igure 3a, rendering 
that parse more probable. (This change (;an be 
rellected in the conceptual model, by stating that 
tire l,ij decisions are Hla(le ill increasing order of 
link length li--Jl and are no longer indepen(lent.) 
2.2 Model B: Seleetional i)references 
In a legal dependency l)axse, every word except 
for the head of the setrtence (tile EOS mark) has 
341 
Pr'(words, tags, links) =/','(words, tags). Pr(link presences and absences I words, tags) (1) 
I-\[ I t om(i + 1), twom(i + 2)). I\] I two,.d(i), two,'dO)) ('e) 
l<i<n l <_i,j <n 
l'v(tword(i) \] tword(i + 1), tword(i + 2)) ~ l','(tag(i) I tag(i + 1), tag(i + 2)). P,'(word(i) I tag(/)) (a) 
Pr(words, tags, links) c~ Pr(words, tags, preferences) =/'r(words, tags). Pr(preferences \] words, t~gs) (4) 
\]-I l',.(twom(i) I two d(i + 1), t o,'d(i + 2)). H I two,.d(i)) 
1 <i<n t<i<n 
/ 1 +#right-kids(i) '~ 
Pv(words, t+gs, links)= II { 1-\[ P,.(two,.d(kid+(i))I t,gj +dd+_,(i) ),t+o,'d(i)) 
l<i<n \c=-(\]-k#left+kids(i)),eT~0 kid~q_ 1 if c < 0 
Figure 2: tligh-level views of model A (formuhrs I 3); model l:l (forinul;t 4); and model C (lbrmula, 5). If i and 
j are tokens, then tword(i) represents the pair (tag(i), word(i)), and L,j C {0, 1} i~ ~ ill" i is the p~m:nt of j. 
exactly one parent. Rather than having the model 
select a subset of the ~2 possible links, as in 
model A, and then discard the result unless each 
word has exactly one parent, we might restrict the 
model to picking out one parent per word to be- 
gin with. Model B generates a sequence of tagged 
words, then specifies a parent or more precisely, 
a type of parent for each word j. 
Of course model A also ends up selecting a par- 
ent tbr each word, but its calculation plays careful 
politics with the set of other words that happen to 
appear: in the senterl(;C: word j considers both the 
benefit of selecting i as a parent, and the costs of 
spurning all the other possible parents/'.Model B 
takes an appro;~ch at the opposite extreme, and 
simply has each word blindly describe its ideal 
parent. For example, price in Figure 3 might in- 
sist (with some probability) that it "depend on a 
verb to my right." To capture arity, words proba- 
bilistically specify their ideal children as well: fell 
is highly likely to want only one noun to its left. 
The form and coarseness of such specifications is 
a parameter of the model. 
When a word stochastically chooses one set of 
requirements on its parents and children, it is 
choosing what a link grammarian would call a dis- 
juuct (set of selectional preferences) for the word. 
We may thus imagine generating a Markov se- 
quence of tagged words as before, and then in- 
dependently "sense tagging" each word with a 
disjunct, a Choosing all the disjuncts does not 
quite specify a parse, llowever, if the disjuncts 
are sufficiently specific, it specifies at most one 
parse. Some sentences generated in this way are 
illegal because their disjuncts cannot be simulta- 
neously satisfied; as in model A, these sentences 
are said to be removed fi'om the population, and 
the probabilities renormalized. A likely parse is 
therefore one that allows a likely and consistent 
aln our implementation, the distribution over pos- 
sible disjuncts is given by a pair of Markov processes, 
as in model C. 
set of sells(', tags; its probability in the population 
is given in (4). 
2.3 Model C: Recursive generation 
The final model we prol)ose is a generation 
model, as opposed l;o the comprehension mo(l- 
els A and B (and to other comprehension modc, ls 
such as (l,afferty et al., 1992; Magerman, 1995; 
Collins, 1996)). r\]'he contrast recalls an ohl debate 
over spoken language, as to whether its properties 
are driven by hearers' acoustic needs (coml)rehen- 
sion) or speakers' articulatory needs (generation). 
Models A and B suggest that spe~kers produce 
text in such a way that the grammatical relations 
can be easily decoded by a listener, given words' 
preferences to associate with each other and tags' 
preferences to follow each other. But model C says 
that speakers' primary goal is to flesh out the syn 
tactic and conceptual structure \['or each word they 
utter, surrounding it with arguments, modifiers, 
and flmction words as appropriate. According to 
model C, speakers should not hesitate to add ex- 
tra prepositionM phrases to a noun, even if this 
lengthens some links that are ordinarily short, or 
leads to tagging or attachment mzJ)iguities. 
The generation process is straightforward. Each 
time a word i is added, it generates a Markov 
sequence of (tag,word) pairs to serve, as its left 
children, and an separate sequence of (tag,word) 
pairs as its right children. Each Markov process, 
whose probabilities depend on the word i and its 
tag, begins in a speciM STAI{T state; the symbols 
it generates are added as i's children, from closest 
to farthest, until it re~ches the STOP state, q'he 
process recurses for each child so generated. This 
is a sort of lexicalized context-free model. 
Suppose that the Markov process, when gem 
crating a child, remembers just the tag of the 
child's most recently generated sister, if any. Then 
the probability of drawing a given parse fi'om the 
population is (5), where kid(i, c) denotes the cth- 
closest right child of word i, and where kid(i, O) = 
START and kid(i, 1 + #,'ight-kids(i)) = STOP. 
342 
(a) 
(b) 
dachshund ovcr there can really phty 
dachshund ow:r there can really play 
I,'igure 4: Spans \])~u'ticipa, ting, in the (:orru(:l. i)a, rsc of 
7'h, at dachs/*und o'+wr there c(+u vcalhl ph+g golf~. (st) 
has one pa,rcnt, lcss cndwor(I; its sul)sl)+tn (b) lists two. 
(c < 0 in(h'xes l('ft children,) 'Fhis may bc 
thought o\[" as a, non-linca.r l;rigrrmt model, where 
each t;agg('d woM is genera, l,ed 1)ascd on the l)a.r 
('nl, 1,~gg(:d wor(l and ;t sistx'r tag. 'l'he links in the 
parse serve Lo pick o,tt; t, he r('Jev;mt t,rit:;t+a,n~s, and 
a.rc' chosen 1;o g('t; l,rigrams t, lml, ot)l, imiz(~ t, hc glohM 
t,a,gging. 'l'tt;tl; the liuks also ha.t)l)en t;o ;ulnot,;:d,('. 
useful setnant;ic rela, tions is, from this t>crsl)ective, 
quil.e a(-cidcn{,a,l. 
Note that the revised v(',rsiol~ of ulo(h:t A uses 
prol)a, bilit, ics /"@ink to chihl I child, I)arenl,, 
closer-('hihh:en), where n.)(le\] (; uses l'v(link 1,o 
child \] parent,, eloscr-chil(h'en). 'l'his is I)c(:;,.t~se 
model A assunw.s 1,lu~l, I,h('. (:hild was i)reviously 
gencrat, ed I)y a lin(;a,r l)roc('ss, aml all t;hal, is nec- 
ess+u'y is t,o li.k 1,o it,. Model (~ a, cl,ually g(,n(;ral,es 
t, he chihl in the process o\[' liuking to il,. 
3 Bottom-\[)i) Dependency Parsing 
lu this sec.tAon we sket(:h our dependel.'y l)m'sing 
;dg;oril, hnl: ~ novel dytmJni('.-l)rogr;mJndng m('.l,hod 
1,o assetnhle l, he mosl, l>rol)a,ble+ i)a.rse From the bet,- 
tom Ul). The algori@m ++(l(Is one link at a l, ime, 
nmking il; easy to multiply oul, the hie(lois' l)rolm 
hility l'a(:t, ors. It, also enforces I,hc special direc 
Lion;dil,y requiremenl~s of dependency gra.nnnar, 
1;he l)rohibitions on cycles mM nlultiple par('nl,s. 4 
'\['\]10 liic.t\]tod llsed is similar t;o tie C K Y met.hod 
of cont.exl,-fr('e l)~rsing, which combines aJIMys(:s 
of shorl, er substrings into analys<:s of progressively 
longer ones. Multiple a.na.lyses It;we l, hc s~tnm 
signature if t;hey are indistinguishal>le i, their 
M)ility to (;Otlll)ill(? wit,h other analyses; if so, the 
parser disca,rds all but, the higlmsl,-scoring one. 
CI,:Y t'cquit',;s ()(?,.:t~ ~) t.i,,,,' +utd O(,,.:'.~) sp+.'.,;, 
where n is the lenglih of 1,he s(mtcn(:c and ,s is a,n 
Upl)(;r bouiM on signal;ures 1)er subsl;ring. 
Let us consider dependency parsing in t;his 
framework. ()he mighl; guess that each substa'ing 
;mMysis shottld bc t+ lcxicM tree ;+ tagged he;ul- 
word plus aJl Icxical sulfl;rees dependc'nt, upon 
i/,. (See l"igure 111.) llowew, r, if a. o:/tst,il, cnt s 
• 11,Mmled depend(reties a,re possible, a.nd a minor 
va,ria, nt ha.ndles the sitnplcr (:~tse of link gra.tnltl;-u', hi- 
deed, abstra.ctly, the a.lgorithm rescmbies ;t c\](,.aamr, 
bottom-up vcrsiou of the top-down link gr~tmm~tr 
pa,rser develol)ed independently by (l,Ml:'crty et aJ., 
1992). 
...... ~fz_ ....... ~ ~ ......................... .+ _~.._._ ....... , %i.y- - ....< 
• • • • • • • 
~o d I a (loll s}fl!Slm,,) Ji, b(_right subspan) ', 
I"iglll'e 5' The ass,:,mbly of a span c from two sm:LIIcr 
spaus (a a,nd b) ~tml a cove.ring link. Only . is miuimal. 
probabilistic behavior depends on iL~ he.adword 
(;he lcxicMisL hypoiJmsis titan dilt'erent;ly hc~:uhxl 
a.na.lyses need dilt'erenI; sigmrtures. There a.re al. 
lca+sl, k of t,hcsc for a, s/ibst;rhl,~ Of le..e;IJI k, whence 
Ge houn(t ,,~ :: t: = ~(u), giving: ;i l, illm COml)lex- 
it,y of tl(,s). ((~ollins, 19.%)uses t,his t~(,'."-')a, lgo 
ril, lml (lireclJy (t,ogel,h('r wil, h l)runiug;). 
\'\% I)rOl)OSe a,u aJl,ermtl, ive a,I)l)roa.('h l, ha, I, I)re 
serves the OOP) hound, hls~ca(t of analyzing sul) 
st,ri.gs as lcxical t, rees that, will be linked t, ogoJ,her 
in(,o la, rgcr h'~xica, I l, rees, t, lic I)arsc, r will ana, lyze 
I,hc'ln a,s uon-const,itm'.nt, sl)a:n.s t;haJ, will he cou 
cat;cm~t,ed into larger spans. A Sl)a,n cousisl;s el' 
> :~ ;.t.i{.:e,l<; words; l,;~gs I'or a,ll these words cx 
(:el)l, possibly the last; ;t list, of all del.'mle.cy \]i, ks 
muong the words in l, hc Sl)an; and l)erha, l)S s()lue 
other inl'ornml,ic, n carried a, long in t;lu, siren's sig- 
naJ, mc. No cych's, n,ull, iph' l)a, rcnts, or (','ossi,tg 
liul.:s are Mlowed in the Sl)a.u, and each Jut,re'hal 
word of' l, he Sl>ml must ha, vc ~ Ira.rein iu the q);m+ 
Two sl>a, ns at<> illustraJ,ed in I"igure d, 'l'hese di- 
a,gra.nts a, rc I,yl)ica,l: a, Sl)a,n el" a (Icpendct.:y l)a+rsc 
may consist, of oil,her a I)a+rcn(,less endword and 
some o\[' its des(:cn<hmt,s on one side (l"igtu'c 4a), 
or two parent, less cndwords, with a.ll t,he right &" 
s('(mda, nLs oF(me and all l;hc M'I, dcscen(hml,s of I, Ii(~ 
el, her (lq,e;urc 4b). '1'tl(.' im, uilAon is I, haJ, L\]le. illl,('A' 
hal part; of a, span is gra, nmmtically iuert: excel)l, 
Ior tit(', cmlwords dachsh, u~td mid play, l;hc struc 
lure o1' ea,ch span is irrelewml, I,o t,\]1(; Sl>Cm's al)ility 
t,o cotnbinc iu ful,ure, so sl)a, ns with different inter- 
1ml strucl, tu'e ca,n colnlmte to bc t;hc I)est,-scoring 
span wil, h a, lm,rticula,r signal;urc. 
117 sl)an a, ctMs on t,he saanc word i l;\[ha, l, st,al'l,s 
span b, t,h(;n law I)a,rs(er tries l;o c(>ml>ine I,hc l, wo 
spans I)y cove, red-(-(mvatcnation (l"igur(; 5). 
The I,wo Col)ies of word i arc idc.nt, i\[ied, a, fl,er 
which a M'l,waM or rightwaM cove\]\['ing link is 
ol)l;ionMly added I)ct,wceu t,h(' c.dwor(ts of t,h0. ,.>.v 
sf)a,n. Any tlepcudcncy parse ca, n I)c built Ill:) hy 
eovered-coitca, tena, i;ion. When the l)a,rser covcrcd- 
('O\]lCaJ,enat,cs (~ trod b, it, ol)l, ains up to IJtrce new 
SlmUS (M't, wa, rd, right,war(I, and no coveritlg \]ink). 
The <'o',,ered-(:oncaJ,cnal,ion of (+ a.nd b, I'ornfing 
(', is 1)arrcd unh;ss it, tricots terra, in simple test;s: 
• . must, I)e minimal (not, itself expressihle ++s a 
concaLenal,ion of narrower spaus). This prcvenLs 
us from assend>ling c in umltiple ways. 
• Since tim overlapping word will bc int;ertta,l to c, 
it; Illll81\[, ha, ve ?g parenl; in cxa,(;L\]y oile of a told b. 
343 
H Pr(tword(i) I tword(i + 1), tword(i + 2)). H Pr(i has peels that j satisfies I tword(i), tword(j)) (6) 
k<_i<g k<i,j<g with i,j linked 
H Pr(Lij ItW°rd(i)' tword(j), tag'(next-closest-kid(i))). H Pr(LiJ ItW°rd(i)' tword(j),...) (7) 
k<_i,j<g with i,j linked k<i<(, (j<k or ~.<j) 
• c must not be given a covering link if either the 
leftmost word of a or the rightmost word of b has 
a parent. (Violating this condition leads to either 
multiple parents or link cycles.) 
Any sufficiently wide span whose left endword 
has a parent is a legal parse, rooted at the EOS 
mark (Figure 1). Note that a span's signature 
must specify whether its endwords have parents. 
4 Bottom-Up Probabilities 
Is this one parser really compatible with all three 
probability models? Yes, but for each model, we 
must provide a way to keep tr~tck of probabilities 
as we parse. Bear in mind that models A, B, and 
C do not themselves specify probabilities for all 
spans; intrinsically they give only probabilities for 
sentences. 
Model C. Define each span's score to be the 
product of all probabilities of links within the 
span. (The link to i from its eth child is asso- 
ciated with the probability Pr(...) in (5).) When 
spans a and b are combined and one more link is 
added, it is easy to compute the resulting span's 
score: score(a), score(b)./°r(covering link)) 
When a span constitutes a parse of the whole 
input sentence, its score as just computed proves 
to be the parse probability, conditional on the tree 
root EOS, under model C. The highest-probability 
parse can therefore be built by dynamic program- 
ming, where we build and retain the highest- 
scoring span of each signature. 
Model B. Taking the Markov process to gen- 
erate (tag,word) pairs from right to left, we let (6) 
define the score of a span from word k to word (?. 
The first product encodes the Markovian proba- 
bility that the (tag,word) pairs k through g- 1 are 
as claimed by the span, conditional on the appear- 
ance of specific (tag,word) pairs at g, ~+1. ~ Again, 
scores can be easily updated when spans combine, 
and the probability of a complete parse P, divided 
by the total probability of all parses that succeed 
in satisfying lexical preferences, is just P's score. 
Model A. Finally, model A is scored the same 
as model B, except for the second factor in (6), 
SThe third factor depends on, e.g., kid(i,c- 1), 
which we recover fl'om the span signature. Also, mat- 
ters are complicated slightly by the probabilities asso- 
ciated with the generation of STOP. 
6Different k-g spans have scores conditioned on dif- 
ferent hypotheses about tag(g) and tag(g + 1); their 
signatures are correspondingly different. Under model B, a k-.g 
span may not combine with an 6-~n span 
whose tags violate its assumptions about g and g + 1. 
11 A I ~1 c I c' T- x I~,~o1~  1.o I 
Non-punt 88.9 89.8 89.6 89.'1 89.8 77.J 
Nouns 90.1 89.8 90.2 90.4 90.0 S(;.2 
I,ex verbs 74.6 75.9 7."/.3 75.8 73.3 67.5 
'Fable t: Results of preliminary experiments: Per- 
centage of tokens correctly tagged by each model. 
which is replaced by the less obvious expression in 
(7). As usual, scores can be constructed from the 
bottom up (though tword(j) in the second factor 
of (7) is not available to the algorithm, j being 
outside the span, so we back off to word(j)). 
5 Empirical Comparison 
We have undertaken a careful study to compare 
these models' success at generalizing from train- 
ing data to test data. Full results on a moderate 
corpus of 25,000+ tagged, dependency-annotated 
Wall Street Journal sentences, discussed in (Eis- 
ner, 1996), were not complete hi; press time. How- 
ever, Tables 1 2 show pilot results for a small set 
of data drawn from that corpus. (The full resnlts 
show substantially better performance, e.g., 93% 
correct tags and 87% correct parents fbr model C, 
but appear qualitatively similar.) 
The pilot experiment was conducted on a subset 
of 4772 of the sentences comprising 93,a~0 words 
and punctuation marks. The corpus was derived 
by semi-automatic means from the Penn Tree- 
bank; only sentences without conjunction were 
available (mean length=20, max=68). A ran- 
domly selected set of 400 sentences was set aside 
for testing all models; the rest were used to esti- 
mate the model parameters. In the pilot (unlike 
the full experiment), the parser was instructed to 
"back oil"' from all probabilities with denomina- 
tors < 10. For this reason, the models were insen- 
sitive to most lexical distinctions. 
In addition to models A, B, and C, described 
above, the pilot experiment evaluated two other 
models for comparison. Model C' was a version 
of model C that ignored lexical dependencies be- 
tween parents and children, considering only de- 
pendencies between a parent's tag and a child's 
tag. This model is similar to the model nsed by 
stochastic CFG. Model X did the same n-gram 
tagging as models A and B (~. = 2 for the prelim- 
inary experiment, rather than n = 3), but did not 
assign any links. 
Tables 1 -2 show the percentage of raw tokens 
that were correctly tagged by each model, as well 
as the proportion that were correctly attached to 
344 
All tokons 
Ntlll-llllnc 
NOLIn8 
17~1 verbs 
\[ A t~- - (' C r- \[ 
L~5..,~ r 8.1S~,,a.~ 47.3 
~l r~ sA rr.~ I '~ 1 
~~-L40:,<~A_ - ~ ~_ 
'l'~d)le 2: \]{.csults of preliininary (,Xl)crimcnts: Per. 
contage of tokens corrc0Lly attached Lo their par- 
onl;s by each model. 
their parents. Per tagging, baseline per\[ol:lnance 
Wa, S I/leaSlli'ed by assigniug each word ill the test 
set its most frequent tag (i\[' any) \['roiii the train- 
lug set. Thc iinusually low I)aseliue t)crJ'orillance 
I:esults \['l'Olll kL conil)iuation of ;t sHiaJl l>ilot Lr;~ill- 
ing set and ;t Inil(lly (~xten(|e(I t~g set. 7 \Vc ol) 
served that hi the ka.ining set, detei:lniners n-lost 
colrinlonly pointed t.o the following; word, so as a 
parsing baseline, we linked every test dctcrnihler 
to the following word; likewise, wc linked every 
test prcpositior, to the preceding word, and so ()11, 
The l'JatterllS in the preliuli/lary data ~ti'e strik- 
ing, with w:rbs showing up as all aFea el (lil\[iculty, 
alld with SOllle \]tlodcis cl<;arly farillg bctter I,\[I;tll 
other. The siinplcst and \['astest uiodel, the l'(~cur-- 
siw ~, generation uiodel (7, did easily i.he bcsl. ,job 
of <'i-q)turing the dependency s/.ructurc ('l'able 2). 
It misattachcd t.hc fewest words, both overall aud 
in each categol:y. This suggcsts that sut)eategjo 
rization 1)rcferc\[lccs the only I'~Lctor ('onsidered 
by model (J I)lay a substantial role in I;he sti:uc- 
lure of Trcebank scntcn(-cs. (lndccd, tii(; erl;ors ill 
model I~, wliich pe:l:forHled worst across the bO~Lr(l, 
were very frequently arity erl:ors, where ttie desire 
of a chihl to ~Ltta(:h LO a 1)articular parent over-. 
calne the rchi(:i;ail(;e of tile \[)areiit to a(:(-el)t uiore 
children.) 
A good deal of the l,arsi0_g Sll(;(',ess of inoclel (7 
seems to h~ve arisen from its k/iowle(lgc, of individ-- 
tiff. words, as we cxpe(:ted. This is showfi by the 
vastly inl~rior l)Cl;forniaH('e o\[' I;}lc control, model 
(ft. On l;he ot\]ier hand, I)oth (7 an(l (J' were conl- 
petitivc with t\[10 oth0r UlOdCiS i~l; tagging. This 
shows that a t~Lg can 1)e predicted ~d)out as well 
\['rolri Lhe tags of its putative p;Lrel,t ;rod sil)\]in<g 
as it ('an fiX)ill the \[~ags O\[" string-a(lja(:cnt words, 
eVell when there is ('onsideral)le e/;l:OF ill dcterinin-- 
ing the parent and s\[bling. 
6 Conclusions 
I~arc-bories dependency grammar which requires 
1lO Ihik labels> no ~ralfliiiai', and ItO fll~S tO 
lirlderstand iS a clean tcstbcd for studying the 
lexical a\[liniLies of words. Wc believe filial; this 
iS all ill,per,all, line of ilivcstigative research> olle 
that is likely to produce both useful parsing tools 
and signilicaut insights ~tboilt language niodeling. 
7We llsed distinctive t~tgs for a,uxi\[ia,ry verbs ;-I, ll(I 
for words being used as noun modifiers (e.g., partici- 
ples), bec<xuse they ha.ve very ditferent subca.tcgoriz~> 
lion fra.mes. 
As a lirst step in the study of lexicM a@n- 
ity, we asked whether there was a "natural" way 
to stochasticize such ~ siint)le formMism a.s de- 
pendency, hi f~ct, wc have now exhibited three 
promising types of lnodel for this simple problem. 
Further, we have develol)cd a novel parsing algo- 
rithm to compare thesc hyt)otheses, with results 
tim, so far favor the spe;tker-oriented model C, 
eveu in written, edited Wall Slrcet dournal I~cxt. 
To our knowledge, the relative merits of speaker 
oricn/,cd V(~l'SilS hcarer-orienl,ed probed)ills,it syn- 
l.iL× ino(h;Is iiave uoL been investigated l)efore. 
ll, efel'ellces 
Ezra Bla.(:k, Fred ,lelinck, et a.1. 1992. Towards history- 
ba,sed gramnl~u:s: using richer mod(,.ls \[br probabilis- 
tic i,~trsing. \[u Fifth I)AI~,FA Worksh.op ou £'pecch 
and Natural Language, Arden (7onfcrcn(:c Ceutcr, 
llnrrim~m, New York, Febrl,u'y. 
\['(enne.th W. (3mr(:h. 1988. A stochastic parts pro- 
gi:ntll a, nd noun l)hra,se parser for unrestri(:tcd text. 
In /'roe, of the 2rid (;onf. on Applied Natural Lan- 
g'uage lJroccssing, 136 148, Austin, TX. Asso(:i~Lti(,n 
for ('~omput~Ltimml l,inguistics, Morristowu, N.I. 
Mi(:ha.el ./. (',ollins. 1996. A new statistical parser 
based on bigr~un lexi(:~fl del)cndeucies, h, l~rocc.cd - 
iTtfJS of tit(; 24th A CL, S~l, nt~,, (~171'Z, (\]A, July. 
Ja.sol! 1'3isner. 199(;. An empirical (:omp~H'ison of prob- 
~dfility nlodcls for dependeucy gl:a, lnnlaJ:. Teehnic;d 
ILeport IRCS 96 11, University of PennsylvaJtilt. 
I!'red .felinck. 1985. M~rkov sour(:e modeliug of text 
gener~Ltiou. In .I. Skwirzinski, editor, hnpact of IS"o- 
tossing 7~chniques ou (;ommunication, /)ordrc(:ht, 
l"red Jelinek, ,lohn 1). l,Mferty, aml Robert 1, Mercer. 
I.¢)92. I\]~si(: niethods el prob~dfilistic context-fre(,. 
~INI,I"I'IILI7S. lit £'pccch tlccoqnition and U~zdcrstand- 
ing: l¢ecent Advances, Trends, and Applications. 
.I. Knpie.c. 1!392. I{obust l)arDof-speech ta.gging us- 
ing a. hidden Ma, rkov model. (7omputcr £'pccch .rid 
Language, 6. 
.\]ohu t,~Lfferty, I)~ufiel Sle~ttor, ~uid I)~vy '\['cmperley. 
1992. (~ramm~LticaJ trigr~mm: A prob~bilistic model 
of link gr~mnnar In 15"oc. of the AAAI Conf. on 
t)robabilistic Approaches to Natural Language, Oct. 
l);wid M~tgerul~n. 1l!)95. St~ttisti(:~d decision-tree 
models for p~u'sing, in Proceedings of the 33rd An- 
'nual Meeting of the A CL, l~oston, MA. 
Igor A, Mel'(:uk. 1988. l)cpcndcncy Syntax: 7?worg 
and l'racticc. St~te University of New York Press. 
IL Meria.hlo. 1990. Tagging text with ;L probabilistic 
model, lu l~rocccdinw of the IBM Natural Language. 
17'L, Paris, Fra.nce, pp. 161-172. 
Yves S(:ha.bes. L992. Stochastic lexi(:alized tree- 
~tdjoining gra.mmars, lit l'rocccdings of C()lHNG'- 
92, Na.nl.es, I)')'auce, .lnly. 
I)nniel Sleator and Daxy Tcmperlcy. 1991. Pro:sing 
I",nglish with ~t I,iuk (h:,~mm~m Te(:hnicifl report 
CM U.-('S-91-196. (iS Dept., C~m,egic Melk)n tl uiv. 
345 
