7 earl: A Probabilistic 
David M. Magerman 
CS Del)a, rtmcnt 
Sta. hn'd U,fivcrsity 
Stanford, CA 94305 
magc,'n mn(i~cs.sl, a.n ford.c(I u 
Chart Parser* 
Mitchell P. Marcus 
CIS l)epartment 
\[.lnivcrsil,y of l)cnnsylvania. 
Pifiladelphia., PA 19104 
mitch ¢21 i n c.(:is, u I)enn .edu 
Abstract 
This i)al)er describes a Ilatural language i)ars - 
ing algorith,n for unrestricted text which uses a 
prol)al)ility-I~ased scoring function to select the 
"l)est" i)arse of a sclfl,ence. The parser, T~earl, 
is a time-asynchronous I)ottom-ul) chart parser 
with Earley-tyl)e tol)-down prediction which l)ur - 
sues the highest-scoring theory iu the chart, where 
the score of a theory represents tim extent to which 
the context of the sentence predicts that interpre- 
tation. This parser dilrers front previous attemi)ts 
at stochastic parsers in that it uses a richer form of 
conditional prol)alfilities I)ased on context to l)re- 
diet likelihood. T>carl also provides a framework 
for i,lcorporating the results of previous work in 
i)art-of-spe(;ch assignrlmn|., unknown word too<l- 
ois, and other probal)ilistic models of lingvistic 
features into one parsing tool, interleaving these 
techniques instead of using the traditional pipeline 
a,'chitecture, lu preliminary tests, "Pearl has I)ee.,i 
st,ccessl'ul at resolving l)art-of-speech and word (in 
sl)eech processing) ambiguity, d:etermining cate- 
gories for unknown words, and selecting correct 
parses first using a very loosely fitting cove,'ing 
grammar, l 
Introduction 
All natural language grammars are alnbiguous. Even 
tightly fitting natural language grammars are ambigu- 
ous in some ways. Loosely fitting grammars, which are 
necessary for handling the variability and complexity 
of unrestricted text and speech, are worse. Tim stan- 
dard technique for dealing with this ambiguity, pruning 
°This work was p,~rtially supported by DARPA grant 
No. N01114-85-1(0018, ONR contract No. N00014-89- 
C-0171 by DARPA and AFOSR jointly under grant No. 
AFOSR-90-0066, and by ARO grant No. DAAL 03-89- 
C(1031 PRI. Special thanks to Carl Weir and Lynette 
llirschman at Unisys for their valued input, guidance and 
support. 
I'Fhe grammar used for our experiments is the string 
~ra.mmar used in Unisys' PUNI)IT natura.I language iin- 
dt'rsl.a ndi n/4 sysl.tml. 
gra.nunars I)y hand, is painful, time-consuming, and 
usually arbitrary. The solution which many people 
have proposed is to use stochastic models to grain sta- 
tistical grammars automatically from a large corpus. 
Attempts in applying statistical techniques to nat- 
ura, I iangt, age parsi,lg have exhibited varying degrees 
of success. These successful and unsuccessful attempts 
have suggested to us that: 
. Stochastic techniques combined with traditional lin- 
guistic theories can (and indeed must) provide a so- 
lull|on to the natural language understanding prob- 
lem. 
* In order for stochastic techniques to be effective, 
they must be applied with restraint (poor estimates 
of context arc worse than none\[7\]). 
- Interactive, interleaved architectvres are preferable 
to pipeline architectures in NLU systems, because 
they use more of the available information in the 
decision-nmkiug process. 
Wc have constructed a stoch~tic parser,/)earl, which 
is based on these ideas. 
The development of the 7~earl parser is an effort to 
combine the statistical models developed recently into 
a single tool which incorporates all of these models into 
the decisiou-making component of a parser, While we 
have only attempted to incorporate a few simple sta- 
tistical models into this parser, ~earl is structured in 
a way which allows any nt, mber of syntactic, semantic, 
and ~other knowledge sources to contribute to parsing 
decisions. The current implementation of "Pearl uses 
ChurclFs part-of-speech assignment trigram model, a 
simple probabilistic unknown word model, and a con- 
ditional probability model for grammar rules based on 
part-of-speech trigrams and parent rules. 
By combining multiple knowledge sources and using 
a chart-parsing framework, 7~earl attempts to handle 
a number of difficult problems. 7%arl has the capa- 
bility to parse word lattices, an ability which is useful 
in recognizing idioms in text processing, as well as in 
speech processing. The parser uses probabilistic train- 
ing from a corpus to disambiguate between grammati- 
cally ac(-i:ptal)h', structures, such ;m determining i)repo - 
-15- 
sitional l)hrase attachment and conjunction scope. Fi- 
nally, ?earl maintains a well-formed substring I,able 
within its chart to allow for partial parse retrieval. Par- 
tial parses are usefid botll for error-message generation 
aud for pro(-cssitlg lulgrattUllal,i('al or illCOllll)h;I,e .'~;l|- 
I,(~llCes. 
ht i)reliluinary tests, ?earl has shown protnisillg re- 
suits in ha,idling part-of-speech ~ussignnlent,, preposi- 
t, ional I)hrase ;d,l, achnlcnl., ait(I Ilnknowlt wor(I catego- 
riza6on. Trained on a corpus of 1100 sentences from 
the Voyager direction-linding system 2 and using the 
string gra,ulm~r from l,he I)UNDIT l,aug,,age IhM,.r- 
sl.atJ(ling Sysl,cuh ?carl correcl, ly i)a.rse(I 35 out of/10 or 
88% of scIitellces sele('tcd frolu Voyager sentcil(:~}.~ tier 
used in the traini,lg data. We will describe the details 
of this exl)crimelfl, lal,cr. 
In this I)al)cr , wc will lirsl, explain our contribu- 
l, ion l,o the sl,ochastic ,nodels which are used in ?earl: 
a context-free granunar with context-sensitive condi- 
l, ional probal)ilities. Then, we will describe the parser's 
architecture and the parsing algorithtn, l"ina.lly, we 
will give the results of some exi)erinlents we performed 
using ?earl which explore its capabilities. 
Using Statistics to Parse 
Recent work involving conl,ext-free a,.I context- 
sensitive probal)ilistic gramnlars I)rovide little hope for 
the success of processing unrestricted text osing I)roba.- 
bilistic teclmiques. Wo,'ks I)y C, Ititrao and Grishman\[3} 
and by Sharmau, .Iclinek, aml Merce,'\[12\] exhil)il, ac- 
cllracy I'atos Iowq;r than 50% using supervised train- 
iny. Supervised trailfiug for probal)ilisl, ic C, FGs re- 
quires parsed corpora, which is very costly in time and 
man-power\[2\]. 
lil otn" illw~sl, igatiolls, w,~ hav,~ Iliad(; two ol)s(~rval,iolm 
which al,tcinl)t to Cxl)laiit l.h(' lack-hlstt'r i)erfornmnce 
of statistical parsing tecluti(lUeS: 
• Sinq)l~: llrol)al)ilistic ( :l,'(;s i)rovidc ycncTnl infornm- 
lion about how likely a constr0ct is going to appear 
anywhere in a sample of a language. This average 
likelihood is often a poor estimat;e of probability. 
• Parsing algorithnls which accumulate I)rol)abilities 
of parse theories by simply multiplying the,n over- 
penalize infrequent constructs. 
?earl avoids the first pitfall" by t,sing a context- 
sensitive conditional probability CFG, where cot ttext 
of a theory is determi,ted by the theories which pre- 
dicted it and the i)art-of-sl)eech sequences in the input 
s,ml,ence. To address the second issue, Pearl scores 
each theory by usi.g the geometric mean of Lhe con- 
textl,al conditional probalfilities of all of I.he theories 
which have contributed to timt theory. This is e(lt, iva- 
lent to using the sum of the logs of l.hese probal)ilities. 
~Spcclnl thanks to Victor Zue at Mlq" h)r the use of the 
Sl)(:c(:h da.t;r from MIT's Voyager sysl, Clll. 
CFG with context-sensitive conditional 
probabilities 
In a very large parsed corpus of English text, one 
finds I, Imt, I,be most freq.ently occurring noun phrase 
structure in I, Iw text is a nomt plu'asc containing a 
determiner followed by a noun. Simple probabilistic 
CFGs dictate that, given this information, "determiner 
noun" should be the most likely interpretation of a 
IlOUn phrase. 
Now, consider only those noun phrases which oc- 
cur as subjects of a senl,ence. In a given corpus, you 
nlighl, liml that pronouns occur just as fre(luently as 
"lletermincr nou,,"s in the subject I)ositiou. This type 
of information can easily be cai)tnred by conditional 
l)robalfilities. 
Finally, tmsume that the sentence begins with a pro- 
noun followed by a verb. In l.his case, it is quite clear 
that, while you can probably concoct a sentence which 
fit, s this description and does not have a pronoun for 
a subject, I,he first, theory which you should pursue is 
one which makes this hypothesis. 
The context-sensitive conditional probabilities which 
?earl uses take into account the irnmediate parent of 
a theory 3 and the part-of-speech trigram centered at 
the beginning of the theory. 
For example, consider the sentence: 
My first love was named ?earl. 
(no subliminal propaganda intended) 
A theory which tries to interpret "love" as a verb will 
be scored based ou the imrl,-of-speecll trigranl "adjec- 
tive verb verb" and the parent theory, probably "S --+ 
NP VP." A theory which interprets "love" as a noun 
will be scored based on the trigram "adjective noun 
w~rl)." AIl,llo.gll Io.xical prollabilities favor "love" as 
a verb, I, he comlitional i)robabilities will heavily favor 
"love" as a noun in tiffs context. 4 
Using the Geometric Mean of Theory 
Scores 
According to probability theory, the likelihood of two independent 
events occurring at the same time is the 
product of their individual probabilities. Previous sta- 
tistical parsing techniques apply this definition to the 
cooceurrence of two theories in a parse, and claim that 
the likelihood of the two theories being correct is the 
product of the probabilities of the two theories. 
3The parent of a theory is defined as a theory with a 
CF rule which co.tains the left-hand side of tile theory. 
For instance, if "S ---, NP VP" and "NP --+ det n" are two 
grammar rules, the first rule can be a parent of tile second, 
since tl,e left-hand side of tile second "NP" occurs in the 
right-hand side of the first rule. 
4In fact, tile part-of-speech tagging model which is Mso 
used in ~earl will heavily favor "love" as a noun. We ignore 
this behavior to demonstrate the benefits of the trigram 
co.ditioni.g. 
16- 
'l?his application of probal)ility theory ignores two 
vital observations el)out the domain of statistical pars- 
ing: 
• Two CO,lstructs .occurring in the same sentence are 
,lot n,:ccssa,'ily indel)cndc.nt (and frequ~ml.ly are not). 
If the indel)el/de//e,, ;msuniption is violated, then tile 
prodl,ct of individual probabilities has no meaning 
with ,'espect to the joint probability of two events. 
• SiilCe sl,al,isl,i(:al liarshig sllil't:rs froln Sl)ars,~ data, 
liroliil.I)ilil,y esl, inlatcs of low frequency evenl.s will 
ilsually lie iiiaccurate estiliiaLes. I,;xl, relue underesl, i- 
ili;i.I,l:s of I, ll,~ likelihood of low frl~qlmlicy \[Welll.s will 
i)rolhl('e liiisl~;idhig .ioint lirohaliilil,y estiulates. 
Froln tiios~; oliserval.ioiis, w(; have de.l, erlnhled that csti- 
lilal,hig.ioinl, liroha.I)ilil,ies of I,li(~ories usilig iliilividilal 
lirohldJilil,ies is Leo dillicull, with the availalih.', data. 
IvVe haw, foulid I,ha.I, the geoinel, ric niean of these prob- 
ahilit,y esl, inial,cs provides an accurate a.,~sl;ssiilellt of a 
IJll~Ol'y's vialiilil.y. 
The Actual Theory Scoring Function 
In a departure front standard liractice, and perhaps 
agailisl. I)el.l.er .iu(Ignienl,, we will inehlde a precise 
(Icsei'illtioii (if I, he theory scoring functioli used liy 
'-Pearl. This scoring fuiiction l,rics to soiw; some of the 
lirolih)lliS listed in lirevious at,telUlitS at tirobabilistic 
parsii,g\[.l\]\[12\]: 
• Theory scores shouhl not deliend on thc icngth of 
the string which t, hc theory spans. 
• ~l)al'S(~ data (zero-fr~:qllelicy eVl;lltS) ~llid evell zero- 
prolJahility ew;nts do occur, and shouhl not result in 
zero scoring Lheorics. 
• Theory scores should not discrinfinate against un- 
likely COlistriicts wJl,'.n the context liredicts theln. 
The raw score of a theory, 0 is calculated by takiug 
I,he. i)rodul:l, of the ¢onditiona.I i)rol)ability of that the- 
ory's (',1"(i ride giw;il the conl,ext (whel'l ~, COlitelt is it 
I)iirl,-of-sl)(~ech I,rigraln a.n(I a l)areiit I,heol'y's rule) alid 
I, he score of tim I, rigrani: 
,5'C:r aw(0) = "P(r {tics I(/'oPl 1'2 ), ruic parent ) sc(pol,! 1)2 ) 
llere, the score of a trigram is the product of the 
mutual infornlation of the part-of-speech trigram, 5 
POPII~2, and tile lexical prol)ability of the word at the 
Ioeatioil of Pi lieing assigiled that liart-of-specch pi .s 
In the case of anlhiguil,y (part-of-speech ambiguity or 
inuitil)le parent theories), the inaxinuim value of this 
lirothict is used. The score of a partial theory or a conl- 
I)lete theory is the geometric liieali of the raw scores of 
all of the theories which are contained in that theory. 
'The liilltilal iliforlll;ll.iOll el r ~ part-of-sl)eech trigram, 
llopipil is (lelincd to lie li(|lillll/'2) where x is lilly lillrt - 7 )( PlizP1 )7)( Ill ) ' 
of-speech. See \[4\] for tintiler exlila.n,%l, ioli. 
GTlie trigrani .~coring funcl.ion actually ilsed by tile 
parser is SOill(:wh;il, tiler(: (:onllili(:al,t~d I, Ilall this. 
Theory Length Independence This scoring func- 
tion, although heuristic in derivation, provides a 
nlethod Ibr evaluating the value of a theory, regardless 
of its length. When a rule is first predicted (Earley- 
styh;), its score is just its raw score, which relireseuts 
how uiuch {,lie context predicts it. llowever, when the 
parse process hypothesizes interpretations of tile sen- 
teuce which reinforce this theory, the geornetric nlean 
of all of the raw scorn of the rule's subtree is used, 
rcllrescnting the ow,rall likelihood or I.he i.heory given 
the coutcxt of the sentence. 
Low-freqlteltcy Ew:nts AII.hol,gll sonic statistical 
natural language aplili('ations enllAoy backing-off e.s- 
timatitm tcchni(lues\[ll\]\[5\] to handle low-freql,eney 
events, "Pearl uses a very sintple estilnation technique, 
reluctantly attributed to Chl,rcl,\[7\]. This technique 
estiniatcs the probability of au event by adding 0.5 to 
every frequency count. ~ Low-scoring theories will be 
predicted by the Earley-style parscr. And, if no other 
hypothesis is suggested, these theories will be pt, rsued. 
If a high scoring theory advauces a theory with a very 
low raw score, the resulting theory's score will be the 
geonletric nlean of all of the raw scores of theories con- 
tained in that thcory, and thus will I)e nluch higher 
than the low-scoring theory's score. 
Example of Scoring Function As an example of 
how the conditional-probability-b<~sed scoring flinction 
handles anlbiguity, consider the sentence 
Fruit, flies like a banana. 
i,i the dontain of insect studies. Lexical probabilities 
should indicate that the word "flies" is niore likely to 
be a plural noun than an active verb. This information 
is incorporated in the trigram scores, llowever, when 
the interliretation 
S --+ . NP VP 
is proposed, two possible NPs will be parsed, 
NP ~ nolnl (fruit) 
all d 
NP -+ noun nouu (fruit flies). 
Sitlce this sentence is syntactically ambiguous, if the 
first hypothesis is tested first, the parser will interpret 
this sentence incorrectly. 
ll0wever, this will not happen in this donlain. Since 
"fruit flies" is a common idiom in insect studies, the 
score of its trigram, noun noun verb, will be much 
greater than the score of the trigram, noun verb verb. 
Titus, not only will the lexical probability of the word 
"flies/verb" be lower than that of "flies/noun," but also 
tile raw score of "NP --+ noun (fruit)" will be lower than 
7We are not deliberately avoiding using ,'ill probabil- 
ity estinlatioll techniques, o,,ly those backillg-off tech- 
aiques which use independence assunlptions that frequently 
provide misleading information when applied to natural 
liillgU age. 
- 17- 
that of "NP -+ nolln nolln (fruit flies)," because of the 
differential between the trigram score~s. 
So, "NP -+ noun noun" will I)e used first to advance 
the "S --+ NI ) VP" rid0.. Further, even if the I)arser 
a(lva.llCeS I)ol,h NII hyliol,h(++ses, I,he "S --+ NP . VI'" 
rule IlSilig "N I j ---+ liOllll iiOlln" will have a higher s(:ore 
l, hau the "S --+ INIP . Vl )'' rule using "NP -+ notul." 
Interleaved Architecture in Pearl 
The interleaved architecture implemented in Pearl pro- 
vides uiany advantages over the tradil,ionai pilieline 
ar('hil,~+.(:l.ln'e, liut it, also iiil.rodu(-~,s c,:rl,a.ili risks. I)('+- 
('iSiOllS abollt word alld liarl,-of-sl)ee('h alnliiguity ca.ii 
I)e dolaye(I until synl,acl, ic I)rocessiug can disanlbiguate 
l,h~;ni. And, using I,he al)llroprial,e score conibhia.tion 
flilicl,iolis, the scoring of aliihigliOllS ('hoi(:es Call direct 
I, li~ parser towards I, he most likely inl,erl)re.tal, ioii elli- 
cicutly. 
I lowevcr, with these delayed decisions COllieS a vasl,ly 
~Jlllal'g~'+lI sl'arch spa(:('. 'l'\]le elf<;ctivelio.ss (if the i)arsi'.r 
dellen(Is on a, nla:ioril,y of tile theories having very low 
scores I)ased ou either uulikely syntactic strllCtllres or 
low scoring hlput (SilCii as low scores from a speech 
recognizer or low lexical I)robabilil,y). hi exl:)eriulenl,s 
we have i)erforn}ed, tliis \]las been the case. 
The Parsing Algorithm 
T'earl is a time-asynchronous I)ottom-up chart parser 
with Earley-tyi)e top-down i)rediction. The signifi- 
cant difference I)etween Pearl and non-I)robabilistic 
bol,tOllHI I) i)arsers is tha.t instead of COml)letely gener- 
ating all grammatical interpretations of a word striug, 
Tcarl pursues i.he N highest-scoring incoml)lete theo- 
ries ill the chart al. each I);mS. Ilowcw~r, Pearl I)a.,'scs 
wilhoul pruniny. All, hough it is ollly mlVallcing the N 
hil~hest-scorhig \] iiieOlill)h~l.~" I, Jieories, it reta.his the lower 
SCOl'illg tlleorics ill its agl~ll(la. If I, he higher scorhlg 
th(,ories do not g(~lleral,e vial)It all,crnal.iw~s, the lower 
SCOl'illg l, lteori~'s IIHly I)(~ IISOd Oil SIliiSC~tllmllt i)a.'~scs. 
The liarsing alg(u'ithill begins with the inl)ut word 
lati,ice. An 11 x It cha.rl, is allocated, where It iS the 
hmgl, h of the Iongesl, word sl,rillg in l,lie lattice, l,¢xical 
i'uh~s for I,he inliut word lal.l, ice a, re inserted into the 
cha.rt. Using Earley-tyl)e liredicLi6u, a st;ntence is pre- 
(licl.ed at, the beginuilig of tim SClitence, and all of the 
theories which are I)re(licl.c(I l)y l, hat initial sentence 
are inserted into the chart. These inconll)lete thee- 
tics are scored accordiug to the context-sensitive con- 
ditional probabilities and the trigram part-of-speech 
nlodel. The incollll)lel.e theories are tested in order by 
score, until N theories are adwl.nced, s The rcsult.iug 
advanced theories arc scored aud predicted for, and 
I, he new iuconll)lete predicted theories are scored and 
aWe believe thai, N depends on tile perl)lcxity of the 
gralillllar used, lint for the string grammar used for our 
CXl)criment.s we ,tsctl N=3. \["or the purl)oses of training, a 
higher N shouhl I)(: tlS(:(I ill order to generaL(: //|ore I)a.rs(:s. 
added to the chart. This process continues until an 
coml)lete parse tree is determined, or until the parser 
decides, heuristically, that it should not continue. The 
heuristics we used for determining that no parse can 
I)e Ibun(I Ibr all inlmt are I)ased on tile highest scoring 
incomplete theory ill the chart, the number of passes 
the parser has made, an(I the size of the chart. 
T'- earl's Capabilities 
Besides nsing statistical methods to guide tile parser 
l,hrough I,h,' I)arsing search space, Pearl also performs 
other functions which arc crucial to robustly processing 
UlU'estricted uatural language text aud speech. 
Handling Unknown Words Pearl uses a very sim- 
ple I)robal)ilistic unknown word model to hypol.h(nsize 
categories for unknown words. When word which is 
unknown to the systenl's lexicon, tile word is assumed 
to I)e a.ny one of the open class categories. The lexical 
i)rol);d)ility givell a (-atcgory is the I)rol)ability of that 
category occurring in the training corpus. 
Idiom Processing and Lat, tice Parsing Since the 
parsing search space can be simplified by recognizing 
idioms, Pearl allows tile input string to i,iclude idioms 
that span more than one word in tile sentence. This is 
accoml)lished by viewing the input sentence as a word 
la.ttice instead of a word string. Since idion}s tend to be 
uuand)igttous with respect to part-of-speech, they are 
generally favored over processing the individual words 
that make up the idiom, since the scores of rules con- 
taining the words will ten(I to be less thau 1, while 
a syntactically apl)rol)riate, unambiguous idiom will 
have a score of close to 1. 
The ahility to parse a scnl.epce wil, h multiple word 
hyl)otlmses and word I)oulidary hyl)othcses makes 
PeaH very usehd in the domain of spoken language 
processing. By delayiug decisions about word selection 
I)ut maintaining scoring information from a sl)eech rec- 
ognizer, tlic I>a.rser can use granmlaticai information in 
word selection without slowing the speech recognition 
pro(~ess. Because of Pearl's interleaved architecture, 
one could easily incorporate scoring information from 
a speech rccogniz, cr into the set of scoring functions 
used in tile parser. Pearl could also provide feedback 
to the specch recognizer about the grammaticality of 
fragnmitt hypotheses to guide the recognizer's search. 
Partial Parses The main advantage of chart-based 
parsiug over other parsing algorithms is that the parser 
can also recognize well-formed substrings within the 
sentence in the course of pursuing a complete parse. 
Pearl takes fidl advantage of this characteristic. Once 
Pearl is given the input sentence, it awaits instructions 
a.s to what type of parse should be attempted for this 
i,lput. A standard parser automatically attempts to 
produce a sentence (S) spanning tile entire input string. 
llowever, if this fails, the semantic interpreter might be 
able to (Icriw-' some mealfiug from the sentence if given 
18- 
aon-ow'.rhq~pirig noun, w~.rb, and prepositional phrases. 
If a s,,nte,,ce f~tils I,o parse,, requests h)r p;trLial parses 
of the input string call be made by specifying a range 
which the parse l.ree should cover and the category 
(NP, VI', etc.). 
Tile al)ilil.y I.o llrodil('c i)artial parses allows the sys- 
tem i.o haildle ,nult.iple sentence inl~ul.s. In both speech 
alld I.~'x|. proc~ssing, il. is difficult to know where the 
(qld Of ;I S('llI,CIICe is. For illsta.llCe~ ouc CaUllOt reli- 
ably d,'l.eriiiitw wholl ;t slmakcr t(~.rlnillat¢.s a selll,c,.ace 
ia free speech. Aml in text processing, abbreviations 
and quoted expressions produce anlbiguity abotll, sen- 
t,,.nc,, teriilinatioil. Wh,~ll this aildfiguil,y exists, .p,'a,'l 
can I),, qucri~'d for partial p;i.rse I.rccs for the given in- 
pill., wh(,re l.ll(~ goal category is a sen(elite. Tin,s, if 
I.hc word sl.rittg ix a cl.ually two COmldcl.c S~'ld.elwcs, I.Im 
pars~,r call r,'l.urn I.his itd'orm;d.ioll. Ilow~,w,r, if I.hc 
word sl, r-itJg is oilly ()tic SCIItI~.IlCC, tllell it colilld~,l,c parse 
l.i't',, is retul'ned at lit.tie extra cost. 
Trai,mllility ()l.' of I.he lim;ior adva,d,agcs of the 
I~rohabilistic pars,,i's ix ti'ainalfility. The c(mditic, tm.I 
probabilities used by T'earl are estimated by using fre- 
quem:ies froth a large corpus of parsed sellte|lce~, rlahe 
pars~,d seill.enccs Ira,st be parsed ttSillg I.he grallima.r 
Ibrmalism which the `pearl will use. 
Assuming l.he g,'ammar is not rccursive in an un- 
constrained way, the parser can be traim~'d in an unsu- 
pervised mode. This is accomplished by framing the 
pars~,r wil.hotlt the scoring functions, and geuerating 
lilall~" parse trees for each sentence. Previous work 9 
has dclllonstrated that the correct information froth 
these parse l.rc~s will I)~" reinforced, while the i,lcorrect 
substructure will not. M ultiple passes of re-Lra.iniqg its- 
ing frequency data. from the previous pass shouhl cause 
t,lw fro(lllency I.abh,s 1.o conw'.rge to a stable sta.te. This 
JLvI)ol.hcsis has not yet beell tesl.cd. TM 
An alternal.iw~ 1.o completely unsupervised training 
is I.o I.akc a parsed corpus for any domain of the same 
\];lllgil;Igl' IlSilig l,h,~ Salli,~ gra.iilllia.r, all<l liS~: I, he fl'~:- 
iIIIpllCy dal,a frolli I.hal, corpllS ;is I, hc iliil, ial I,ra.iliiilgj 
iilal, erial for I, he liew corpus. This allproach should 
s,)i'vt~ ()lily I,o iiiinilnize I, he lilliilber of UliSUllCrvised 
passes reqilired for l.lio freqileilcy dal, a I,o converge. 
Preliminary Evaluation 
While we haw; ,rot yet done ~-xte,miw~' testing of all of 
the Cal)abilities of "/)carl, we perforumd some simple 
tests to determine if its I~erformance is at least con- 
sistent with the premises ,port which it is based. The 
I.cst s,'ntcnces used for this evaluation are not fi'om the 
°This is a.u Unl~,,blishcd result, reportedly due to Fu- 
jisaki a.t IBM .\]apitll. 
l0 In fact, h~r certain grail|liiars, th(.' fr(.~qllClicy I.~tl)les may 
not conw:rge at all, or they may converge to zero, with 
the g,','tmmar gc,tcrati,lg no pa.rscs for the entire corpus. 
This is a worst-case sccl,ario whicl, we do oct a,lticipate 
halq~cning. 
training data on which the parser was trained. Using 
.p,'arl's cont(.'xt-free gr;unmar, i,h~.~e test sentences pro- 
duced an average of 64 parses per sentence, with some 
sentences producing over 100 parses. 
Unknown Word Part-of-speech 
Assignment 
To determine how "Pearl hamlles unknown words, we 
remow'd live words from the lexicon, i, kuow, lee, de- 
scribe, aml station, and tried to parse the 40 sample 
sentences I,sing the simple unknown word model pre- 
vie,rely d,:scribcd. 
I,i this test, the pl'onollll, il W~L,'q assigncd the cor- 
rect. i)art-of-speech 9 of 10 I.iiiies it occurred in the test 
,s'~'nt~mces. The nouns, lee and slalion, were correctly 
I.~tggcd 4 of 5 I.inics. And the w;rbs, kltow and describe, 
were corl'~cl.ly I, aggcd :l of :l tiilles. 
pronoun 90% 
nou,i 80% 
verb 100% 
'overall 89% .... 
Figure 1: Performance on Unknown Words in Test Sen- 
I, ences 
While this accuracy is expected for unknown words 
in isolation, based oil the accuracy of the part-of- 
speech tagging model, the performance is expected to 
degrade for sequences of unk,lown words. 
Prepositional Phrase Attachment 
Acc0rately determining prepositional phrase attach- 
nlent in general is a difficult and well-documented 
problem, llowever, based on experience with several 
different donmins, we have found prel)ositional phrase 
attachment to be a domain-specific pheuomenon for 
which training ca,t I)e very helpfld. For insta,tce, in 
the dirccl.ion-li,ldi,,g do,lmin, from aml to prepositional 
phrases generally attach to the preceding verb and 
not to any noun phrase. This tende,icy is captured 
iu the training process for .pearl and is used to guide 
the parscr to the more likely attach,nent with respect 
to ~he domain. This does not mean that Pearl will 
gel. the correct parse when the less likely attachme\]tt 
is correct; in fact, .pearl will invariably get this case 
wrong, llowever, based on the premise that this is the 
less likely attachment, this will produce more correct 
analyses than incorrect. And, using a more sophisti- 
cated statistical model, this pcrfornla,lcc can easily be 
improved. 
"Pearl's performance on prepositional phrase attach- 
meat was very high (54/55 or 98.2% correct). The rea- 
so,i the accuracy rate was so high is that/.lie direction- 
finding domain is very consistent in it's use of individ- 
t,al prepositions. The accuracy rate is not expected 
to be as high in other domains, although it certainly 
- 19- 
should be higher than 50% and we would expect it to 
bc greater than 75 %, although wc have nol. performed 
any rigorous tests on other (Ionmius to verify this. 
i,.ro,,ositio., I to i o,, Accuracy R,ate 92 % 100 % 100 % 98.2 % 
I"igure 2: Accl,racy Rate for Prepositional Phr;~se At- 
I.achnlcnt, I)y l)reposition 
Overall Parsing Accuracy 
The 40 test sentences were parsed by 7)earl and the 
highest scoring parse for each sentence was compared 
to the correct parse produced by I'UNI)rr. Of these 40 
s~llt.encos, "\])~'.;I.l'I I),'odu('ed p;t.rsr: tl'(?t:s for :18 of ti,enl, 
alld :15 of I, he.sc i)a.rsc tree's wt~t'\[~" {:(liliv;i.I(:lll, I,o I,hc cor- 
I'~:Cl, I)al'Se i)roducetl by I)ulldil,, for an overall at;cura(:y 
M;itly of Lilt: I,(?st SelltellCCS W(?l't. ~ IIot (lillicult I,o i)arsc 
for existing l)arsers, but \]hOSt had some granunatical 
atl ll)igllil,y which wouhl pro(lllce lllilitil)le i)arses. Ill 
fact, on 2 of tile 3 sciitences which were iucorrectly 
i)arsed, "POal'l i)roduced the corl't~ct i);ll'SC ;is well, but 
the correct i)a,'se did not have the highest score. 
Future Work 
The "Pearl parser takes advantage of donmin-depen(lent 
information to select the most approi)riate interpreta- 
tion of an inpul,. Ilowew'.r, i,he statistical measure used 
to disalnbiguate these interpretations is sensitive to 
certain attributes of the grammatical formalism used, 
as well as to the part-of-si)eech categories used to la- 
I)el lexical entries. All of the exl)erimcnts performed on 
T'carl titus fa," have been using one gra.linrla.r, one pa.rl.- 
of-speech tag set, and one donlaiu (hecause of avail- 
ability constra.ints). Future experime.nl,s are I)lanned 
to evalua.l,e "Pearl's i)erforma.nce on dii\[cre.nt domaius, 
as well as on a general corpus of English, arid ott dig 
fi~rent grammars, including a granunar derived fi'om a 
nlanually parsed corl)us. 
Conclusion 
The probal)ilistic parser which we have described pro- 
vides a I)latform for exploiting the useful informa- 
tion made available by statistical models in a manner 
which is consistent with existing grammar formalisms 
and parser desigus. 7)carl can bc trained to use any 
context-free granurlar, ;iccompanied I)y tile al)l)ropri- 
ate training matc,'ial. Anti, the parsing algorithm is 
very similar to a standard bottom-t,I) algorithm, with 
the exception of using theory scores to order the search. 
More thorough testing is necessary to inclosure 
7)carl's performance in tcrms of i)arsing accuracy, part- 
of-sl)eech assignnmnt, unknown word categorization, 
kliom processing cal)al)ilil.ies, aml even word selection 
in speech processing. With the exception of word se- 
lection, preliminary tesl.s show /)earl performs these 
ttLsks with a high degree of accuracy. 
References 
\[1\] Ayuso, D., Bobrow, It, el. al. 1990. 'lbwards Un- 
derstanding Text with a Very Large Vocabulary. 
In Proceedings of the June 1990 DARPA Speech 
and Natural Language Workshop. llidden Valley, 
Pennsylvania. 
\[2\] Brill, E., Magerman, D., Marcus, M., anti San- 
torini, I1. 1990. Deducing Linguistic Strl,cture 
fi'om the Statistics of Large Corl)ora. In Proceed- 
ings of the June 1990 I)A IU)A Speech and Natural 
Language Workshop. llidden Valley, Pennsylva- 
Ilia. 
\[3\] C'hil, rao, M. and (.','ishnla, i, IL 1990. SI,atisti- 
cal Parsing of Messages. hi Proceedings of the 
J utle 1990 I)A R.PA Speech and Natural Language 
WorkshoiL Iliddeu Valley, Pennsylvania. 
\[4} Church, K. 1988. A Stochastic Parts Program 
and Noun Phra.se Parser for Unrestricted Tcxt. In 
Procee(li*lgs of the Second Confereuce on Applied 
Natural I,at.~gt,age Processing. Austin, 'l~xas. 
\[5\] Chu,'dl, K. and Gale, W. 1990. Enhanced Good- 
Turing and Cat-Cal: Two New Methods for Es- 
timating Probal)ilitics of English Bigrams. Com- 
pulers, Speech and Language. 
\[6\] Fano, R.. 1961. Transmission of \[nformalion. New 
York, New York: MIT Press. 
\[7\] Gale, W. A. and Church, K. 1990. Poor Estimates 
of Context are Worse than None. In Proceedings 
of the June 1990 I)AR.PA Speech and Natural 
I,anguage Workshol). llidden Valley, Pennsylva- 
nia. 
\[8\] llin(lle, I). 1988. Acquiring a Noun Classification 
from Predicate-Argument Structures. Bell Labo- 
ratories. 
\[9\] llindle, D. and R.ooth, M. 1990. Structural Ambi- 
guity and l,exical R.clations. hi Proceedings of the 
J uuc 1990 I)A I)d~A SI)ccch and Natural Language 
Workshop. llid(len Valley, Pennsylvania. 
\[10\] Jelinek, F. 1985. Self-organizing Language Mod- 
eling for Speech li.ecognition. IBM R.eport. 
\[l 1\] Katz, S. M. 1987. Estimation of Probabilities from 
Sparse Data for the Language Model Compo- 
nent of a SI)eech R.ecognizer. IEEE Trausaclions 
on Acouslics, Speech, aud Signal Processing, Vol. 
ASSP-35, No. 3. 
\[12\] Sharman, IL A., Jelinek, F., and Mercer, R. 1990. 
In Proceedings of tile June 1990 DARPA Speech 
and Natural Language Workshop. 11idden Valley, 
Pennsylvauia. 
- 20 - 
