Parsing the Voyager Domain Using Pearl 
David M. Magerman and Mitchell P. Marcus 
CIS Department 
University of Pennsylvania 
Philadelphia, PA 19104 
ABSTRACT 
This paper* describes a .al, ural language p~rsi.g algorithm &)r un- 
resu'icl,ed I,ext which uses a probabilhq-hased scoring funct,iow I,o se- 
le(:l, I,he "besC: parse of ~/ sent,enos acc:ording t,c~ a given gra0nunar. 
The parser~ "Pearl~ ix a i,hne-asynciironous t)ol,l,orn-u 1) chart, parser with 
Earley-l,ype I,Ol)-dowil t)redic:l,ion which pursues I,he }6gtmsl,-s(:ori.g t,he- 
ory in I,he (:|larl,, where I,he set)re of a IJmory represe.l,s I,|le exl,e.l, I,o 
W|lic:|l L}le ~:on|,exl, ()~ I,|le seutl,euic:e predic:l;s I,}lat, int,erprel,at,ion. This 
parser {lifters front previous au,eUnl)l,s el, sto(:}la.st,i(: parsers in I,|lal, it, .ses 
a richer h)rm of condii,ional probabilil, ies |)a.~ed on conl,exl, l,o predict, 
likelihood. Pearl Mso provides a framework for in(:orporat,ing the resuh,s 
of previo.s work in part,-of-st)eeeh assig.tnenl,) InlkutowIi word u||od- 
els, and olJher l)rol)abilist,ic models (ff linguistfic feauJres im,o o.e pars- 
i.g U)t)\], inl,erleaving I,}lese I,e(:imiques |.stead of usi.g I, he I,ra~lil,iona\] 
pipeline archil,eci, ure. In I, esl~ perh~ruvled on I,|ie Voyager (lirecl,io.- 
finding domain, "Pearl has been s.ccessful el, resolvi.g parl,-of-speech 
aunifiguhq, del,ermiufing cal,egories for umknow, words, and selecl,ing 
correcl, parses firsl, using a very loosely fil, l,lng covering grammar. ~ 
INTRODUCTION 
All natural language grammars are ambiguous. Even tightly 
fitting natural language grammars are ambiguous in some ways. 
Loosely fitting grammars, which are necessary for handhng the 
variability and complexity of unrestricted text and speech, are 
worse. The standard technique for dealing with this ambiguity, 
prtming grammars by hand, is painful, time-consuming, and usu- 
ally arbitrary. The solution which many people have proposed is 
to use stochastic models to train statistical grammars automati- 
cally from a large corpus. 
Attempts in applying statistical techniques to natural lan- 
guage parsing have exhibited varying degrees of success. These 
successful and unsuccessful attempts have suggested to us that: 
• Stochastic techniques combined with traditional hnguistic 
theories can (and indeed must) provide a solution to the 
natural language understanding problem. 
*This work was partially supportcd by DARPA grant No. Nt1014-85- 
K0018, ONR contract No. Ntltltl14-89-O~0171 by DARPA and AIrOSR jointly 
ttndcr grant No. AFOSR-90-Ut166, and by ARO grant No. DAAL 03-89-Ctlt131 
PRL SpccJal thanks to Carl Wcir and Lynettc Hirschman at Unisys for thcir 
valucd input, guidancc and support. 
2Thc grammar uscd for o~tr cxpcrimcnts is the string grammar used in 
U nisys: P U NDIT natural languagc undcrstanding systcm. 
• In order for stochastic techniques to be ett~ctive, they must 
be applied with restraint (poor estimates of context axe 
worse than none\[6|). 
• Interactive, interlea~'ed architectures axe preferable to pipeline 
architectures in NLU systems, because they use more of the 
available information in the decision-malting process. 
We have constructed a stochastic parser, "Pearl, which is based 
on these ideas. 
The development of the Pearl parser is an ettbrt to combine 
the statistical models developed recently into a single tool which 
incorporates all of these models into the decision-making compo- 
nent of a. parser. While we hax'e only attempted to incorporate a 
few simple statistical models into this parser, Peaxl is structured 
in a way which allows any number of syntactic, semantic, and 
other knowledge sources to contribute to parsing decisions. 'l'he 
current implementation of Pearl uses Church's part-of-speech as- 
signment trigram model, a simple probabilistic unknown word 
model, and a conditional probability model for grammar rules 
based on part-of-speech trigrams and parent rules. 
By combining multiple knowledge sources and using a chart- 
parsing framework, Pearl attempts to handle a number of difficult 
problems. Pearl has the capability to parse word lattices, an 
ability which is useful in recognizing idioms in text processing, as 
well as in speech processing. The parser uses probabilistic training 
from a corpus to disambiguate between grammatically acceptable 
structures, such as determining prepositional phrase attachment 
and conjunction scope. Finally, Pearl ms|mains a well-formed 
subs|ring table within its chart to allow for partial parse retrieval. 
Partial parses are useful both for error-message generation and for 
processing ungrammatical or incomplete sentences. 
For preliminary tests of Pearl's capabilities, we are using the 
Voyager direction-finding domain, a spoken-language system de- 
veloped at MiT. 3 We have selected this domain for a number 
of reasons. First, it exhibits the attachment regularities which 
we are trying to capture with the context-sensitive probability 
model. Also, since both MIT and Unisys have developed parsers 
and grammars for this domain, there are existing parsers with 
which we can compare 7Pearl. Finally, pearl's dependence on 
a parsed corpus to train its models and to deri~ its grammar 
3Spccial thanks to Victor Zuc at MIT for thc use of thc speech data from 
MIT:s Voyagcr system. 
231 
required that we use a domain for which a parsed corpus ex- 
isted. A corpus of 1100 parsed sentences was generated by the 
Unisys' I-'I.tNDIT Language Understanding System. These parse 
trees were evaluated to be semantically correct by PUNDIT'S se- 
mantics component, although no hand-verification of this corpus 
was performed. PUNDIT'S parser uses a string grammar with many 
comphcated, hand-generated restrictions. The goal of the exper- 
iments we performed was to reproduce (or improve upon) the 
parsing accuracy of PUNDIT USing jUSt the context-free backbone 
of the PIINDIT grammar, without the hand-generated restrictions 
and, equally important, without the benefit of semantic analysis. 
In a. test on 40 Voyager sentences excluded from the training 
material, Pearl has shown promising results in handling part- 
of-speech assignment, prepositional phrase attachment, and un- 
known word categorization. Pearl correctly parsed 35 out of 40 
or 87.5% of these sentences, where a correc~ parse is defined to 
mean one which would produce a correct response from the Voy- 
ager system. We will describe the details of this experiment later. 
In this paper, we will first explain our contribution to the 
stochastic models which axe used in Pearl: a context-free gram- 
mar with context-sensitive conditional probabilities. Then, we 
will describe the purser's architecture and the parsing algorithm. 
Finally, we will gi~m the results of experiments we performed using 
Pearl which explore its capabilities. 
USING STATISTICS TO PARSE 
Recent work involving context-~ee and context-sensitive prob- 
abilistic grammars provide httle hope for the success of processing 
unrestricted text using probabilistic techniques. Works by Chi- 
trao and Grishman\[3\] and by Sharman, Jehnek, and Mercer\[Ill 
exhibit accuracy rates lower than 50% using supervised training. 
Supervised training for probabilistic CFGs requires parsed cor- 
pora., which is very costly in time and maa-power\[2\]. 
In our investigations, we have made two observations which 
attempt to explain the lack-luster performance of statistical pars- 
ing techniques: 
• Simple probabilistic CFGs provide generalinformation about 
how likely a construct is going to appear anywhere in a sam- 
ple of a language. This average likehhood is often a poor 
estimate of probability. 
• Parsing algorithms which accumulate probabilities of parse 
theories by simply multiplying them over-penalize infre- 
quent constructs. 
Pearl avoids the first pitfall by using a context-sensitive condi- 
tional probabihty CFG, where context of a theory is determined 
by the theories which predicted it and the part-of-speech se- 
quences in the input sentence. 'lb address the second issue, Pearl 
scores each theory by using the geometric mean of the contextual 
conditional probabilities of all of the theories which have con- 
tributed to that theory. This is equivalent to using the sum of 
the logs of these probabilities. 
CFG with context-sensitive conditional probabilities 
In a very large parsed corpus of English text, one finds that 
the most frequently occurring noun phrase structure in the text 
is a noun phrase containing a determiner followed by a noun. 
Simple probabilistic CFGs dictate that, given this information, 
"determiner noun" should be the most likely interpretation of a 
noun phrase. 
Now, consider only those noun phrases which occur as subjects 
of a sentence. In a given corpus, yon might find that, pronouns 
occur just as frequently as "determiner nolm"s in the subject 
position. This type ~fff information can ea~ily be captured by 
conditional probabilities. 
Finally, assume that the sentence begins with a pronoun fol- 
lowed by a verb. In this case, it, is quite clear that, while yon 
can probably concoct a sentence which fits this description and 
does not have a pronoun for a subject, the first theory which yon 
should pursue is one which makes this hypothesis. 
The context-sensitive conditional probabilities which "Pearl 
uses take into account the immediate parent of a theory 4 and the 
part-of-speech trigram centered at the beginning of the theory. 
For example, consider the sentence: 
My first love was named 'Pearl. 
(no subliminal propaganda intended) 
A theory which tries to interpret "love" as a verb will be scored 
based on the part-of-speech trigram "adjective verb verb" and the 
parent theory, probably "S --+ NP VP." A theory which interprets 
"love" as a noun will be scored based on the trigram "adjective 
noun verb." Although lexical probabilities favor "love" as a verb, 
the conditional probabilities will heavily favor "love" a.~ a noun 
in this context. '5 
Using the Geometric Mean of Theory Scores 
According to probability theory, the likelihood of two inde- 
pcndcnl, events occurring at, the same time is the product of their 
individual probabilities. Previous statistical parsing techniques 
apply this definition to the cooceurrence of two theories in a parse, 
and claim that the likelihood of the two theories being correct is 
the product of the probabilities of the two theories. 
This application of probability theory ignores two vital obser- 
vations about the domain of statistical parsing: 
• Two constructs occurring in the same sentence are not, nec- 
essarily independent (and frequently are not). If the inde- 
pendence assumption is violated, then the product of in- 
dividual probabilities has no meaning with respect to the 
joint probability of two event, s. 
• Since statistical parsing suffers from sparse data, probability 
estimates of low frequency events will usually be inaccurate 
estimates. Extreme underestimates of the likelihood of low 
frequency events will produce misleading joint probability 
estimates. 
4Tl,e parent of a theory is defined as a theory with a CF rule which contains 
the left-hand side of the theory. For instance, if ~S ~ NP VP" and "NP --* 
det o" are two grammar rules, the .first rule can be a parent of the secoud~ 
sittce the left-hand side of the second "NP" occurs in the right-hand side of 
the frst rule. 
5In fact, the part-of-speedt tagging model wlddt is also used in "Pearl will 
heavily favor "love" as a noun. We ignore this behavior to demonstrate the 
benefits of the trlgram conditioning. 
232 
From these observations, we have determined that estimating 
joint probabilities of theories using individual probabilities is too 
difficnlt with the available data. We have fonnd that the geo- 
metric mean of these probability estimates provides an accurate 
assessment of a theory's viability. 
The Actual Theory Scoring Function 
In a departnre from standard practice, and perhaps against 
better judgment,we will include a precise description of the the- 
ory scoring fimction used by Pearl. This scoring fimction tries to 
solve some of the problen~ noted in previous attempts at proba- 
bilistic parsing\[3\]\[11\]: 
• Theory scores should not depend on the length of the string 
which the theory spans. 
• Sparse data. (zero=frequency events) and even zero=probability 
events do occur, and shonld not resnlt in zero scoring the- 
ories. 
• Theory scores should not discriminate against unlikely con= 
structs when the context predicts them. 
In this discnssion, a theory is defined to be a partial or com- 
plete syntactic interpretation of a word string, or, simply, a parse 
tree. The raw score of a theory, 0, is calculated by taking the 
product of the conditional probability of that theory's CFG rule 
given the context, where context is a part-of-speech trigram cen- 
tered at the beginning of the theory and a parent theory's rule, 
and the score of the contextnal trigram: 
SCram(0) = .p(rulcol(poPtP2), rulcparent)Sc(poplp2) 
Here, the score of a trigram is the prodnct of the mutna\] in- 
formation of the part-of-speech trigram, 6 P0PlP2, and the lexical 
probability of the word at the location of Pi being assigned that 
part-of-speech Pi .7 In the case of ambiguity (part-of-speech am- 
bignity or multiple parent theories), the maximnm valne of this 
product is used. The score of a partial theory or a complete the- 
ory is the geometric mean of the raw scores of all of the theories 
which are contained in that theory. 
Theory Length Independence This scoring fimction, although 
heuristic in derivation, provides a method for evaluating the value 
of a theory, regardle~ of its length. When a rule is first, predicted 
(Earley-style), its score is just its raw score, which represents how 
mnch the context predicts it. However, when the parse process 
hypothesizes interpretations of the sentence which reinforce this 
theory, the geometric mean of all of the raw scores of the rule's 
snbtree is nsed, representing the overall likelihood of the theory 
given the context of the sentence. 
Low-freqnency Events Although some statistical natural lan- 
gnage applications employ backing-off estimation techniqnes\[10\]\[5\] 
to handle low-frequency events, 'Pearl uses a very simple estima- 
tion technique, reluctantly attributed to Church\[6\]. This tech- 
niqne estimates the probability of an event by adding 0.5 to ev- 
6The mutual information of a part-of-speech trigrnm, poPlP2, is defined 
to be ..pry ~w') . where x is any part-of-sr)eech. See \[4\] for further "M( pilxp~)'P( ~t ) ) 
ex planation. 
7The trigram scoring \[traction actually used by the parser is somewhat 
more complicated than this. 
ery frequency count. 8 Low-scoring theories will be predicted by 
the Earley-style parser. And, if no other hypothesis is suggested, 
these theories will be pursued. If a high scoring theory advances 
a theory with a very low raw score, the resulting theory's score 
will be the geometric mean of all of the raw scores of theories 
contained in thkt theory, and thus will be much higher than the 
low-scoring theory's score. 
Example of Scoring Fnnction As an example of how the conditional- 
probability-based scoring fimction handles ambiguity, consider 
the sentence 
Fruit flies like a banana. 
in the domain of insect studies. Lexica.I probabilities should indi- 
cate that the word "flies" is more likely to be a plural noun than 
a tensed verb. This information is incorporated in the trigram 
scores. However, when the interpretation 
S-+. NPVP 
is proposed, two possible NPs will be parsed, 
NP --~ noun (frnit) 
and 
NP ~ noun nmm (fruit file.6). 
Since this sentence is syntactically a.mbiglmns, if the first hypoth- 
esis is tested first, the parser will interpret this sentence incor- 
rectly. 
However, this will not happen in this domain. Since "fruit 
flies" is a conmmn idiom in insect studies, the score of its tri- 
gram, noun noun verb, will be much greater than the score of the 
trigram, noun verb verb. Thus, not only will the lexical proba- 
bility of the word "flies\]verb" be lower than that, of "flies/norm," 
but also the raw score of "NP ~ noun (fruit)" will be lower than 
that, of "NP ~ norm noun (fruit flies)," because of the differential 
between the trigram scores. 
So, "NP --~ noun noun" will be used first to advance the "S 
. NP VP" rnle. Further, even if the parser advances both NP 
hypotheses, the "S ~ NP . VP" rnle using "NP --~ noun noun" 
will have a higher score than the "S ~ NP . VP" rule using "NP 
---~ 111011I'I .~ 
INTERLEAVED ARCHITECTURE IN 
PEARL 
The interleaved architecture implemented in .pearl provides 
many advantages over the traditional pipeline architecture, but 
it also introduces certain risks. Decisions about word and part- 
of-speech ambiguity can be delayed nntil syntactic processing can 
SWe are not deliberately avoiding using all probability estimation tech- 
niques, only those backLItg-O~ teclLttiqu.eS wltich thse itLdel.)endence ¢~ssump- 
~ons that frequently provide misleading information when applied to natural 
language. 
233 
disarnbignate them. And, using the appropriate score combina- 
/,ion fimctions, the scoring of ambiguous choices can direct the 
parser towards the most likely interpretation efficiently. 
However, with these delayed decisions comes a. vastly enlarged 
search space. The effectiveness of the parser depends on a major- 
ity of the theories having very low scores barred on either unlikely 
syntactic struct~Jres or low scoring input (such as low scores from 
a speech recognizer or low lexical probability). In experiments we 
have performed, this has been the case. 
The Parsing Algorithm 
Pearl is an agenda~ba~sed time-asynchronous bottom-up chart 
parser with Earley-type top-down prediction. The significant dif- 
ference between T~earl and non-probabilistic bottom-up parsers 
is that instead of completely generating all grammatical inter- 
pretations of a word string, ~earl uses an agenda to order the 
incomplete theories in its chart to determine which theory to ad- 
vance next. The agenda is sorted by the value of the theory 
scoring fimction described above. Instead of expanding all the- 
ories in the chart, Pearl pl~rsnes the highest-scoring incomplete 
theories in the chart, advancing up to N theories at each pass. 
However, T~earl parses without pruning. Although it is only ad- 
vancing N incomplete theories at each pass, it retains the lower 
scoring theories in its agenda. If the higher scoring theories do 
not generate viable alternatives, the lower scoring theories may 
be used on snbseqnent passes. 
The parsing algorithm begins with an input word lattice, which 
describes the input sentence and includes possible idiom bypothese 
and may include alternative word hypotheses. "q Lexical rules for 
/.he input word lattice are inserted into the parser's chart,. Using 
Earley-type prediction, a sentence (S) is predicted at the begin- 
ning of the input, and all of the theories which are predicted by 
that initial sentence are inserted into the chart. These incomplete 
theories are scored according to the context-sensitive conditional 
probabilities and the trigrarn part-of-speech model. The incom- 
plete theories are tested in order by score, until N theories are 
advanced, m , The resulting advanced theories are scored and pre- 
dicted for, and the new incomplete predicted theories are scored 
and added to the chart. This process continues until an complete 
parse tree is determined, or nnt~il the parser decides, heuristically, 
that it should not continue. The heuristics we used for deter- 
mining that no parse can be found for an input are based on 
the highest, scoring incomplete theory inn the chart, the number of 
passes the parser hans made, and the size of the chart. 
Pearl's Capabilities 
Besides using statistical methods to guide the parser through 
the parsing search space, "Pearl also performs other fimctions 
0 Usi*tg alternative word hypotheses without incorporating a speech recog- 
tfition model would not necessarily produce ttsefftd results. Given two unam- 
bigttous norms at the same position in the sentence, "Pearl has no information 
with wlfich to disambiguate these words, and will invariably select thefirst one 
entered into the chart. The capability to process a alternate word hypothe- 
ses is inchtded to suggezt the future implementation off a speedt recognition 
modal i, +Pearl. 
J%Ve believe that N depends on the perplexity off the grammar used, but for 
the string grammar used for ottr experiments we itsed N=3. For the pttrp(yses 
off training, we sttgg¢~l, that a higher N shottld be used in order to generate 
more parses. 
which are crncial to robustly processing unrestricted natliral lan- 
guage text and speech. 
Handling Unknown Words Pearl uses a very simple proba- 
bilistic unknown word model to hypothesize categories for un- 
known words. When a word is fonnd which is unknown to the 
system's lexicon, the word is a.ssumed to be any one of the open 
cla~ss categories. The lexical probability given a category is the 
probability of that category occurring in the training corpns. 
Idiom Processing and Lattice Parsing Since the parsing search 
space can be simplified by recognizing idion~s, Pearl allows the 
inpnt string to inch~de idiorrrs that. span more than one word in 
the sentence. This is accomplished by viewing the input sentence 
as a word lattice instead of a word string. Since idioo~s tend to 
be nnambignous with respect to part-of-speech, they are gener- 
ally favored over processing the individual words that make up 
the idiom, since the scores of rules containing the words will tend 
to be lens than 1, while a syntactically appropriate~ unambiguous 
idiom will have a score of close to 1. 
The ability to parse a sentence with mnltiple word hypothe- 
ses and word boundary hypotheses makes Pearl very nsefifi in 
the domain of spoken language processing. By delaying decisions 
about word selection but maintaining scoring information from 
a speech recognizer, the parser can use grammatical information 
in word selection without slowing the speech recognition process. 
Because of Pearl's interleaved architecture, one conld ea.sily in- 
corporate scoring information from a speech recognizer into the 
set of scoring fl\]nctions used in the parser. 'Pearl could also pro- 
vide feedback to the speech recognizer abont the grarnmaticality 
of fragment hypotheses to glfide the recognizer's search. 
Partial Parses The main advantage of chart-barred parsing 
over other parsing algorithms is that a chart-based parser can 
recognize well-formed substrings within the input string in the 
course of pursuing a complete parse. Pearl takes fi,ll advantage 
of this characteristic. Once Pearl is given the input sentence, it 
awaits instructions as to what type of parse should be attempted 
for this input. A standard parser automatically attempts to pro- 
dace a sentence (S) spanning the entire inplJt string. However, if 
this fails, the semantic interpreter might be able to derive some 
meaning from the sentence if given non-overlapping noun, verb, 
and prepositional phrases. If a sentence fails to parse, requests 
for partial parses of the input string can be made by specifying 
a range which the parse tree should cover and the category (NP, 
VP, etc.). These requests, however, must be initiated by an in- 
telligent semantics processor which can manipulate these partial 
parses. 
Trainability One of the major advantages of the probabilis- 
tic parsers is trainability. The conditional probabilities used by 
Pearl are estimated by using frequencies from a large corpus of 
parsed sentences. The parsed sentences must be parsed using the 
grammar formalism which the Pearl will use. 
Assuming the grammar is not recnrsive in an unconstrained 
way, the parser can be trained in an unsupervised mode. This 
is accomplished by running the parser without the scoring flmc- 
tions, and generating many parse trees for each sentence. Previ- 
ous work H has demonstrated that the correct information from 
nThis is art unpublished result, reportedly due to Fujisaki at IBM Japan. 
234 
these parse trees will be reinforced, while the incorrect substruc- 
ture will not. Multiple passes of re-training using frequency data 
from the previous pass should creme the frequency tables to con- 
verge to a stable state. This hypothesis has not yet been tested, t2 
An alternative to completely unsupervised training is to take 
a parsed corpus for any domain of the same language using the 
same grammar, and use the frequency data from that corpus as 
the initial training material for the new corpus. This approach 
should serve only to minimize the number of nnsupervised passes 
required for the frequency data to converge. 
PARSING THE VOYAGEI~ DOMAIN 
In order to test Pearl's capabilities, we performed some simple 
tests to determine if its performance is at least consistent with the 
premises upon which it is bmsed. The test sentences used for this 
evaluation are not from the training dataon which the parser was 
trained. Using Pearl's context-free grammar, which is equivalent 
to the context-free backbone of PUNDIT'S grammar, these test 
sentences produced an average of 64 parses per sentence , with 
some sentences producing over 100 parses. 
Overall Parsing Accuracy 
The 40 test sentences were parsed by "Pearl and the highest 
scoring parse fbr each sentence was compared to the correct parse 
produced by PUNDIT. Of these 40 sentences, "Pearl produced parse 
trees fbr 38 of them, and 35 of these parse trees were equivalent 
to the correct parse produced by PUNDIT, fbr an overall accu- 
racy rate of 88%. Although precise accuracy statistics are not 
available ibr PUNDIT, this result is believed to be comparable to 
PUNDIT's perfbrmance. However, the result is achieved without 
the painfully hand-crafted restriction grammar associated with 
PUNDIT'S parser. 
Many of the test sentences were not difficult to parse fbr ex- 
isting parsers, but most had some grammatical ambiguity which 
would produce multiple parses. In fkct, on 2 of the 3 sentences 
which were incorrectly parsed, "Pearl produced the correct parse 
as well, but the correct parse did not have the highest score. And 
both of these sentences would have been correctly processed if' 
semantic filtering were used on the top three parses. 
Of the two sentences which did not parse, one used passive 
voice, which only occurred in one sentence in the training corpus. 
While the other sentence, 
How can I got from care sushi to Cambridge 
City Hospital by walking 
did not produce a parse for the entire word string, it could be pro- 
cessed using "Pearl's partial parsing capability. By accessing the 
chart produced by the failed parse attempt, the parser can find 
a parsed sentence containing the first eleven words, and a prepo- 
sitional phrase containing the final two words. This infbrmation 
could be used to interpret the sentence properly. 
12In fact, for certain grammars, the frequency tables may not converge at 
all, or they may converge to zero, with the grammar generating no parses for 
the entire corpus. This is a worst-ease scenario which we do not anticipate 
happening. 
Unknown Word Part-of-speech Assignment 
To determine how "Pearl handles unknown words, we randomly 
selected five words f~om the test sentences, \[, know, ~cc, dcscriSc, 
removed their entries f~om the lexicon, and stalion, and tried to 
parse the 40 sample sentences using the simple unknown word 
model previously described) ~ 
In this test, the pronoun, /, was assigned the correct part-of: 
speech 9 of 10 times it occurred in the test sentences. The nouns, 
~ee and station, were correctly tagged 4 of 5 times. And the verbs, 
know and describc, were correctly tagged 3 of 3 times. While this 
Category Accuracy 
pronoun 90% 
noun 80% 
verb 100% 
overall 89% 
Fig,r e 1: Performance on Unknown Words in Test Sentences 
accuracy is expected for unknown words in isolation, based on the 
accuracy of' the part-of:speech tagging model, the perfbrmance is 
expected to degrade for sequences of" unknown words. 
Prepositional Phrase Attachment 
Accurately determining prepositional phrase attachment in 
general is a difficult and well-documented problem. However, 
based on experience with several different domains, we have ibund 
prepositional phrase attachment to be a domain-specific phenomenon 
for which training can be very helpful. For instance, in the 
direction-finding domain, from and to prepositional phrases gen- 
erally attach to the preceding verb and not to any noun phrase. 
This tendency is captured in the training process for "Pearl and 
is used to guide the parser to the more likely attachment with re- 
spect to the domain. This does not mean that "Pearl will get the 
correct parse when the less likely attachment is correct; in fact, 
"Pearl will invariably get this case wrong. However, based on the 
premise that this is the less likely attachment, this will produce 
more correct analyses than incorrect. And, using a more sophis- 
ticated statistical model which uses more contextual infbrmation, 
this perfbrmance can likely be improved. 
"Pearl's perfbrmance on prepositional phrase attachment was 
very high (54/55 or 98.2% correct). The reason the accuracy rate 
is so high is that the direction-finding domain is very consistent 
in its use of individual prepositions. The accuracy rate is not 
expected to be as high in less consistent domains, although we 
expect it to be significantly higher than chance. 
Search Space Reduction 
One claim of "Pearl, and of probabilistic parsers in general, is 
that probabilities can help guide a parser through the immense 
search space produced by ambiguous grammars. Since, without 
probabilisties, the test sentences produced an average of 64 parses 
per sentence, "Pearl unquestionably has reduced the space of possi- 
bilities by only producing 3 parses per sentence while maintaining 
nThe unknown word model used in this test was augmented to include 
dosed class categories as well as open class, since the words removed from 
the lexicon may have included (in fact did include) dosed dass words. 
235 
Figure 2: 
Preposition 
Prep. , Accuracy 
from ' 92% 
to \[ 100% 
on 100% 
Overall\[ 98.2% 
Accuracy Rate for Prepositional Phrm~e Attachment, by 
high accuracy. However, it is interesting to see how "Pearl's scor- 
ing function performs against previously proposed scoring func- 
tions. The four scoring :\['unctions compared include a simple prob- 
abilistic CFG, where each context-fl'ee rule is assigned a fixed like- 
lihood based on training, a CFG using probabilistic conditioning 
on the parent rule only, which is similar to the scoring f'unction 
used by Chitrao and Grishman\[3\], and two versions of the CFG 
with CSP model, one using the geometric mean of raw theory 
scores and the other using the product of" these raw scores. Using 
Technique Edges Accuracy 
P-CFG j 929 35% 
CFG with Parent Cond. 883 50% 
CFG with CSP 210 ~ 88% 
Prod. of Scores 657 60% 
Figure 3: Search Space Reduction and Accuracy for 1,bur Probabilistic 
Models 
a simple probabilistic CFG model, the parser produced a much 
lower accuracy rate (35%). The parentM conditioning brought 
this rate up to 50%, and the trigram conditioning brought this 
level up to 88%. The search space for CFG with CSP was 4 to 5 
times lower than the simple probabilistic CFG. 
FUTURE WORK 
The "Pearl parser takes advantage of domain-dependent infor- 
mation to select the most appropriate interpretation of an input. 
However, the statistical measure used to disambiguate these in- 
terpretations is sensitive to certain attributes of' the grammatical 
ibrmalism used, as well as to the part-of-speech categories used to 
label lexical entries. All of the experiments perfbrmed on "Pearl 
thus far have been using one grammar, one part-of-speech tag 
set, and one domain (because of availability constraints). Future 
experiments are planned to e~xluate "Pearl's perfbrmance on dif- 
ferent domains, as well as on a general corpus of English, and on 
different grammars, including a grammar derived fl'om a manually 
parsed corpus. 
Specifically, we plan to retrain "Pearl on a corpus of terrorist- 
related messages fl'om the Message Understanding Conference 
(MUC). Using this material, we will attempt two very differ- 
ent experiments. The first experiment will be similar to the 
one performed on the Voyager data. Using a corpus of correctly 
parsed MUC sentences fl'om SRI's Tacitus system, we will derive 
a context-f~ee grammar and extract training statistics ibr "Pearl's 
models. Since the MUC sentences exhibit many more difficul- 
ties than Voyager, including 50 word sentences, punctuation, no 
sentence markers, and typographical errors, we expect "Pearl to 
require significant re-engineering to handle this experiment. 
'The second experiment on the MUC corpus involves extract- 
ing a grammar and training statistics from a hand-parsed corpus. 
When the University of" Pennsylvania's Treebank project\[2\] makes 
a hand-parsed version of the MUG training material a~ilable to 
the DARPA community, we will extract a context-f~ee grammar 
from these parse trees, and retrain ~earl on this material. This 
experiment is even more interesting because, if successful, it will 
show that ~Oearl provides an alternative to the hand-pruning of 
grammars to cover specific domains. If a hand-parsed corpus 
can provide a covering grammar which can be used to accurately 
parse a particular domain, porting natural language applications 
to new domains will be greatly facilitated. 
CONCLUSION 
The probabilistic parser which we have described provides a 
platibrm for exploiting the useful ini:brmation made available by 
statistical models in a manner which is consistent with existing 
grammar fbrmalisms and parser designs. "Pearl can be trained to 
use any context-f~ee grammar, accompanied by the appropriate 
training material. And, the parsing algorithm is very similar to a 
standard bottom-up algorithm, with the exception of using theory 
scores to order the search. 
In experiments on the Voyager direction-finding domain, ~earl, 
using only a context-i~ee grammar and statistical models, per- 
fbrmed at least as well as PUNDIT'S parser, which includes hand- 
generated restrictions. In the ihture, we hope to demonstrate 
similar peribrmance on more difficult domains and using manu- 
ally parsed corpora. 
REFERENCES 
\[1\] Ayuso, D., Bobrow, R., et. el. 1990. Towards Understanding Text 
with a Very Large Vocabulary. In Proceedings of the June 1990 
DA'RPA Speech and NatorM Language xA'orkshop. Hidden Valley, 
Pennsylvania. 
\[2\] 13rill, E., Magerman, D., Marcus, M., and Santorini, D. 1990. De- 
ducing Lingnistic Structure from the Statistics of Large Corpora. 
In Proceedings of the June 1990 DARPA Speech and Natural Lan- 
guagn Workshop. Hidden Valley, Pennsylvania. 
\[3\] Chitrao, M. and Grishman, R. 1990. Statistical Parsing of Mes- 
sages. In Proceedings of the June 1990 DARPA Speech and NMurM 
Langamgs %Vorkshop. Hidden VMley, Pennsylvania. 
\[4\] Church, K. 1988. h Stoeha.qtic Parts Program and Noun Phra.,~e 
Parser for Unrestricted Text. In Proceedings of the Second Con- 
ference on Applied NaturM Language Proce~qing. Austin, Texas. 
\[5\] Church, K. and Gale, W. 1990. Enhanced Good-Tnring and Cat- 
CM: Two New Methods for Estimating Probabilities of English 
Bigrams. Comlmlen % Speech and Lauguaye. 
\[6\] Gale, W. A. and Church, K. 1990. Poor Estimates of Context are 
Worse than None. In Proceedings of the June 1990 DA RPA Speech 
and NaturM Language Workshop. Hidden Valley, Pennsylvania. 
\[7\] Hindle, D. 1988. Acquiring a Noun Classification from Predicate- 
Argument Structures. Bell Laboratories. 
\[8\] Hindle, D. and Rooth, M. 1990. StructnrM Ambiguity and Lexical 
"Relations. In Proceedings of the June 1990 DAKPA Speech and 
Natural Language Workshop. Hidden Valley, Pennsylvania. 
\[9\] Jelinek, F. 1985. Self-organizing Language Modeling for Speech 
"Recognition. Il3M Report. 
\[10\] Katz, S. M. 1987. Estimation of Probabilitie.q from Sparse Data for 
the Language Model Component of a Speech Recognizer. ?EEE 
7"~nnsaclions ou Acouslics, Speech, aud Signal Plvee.ssing, I/)H. 
ASSP-35, No. 3. 
\[11\] Sh~.rman, R. A., Jelinek, 1,'., and Mercer, R. 1990. In Proceedings 
of the June 1990 DARPA Speech and NatnrM Language Workshop. 
Hidden Valley, Pennsylvania. 
236 
