File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1044_metho.xml
Size: 27,092 bytes
Last Modified: 2025-10-06 14:12:41
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1044"> <Title>Parsing the Voyager Domain Using Pearl</Title> <Section position="3" start_page="231" end_page="232" type="metho"> <SectionTitle> USING STATISTICS TO PARSE </SectionTitle> <Paragraph position="0"> Recent work involving context-~ee and context-sensitive probabilistic grammars provide httle hope for the success of processing unrestricted text using probabilistic techniques. Works by Chitrao and Grishman\[3\] and by Sharman, Jehnek, and Mercer\[Ill exhibit accuracy rates lower than 50% using supervised training.</Paragraph> <Paragraph position="1"> Supervised training for probabilistic CFGs requires parsed corpora., which is very costly in time and maa-power\[2\].</Paragraph> <Paragraph position="2"> In our investigations, we have made two observations which attempt to explain the lack-luster performance of statistical parsing techniques: * Simple probabilistic CFGs provide generalinformation about how likely a construct is going to appear anywhere in a sample of a language. This average likehhood is often a poor estimate of probability.</Paragraph> <Paragraph position="3"> * Parsing algorithms which accumulate probabilities of parse theories by simply multiplying them over-penalize infrequent constructs.</Paragraph> <Paragraph position="4"> Pearl avoids the first pitfall by using a context-sensitive conditional probabihty CFG, where context of a theory is determined by the theories which predicted it and the part-of-speech sequences in the input sentence. 'lb address the second issue, Pearl scores each theory by using the geometric mean of the contextual conditional probabilities of all of the theories which have contributed to that theory. This is equivalent to using the sum of the logs of these probabilities.</Paragraph> <Paragraph position="5"> CFG with context-sensitive conditional probabilities In a very large parsed corpus of English text, one finds that the most frequently occurring noun phrase structure in the text is a noun phrase containing a determiner followed by a noun.</Paragraph> <Paragraph position="6"> Simple probabilistic CFGs dictate that, given this information, &quot;determiner noun&quot; should be the most likely interpretation of a noun phrase.</Paragraph> <Paragraph position="7"> Now, consider only those noun phrases which occur as subjects of a sentence. In a given corpus, yon might find that, pronouns occur just as frequently as &quot;determiner nolm&quot;s in the subject position. This type ~fff information can ea~ily be captured by conditional probabilities.</Paragraph> <Paragraph position="8"> Finally, assume that the sentence begins with a pronoun followed by a verb. In this case, it, is quite clear that, while yon can probably concoct a sentence which fits this description and does not have a pronoun for a subject, the first theory which yon should pursue is one which makes this hypothesis.</Paragraph> <Paragraph position="9"> The context-sensitive conditional probabilities which &quot;Pearl uses take into account the immediate parent of a theory 4 and the part-of-speech trigram centered at the beginning of the theory.</Paragraph> <Paragraph position="10"> For example, consider the sentence: My first love was named 'Pearl.</Paragraph> <Paragraph position="11"> (no subliminal propaganda intended) A theory which tries to interpret &quot;love&quot; as a verb will be scored based on the part-of-speech trigram &quot;adjective verb verb&quot; and the parent theory, probably &quot;S --+ NP VP.&quot; A theory which interprets &quot;love&quot; as a noun will be scored based on the trigram &quot;adjective noun verb.&quot; Although lexical probabilities favor &quot;love&quot; as a verb, the conditional probabilities will heavily favor &quot;love&quot; a.~ a noun in this context. '5 Using the Geometric Mean of Theory Scores According to probability theory, the likelihood of two indepcndcnl, events occurring at, the same time is the product of their individual probabilities. Previous statistical parsing techniques apply this definition to the cooceurrence of two theories in a parse, and claim that the likelihood of the two theories being correct is the product of the probabilities of the two theories.</Paragraph> <Paragraph position="12"> This application of probability theory ignores two vital observations about the domain of statistical parsing: * Two constructs occurring in the same sentence are not, necessarily independent (and frequently are not). If the independence assumption is violated, then the product of individual probabilities has no meaning with respect to the joint probability of two event, s.</Paragraph> <Paragraph position="13"> * Since statistical parsing suffers from sparse data, probability estimates of low frequency events will usually be inaccurate estimates. Extreme underestimates of the likelihood of low frequency events will produce misleading joint probability estimates.</Paragraph> <Paragraph position="14"> 4Tl,e parent of a theory is defined as a theory with a CF rule which contains the left-hand side of the theory. For instance, if ~S ~ NP VP&quot; and &quot;NP --* det o&quot; are two grammar rules, the .first rule can be a parent of the secoud~ sittce the left-hand side of the second &quot;NP&quot; occurs in the right-hand side of the frst rule.</Paragraph> <Paragraph position="15"> 5In fact, the part-of-speedt tagging model wlddt is also used in &quot;Pearl will heavily favor &quot;love&quot; as a noun. We ignore this behavior to demonstrate the benefits of the trlgram conditioning.</Paragraph> <Paragraph position="16"> From these observations, we have determined that estimating joint probabilities of theories using individual probabilities is too difficnlt with the available data. We have fonnd that the geometric mean of these probability estimates provides an accurate assessment of a theory's viability.</Paragraph> <Paragraph position="17"> The Actual Theory Scoring Function In a departnre from standard practice, and perhaps against better judgment,we will include a precise description of the theory scoring fimction used by Pearl. This scoring fimction tries to solve some of the problen~ noted in previous attempts at probabilistic parsing\[3\]\[11\]: * Theory scores should not depend on the length of the string which the theory spans.</Paragraph> <Paragraph position="18"> * Sparse data. (zero=frequency events) and even zero=probability events do occur, and shonld not resnlt in zero scoring theories. null * Theory scores should not discriminate against unlikely con= structs when the context predicts them.</Paragraph> <Paragraph position="19"> In this discnssion, a theory is defined to be a partial or complete syntactic interpretation of a word string, or, simply, a parse tree. The raw score of a theory, 0, is calculated by taking the product of the conditional probability of that theory's CFG rule given the context, where context is a part-of-speech trigram centered at the beginning of the theory and a parent theory's rule, and the score of the contextnal trigram:</Paragraph> <Paragraph position="21"> Here, the score of a trigram is the prodnct of the mutna\] information of the part-of-speech trigram, 6 P0PlP2, and the lexical probability of the word at the location of Pi being assigned that part-of-speech Pi .7 In the case of ambiguity (part-of-speech ambignity or multiple parent theories), the maximnm valne of this product is used. The score of a partial theory or a complete theory is the geometric mean of the raw scores of all of the theories which are contained in that theory.</Paragraph> <Paragraph position="22"> Theory Length Independence This scoring fimction, although heuristic in derivation, provides a method for evaluating the value of a theory, regardle~ of its length. When a rule is first, predicted (Earley-style), its score is just its raw score, which represents how mnch the context predicts it. However, when the parse process hypothesizes interpretations of the sentence which reinforce this theory, the geometric mean of all of the raw scores of the rule's snbtree is nsed, representing the overall likelihood of the theory given the context of the sentence.</Paragraph> <Paragraph position="23"> Low-freqnency Events Although some statistical natural langnage applications employ backing-off estimation techniqnes\[10\]\[5\] to handle low-frequency events, 'Pearl uses a very simple estimation technique, reluctantly attributed to Church\[6\]. This techniqne estimates the probability of an event by adding 0.5 to ev- null more complicated than this.</Paragraph> <Paragraph position="24"> ery frequency count. 8 Low-scoring theories will be predicted by the Earley-style parser. And, if no other hypothesis is suggested, these theories will be pursued. If a high scoring theory advances a theory with a very low raw score, the resulting theory's score will be the geometric mean of all of the raw scores of theories contained in thkt theory, and thus will be much higher than the low-scoring theory's score.</Paragraph> <Paragraph position="25"> Example of Scoring Fnnction As an example of how the conditionalprobability-based scoring fimction handles ambiguity, consider the sentence Fruit flies like a banana.</Paragraph> <Paragraph position="26"> in the domain of insect studies. Lexica.I probabilities should indicate that the word &quot;flies&quot; is more likely to be a plural noun than a tensed verb. This information is incorporated in the trigram scores. However, when the interpretation S-+. NPVP is proposed, two possible NPs will be parsed,</Paragraph> <Paragraph position="28"> Since this sentence is syntactically a.mbiglmns, if the first hypothesis is tested first, the parser will interpret this sentence incorrectly. null However, this will not happen in this domain. Since &quot;fruit flies&quot; is a conmmn idiom in insect studies, the score of its trigram, noun noun verb, will be much greater than the score of the trigram, noun verb verb. Thus, not only will the lexical probability of the word &quot;flies\]verb&quot; be lower than that, of &quot;flies/norm,&quot; but also the raw score of &quot;NP ~ noun (fruit)&quot; will be lower than that, of &quot;NP ~ norm noun (fruit flies),&quot; because of the differential between the trigram scores.</Paragraph> <Paragraph position="29"> So, &quot;NP --~ noun noun&quot; will be used first to advance the &quot;S . NP VP&quot; rnle. Further, even if the parser advances both NP hypotheses, the &quot;S ~ NP . VP&quot; rnle using &quot;NP --~ noun noun&quot; will have a higher score than the &quot;S ~ NP . VP&quot; rule using &quot;NP ---~ 111011I'I .~</Paragraph> </Section> <Section position="4" start_page="232" end_page="234" type="metho"> <SectionTitle> INTERLEAVED ARCHITECTURE IN PEARL </SectionTitle> <Paragraph position="0"> The interleaved architecture implemented in .pearl provides many advantages over the traditional pipeline architecture, but it also introduces certain risks. Decisions about word and part-of-speech ambiguity can be delayed nntil syntactic processing can SWe are not deliberately avoiding using all probability estimation techniques, only those backLItg-O~ teclLttiqu.eS wltich thse itLdel.)endence C/~ssump~ons that frequently provide misleading information when applied to natural language.</Paragraph> <Paragraph position="1"> disarnbignate them. And, using the appropriate score combina/,ion fimctions, the scoring of ambiguous choices can direct the parser towards the most likely interpretation efficiently.</Paragraph> <Paragraph position="2"> However, with these delayed decisions comes a. vastly enlarged search space. The effectiveness of the parser depends on a majority of the theories having very low scores barred on either unlikely syntactic struct~Jres or low scoring input (such as low scores from a speech recognizer or low lexical probability). In experiments we have performed, this has been the case.</Paragraph> <Section position="1" start_page="233" end_page="233" type="sub_section"> <SectionTitle> The Parsing Algorithm </SectionTitle> <Paragraph position="0"> Pearl is an agenda~ba~sed time-asynchronous bottom-up chart parser with Earley-type top-down prediction. The significant difference between T~earl and non-probabilistic bottom-up parsers is that instead of completely generating all grammatical interpretations of a word string, ~earl uses an agenda to order the incomplete theories in its chart to determine which theory to advance next. The agenda is sorted by the value of the theory scoring fimction described above. Instead of expanding all theories in the chart, Pearl pl~rsnes the highest-scoring incomplete theories in the chart, advancing up to N theories at each pass.</Paragraph> <Paragraph position="1"> However, T~earl parses without pruning. Although it is only advancing N incomplete theories at each pass, it retains the lower scoring theories in its agenda. If the higher scoring theories do not generate viable alternatives, the lower scoring theories may be used on snbseqnent passes.</Paragraph> <Paragraph position="2"> The parsing algorithm begins with an input word lattice, which describes the input sentence and includes possible idiom bypothese and may include alternative word hypotheses. &quot;q Lexical rules for /.he input word lattice are inserted into the parser's chart,. Using Earley-type prediction, a sentence (S) is predicted at the beginning of the input, and all of the theories which are predicted by that initial sentence are inserted into the chart. These incomplete theories are scored according to the context-sensitive conditional probabilities and the trigrarn part-of-speech model. The incomplete theories are tested in order by score, until N theories are advanced, m , The resulting advanced theories are scored and predicted for, and the new incomplete predicted theories are scored and added to the chart. This process continues until an complete parse tree is determined, or nnt~il the parser decides, heuristically, that it should not continue. The heuristics we used for determining that no parse can be found for an input are based on the highest, scoring incomplete theory inn the chart, the number of passes the parser hans made, and the size of the chart.</Paragraph> </Section> <Section position="2" start_page="233" end_page="234" type="sub_section"> <SectionTitle> Pearl's Capabilities </SectionTitle> <Paragraph position="0"> Besides using statistical methods to guide the parser through the parsing search space, &quot;Pearl also performs other fimctions 0 Usi*tg alternative word hypotheses without incorporating a speech recogtfition model would not necessarily produce ttsefftd results. Given two unambigttous norms at the same position in the sentence, &quot;Pearl has no information with wlfich to disambiguate these words, and will invariably select thefirst one entered into the chart. The capability to process a alternate word hypotheses is inchtded to suggezt the future implementation off a speedt recognition modal i, +Pearl.</Paragraph> <Paragraph position="1"> J%Ve believe that N depends on the perplexity off the grammar used, but for the string grammar used for ottr experiments we itsed N=3. For the pttrp(yses off training, we sttggC/~l, that a higher N shottld be used in order to generate more parses.</Paragraph> <Paragraph position="2"> which are crncial to robustly processing unrestricted natliral language text and speech.</Paragraph> <Paragraph position="3"> Handling Unknown Words Pearl uses a very simple probabilistic unknown word model to hypothesize categories for unknown words. When a word is fonnd which is unknown to the system's lexicon, the word is a.ssumed to be any one of the open cla~ss categories. The lexical probability given a category is the probability of that category occurring in the training corpns.</Paragraph> <Paragraph position="4"> Idiom Processing and Lattice Parsing Since the parsing search space can be simplified by recognizing idion~s, Pearl allows the inpnt string to inch~de idiorrrs that. span more than one word in the sentence. This is accomplished by viewing the input sentence as a word lattice instead of a word string. Since idioo~s tend to be nnambignous with respect to part-of-speech, they are generally favored over processing the individual words that make up the idiom, since the scores of rules containing the words will tend to be lens than 1, while a syntactically appropriate~ unambiguous idiom will have a score of close to 1.</Paragraph> <Paragraph position="5"> The ability to parse a sentence with mnltiple word hypotheses and word boundary hypotheses makes Pearl very nsefifi in the domain of spoken language processing. By delaying decisions about word selection but maintaining scoring information from a speech recognizer, the parser can use grammatical information in word selection without slowing the speech recognition process.</Paragraph> <Paragraph position="6"> Because of Pearl's interleaved architecture, one conld ea.sily incorporate scoring information from a speech recognizer into the set of scoring fl\]nctions used in the parser. 'Pearl could also provide feedback to the speech recognizer abont the grarnmaticality of fragment hypotheses to glfide the recognizer's search.</Paragraph> <Paragraph position="7"> Partial Parses The main advantage of chart-barred parsing over other parsing algorithms is that a chart-based parser can recognize well-formed substrings within the input string in the course of pursuing a complete parse. Pearl takes fi,ll advantage of this characteristic. Once Pearl is given the input sentence, it awaits instructions as to what type of parse should be attempted for this input. A standard parser automatically attempts to prodace a sentence (S) spanning the entire inplJt string. However, if this fails, the semantic interpreter might be able to derive some meaning from the sentence if given non-overlapping noun, verb, and prepositional phrases. If a sentence fails to parse, requests for partial parses of the input string can be made by specifying a range which the parse tree should cover and the category (NP, VP, etc.). These requests, however, must be initiated by an intelligent semantics processor which can manipulate these partial parses.</Paragraph> <Paragraph position="8"> Trainability One of the major advantages of the probabilistic parsers is trainability. The conditional probabilities used by Pearl are estimated by using frequencies from a large corpus of parsed sentences. The parsed sentences must be parsed using the grammar formalism which the Pearl will use.</Paragraph> <Paragraph position="9"> Assuming the grammar is not recnrsive in an unconstrained way, the parser can be trained in an unsupervised mode. This is accomplished by running the parser without the scoring flmctions, and generating many parse trees for each sentence. Previous work H has demonstrated that the correct information from nThis is art unpublished result, reportedly due to Fujisaki at IBM Japan. these parse trees will be reinforced, while the incorrect substructure will not. Multiple passes of re-training using frequency data from the previous pass should creme the frequency tables to converge to a stable state. This hypothesis has not yet been tested, t2 An alternative to completely unsupervised training is to take a parsed corpus for any domain of the same language using the same grammar, and use the frequency data from that corpus as the initial training material for the new corpus. This approach should serve only to minimize the number of nnsupervised passes required for the frequency data to converge.</Paragraph> </Section> </Section> <Section position="5" start_page="234" end_page="235" type="metho"> <SectionTitle> PARSING THE VOYAGEI~ DOMAIN </SectionTitle> <Paragraph position="0"> In order to test Pearl's capabilities, we performed some simple tests to determine if its performance is at least consistent with the premises upon which it is bmsed. The test sentences used for this evaluation are not from the training dataon which the parser was trained. Using Pearl's context-free grammar, which is equivalent to the context-free backbone of PUNDIT'S grammar, these test sentences produced an average of 64 parses per sentence , with some sentences producing over 100 parses.</Paragraph> <Paragraph position="1"> The 40 test sentences were parsed by &quot;Pearl and the highest scoring parse fbr each sentence was compared to the correct parse produced by PUNDIT. Of these 40 sentences, &quot;Pearl produced parse trees fbr 38 of them, and 35 of these parse trees were equivalent to the correct parse produced by PUNDIT, fbr an overall accuracy rate of 88%. Although precise accuracy statistics are not available ibr PUNDIT, this result is believed to be comparable to PUNDIT's perfbrmance. However, the result is achieved without the painfully hand-crafted restriction grammar associated with PUNDIT'S parser.</Paragraph> <Paragraph position="2"> Many of the test sentences were not difficult to parse fbr existing parsers, but most had some grammatical ambiguity which would produce multiple parses. In fkct, on 2 of the 3 sentences which were incorrectly parsed, &quot;Pearl produced the correct parse as well, but the correct parse did not have the highest score. And both of these sentences would have been correctly processed if' semantic filtering were used on the top three parses.</Paragraph> <Paragraph position="3"> Of the two sentences which did not parse, one used passive voice, which only occurred in one sentence in the training corpus. While the other sentence, How can I got from care sushi to Cambridge City Hospital by walking did not produce a parse for the entire word string, it could be processed using &quot;Pearl's partial parsing capability. By accessing the chart produced by the failed parse attempt, the parser can find a parsed sentence containing the first eleven words, and a prepositional phrase containing the final two words. This infbrmation could be used to interpret the sentence properly.</Paragraph> <Paragraph position="4"> 12In fact, for certain grammars, the frequency tables may not converge at all, or they may converge to zero, with the grammar generating no parses for the entire corpus. This is a worst-ease scenario which we do not anticipate happening.</Paragraph> <Section position="1" start_page="234" end_page="235" type="sub_section"> <SectionTitle> Unknown Word Part-of-speech Assignment </SectionTitle> <Paragraph position="0"> To determine how &quot;Pearl handles unknown words, we randomly selected five words f~om the test sentences, \[, know, ~cc, dcscriSc, removed their entries f~om the lexicon, and stalion, and tried to parse the 40 sample sentences using the simple unknown word model previously described) ~ In this test, the pronoun, /, was assigned the correct part-of: speech 9 of 10 times it occurred in the test sentences. The nouns, ~ee and station, were correctly tagged 4 of 5 times. And the verbs, know and describc, were correctly tagged 3 of 3 times. While this accuracy is expected for unknown words in isolation, based on the accuracy of' the part-of:speech tagging model, the perfbrmance is expected to degrade for sequences of&quot; unknown words.</Paragraph> <Paragraph position="1"> Accurately determining prepositional phrase attachment in general is a difficult and well-documented problem. However, based on experience with several different domains, we have ibund prepositional phrase attachment to be a domain-specific phenomenon for which training can be very helpful. For instance, in the direction-finding domain, from and to prepositional phrases generally attach to the preceding verb and not to any noun phrase. This tendency is captured in the training process for &quot;Pearl and is used to guide the parser to the more likely attachment with respect to the domain. This does not mean that &quot;Pearl will get the correct parse when the less likely attachment is correct; in fact, &quot;Pearl will invariably get this case wrong. However, based on the premise that this is the less likely attachment, this will produce more correct analyses than incorrect. And, using a more sophisticated statistical model which uses more contextual infbrmation, this perfbrmance can likely be improved.</Paragraph> <Paragraph position="2"> &quot;Pearl's perfbrmance on prepositional phrase attachment was very high (54/55 or 98.2% correct). The reason the accuracy rate is so high is that the direction-finding domain is very consistent in its use of individual prepositions. The accuracy rate is not expected to be as high in less consistent domains, although we expect it to be significantly higher than chance.</Paragraph> <Paragraph position="3"> Search Space Reduction One claim of &quot;Pearl, and of probabilistic parsers in general, is that probabilities can help guide a parser through the immense search space produced by ambiguous grammars. Since, without probabilisties, the test sentences produced an average of 64 parses per sentence, &quot;Pearl unquestionably has reduced the space of possibilities by only producing 3 parses per sentence while maintaining nThe unknown word model used in this test was augmented to include dosed class categories as well as open class, since the words removed from the lexicon may have included (in fact did include) dosed dass words.</Paragraph> </Section> </Section> <Section position="6" start_page="235" end_page="235" type="metho"> <SectionTitle> Accuracy Rate for Prepositional Phrm~e Attachment, by </SectionTitle> <Paragraph position="0"> high accuracy. However, it is interesting to see how &quot;Pearl's scoring function performs against previously proposed scoring functions. The four scoring :\['unctions compared include a simple probabilistic CFG, where each context-fl'ee rule is assigned a fixed likelihood based on training, a CFG using probabilistic conditioning on the parent rule only, which is similar to the scoring f'unction used by Chitrao and Grishman\[3\], and two versions of the CFG with CSP model, one using the geometric mean of raw theory scores and the other using the product of&quot; these raw scores. Using</Paragraph> <Section position="1" start_page="235" end_page="235" type="sub_section"> <SectionTitle> Models </SectionTitle> <Paragraph position="0"> a simple probabilistic CFG model, the parser produced a much lower accuracy rate (35%). The parentM conditioning brought this rate up to 50%, and the trigram conditioning brought this level up to 88%. The search space for CFG with CSP was 4 to 5 times lower than the simple probabilistic CFG.</Paragraph> </Section> </Section> class="xml-element"></Paper>