File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1021_metho.xml
Size: 22,915 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?> <Paper uid="N01-1021"> <Title>A Probabilistic Earley Parser as a Psycholinguistic Model</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Language models </SectionTitle> <Paragraph position="0"> Stolcke's parsing algorithm was initially applied as a component of an automatic speech recognition system. In speech recognition, one is often interested in the probability that some word will follow, given that a sequence of words has been seen. Given some lexicon of all possible words, a language model assigns a probability to every string of words from the lexicon. This de nes a probabilistic language (Grenander, 1967) (Booth and Thompson, 1973) (Soule, 1974) (Wetherell, 1980).</Paragraph> <Paragraph position="1"> A language model helps a speech recognizer focus its attention on words that are likely continuations of what it has recognized so far. This is typically done using conditional probabilities of the form . Given some nite lexicon, the probability of each possible outcome for W n can be estimated using that outcome's relative frequency in a sample.</Paragraph> <Paragraph position="2"> Traditional language models used for speech are n-gram models, in which n [?] 1 words of history serve as the basis for predicting the nth word. Such models do not have any notion of hierarchical syntactic structure, except as might be visible through an n-word window.</Paragraph> <Paragraph position="3"> Aware that the n-gram obscures many linguistically-signi cant distinctions (Chomsky, 1956, section 2.3), many speech researchers (Jelinek and La erty, 1991) sought to incorporate hierarchical phrase structure into language modeling (see (Stolcke, 1997)) although it was not until the late 1990s that such models were able to signi cantly improve on 3-grams (Chelba and Jelinek, 1998).</Paragraph> <Paragraph position="4"> Stolcke's probabilistic Earley parser is one way to use hierarchical phrase structure in a language model. The grammar it parses is a probabilistic context-free phrase structure grammar (PCFG), e.g.</Paragraph> <Paragraph position="6"> see (Charniak, 1993, chapter 5) Such a grammar de nes a probabilistic language in terms of a stochastic process that rewrites strings of grammar symbols according to the probabilities on the rules. Then each sentence in the language of the grammar has a probability equal to the product of the probabilities of all the rules used to generate it. This multiplication embodies the assumption that rule choices are independent. Sentences with more than one derivation accumulate the probability of all derivations that generate them. Through recursion, in nite languages can be speci ed; an important mathematical question in this context is whether or not such a grammar is consistent { whether it assigns some probability to in nite derivations, or whether all derivations are guaranteed to terminate.</Paragraph> <Paragraph position="7"> Even if a PCFG is consistent, it would appear to have another drawback: it only assigns probabilities to complete sentences of its language. This is as inconvenient for speech recognition as it is for modeling reading times.</Paragraph> <Paragraph position="8"> Stolcke's algorithm solves this problem by computing, at each word of an input string, the pre x probability. This is the sum of the probabilities of all derivations whose yield is compatible with the string seen so far. If the grammar is consistent (the probabilities of all derivations sum to 1.0) then subtracting the pre x probability from 1.0 gives the total probability of all the analyses the parser has discon rmed. If the human parser is eager, then the \work&quot; done during sentence processing is exactly this discon rmation. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Earley parsing </SectionTitle> <Paragraph position="0"> The computation of pre x probabilities takes advantage of the design of the Earley parser (Earley, 1970) which by itself is not probabilistic. In this section I provide a brief overview of Stolcke's algorithm but the original paper should be consulted for full details (Stolcke, 1995).</Paragraph> <Paragraph position="1"> Earley parsers work top-down, and propagate predictions con rmed by the input string back up through a set of states representing hypotheses the parser is entertaining about the structure of the sentence. The global state of the parser at any one time is completely de ned by this collection of states, a chart, which de nes a tree set. A state is a record that speci es the current input string position processed so far a grammar rule a \dot-position&quot; in the rule representing how much of the rule has already been recognized the leftmost edge of the substring this rule generates null An Earley parser has three main functions, predict, scan and complete, each of which can enter new states into the chart. Starting from a dummy start state in which the dot is just to the left of the grammar's start symbol, predict adds new states for rules which could expand the start symbol. In these new predicted states, the dot is at the far left-hand side of each rule. After prediction, scan checks the input string: if the symbol immediately following the dot matches the current word in the input, then the dot is moved rightward, across the symbol. The parser has \scanned&quot; this word. Finally, complete propagates this change throughout the chart. If, as a result of scanning, any states are now present in which the dot is at the end of a rule, then the left hand side of that rule has been recognized, and any other states having a dot immediately in front of the newly-recognized left hand side symbol can now have their dots moved as well. This happens over and over until no new states are generated. Parsing nishes when the dot in the dummy start state is moved across the grammar's start symbol.</Paragraph> <Paragraph position="2"> Stolcke's innovation, as regards pre x probabilities is to add two additional pieces of information to each state: , the forward, or pre x probability, and g the \inside&quot; probability. He notes that path An (unconstrained) Earley path, or simply path, is a sequence of Earley states linked by prediction, scanning, or completion.</Paragraph> <Paragraph position="3"> constrained A path is said to be constrained by, or generate a string x if the terminals immediately to the left of the dot in all scanned states, in sequence, form the string x.</Paragraph> <Paragraph position="4"> ...</Paragraph> <Paragraph position="5"> The signi cance of Earley paths is that they are in a one-to-one correspondence with left-most derivations. This will allow us to talk about probabilities of derivations, strings and pre xes in terms of the actions performed by Earley's parser.</Paragraph> <Paragraph position="6"> (Stolcke, 1995, page 8) This correspondence between paths of parser operations and derivations enables the computation of the pre x probability { the sum of all derivations compatible with the pre x seen so far. By the correspondence between derivations and Earley paths, one would need only to compute the sum of all paths that are constrained by the observed pre x. But this can be done in the course of parsing by storing the current pre x probability in each state. Then, when a new state is added by some parser operation, the contribution from each antecedent state { each previous state linked by some parser operation { is summed in the new state. Knowing the pre x probability at each state and then summing for all parser operations that result in the same new state e ciently counts all possible derivations.</Paragraph> <Paragraph position="7"> Predicting a rule corresponds to multiplying by that rule's probability. Scanning does not alter any probabilities. Completion, though, requires knowing g, the inside probability, which records how probable was the inner structure of some recognized phrasal node. When a state is completed, a bottom-up conrmation is united with a top-down prediction, so the value of the complete-ee is multiplied by the g value of the complete-er.</Paragraph> <Paragraph position="8"> Important technical problems involving left-recursive and unit productions are examined and overcome in (Stolcke, 1995). However, these complications do not add any further machinery to the parsing algorithm per se beyond the grammar rules and the dot-moving conventions: in particular, there are no heuristic parsing principles or intermediate structures that are later destroyed. In this respect the algorithm observes strong competence { principle 1. In virtue of being a probabilistic parser it observes principle 2. Finally, in the sense that predict and complete each apply exhaustively at each new input word, the algorithm is eager, satisfying principle 3.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Parallelism </SectionTitle> <Paragraph position="0"> Psycholinguistic theories vary regarding the amount bandwidth they attribute to the human sentence processing mechanism. Theories of initial parsing preferences (Fodor and Ferreira, 1998) suggest that the human parser is fundamentally serial: a function from a tree and new word to a new tree. These theories explain processing di culty by appealing to \garden pathing&quot; in which the current analysis is faced with words that cannot be reconciled with the structures built so far. A middle ground is held by bounded-parallelism theories (Narayanan and Jurafsky, 1998) (Roark and Johnson, 1999). In these theories the human parser is modeled as a function from some subset of consistent trees and the new word, to a new tree subset. Garden paths arise in these theories when analyses fall out of the set of trees maintained from word to word, and have to be reanalyzed, as on strictly serial theories. Finally, there is the possibility of total parallelism, in which the entire set of trees compatible with the input is maintained somehow from word to word. On such a theory, garden-pathing cannot be explained by reanalysis. null The probabilistic Earley parser computes all parses of its input, so as a psycholinguistic theory it is a total parallelism theory. The explanation for garden-pathing will turn on the reduction in the probability of the new tree set compared with the previous tree set { reanalysis plays no role. Before illustrating this kind of explanation with a speci c example, it will be important to rst clarify the nature of the linking hypothesis between the operation of the probabilistic Earley parser and the measured e ects of the human parser.</Paragraph> </Section> <Section position="5" start_page="0" end_page="1" type="metho"> <SectionTitle> 4 Linking hypothesis </SectionTitle> <Paragraph position="0"> The measure of cognitive e ort mentioned earlier is de ned over pre xes: for some observed pre x, the cognitive e ort expended to parse that pre x is proportional to the total probability of all the structural analyses which cannot be compatible with the observed pre x. This is consistent with eagerness since, if the parser were to fail to infer the incompatibility of some incompatible analysis, it would be delaying a computation, and hence not be eager.</Paragraph> <Paragraph position="1"> This pre x-based linking hypothesis can be turned into one that generates predictions about word-by-word reading times by comparing the total e ort expended before some word to the total e ort after: in particular, take the comparison to be a ratio.</Paragraph> <Paragraph position="2"> Making the further assumption that the probabilities on PCFG rules are statements about how di cult it is to discon rm each rule , then the ratio of This assumption is inevitable given principles 1 and 2. If there were separate processing costs distinct from the optimization costs postulated in the grammar, then strong competence is violated. De ning all grammatical structures as equally easy to discon rm or perceive likewise voids the gradedness of grammaticality of any content.</Paragraph> <Paragraph position="3"> the value for the previous word to the value for the current word measures the combined di culty of discon rming all discon rmable structures at a given word { the de nition of cognitive load. Scaling this number by taking its log gives the surprisal, and de nes a word-based measure of cognitive e ort in terms of the pre x-based one. Of course, if the language model is sensitive to hierarchical structure, then the measure of cognitive e ort so de ned will be structure-sensitive as well.</Paragraph> </Section> <Section position="6" start_page="1" end_page="6" type="metho"> <SectionTitle> 5 Plausibility of Probabilistic Context-Free Grammar </SectionTitle> <Paragraph position="0"> The debate over the form grammar takes in the mind is clearly a fundamental one for cognitive science.</Paragraph> <Paragraph position="1"> Much recent psycholinguistic work has generated a wealth of evidence that frequency of exposure to linguistic elements can a ect our processing (Mitchell et al., 1995) (MacDonald et al., 1994). However, there is no clear consensus as to the size of the elements over which exposure has clearest e ect. Gibson and Pearlmutter identify it as an \outstanding question&quot; whether or not phrase structure statistics are necessary to explain performance e ects in sentence comprehension: Are phrase-level contingent frequency constraints necessary to explain comprehension performance, or are the remaining types of constraints su cient. If phrase-level contingent frequency constraints are necessary, can they subsume the e ects of other constraints (e.g. locality) ? (Gibson and Pearlmutter, 1998, page 13) Equally, formal work in linguistics has demonstrated the inadequacy of context-free grammars as an appropriate model for natural language in the general case (Shieber, 1985). To address this criticism, the same pre x probabilities could be computing using tree-adjoining grammars (Nederhof et al., 1998). With context-free grammars serving as the implicit backdrop for much work in human sentence processing, as well as linguistics simplicity seems as good a guide as any in the selection of a grammar formalism.</Paragraph> <Paragraph position="2"> Some important work in computational psycholinguistics (Ford, 1989) assumes a Lexical-Functional Grammar where the c-structure rules are essentially context-free and have attached to them \strengths&quot; which one might interpret as probabilities.</Paragraph> <Paragraph position="3"> could account for garden path structural ambiguity. Grammar (1) generates the celebrated garden path sentence \the horse raced past the barn fell&quot; (Bever, 1970). English speakers hearing these words one by one are inclined to take \the horse&quot; as the subject of \raced,&quot; expecting the sentence to end at the word \barn.&quot; This is the main verb reading in gure 1. The human sentence processing mechanism is metaphorically led up the garden path by the main verb reading, when, upon hearing \fell&quot; it is forced to accept the alternative reduced relative reading shown in gure 2.</Paragraph> <Paragraph position="4"> The confusion between the main verb and the reduced relative readings, which is resolved upon hearing \fell&quot; is the empirical phenomenon at issue. As the parse trees indicate, grammar (1) analyzes reduced relative clauses as a VP adjoined to an NP .</Paragraph> <Paragraph position="5"> In one sample of parsed text such adjunctions are about 7 times less likely than simple NPs made up of a determiner followed by a noun. The probabilities of the other crucial rules are likewise estimated by their relative frequencies in the sample.</Paragraph> <Paragraph position="6"> See section 1.24 of the Treebank style guide The sample, starts at sentence 93 of section 16 of the Treebank and goes for 500 sentences (12924 words) For information about the Penn Treebank project see This simple grammar exhibits the essential character of the explanation: garden paths happen at points where the parser can discon rm alternatives that together comprise a great amount of probability. Note the category ambiguity present with raced which can show up as both a past-tense verb (VBD) and a past participle (VBN).</Paragraph> <Paragraph position="7"> Figure 3 shows the reading time predictions derived via the linking hypothesis that reading time at word n is proportional to the surprisal log on simple grammar At \fell,&quot; the parser garden-paths: up until that point, both the main-verb and reduced-relative structures are consistent with the input. The pre x probability before \fell&quot; is scanned is more than 10 times greater than after, suggesting that the probability mass of the analyses discon rmed at that point was indeed great. In fact, all of the probability assigned to the main-verb structure is now lost, and only parses that involve the low-probability NP rule survive { a rule introduced 5 words back.</Paragraph> <Section position="1" start_page="5" end_page="6" type="sub_section"> <SectionTitle> 6.2 A comparison </SectionTitle> <Paragraph position="0"> If this garden path e ect is truly a result of both the main verb and the reduced relative structures being simultaneously available up until the nal verb, Whether the quantitative values of the predicted reading times can be mapped onto a particular experiment involves taking some position on the oft-observed (Gibson and Sch&quot;utze, 1999) imperfect relationship between corpus frequency and psychological norms.</Paragraph> <Paragraph position="1"> then the e ect should disappear when words intervene that cancel the reduced relative interpretation early on.</Paragraph> <Paragraph position="2"> To examine this possibility, consider now a di erent example sentence, this time from the language of grammar (2).</Paragraph> <Paragraph position="3"> The probabilities in grammar (2) are estimated from the same sample as before. It generates a sentence composed of words actually found in the sample, \the banker told about the buy-back resigned.&quot; This sentence exhibits the same reduced relative clause structure as does \the horse raced past the barn This grammar also generates active and simple passive sentences, rating passive sentences as more probable than the actives. This is presumably a fact about the writing style favored by the Wall Street Journal.</Paragraph> <Paragraph position="4"> the banker who was told about the buy-backresigned RC only the banker who was told about the buy-back resigned The words who was cancel the main verb reading, and should make that condition easier to process. This asymmetry is borne out in graphs 4 and 5. At \resigned&quot; the probabilistic Earley parser predicts less reading time in the subject relative condition than in the reduced relative condition.</Paragraph> <Paragraph position="5"> This comparison veri es that the same sorts of phenomena treated in reanalysis and bounded parallelism parsing theories fall out as cases of the present, total parallelism theory.</Paragraph> </Section> <Section position="2" start_page="6" end_page="6" type="sub_section"> <SectionTitle> 6.3 An entirely empirical grammar </SectionTitle> <Paragraph position="0"> Although they used frequency estimates provided by corpus data, the previous two grammars were partially hand-built. They used a subset of the rules found in the sample of parsed text. A grammar including all rules observed in the entire sample supports the same sort of reasoning. In this grammar, instead of just 2 NP rules there are 532, along with 120 S rules. Many of these generate analyses compatible with pre xes of the reduced relative clause at various points during parsing, so the expectation is that the parser will be discon rming many more hypotheses at each word than in the simpler example.</Paragraph> <Paragraph position="1"> Figure 6 shows the reading time predictions derived from this much richer grammar.</Paragraph> <Paragraph position="2"> Because the terminal vocabulary of this richer grammar is so much larger, a comparatively large amount of information is conveyed by the nouns \banker&quot; and \buy-back&quot; leading to high surprisal the banker told about the buy-backresigned .</Paragraph> <Paragraph position="3"> e ect is still observable at \resigned&quot; where the prex probability ratio is nearly 10 times greater than at either of the nouns. Amid the lexical e ects, the probabilistic Earley parser is a ected by the same structural ambiguity that a ects English speakers.</Paragraph> </Section> </Section> <Section position="7" start_page="6" end_page="8" type="metho"> <SectionTitle> 7 Subject/Object asymmetry </SectionTitle> <Paragraph position="0"> The same kind of explanation supports an account of the subject-object relative asymmetry (cf. references in (Gibson, 1998)) in the processing of unreduced relative clauses. Since the Earley parser is designed to work with context-free grammars, the following example grammar adopts a GPSG-style analysis of relative clauses (Gazdar et al., 1985, page 155). The estimates of the ratios for the two S[+R] rules are obtained by counting the proportion of sub-ject relatives among all relatives in the Treebank's parsed Brown corpus In particular, relative clauses in the Treebank are analyzed as</Paragraph> <Paragraph position="2"> where the S contains a trace *T* coindexed with the WHNP. The total number of structures in which both rule 1 and rule 2 apply is 5489. The total number where the rst child of S is null is 4768. This estimate puts the total number of object relatives at 721 and the frequency of object relatives at 0.13135362 and the frequency of subject relatives at 0.86864638.</Paragraph> <Paragraph position="3"> Grammar (3) generates both subject and object relative clauses. S[+R]!NP[+R] VP is the rule that generates subject relatives and S[+R] ! NP[+R] S/NP generates object relatives. One might expect there to be a greater processing load for object relatives as soon as enough lexical material is present to determine that the sentence is in fact an object relative .</Paragraph> <Paragraph position="4"> The same probabilistic Earley parser (modi ed to handle null-productions) explains this asymmetry in the same way as it explains the garden path e ect.</Paragraph> <Paragraph position="5"> Its predictions, under the same linking hypothesis as in the previous cases, are depicted in graphs 7 and 8. The mean surprisal for the object relative is about 5.0 whereas the mean surprisal for the subject relative is about 2.1.</Paragraph> </Section> class="xml-element"></Paper>