XML Viewer - j01-2004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/j01-2004_metho.xml
Size: 58,568 bytes
Last Modified: 2025-10-06 14:07:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-2004">
  <Title>Probabilistic Top-Down Parsing and Language Modeling</Title>
  <Section position="5" start_page="253" end_page="256" type="metho">
    <SectionTitle>
6 A node A dominates a node B in a tree if and only if either (i) A is the parent of B; or (ii) A is the
</SectionTitle>
    <Paragraph position="0"> parent of a node C that dominates B.</Paragraph>
    <Section position="1" start_page="254" end_page="254" type="sub_section">
      <SectionTitle>
Roark Top-Down Parsing
</SectionTitle>
      <Paragraph position="0"> and those that use a parser to uncover phrasal heads standing in an important relation (c-command) to the current word. The approach that we will subsequently present uses the probabilistic grammar as its language model, but only includes probability mass from those parses that are found, that is, it uses the parser to find a subset of the total set of parses (hopefully most of the high-probability parses) and uses the sum of their probabilities as an estimate of the true probability given the grammar.</Paragraph>
    </Section>
    <Section position="2" start_page="254" end_page="254" type="sub_section">
      <SectionTitle>
3.1 Grammar Models
</SectionTitle>
      <Paragraph position="0"> As mentioned in Section 2.1, a PCFG defines a probability distribution over strings of words. One approach to syntactic language modeling is to use this distribution directly as a language model. There are efficient algorithms in the literature (Jelinek and Lafferty 1991; Stolcke 1995) for calculating exact string prefix probabilities given a PCFG. The algorithms both utilize a left-corner matrix, which can be calculated in closed form through matrix inversion. They are limited, therefore, to grammars where the nonterminal set is small enough to permit inversion. String prefix probabilities can be straightforwardly used to compute conditional word probabilities by definition:</Paragraph>
      <Paragraph position="2"> Stolcke and Segal (1994) and Jurafsky et al. (1995) used these basic ideas to estimate bigram probabilities from hand-written PCFGs, which were then used in language models. Interpolating the observed bigram probabilities with these calculated bigrams led, in both cases, to improvements in word error rate over using the observed bigrams alone, demonstrating that there is some benefit to using these syntactic language models to generalize beyond observed n-grams.</Paragraph>
    </Section>
    <Section position="3" start_page="254" end_page="256" type="sub_section">
      <SectionTitle>
3.2 Finding Phrasal Heads
</SectionTitle>
      <Paragraph position="0"> Another approach that uses syntactic structure for language modeling has been to use a shift-reduce parser to &amp;quot;surface&amp;quot; c-commanding phrasal headwords or part-of-speech (POS) tags from arbitrarily far back in the prefix string, for use in a trigram-like model.</Paragraph>
      <Paragraph position="1"> A shift-reduce parser operates from left to right using a stack and a pointer to the next word in the input string. 9 Each stack entry consists minimally of a nonterminal label. The parser performs two basic operations: (i) shifting, which involves pushing the POS label of the next word onto the stack and moving the pointer to the following word in the input string; and (ii) reducing, which takes the top k stack entries and replaces them with a single new entry, the nonterminal label of which is the left-hand side of a rule in the grammar that has the k top stack entry labels on the right-hand side. For example, if there is a rule NP -~ DT NN, and the top two stack entries are NN and DT, then those two entries can be popped off of the stack and an entry with the label NP pushed onto the stack.</Paragraph>
      <Paragraph position="2"> Goddeau (1992) used a robust deterministic shift-reduce parser to condition word probabilities by extracting a specified number of stack entries from the top of the current state, and conditioning on those entries in a way similar to an n-gram. In empirical trials, Goddeau used the top two stack entries to condition the word probability. He was able to reduce both sentence and word error rates on the ATIS corpus using this method.</Paragraph>
      <Paragraph position="3"> 9 For details, see Hopcroft and Ullman (1979), for example.</Paragraph>
      <Paragraph position="4">  Tree representation of a derivation state.</Paragraph>
      <Paragraph position="5"> The structured language model (SLM) used in Chelba and Jelinek (1998a, 1998b, 1999), Jelinek and Chelba (1999), and Chelba (2000) is similar to that of Goddeau, except that (i) their shift-reduce parser follows a nondeterministic beam search, and (ii) each stack entry contains, in addition to the nonterminal node label, the headword of the constituent. The SLM is like a trigram, except that the conditioning words are taken from the tops of the stacks of candidate parses in the beam, rather than from the linear order of the string.</Paragraph>
      <Paragraph position="6"> Their parser functions in three stages. The first stage assigns a probability to the word given the left context (represented by the stack state). The second stage predicts the POS given the word and the left context. The last stage performs all possible parser operations (reducing stack entries and shifting the new word). When there is no more parser work to be done (or, in their case, when the beam is full), the following word is predicted. And so on until the end of the string.</Paragraph>
      <Paragraph position="7"> Each different POS assignment or parser operation is a step in a derivation. Each distinct derivation path within the beam has a probability and a stack state associated with it. Every stack entry has a nonterminal node label and a designated headword of the constituent. When all of the parser operations have finished at a particular point in the string, the next word is predicted as follows: For each derivation in the beam, the headwords of the two topmost stack entries form a trigram with the conditioned word.</Paragraph>
      <Paragraph position="8"> This interpolated trigram probability is then multiplied by the normalized probability of the derivation, to provide that derivation's contribution to the probability of the word. More precisely, for a beam of derivations Di</Paragraph>
      <Paragraph position="10"> where hod and hld are the lexical heads of the top two entries on the stack of d.</Paragraph>
      <Paragraph position="11"> Figure 3 gives a partial tree representation of a potential derivation state for the string &amp;quot;the dog chased the cat with spots&amp;quot;, at the point when the word &amp;quot;with&amp;quot; is to be predicted. The shift-reduce parser will have, perhaps, built the structure shown, and the stack state will have an NP entry with the head &amp;quot;cat&amp;quot; at the top of the stack, and a VBD entry with the head &amp;quot;chased&amp;quot; second on the stack. In the Chelba and Jelinek model, the probability of &amp;quot;with&amp;quot; is conditioned on these two headwords, for this derivation.</Paragraph>
      <Paragraph position="12"> Since the specific results of the SLM will be compared in detail with our model when the empirical results are presented, at this point we will simply state that they have achieved a reduction in both perplexity and word error rate over a standard trigram using this model.</Paragraph>
      <Paragraph position="13"> The rest of this paper will present our parsing model, its application to language modeling for speech recognition, and empirical results.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="256" end_page="262" type="metho">
    <SectionTitle>
Roark Top-Down Parsing
4. Top-Down Parsing and Language Modeling
</SectionTitle>
    <Paragraph position="0"> Statistically based heuristic best-first or beam-search strategies (Caraballo and Charniak 1998; Charniak, Goldwater, and Johnson 1998; Goodman 1997) have yielded an enormous improvement in the quality and speed of parsers, even without any guarantee that the parse returned is, in fact, that with the maximum likelihood for the probability model. The parsers with the highest published broad-coverage parsing accuracy, which include Charniak (1997, 2000), Collins (1997, 1999), and Ratnaparkhi (1997), all utilize simple and straightforward statistically based search heuristics, pruning the search-space quite dramatically. 1deg Such methods are nearly always used in conjunction with some form of dynamic programming (henceforth DP). That is, search efficiency for these parsers is improved by both statistical search heuristics and DP. Here we will present a parser that uses simple search heuristics of this sort without DE Our approach is found to yield very accurate parses efficiently, and, in addition, to lend itself straightforwardly to estimating word probabilities on-line, that is, in a single pass from left to right. This on-line characteristic allows our language model to be interpolated on a word-by-word basis with other models, such as the trigram, yielding further improvements.</Paragraph>
    <Paragraph position="1"> Next we will outline our conditional probability model over rules in the PCFG, followed by a presentation of the top-down parsing algorithm. We will then present empirical results in two domains: one to compare with previous work in the parsing literature, and the other to compare with previous work using parsing for language modeling for speech recognition, in particular with the Chelba and Jelinek results mentioned above.</Paragraph>
    <Section position="1" start_page="256" end_page="258" type="sub_section">
      <SectionTitle>
4.1 Conditional Probability Model
</SectionTitle>
      <Paragraph position="0"> A simple PCFG conditions rule probabilities on the left-hand side of the rule. It has been shown repeatedly--e.g., Briscoe and Carroll (1993), Charniak (1997), Collins (1997), Inui et al. (1997), Johnson (1998)--that conditioning the probabilities of structures on the context within which they appear, for example on the lexical head of a constituent (Charniak 1997; Collins 1997), on the label of its parent nonterrninal (Johnson 1998), or, ideally, on both and many other things besides, leads to a much better parsing model and results in higher parsing accuracies.</Paragraph>
      <Paragraph position="1"> One way of thinking about conditioning the probabilities of productions on contextual information (e.g., the label of the parent of a constituent or the lexical heads of constituents), is as annotating the extra conditioning information onto the labels in the context-free rules. Examples of this are bilexical grammars--such as Eisner and Satta (1999), Charniak (1997), Collins (1997)--where the lexical heads of each constituent are annotated on both the right- and left-hand sides of the context-free rules, under the constraint that every constituent inherits the lexical head from exactly one of its children, and the lexical head of a POS is its terminal item. Thus the rule S -* NP VP becomes, for instance, S\[barks\] ---* NP\[dog\] VP\[barks\]. One way to estimate the probabilities of these rules is to annotate the heads onto the constituent labels in the training corpus and simply count the number of times particular productions occur (relative frequency estimation). This procedure yields conditional probability distributions of 10 Johnson et al. (1999), Henderson and Brill (1999), and Collins (2000) demonstrate methods for choosing the best complete parse tree from among a set of complete parse trees, and the latter two show accuracy improvements over some of the parsers cited above, from which they generated their candidate sets. Here we will be comparing our work with parsing algorithms, i.e., algorithms that build parses for strings of words.  Computational Linguistics Volume 27, Number 2 constituents on the right-hand side with their lexical heads, given the left-hand side constituent and its lexical head. The same procedure works if we annotate parent information onto constituents. This is how Johnson (1998) conditioned the probabilities of productions: the left-hand side is no longer, for example, S, but rather STSBAR, i.e., an S with SBAR as parent. Notice, however, that in this case the annotations on the right-hand side are predictable from the annotation on the left-hand side (unlike, for example, bilexical grammars), so that the relative frequency estimator yields conditional probability distributions of the original rules, given the parent of the left-hand side.</Paragraph>
      <Paragraph position="2"> All of the conditioning information that we will be considering will be of this latter sort: the only novel predictions being made by rule expansions are the node labels of the constituents on the right-hand side. Everything else is already specified by the left context. We use the relative frequency estimator, and smooth our production probabilities by interpolating the relative frequency estimates with those obtained by &amp;quot;annotating&amp;quot; less contextual information.</Paragraph>
      <Paragraph position="3"> This perspective on conditioning production probabilities makes it easy to see that, in essence, by conditioning these probabilities, we are growing the state space. That is, the number of distinct nonterminals grows to include the composite labels; so does the number of distinct productions in the grammar. In a top-down parser, each rule expansion is made for a particular candidate parse, which carries with it the entire rooted derivation to that point; in a sense, the left-hand side of the rule is annotated with the entire left context, and the rule probabilities can be conditioned on any aspect of this derivation.</Paragraph>
      <Paragraph position="4"> We do not use the entire left context to condition the rule probabilities, but rather &amp;quot;pick-and-choose&amp;quot; which events in the left context we would like to condition on.</Paragraph>
      <Paragraph position="5"> One can think of the conditioning events as functions, which take the partial tree structure as an argument and return a value, upon which the rule probability can be conditioned. Each of these functions is an algorithm for walking the provided tree and returning a value. For example, suppose that we want to condition the probability of the rule A --* ~. We might write a function that takes the partial tree, finds the parent of the left-hand side of the rule and returns its node label. If the left-hand side has no parent (i.e., it is at the root of the tree), the function returns the null value (NULL). We might write another function that returns the nonterminal label of the closest sibling to the left of A, and NULL if no such node exists. We can then condition the probability of the production on the values that were returned by the set of functions.</Paragraph>
      <Paragraph position="6"> Recall that we are working with a factored grammar, so some of the nodes in the factored tree have nonterminal labels that were created by the factorization, and may not be precisely what we want for conditioning purposes. In order to avoid any confusions in identifying the nonterminal label of a particular rule production in either its factored or nonfactored version, we introduce the function constLtuent (A) for every nonterminal in the factored grammar GI, which is simply the label of the constituent whose factorization results in A. For example, in Figure 2, constLtuent (NP-DT-NN) is simply NP.</Paragraph>
      <Paragraph position="7"> Note that a function can return different values depending upon the location in the tree of the nonterminal that is being expanded. For example, suppose that we have a function that returns the label of the closest sibling to the left of constituent(A) or NULL if no such node exists. Then a subsequent function could be defined as follows: return the parent of the parent (the grandparent) of constituent (A) only if constituent(A) has no sibling to the left--in other words, if the previous function returns NULL; otherwise return the second closest sibling to the left of constLtuent (A), or, as always, NULL if no such node exists. If the function returns, for example, NP, this could either mean that the grandparent is NP or the second closest sibling is</Paragraph>
    </Section>
    <Section position="2" start_page="258" end_page="260" type="sub_section">
      <SectionTitle>
Roark Top-Down Parsing
</SectionTitle>
      <Paragraph position="0"> For all rules A -+ ct (r) A ) O the parent, Yp, of constituent (A) in the derivation I (~ the closest sibling, Ys, to the left of constituent(A) in the derivation S, G NULL @ the parent, ~, of Yp in the derivation the closest c-commanding lexical head to A s I the next closest c-commanding the closest sibling, the POS of the closest lexical head to A @ }'ps, to the left of }'~ c-commanding lexical head to .4 If Y is CC the leftmost child 8  (r) of the conjoining category; else NULL the closest c-commanding lexical head to A ! the lexical t!ead of consl:ituent (.4) if already seen; (r) otherwise the lexical head of the closest the next closest c-commanding constituent to the left of A within constituent(A) lexical head to A  Conditional probability model represented as a decision tree, identifying the location in the partial parse tree of the conditioning information. NP; yet there is no ambiguity in the meaning of the function, since the result of the previous function disambiguates between the two possibilities. The functions that were used for the present study to condition the probability of the rule, A ---, a, are presented in Figure 4, in a tree structure. This is a sort of decision tree for a tree-walking algorithm to decide what value to return, for a given partial tree and a given depth. For example, if the algoritt~rn is asked for the value at level 0, it will return A, the left-hand side of the rule being expanded. 1~ Suppose the algorithm is asked for the value at level 4. After level 2 there is a branch in the decision tree. If the left-hand side of the rule is a POS, and there is no sibling to the left of constituent (A) in the derivation, then the algorithm takes the right branch of the decision tree to decide what value to return; otherwise the left branch. Suppose it takes the left branch. Then after level 3, there is another branch in the decision tree. If the left-hand side of the production is a POS, then the algorithm takes the right branch of the decision tree, and returns (at level 4) the POS of the closest c-commanding lexical head to A, which it finds by walking the parse tree; if the left-hand side of the rule is not a POS, then the algorithm returns (at level 4) the closest sibling to the left of the parent of constituent (A).</Paragraph>
      <Paragraph position="1"> The functions that we have chosen for this paper follow from the intuition (and experience) that what helps parsing is different depending on the constituent that is being expanded. POS nodes have lexical items on the right-hand side, and hence can bring into the model some of the head-head dependencies that have been shown to  be so effective. If the POS is leftmost within its constituent, then very often the lexical 11 Recall that A can be a composite nonterminal introduced by grammar factorization. When the function is defined in terms of constJ_tuent (A), the values returned are obtained by moving through the nonfactored tree.</Paragraph>
      <Paragraph position="2">  Everything for non-POS expansions More structural info for leftmost POS expansions All attachment info for leftmost POS expansions Everything item is sensitive to the governing category to which it is attaching. For example, if the POS is a preposition, then its probability of expanding to a particular word is very different if it is attaching to a noun phrase than if it is attaching to a verb phrase, and perhaps quite different depending on the head of the constituent to which it is attaching. Subsequent POSs within a constituent are likely to be open-class words, and less dependent on these sorts of attachment preferences.</Paragraph>
      <Paragraph position="3"> Conditioning on parents and siblings of the left-hand side has proven to be very useful. To understand why this is the case, one need merely to think of VP expansions. If the parent of a VP is another VP (i.e., if an auxiliary or modal verb is used), then the distribution over productions is different than if the parent is an S. Conditioning on head information, both POS of the head and the lexical item itself, has proven useful as well, although given our parser's left-to-right orientation, in many cases the head has not been encountered within the particular constituent. In such a case, the head of the last child within the constituent is used as a proxy for the constituent head. All of our conditioning functions, with one exception, return either parent or sibling node labels at some specific distance from the left-hand side, or head information from c-commanding constituents. The exception is the function at level 5 along the left branch of the tree in Figure 4. Suppose that the node being expanded is being conjoined with another node, which we can tell by the presence or absence of a CC node. In that case, we want to condition the expansion on how the conjoining constituent expanded. In other words, this attempts to capture a certain amount of parallelism between the expansions of conjoined categories.</Paragraph>
      <Paragraph position="4"> In presenting the parsing results, we will systematically vary the amount of conditioning information, so as to get an idea of the behavior of the parser. We will refer to the amount of conditioning by specifying the deepest level from which a value is returned for each branching path in the decision tree, from left to right in Figure 4: the first number is for left contexts where the left branch of the decision tree is always followed (non-POS nonterminals on the left-hand side); the second number is for a left branch followed by a right branch (POS nodes that are leftmost within their constituent); and the third number is for the contexts where the right branch is always followed (POS nodes that are not leftmost within their constituent). For example, (4,3,2) would represent a conditional probability model that (i) returns NULL for all functions below level 4 in all contexts; (ii) returns NULL for all functions below level 3 if the left-hand side is a POS; and (iii) returns NULL for all functions below level 2 for nonleftmost POS expansions.</Paragraph>
      <Paragraph position="5"> Table 1 gives a breakdown of the different levels of conditioning information used in the empirical trials, with a mnemonic label that will be used when presenting results. These different levels were chosen as somewhat natural points at which to observe</Paragraph>
    </Section>
    <Section position="3" start_page="260" end_page="260" type="sub_section">
      <SectionTitle>
Roark Top-Down Parsing
</SectionTitle>
      <Paragraph position="0"> how much of an effect increasing the conditioning information has. We first include structural information from the context, namely, node labels from constituents in the left context. Then we add lexical information, first for non-POS expansions, then for leftmost POS expansions, then for all expansions.</Paragraph>
      <Paragraph position="1"> All of the conditional probabilities are linearly interpolated. For example, the probability of a rule conditioned on six events is the linear interpolation of two probabilities: (i) the empirically observed relative frequency of the rule when the six events co-occur; and (ii) the probability of the rule conditioned on the first five events (which is in turn interpolated). The interpolation coefficients are a function of the frequency of the set of conditioning events, and are estimated by iteratively adjusting the coefficients so as to maximize the likelihood of a held-out corpus.</Paragraph>
      <Paragraph position="2"> This was an outline of the conditional probability model that we used for the PCFG. The model allows us to assign probabilities to derivations, which can be used by the parsing algorithm to decide heuristically which candidates are promising and should be expanded, and which are less promising and should be pruned. We now outline the top-down parsing algorithm.</Paragraph>
    </Section>
    <Section position="4" start_page="260" end_page="262" type="sub_section">
      <SectionTitle>
4.2 Top-Down Probabilistic Parsing
</SectionTitle>
      <Paragraph position="0"> This parser is essentially a stochastic version of the top-down parser described in Aho, Sethi, and Ullman (1986). It uses a PCFG with a conditional probability model of the sort defined in the previous section. We will first define candidate analysis (i.e., a partial parse), and then a derives relation between candidate analyses. We will then present the algorithm in terms of this relation.</Paragraph>
      <Paragraph position="1"> The parser takes an input string w~, a PCFG G, and a priority queue of candidate analyses. A candidate analysis C = (D, S, Po, F, w n) consists of a derivation D, a stack S, a derivation probability Po, a figure of merit F, and a string w n remaining to be parsed. The first word in the string remaining to be parsed, wi, we will call the look-ahead word. The derivation D consists of a sequence of rules used from G. The stack $ contains a sequence of nonterminal symbols, and an end-of-stack marker $ at the bottom. The probability Po is the product of the probabilities of all rules in the derivation D. F is the product of PD and a look-ahead probability, LAP($,wi), which is a measure of the likelihood of the stack $ rewriting with wi at its left corner.</Paragraph>
      <Paragraph position="2"> We can define a derives relation, denoted 3, between two candidate analyses as follows. (D,S, PD, F,w~/) ~ (D',S',PD,,F',w~) if and only if 12  i. DI = D + A --~ Xo...Xk ii. $ = Ac~$; iii. either S' = Xo... Xko~$ and j = i or k = 0, X0 = wi, j = i+ 1, and $' = c~$; iv. PD, = PoP(A --~ Xo... Xk); and v. F' = PD, LAP(S',w/) The parse begins with a single candidate analysis on the priority queue: ((), St$, 1, 1, w~). Next, the top ranked candidate analysis, C = (D, $, PD, F, w~), is popped from the priority queue. If $ = $ and wi = (/s), then the analysis is complete. Otherwise, all C' such that C ~ C t are pushed onto the priority queue.</Paragraph>
      <Paragraph position="3"> 12 Again, for ease of exposition, we will ignore e-productions. Everything presented here can be straightforwardly extended to include them. The + in (i) denotes concatenation. To avoid confusion between sets and sequences, 0 will not be used for empty strings or sequences, rather the symbol ( ) will be used. Note that the script $ is used to denote stacks, while St is the start symbol.  Computational Linguistics Volume 27, Number 2 We implement this as a beam search. For each word position i, we have a separate priority queue Hi of analyses with look-ahead wi. When there are &amp;quot;enough&amp;quot; analyses by some criteria (which we will discuss below) on priority queue Hi+l, all candidate analyses remaining on Hi are discarded. Since Wn = (/s), all parses that are pushed onto Hn+l are complete. The parse on Hn+l with the highest probability is returned for evaluation. In the case that no complete parse is found, a partial parse is returned and evaluated.</Paragraph>
      <Paragraph position="4"> The LAP is the probability of a particular terminal being the next left-corner of a particular analysis. The terminal may be the left corner of the topmost nonterminal on the stack of the analysis or it might be the left corner of the nth nonterminal, after the top n - 1 nonterminals have rewritten to ~. Of course, we cannot expect to have adequate statistics for each nonterminal/word pair that we encounter, so we smooth to the POS. Since we do not know the POS for the word, we must sum the LAP for all POS labels. 13 For a PCFG G, a stack $ = A0... A,$ (which we will write Ag$) and a look-ahead terminal item wi, we define the look-ahead probability as follows:</Paragraph>
      <Paragraph position="6"> We recursively estimate this with two empirically observed conditional probabilities for every nonterminal Ai: P(Ai -- wio~) and P(Ai * c). The same empirical probability, P(Ai -G Xc~), is collected for every preterminal X as well. The LAP approximation for a given stack state and look-ahead terminal is:</Paragraph>
      <Paragraph position="8"> The lambdas are a function of the frequency of the nonterminal Aj, in the standard way (Jelinek and Mercer 1980).</Paragraph>
      <Paragraph position="9"> The beam threshold at word wi is a function of the probability of the top-ranked candidate analysis on priority queue Hi+1 and the number of candidates on Hi+l. The basic idea is that we want the beam to be very wide if there are few analyses that have been advanced, but relatively narrow if many analyses have been advanced. If ~ is the probability of the highest-ranked analysis on Hi+l, then another analysis is discarded if its probability falls below Pf('7, IH/+ll), where 3' is an initial parameter, which we call the base beam factor. For the current study, 3' was 10 -11 , unless otherwise noted, and f('y, IHi+l I) = 'TIHi+I\]3&amp;quot; ThUS, if 100 analyses have already been pushed onto/-//+1, then a candidate analysis must have a probability above 10-5~ to avoid being pruned.</Paragraph>
      <Paragraph position="10"> After 1,000 candidates, the beam has narrowed to 10-2p. There is also a maximum number of allowed analyses on Hi, in case the parse fails to advance an analysis to Hi+\]. This was typically 10,000.</Paragraph>
      <Paragraph position="11"> As mentioned in Section 2.1, we left-factor the grammar, so that all productions are binary, except those with a single terminal on the right-hand side and epsilon productions. The only e-productions are those introduced by left-factorization. Our factored 13 Equivalently, we can split the analyses at this point, so that there is one POS per analysis.</Paragraph>
    </Section>
    <Section position="5" start_page="262" end_page="262" type="sub_section">
      <SectionTitle>
Roark Top-Down Parsing
</SectionTitle>
      <Paragraph position="0"> grammar was produced by factoring the trees in the training corpus before grammar induction, which proceeded in the standard way, by counting rule frequencies.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="262" end_page="271" type="metho">
    <SectionTitle>
5. Empirical Results
</SectionTitle>
    <Paragraph position="0"> The empirical results will be presented in three stages: (i) trials to examine the accuracy and efficiency of the parser; (ii) trials to examine its effect on test corpus perplexity and recognition performance; and (iii) trials to examine the effect of beam variation on these performance measures. Before presenting the results, we will introduce the methods of evaluation.</Paragraph>
    <Section position="1" start_page="262" end_page="263" type="sub_section">
      <SectionTitle>
5.1 Evaluation
</SectionTitle>
      <Paragraph position="0"> Perplexity is a standard measure within the speech recognition community for comparing language models. In principle, if two models are tested on the same test corpus, the model that assigns the lower perplexity to the test corpus is the model closest to the true distribution of the language, and thus better as a prior model for speech recognition. Perplexity is the exponential of the cross entropy, which we will define next.</Paragraph>
      <Paragraph position="1"> Given a random variable X with distribution p and a probability model q, the cross entropy, H(p, q) is defined as follows:</Paragraph>
      <Paragraph position="3"> Let p be the true distribution of the language. Then, under certain assumptions, given a large enough sample, the sample mean of the negative log probability of a model will converge to its cross entropy with the true model. 14 That is</Paragraph>
      <Paragraph position="5"> where w~ is a string of the language L. In practice, one takes a large sample of the language, and calculates the negative log probability of the sample, normalized by its size. 15 The lower the cross entropy (i.e., the higher the probability the model assigns to the sample), the better the model. Usually this is reported in terms of perplexity, which we will do as well. 16 Some of the trials discussed below will report results in terms of word and/or sentence error rate, which are obtained when the language model is embedded in a speech recognition system. Word error rate is the number of deletion, insertion, or substitution errors per 100 words. Sentence error rate is the number of sentences with one or more errors per 100 sentences.</Paragraph>
      <Paragraph position="6"> Statistical parsers are typically evaluated for accuracy at the constituent level, rather than simply whether or not the parse that the parser found is completely correct  or not. A constituent for evaluation purposes consists of a label (e.g., NP) and a span (beginning and ending word positions). For example, in Figure l(a), there is a VP that spans the words &amp;quot;chased the ball&amp;quot;. Evaluation is carried out on a hand-parsed test corpus, and the manual parses are treated as correct. We will call the manual parse 14 See Cover and Thomas (1991) for a discussion of the Shannon-McMillan-Breiman theorem, under the assumptions of which this convergence holds.</Paragraph>
      <Paragraph position="7"> 15 It is important to remember to include the end marker in the strings of the sample. 16 When assessing the magnitude of a perplexity improvement, it is often better to look at the reduction in cross entropy, by taking the log of the perplexity. It will be left to the reader to do so.  Computational Linguistics Volume 27, Number 2 GOLD and the parse that the parser returns TEST. Precision is the number of common constituents in GOLD and TEST divided by the number of constituents in TEST. Recall is the number of common constituents in GOLD and TEST divided by the number of constituents in GOLD. Following standard practice, we will be reporting scores only for non-part-of-speech constituents, which are called labeled recall (LR) and labeled precision (LP). Sometimes in figures we will plot their average, and also what can be termed the parse error, which is one minus their average.</Paragraph>
      <Paragraph position="8"> LR and LP are part of the standard set of PARSEVAL measures of parser quality (Black et al. 1991). From this set of measures, we will also include the crossing bracket scores: average crossing brackets (CB), percentage of sentences with no crossing brackets (0 CB), and the percentage of sentences with two crossing brackets or fewer (&lt; 2 CB). In addition, we show the average number of rule expansions considered per word, that is, the number of rule expansions for which a probability was calculated (see Roark and Charniak \[2000\]), and the average number of analyses advanced to the next priority queue per word.</Paragraph>
      <Paragraph position="9"> This is an incremental parser with a pruning strategy and no backtracking. In such a model, it is possible to commit to a set of partial analyses at a particular point that cannot be completed given the rest of the input string (i.e., the parser can &amp;quot;garden path&amp;quot;). In such a case, the parser fails to return a complete parse. In the event that no complete parse is found, the highest initially ranked parse on the last nonempty priority queue is returned. All unattached words are then attached at the highest level in the tree. In such a way we predict no new constituents and all incomplete constituents are closed. This structure is evaluated for precision and recall, which is entirely appropriate for these incomplete as well as complete parses. If we fail to identify nodes later in the parse, recall will suffer, and if our early predictions were bad, both precision and recall will suffer. Of course, the percentage of these failures are reported as well.</Paragraph>
    </Section>
    <Section position="2" start_page="263" end_page="265" type="sub_section">
      <SectionTitle>
5.2 Parser Accuracy and Efficiency
</SectionTitle>
      <Paragraph position="0"> The first set of results looks at the performance of the parser on the standard corpora for statistical parsing trials: Sections 2-21 (989,860 words, 39,832 sentences) of the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993) served as the training data, Section 24 (34,199 words, 1,346 sentences) as the held-out data for parameter estimation, and Section 23 (59,100 words, 2,416 sentences) as the test data. Section 22 (41,817 words, 1,700 sentences) served as the development corpus, on which the parser was tested until stable versions were ready to run on the test data, to avoid developing the parser to fit the specific test data.</Paragraph>
      <Paragraph position="1"> Table 2 shows trials with increasing amounts of conditioning information from the left context. There are a couple of things to notice from these results. First, and least surprising, is that the accuracy of the parses improved as we conditioned on more and more information. Like the nonlexicalized parser in Roark and Johnson (1999), we found that the search efficiency, in terms of number of rule expansions considered or number of analyses advanced, also improved as we increased the amount of conditioning. Unlike the Roark and Johnson parser, however, our coverage did not substantially drop as the amount of conditioning information increased, and in some cases, coverage improved slightly. They did not smooth their conditional probability estimates, and blamed sparse data for their decrease in coverage as they increased the conditioning information. These results appear to support this, since our smoothed model showed no such tendency.</Paragraph>
      <Paragraph position="2"> Figure 5 shows the reduction in parser error, 1 - LR+LP and the reduction in 2 ' rule expansions considered as the conditioning information increased. The bulk of  Reduction in average precision/recall error and in number of rule expansions per word as conditioning increases, for sentences of length &lt; 40.</Paragraph>
      <Paragraph position="3"> the improvement comes from simply conditioning on the labels of the parent and the closest sibling to the node being expanded. Interestingly, conditioning all POS expansions on two c-commanding heads made no difference in accuracy compared to conditioning only leftmost POS expansions on a single c-commanding head; but it did improve the efficiency.</Paragraph>
      <Paragraph position="4"> These results, achieved using very straightforward conditioning events and considering only the left context, are within one to four points of the best published</Paragraph>
      <Paragraph position="6"/>
    </Section>
    <Section position="3" start_page="265" end_page="266" type="sub_section">
      <SectionTitle>
Sentence Length
</SectionTitle>
      <Paragraph position="0"> Figure 6 Observed running time on Section 23 of the Penn Treebank, with the full conditional probability model and beam of 10 -:1, using one 300 MHz UltraSPARC processor and 256MB of RAM of a Sun Enterprise 450.</Paragraph>
      <Paragraph position="1"> accuracies cited above. :7 Of the 2,416 sentences in the section, 728 had the totally correct parse, 30.1 percent tree accuracy. Also, the parser returns a set of candidate parses, from which we have been choosing the top ranked; if we use an oracle to choose the parse with the highest accuracy from among the candidates (which averaged 70.0 in number per sentence), we find an average labeled precision/recall of 94.1, for sentences of length G 100. The parser, thus, could be used as a front end to some other model, with the hopes of selecting a more accurate parse from among the final candidates. While we have shown that the conditioning information improves the efficiency in terms of rule expansions considered and analyses advanced, what does the efficiency of such a parser look like in practice? Figure 6 shows the observed time at our standard base beam of 10 -11 with the full conditioning regimen, alongside an approximation of the reported observed (linear) time in Ratnaparkhi (1997). Our observed times look polynomial, which is to be expected given our pruning strategy: the denser the competitors within a narrow probability range of the best analysis, the more time will be spent working on these competitors; and the farther along in the sentence, the more chance for ambiguities that can lead to such a situation. While our observed times are not linear, and are clearly slower than his times (even with a faster machine), they are quite respectably fast. The differences between a k-best and a beam-search parser (not to mention the use of dynamic programming) make a running time difference unsur17 Our score of 85.8 average labeled precision and recall for sentences less than or equal to 100 on Section 23 compares to: 86.7 in Charniak (1997), 86.9 in Ratnaparkhi (1997), 88.2 in Collins (1999), 89.6 in Charniak (2000), and 89.75 in Collins (2000).</Paragraph>
    </Section>
    <Section position="4" start_page="266" end_page="266" type="sub_section">
      <SectionTitle>
Roark Top-Down Parsing
</SectionTitle>
      <Paragraph position="0"> prising. What is perhaps surprising is that the difference is not greater. Furthermore, this is quite a large beam (see discussion below), so that very large improvements in efficiency can be had at the expense of the number of analyses that are retained.</Paragraph>
    </Section>
    <Section position="5" start_page="266" end_page="269" type="sub_section">
      <SectionTitle>
5.3 Perplexity Results
</SectionTitle>
      <Paragraph position="0"> The next set of results will highlight what recommends this approach most: the ease with which one can estimate string probabilities in a single pass from left to right across the string. By definition, a PCFG's estimate of a string's probability is the sum of the probabilities of all trees that produce the string as terminal leaves (see Equation 1).</Paragraph>
      <Paragraph position="1"> In the beam search approach outlined above, we can estimate the string's probability in the same manner, by summing the probabilities of the parses that the algorithm finds. Since this is not an exhaustive search, the parses that are returned will be a subset of the total set of trees that would be used in the exact PCFG estimate; hence the estimate thus arrived at will be bounded above by the probability that would be generated from an exhaustive search. The hope is that a large amount of the probability mass will be accounted for by the parses in the beam. The method cannot overestimate the probability of the string.</Paragraph>
      <Paragraph position="2"> Recall the discussion of the grammar models above, and our definition of the set of partial derivations Dwd with respect to a prefix string w0 j (see Equations 2 and 7).</Paragraph>
      <Paragraph position="4"> Note that the numerator at word wj is the denominator at word wj+l, so that the product of all of the word probabilities is the numerator at the final word, namely, the string prefix probability.</Paragraph>
      <Paragraph position="5"> We can make a consistent estimate of the string probability by similarly summing over all of the trees within our beam. Let H~ nit be the priority queue Hi before any processing has begun with word Wi in the look-ahead. This is a subset of the possible leftmost partial derivations with respect to the prefix string w 0i-1. Since ,/4init~i+l is produced by expanding only analyses on priority queue HI '~it, the set of complete trees consistent with the partial derivations on priority queue ~4i,,~t is a subset of the &amp;quot;i+1 set of complete trees consistent with the partial derivations on priority queue H~ nit, that is, the total probability mass represented by the priority queues is monotonically decreasing. Thus conditional word probabilities defined in a way consistent with Equation 14 will always be between zero and one. Our conditional word probabilities are calculated as follows:</Paragraph>
      <Paragraph position="7"> As mentioned above, the model cannot overestimate the probability of a string, because the string probability is simply the sum over the beam, which is a subset of the possible derivations. By utilizing a figure of merit to identify promising analyses, we are simply focusing our attention on those parses that are likely to have a high probability, and thus we are increasing the amount of probability mass that we do capture, of the total possible. It is not part of the probability model itself.</Paragraph>
      <Paragraph position="8"> Since each word is (almost certainly, because of our pruning strategy) losing some probability mass, the probability model is not &amp;quot;proper&amp;quot;--the sum of the probabilities over the vocabulary is less than one. In order to have a proper probability distribution,  Computational Linguistics Volume 27, Number 2 we would need to renormalize by dividing by some factor. Note, however, that this renormalization factor is necessarily less than one, and thus would uniformly increase each word's probability under the model, that is, any perplexity results reported below will be higher than the &amp;quot;true&amp;quot; perplexity that would be assigned with a properly normalized distribution. In other words, renormalizing would make our perplexity measure lower still. The hope, however, is that the improved parsing model provided by our conditional probability model will cause the distribution over structures to be more peaked, thus enabling us to capture more of the total probability mass, and making this a fairly snug upper bound on the perplexity.</Paragraph>
      <Paragraph position="9"> One final note on assigning probabilities to strings: because this parser does garden path on a small percentage of sentences, this must be interpolated with another estimate, to ensure that every word receives a probability estimate. In our trials, we used the unigram, with a very small mixing coefficient:</Paragraph>
      <Paragraph position="11"> If ~dcH~it P(d) = 0 in our model, then our model provides no distribution over following words since the denominator is zero. Thus,</Paragraph>
      <Paragraph position="13"> Chelba and Jelinek (1998a, 1998b) also used a parser to help assign word probabilities, via the structured language model outlined in Section 3.2. They trained and tested the SLM on a modified, more &amp;quot;speech-like&amp;quot; version of the Penn Treebank. Their modifications included: (i) removing orthographic cues to structure (e.g., punctuation); (ii) replacing all numbers with the single token N; and (iii) closing the vocabulary at 10,000, replacing all other words with the UNK token. They used Sections 00-20 (929,564 words) as the development set, Sections 21-22 (73,760 words) as the check set (for interpolation coefficient estimation), and tested on Sections 23-24 (82,430 words).</Paragraph>
      <Paragraph position="14"> We obtained the training and testing corpora from them (which we will denote C&amp;J corpus), and also created intermediate corpora, upon which only the first two modifications were carried out (which we will denote no punct). Differences in performance will give an indication of the impact on parser performance of the different modifications to the corpora. All trials in this section used Sections 00-20 for counts, held out 21-22, and tested on 23-24.</Paragraph>
      <Paragraph position="15"> Table 3 shows several things. First, it shows relative performance for unmodified, no punct, and C&amp;J corpora with the full set of conditioning information. We can see that removing the punctuation causes (unsurprisingly) a dramatic drop in the accuracy and efficiency of the parser. Interestingly, it also causes coverage to become nearly total, with failure on just two sentences per thousand on average.</Paragraph>
      <Paragraph position="16"> We see the familiar pattern, in the C&amp;J corpus results, of improving performance as the amount of conditioning information grows. In this case we have perplexity results as well, and Figure 7 shows the reduction in parser error, rule expansions, and perplexity as the amount of conditioning information grows. While all three seem to be similarly improved by the addition of structural context (e.g., parents and siblings), the addition of c-commanding heads has only a moderate effect on the parser accuracy, but a very large effect on the perplexity. The fact that the efficiency was improved more than the accuracy in this case (as was also seen in Figure 5), seems to indicate that this additional information is causing the distribution to become more peaked, so that fewer analyses are making it into the beam.</Paragraph>
      <Paragraph position="17">  Table 4 compares the perplexity of our model with Chelba and Jelinek (1998a, 1998b) on the same training and testing corpora. We built an interpolated trigram model to serve as a baseline (as they did), and also interpolated our model's perplexity with the trigram, using the same mixing coefficient as they did in their trials (taking  36 percent of the estimate from the trigram), is The trigram model was also trained on Sections 00-20 of the C&amp;J corpus. Trigrams and bigrams were binned by the total 18 Our optimal mixture level was closer to 40 percent, but the difference was negligible.  count of the conditioning words in the training corpus, and maximum likelihood mixing coefficients were calculated for each bin, to mix the trigram with bigram and unigram estimates. Our trigram model performs at almost exactly the same level as theirs does, which is what we would expect. Our parsing model's perplexity improves upon their first result fairly substantially, but is only slightly better than their second result. 19 However, when we interpolate with the trigram, we see that the additional improvement is greater than the one they experienced. This is not surprising, since our conditioning information is in many ways orthogonal to that of the trigram, insofar as it includes the probability mass of the derivations; in contrast, their model in some instances is very close to the trigram, by conditioning on two words in the prefix string, which may happen to be the two adjacent words.</Paragraph>
      <Paragraph position="18"> These results are particularly remarkable, given that we did not build our model as a language model per se, but rather as a parsing model. The perplexity improvement was achieved by simply taking the existing parsing model and applying it, with no extra training beyond that done for parsing.</Paragraph>
      <Paragraph position="19"> The hope was expressed above that our reported perplexity would be fairly close to the &amp;quot;true&amp;quot; perplexity that we would achieve if the model were properly normalized, i.e., that the amount of probability mass that we lose by pruning is small. One way to test this is the following: at each point in the sentence, calculate the conditional probability of each word in the vocabulary given the previous words, and sum them. 2deg If there is little loss of probability mass, the sum should be close to one. We did this for the first 10 sentences in the test corpus, a total of 213 words (including the end-of-sentence markers). One of the sentences was a failure, so that 12 of the word probabilities (all of the words after the point of the failure) were not estimated by our model. Of the remaining 201 words, the average sum of the probabilities over the 10,000-word vocabulary was 0.9821, with a minimum of 0.7960 and a maximum of 0.9997. Interestingly, at the word where the failure occurred, the sum of the probabilities was 0.9301.</Paragraph>
    </Section>
    <Section position="6" start_page="269" end_page="270" type="sub_section">
      <SectionTitle>
5.4 Word Error Rate
</SectionTitle>
      <Paragraph position="0"> In order to get a sense of whether these perplexity reduction results can translate to improvement in a speech recognition task, we performed a very small preliminary experiment on n-best lists. The DARPA '93 HUB1 test setup consists of 213 utterances read from the Wall Street Journal, a total of 3,446 words. The corpus comes with a baseline trigram model, using a 20,000-word open vocabulary, and trained on approximately 40 million words. We used Ciprian Chelba's A* decoder to find the 50 best hypotheses from each lattice, along with the acoustic and trigram scores. 21 Given  19 Recall that our perplexity measure should, ideally, be even lower still.</Paragraph>
      <Paragraph position="1"> 20 Thanks to Ciprian Chelba for this suggestion. 21 See Chelba (2000) for details on the decoder.</Paragraph>
    </Section>
    <Section position="7" start_page="270" end_page="271" type="sub_section">
      <SectionTitle>
Roark Top-Down Parsing
</SectionTitle>
      <Paragraph position="0"> the idealized circumstances of the production (text read in a lab), the lattices are relatively sparse, and in many cases 50 distinct string hypotheses were not found in a lattice. We reranked an average of 22.9 hypotheses with our language model per utterance.</Paragraph>
      <Paragraph position="1"> One complicating issue has to do with the tokenization in the Penn Treebank versus that in the HUB1 lattices. In particular, contractions (e.g., he' s) are split in the Penn Treebank (he 's) but not in the HUB1 lattices. Splitting of the contractions is critical for parsing, since the two parts oftentimes (as in the previous example) fall in different constituents. We follow Chelba (2000) in dealing with this problem: for parsing purposes, we use the Penn Treebank tokenization; for interpolation with the provided trigram model, and for evaluation, the lattice tokenization is used. If we are to interpolate our model with the lattice trigram, we must wait until we have our model's estimate for the probability of both parts of the contraction; their product can then be interpolated with the trigram estimate. In fact, interpolation in these trials made no improvement over the better of the uninterpolated models, but simply resulted in performance somewhere between the better and the worse of the two models, so we will not present interpolated trials here.</Paragraph>
      <Paragraph position="2"> Table 5 reports the word and sentence error rates for five different models: (i) the trigram model that comes with the lattices, trained on approximately 40M words, with a vocabulary of 20,000; (ii) the best-performing model from Chelba (2000), which was interpolated with the lattice trigram at ,~ = 0.4; (iii) our parsing model, with the same training and vocabulary as the perplexity trials above; (iv) a trigram model with the same training and vocabulary as the parsing model; and (v) no language model at all.</Paragraph>
      <Paragraph position="3"> This last model shows the performance from the acoustic model alone, without the influence of the language model. The log of the language model score is multiplied by the language model (LM) weight when summing the logs of the language and acoustic scores, as a way of increasing the relative contribution of the language model to the composite score. We followed Chelba (2000) in using an LM weight of 16 for the lattice trigram. For our model and the Treebank trigram model, the LM weight that resulted in the lowest error rates is given.</Paragraph>
      <Paragraph position="4"> The small size of our training data, as well as the fact that we are rescoring n-best lists, rather than working directly on lattices, makes comparison with the other models not particularly informative. What is more informative is the difference between our model and the trigram trained on the same amount of data. We achieved an 8.5 percent relative improvement in word error rate, and an 8.3 percent relative improvement in sentence error rate over the Treebank trigram. Interestingly, as mentioned above, interpolating two models together gave no improvement over the better of the two, whether our model was interpolated with the lattice or the Treebank trigram. This  contrasts with our perplexity results reported above, as well as with the recognition experiments in Chelba (2000), where the best results resulted from interpolated models. The point of this small experiment was to see if our parsing model could provide useful information even in the case that recognition errors occur, as opposed to the (generally) fully grammatical strings upon which the perplexity results were obtained.</Paragraph>
      <Paragraph position="5"> As one reviewer pointed out, given that our model relies so heavily on context, it may have difficulty recovering from even one recognition error, perhaps more difficulty than a more locally oriented trigram. While the improvements over the trigram model in these trials are modest, they do indicate that our model is robust enough to provide good information even in the face of noisy input. Future work will include more substantial word recognition experiments.</Paragraph>
    </Section>
    <Section position="8" start_page="271" end_page="271" type="sub_section">
      <SectionTitle>
5.5 Beam Variation
</SectionTitle>
      <Paragraph position="0"> The last set of results that we will present addresses the question of how wide the beam must be for adequate results. The base beam factor that we have used to this point is 10 -11 , which is quite wide. It was selected with the goal of high parser accuracy; but in this new domain, parser accuracy is a secondary measure of performance. To determine the effect on perplexity, we varied the base beam factor in trials on the Chelba and Jelinek corpora, keeping the level of conditioning information constant, and Table 6 shows the results across a variety of factors.</Paragraph>
      <Paragraph position="1"> The parser error, parser coverage, and the uninterpolated model perplexity ()~ = 1) all suffered substantially from a narrower search, but the interpolated perplexity remained quite good even at the extremes. Figure 8 plots the percentage increase in parser error, model perplexity, interpolated perplexity, and efficiency (i.e., decrease in rule expansions per word) as the base beam factor decreased. Note that the model perplexity and parser accuracy are quite similarly affected, but that the interpolated perplexity remained far below the trigram baseline, even with extremely narrow beams.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML