File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0302_metho.xml
Size: 29,761 bytes
Last Modified: 2025-10-06 14:14:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0302"> <Title>Global Thresholding and Multiple-Pass Parsing*</Title> <Section position="3" start_page="11" end_page="12" type="metho"> <SectionTitle> 2 Beam Thresholding </SectionTitle> <Paragraph position="0"> The first, and simplest, technique we will examine is beam thresholding. While this technique is used as part of many search algorithms, beam thresholding with PCFGs is most similar to beam thresholding as used in speech recognition. Beam thresholding is often used in statistical parsers, such as that of Collins (1996).</Paragraph> <Paragraph position="1"> Consider a nonterminal X in a cell covering the span of terminals tj...tk. We will refer to this as node NjXk, since it corresponds to a potential node in the final parse tree. Recall that in beam thresholding, we compare nodes N~, k and N~, k covering the same span. If one node is much more likely than the other, then it is unlikely that the less probable node will be part of the correct parse, and we can remove it from the chart, saving time later.</Paragraph> <Paragraph position="2"> There is a subtlety about what it means for a node N~, k to be more likely than some other node. According to folk wisdom, the best way to measure the likelihood of a node N~, k is to use the probability that the nonterminal X generates the span tj...tk, called the inside probability. Formally, we write this as P(X =~ tj...tk) , and denote it by x ~(Nj,k). However, this does not give information about the probability of the node in the context of the full parse tree. For instance, two nodes, one an NP and the other a FRA G (fragment), may have equal inside probabilities, but since there are far more NPs than there are FRAG clauses, the NP node is more likely overall.</Paragraph> <Paragraph position="3"> Therefore, we must consider more information than just the inside probability.</Paragraph> <Paragraph position="4"> The outside probability of a node N~k is the probability of that node given the surrounding terminals of the sentence, i.e. P(S =~ tl...tj-xXtk+l...tn), which we denote by a(N~k ). Ideally, we would multiply the inside probability by the outside probability, and normalize. This product would give us the overall probability that the node is part of the correct parse. Unfortunately, there is no good way to quickly compute the outside probability of a node during bottom-up chart parsing (although it can be efficiently computed afterwards). Thus, we instead multiply the inside probability simply by the prior probability of the nonterminal type, P(X), which is an approximation to the outside probability. Our final thresholding measure is P(X) xfl(Nj,Xk). In Section 7.4, we will show experiments comparing insideprobability beam thresholding to beam thresholding using the inside probability times the prior. Using the prior can lead to a speedup of up to a factor of 10, at the same performance level.</Paragraph> <Paragraph position="5"> To the best of our knowledge, using the prior probability in beam thresholding is new, although not particularly insightful on our part.</Paragraph> <Paragraph position="6"> Collins (personal communication) independently observed the usefulness of this modification, and Caraballo and Charniak (1996) used a related technique in a best-first parser. We think that the main reason this technique was not used sooner is that beam thresholding for PCFGs is derived from beam thresholding in speech recognition using Hidden Markov Models (HMMs). In an HMM, the forward probability of a given state corresponds to the probability of reaching that state from the start state. The probability of eventually reaching the final state from any state is always 1. Thus, the forward probability is all that is needed. The same is true in some top down probabilistic parsing algorithms, such as stochastic versions of Earley's algorithm (Stolcke, 1993). However, in a bottom-up algorithm, we need the extra factor that indicates the probability of getting from the start symbol to the nonterminal in question, which we approximate by the prior probability. As we noted, this can be very different for different nonterminals.</Paragraph> </Section> <Section position="4" start_page="12" end_page="13" type="metho"> <SectionTitle> 3 Global Thresholding </SectionTitle> <Paragraph position="0"> As mentioned earlier, the problem with beam thresholding is that it can only threshold out the worst nodes of a cell. It cannot threshold out an entire cell, even if there are no good nodes in it. To remedy this problem, we introduce a novel thresholding technique, global thresholding.</Paragraph> <Paragraph position="1"> The key insight of global thresholding is due to Rayner and Carter (1996). Rayner et al. noticed that a particular node cannot be part of the correct parse if there are no nodes in adjacent cells. In fact, it must be part of a sequence of nodes stretching from the start of the string to the end. In a probabilistic framework where almost every node will have some (possibly very small) probability, we can rephrase this requirement as being that the node must be part of a reasonably probable sequence.</Paragraph> <Paragraph position="2"> Figure 2 shows an example of this insight. Nodes A, B, and C will not be thresholded out, because each is part of a sequence from the beginning to the end of the chart. On the other hand, nodes X, Y, and Z will be thresholded out, because none is part of such a sequence.</Paragraph> <Paragraph position="3"> Rayner et al. used this insight for a hierarchical, non-recursive grammar, and only used their technique to prune after the first level of the grammar. They computed a score for each sequence as the minimum of the scores of each node in the sequence, and computed a score for each node in the sequence as the minimum of three scores: one based on statistics about nodes to the left, one based on nodes to the right, and one based on unigram statistics.</Paragraph> <Paragraph position="4"> We wanted to extend the work of Rayner et al. to general PCFGs, including those that were recursive.</Paragraph> <Paragraph position="5"> Our approach therefore differs from theirs in many ways. Rayner et al. ignore the inside probabilities of nodes; while this may work after processing only the first level of a grammar, when the inside probabilities will be relatively homogeneous, it could cause problems after other levels, when the inside probability of a node will give important information about its usefulness. On the other hand, because long nodes will tend to have low inside probabilities, taking the minimum of all scores strongly favors sequences of short nodes. Furthermore, their algorithm requires time O(n a) to run just once. This is acceptable if the algorithm is run only after the first level, but running it more often would lead to an overall run time of O(n4). Finally, we hoped to find an algorithm that was somewhat less heuristic in nature.</Paragraph> <Paragraph position="7"> Our global thresholding technique thresholds out node N if the ratio between the most probable sequence of nodes including node N and the over-all most probable sequence of nodes is less than some threshold, To. Formally, denoting sequences of nodes by L, we threshold node N if TG relax P(L) > max P(L) LINeL Now, the hard part is determining P(L), the probability of a node sequence. Unfortunately, there is no way to do this efficiently as part of the intermediate computation of a bottom-up chart parser. 1 We will approximate P(L) as follows:</Paragraph> <Paragraph position="9"> versions of Earley parsers (Stolcke, 1993), efficiently compute related probabilities, but we won't explore these parsers here. We confess that our real interest is in more complicated grammars, such as those that use head words. Grammars such as these can best be parsed bottom up.</Paragraph> <Paragraph position="10"> That is, we assume independence between the elements of a sequence. The probability of node Li = N~k is just its prior probability times its inside probability, as before.</Paragraph> <Paragraph position="11"> The most important difference between global thresholding and beam thresholding is that global thresholding is global: any node in the chart can help prune out any other node. In stark contrast, beam thresholding only compares nodes to other nodes covering the same span. Beam thresholding typically allows tighter thresholds since there are fewer approximations, but does not benefit from global information. null</Paragraph> <Section position="1" start_page="13" end_page="13" type="sub_section"> <SectionTitle> 3.1 Global Thresholding Algorithm </SectionTitle> <Paragraph position="0"> Global thresholding is performed in a bottom-up chart parser immediately after each length is completed. It thus runs n times during the course of parsing a sentence of length n.</Paragraph> <Paragraph position="1"> We use the simple dynamic programming algorithm in Figure 3. There are O(n 2) nodes in the chart, and each node is examined exactly three times, so the run time of this algorithm is O(n2).</Paragraph> <Paragraph position="2"> The first section of the algorithm works forwards, computing, for each i, f\[i\], which contains the score of the best sequence covering terminals tl...ti-1.</Paragraph> <Paragraph position="3"> Thus fin+l\] contains the score of the best sequence covering the whole sentence, maxL P(L). The algorithm works analogously to the Viterbi algorithm for HMMs. The second section is analogous, but works backwards, computing b\[i\], which contains the score of the best sequence covering terminals ti...tn.</Paragraph> <Paragraph position="4"> Once we have computed the preceding arrays, computing maXL\]NE L P(L) is straightforward. We simply want the score of the best sequence covering the nodes to the left of N, f\[Nstart\], times the score of the node itself, times the score of the best sequence of nodes from Ns~art + Nt~ngth to the end, which is just b\[N~u, rt + Nt~ngth\]. Using this expression, we can threshold each node quickly.</Paragraph> <Paragraph position="5"> Since this algorithm is run n times during the course of parsing, and requires time O(n 2) each time it runs, the algorithm requires time O(n 3) overall.</Paragraph> <Paragraph position="6"> Experiments will show that the time it saves easily outweighs the time it uses.</Paragraph> </Section> </Section> <Section position="5" start_page="13" end_page="14" type="metho"> <SectionTitle> 4 Multiple-Pass Parsing </SectionTitle> <Paragraph position="0"> In this section, we discuss a novel thresholding technique, multiple-pass parsing. We show that multiple-pass parsing techniques can yield large speedups. Multiple-pass parsing is a variation on a new technique in speech recognition, multiple-pass speech recognition (Zavaliagkos et al., 1994), which we introduce first.</Paragraph> <Section position="1" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 4.1 Multiple-Pass Speech Recognition </SectionTitle> <Paragraph position="0"> In an idealized multiple-pass speech recognizer, we first run a simple pass, computing the forward and backward probabilities. This first pass runs relatively quickly. We can use information from this simple, fast first pass to eliminate most states, and then run a more complicated, slower second pass that does not examine states that were deemed unlikely by the first pass. The extra time of running two passes is more than made up for by the time saved in the second pass.</Paragraph> <Paragraph position="1"> The mathematics of multiple-pass recognition is fairly simple. In the first simple pass, we record the forward probabilities, c~(S~), and backward probabilities, fl(S~), of each state i at each time t. NOW , ~(s~)x~(s~) gives the overall probability of being in ~(s~,.,) state i at time t given the acoustics. Our second pass will use an HMM whose states are analogous to the first pass HMM's states. If a first pass state at some time is unlikely, then the analogous second pass state is probably also unlikely, so we can threshold it out.</Paragraph> <Paragraph position="2"> There are a few complications to multiple-pass recognition. First, storing all the forward and backward probabilities can be expensive. Second, the second pass is more complicated than the first, typically ineaning that it has more states. So the mapping between states in the first pass and states in the second pass may be non-trivial. To solve both these problems, only states at word transitions are saved.</Paragraph> <Paragraph position="3"> That is, from pass to pass, only information about where words are likely to start and end is used for thresholding.</Paragraph> </Section> <Section position="2" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 4.2 Multiple-Pass Parsing </SectionTitle> <Paragraph position="0"> We can use an analogous algorithm for multiple-pass parsing. In particular, we can use two grammars, one fast and simple and the other slower, more complicated, and more accurate. Rather than using the forward and backward probabilities of speech recognition, we use the analogous inside and outside probabilities, x fl(Nj,k) and a(Nfk ) respectively. Remember that B(N~i. ) is the probability that Nfk is in the correct parse (given, as always, the model and the string). Thus, we run our first pass, computing this expression for each node. We can then eliminate from consideration in our later passes all nodes for which the probability of being in the correct parse was too small in the first pass.</Paragraph> <Paragraph position="1"> Of course, for our second pass to be more accurate, it will probably be more complicated, typically containing an increased number of nonterminals and productions. Thus, we create a mapping function for length := 2 to n for start := 1 to n - length + 1</Paragraph> <Paragraph position="3"> for each LeftNodePrev E LeftPrev for each production instance Prod from LeftNodePrev of size length for each descendant L of ProdLelt for each descendant R of ProdRight for each descendant P of Prodpar~n~ such that P ~ L R from each first pass nonterminal to a set of second pass nonterminals, and threshold out those second pass nonterminals that map from low-scoring first pass nonterminals. We call this mapping function the descendants function. 2 There are many possible examples of first and second pass combinations. For instance, the first pass could use regular nonterminals, such as NP and VP and the second pass could use nonterminals augmented with head-word information. The descendants function then appends the possible head words to the first pass nonterminals to get the second pass ones.</Paragraph> <Paragraph position="4"> Even though the correspondence between forward/backward and inside/outside probabilities is very close, there are important differences between speech-recognition HMMs and natural-language processing PCFGs. In particular, we have found that it is more important to threshold productions than nonterminals. That is, rather than just noticing that a particular nonterminal VP spanning the words &quot;killed the rabbit&quot; is very likely, we also note that the production VP --~ V NP (and the relevant spans) is likely.</Paragraph> <Paragraph position="5"> Both the first and second pass parsing algorithms are simple variations on CKY parsing. In the first pass, we now keep track of each production instance associated with a node, i.e. N'x~,3 ~ NYi,k gZk+l,j, computing the inside and outside probabilities of each. The second pass requires more changes. Let us denote the descendants of nonterminal X by nonterminal can descend from at most one first pass non-terminal in each cell. Th~ grammars used here have this property. If this assumption is violated, multiple-pass parsing is still possible, but some of the algorithms need to be changed.</Paragraph> <Paragraph position="6"> lS of the form N. X. N Y z ~,~ ~ i,k N~+Ij in the first pass that wasn't thresholded out by multi-pass thresholding, beam thresholding, etc., we consider every descendant production instance, that is, all those of the Z. form N~,~ p ~ Ni, ~ N~+,j, for appropriate values of p, q, r. This algorithm is given in Figure 4, which uses a current pass matrix Chart to keep track of nonterminals in the current pass, and a previous pass matrix, PrevChart to keep track of nonterminals in the previous pass. We use one additional optimization, keeping track of the descendants of each non-terminal in each cell in PrevChart which are in the corresponding cell of Chart.</Paragraph> <Paragraph position="7"> We tried multiple-pass thresholding in two different ways. In the first technique we tried, productioninstance thresholding, we remove from consideration in the second pass the descendants of all production instances whose combined inside-outside probability falls below a threshold. In the second technique, node thresholding, we remove from consideration the descendants of all nodes whose inside-outside probability falls below a threshold. In our pilot experiments, we found that in some cases one technique works slightly better, and in some cases the other does. We therefore ran our experiments using both thresholds together.</Paragraph> <Paragraph position="8"> One nice feature of multiple-pass parsing is that under special circumstances, it is an admissible search technique, meaning that we are guaranteed to find the best solution with it. In particular, if we parse using no thresholding, and our grammars have the property that for every non-zero probability parse in the second pass, there is an analogous non-zero probability parse in the first pass, then multiple-pass search is admissible. Under these circumstances, no non-zero probability parse will be thresholded out, but many zero probability parses may be removed from consideration. While we will almost always wish to parse using thresholds, it is nice to know that multiple-pass parsing can be seen as an approximation to an admissible technique, where the degree of approximation is controlled by the thresholding parameter.</Paragraph> </Section> </Section> <Section position="6" start_page="14" end_page="17" type="metho"> <SectionTitle> 5 Multiple Parameter Optimization </SectionTitle> <Paragraph position="0"> The use of any one of these techniques does not exclude the use of the others. There is no reason that we cannot use beam thresholding, global thresholding, and multiple-pass parsing all at the same time. In general, it wouldn't make sense to use a technique such as multiple-pass parsing without other thresholding techniques; our first pass would be overwhelmingly slow without some sort of threshwhile not Thresholds E ThresholdsSet There are, however, some practical considerations. To optimize a single threshold, we could simply sweep our parameters over a one dimensional range, and pick the best speed versus performance tradeoff. In combining multiple techniques, we need to find optimal combinations of thresholding parameters. Rather than having to examine 10 values in a single dimensional space, we might have to examine 100 combinations in a two dimensional space.</Paragraph> <Paragraph position="1"> Later, we show experiments with up to six thresholds. Since we don't have time to parse with one million parameter combinations, we need a better search algorithm.</Paragraph> <Paragraph position="2"> Ideally, we would like to be able to pick a performance level (in terms of either entropy or precision and recall) and find the best set of thresholds for achieving that performance level as quickly as possible. If this is our goal, then a normal gradient descent technique won't work, since we can't use such a technique to optimize one function of a set of variables (time as a function of thresholds) while holding another one constant (performance). 3 We wanted a metric of performance which would be sensitive to changes in threshold values. In particular, our ideal metric would be strictly increasing as our thresholds loosened, so that every loosening of threshold values would produce a measurable increase in performance. The closer we get to this ideal, the fewer sentences we need to test during parameter optimization.</Paragraph> <Paragraph position="3"> We tried an experiment in which we ran beam thresholding with a tight threshold, and then a loose threshold, on all sentences of section 0 of length < 40. For this experiment only, we discarded those sentences which could not be parsed with the specified setting of the threshold, rather than retrying with looser thresholds. We then computed for each of six metrics how often the metric decreased, stayed the same, or increased for each sentence between the two runs. Ideally, as we loosened the &quot;threshold, every sentence should improve on every metric, but in practice, ,that wasn't the case. As can be seen, the inside score was by far the most nearly strictly increasing metric. Therefore, we should use the inside probability as our metric of performance; however inside probabilities can become very close to zero, so instead we measure entropy, the negative logarithm of the inside probability.</Paragraph> <Paragraph position="4"> We implemented a variation on a steepest descent search technique. We denote the entropy of the sentence after thresholding by ET. Our search engine is given a target performance level ET to search for, 3We could use gradient descent to minimize a weighted sum of time and performance, but we wouldn't know at the beginning what performance we would have at the end. If our goal is to have the best performance we can while running in real time, or to achieve a minimum acceptable performance level with as little time as necessary, then a simple gradient descent function wouldn't work as well as our algorithm.</Paragraph> <Paragraph position="5"> Also, for this algorithm (although not for most experiments), our measurement of time was the total number of productions searched, rather than cpu time; we wanted the greater accuracy of measuring productions.</Paragraph> <Paragraph position="6"> and then tries to find the best combination of parameters that works at approximately this level of performance. At each point, it finds the threshold to change that gives the most &quot;bang for the buck.&quot; It then changes this parameter in the correct direction to move towards ET (and possibly overshoot works. There are two cases. In the first case, if we are currently above the goal entropy, then we loosen our thresholds, leading to slower speed and lower entropy. We then wish to get as much entropy reduction as possible per time increase; that is, we want the steepest slope possible. On the other hand, if we are trying to increase our entropy, we want as much time decrease as possible per entropy increase; that is, we want the flattest slope possible. Because of this difference, we need to compute different ratios depending on which side of the goal we axe on.</Paragraph> <Paragraph position="7"> There are several subtleties when thresholds are set very tightly. When we fail to parse a sentence because the thresholds are too tight, we retry the parse with lower thresholds. This can lead to conditions that are the opposite of what we expect; for instance, loosening thresholds may lead to faster parsing, because we don't need to parse the sentence, fail, and then retry with looser thresholds. The full algorithm contains additional checks that our thresholding change had the effect we expected (either increased time for decreased entropy or vice versa). If we get either a change in the wrong direction, or a change that makes everything worse, then we retry with the inverse change, hoping that that will have the intended effect. If we get a change that makes both time and entropy better, then we make that change regardless of the ratio.</Paragraph> <Paragraph position="8"> Also, we need to do checks that the denominator when computing Ratio isn't too small. If it is very small, then our estimate may be unreliable, and we don't consider changing this parameter. Finally, the actual algorithm we used also contained a simple &quot;annealing schedule&quot;, in which we slowly decreased the factor by which we changed thresholds. That is, we actually run the algorithm multiple times to termination, first changing thresholds by a factor of 16. After a loop is reached at this factor, we lower the factor to 4, then 2, then 1.414, then 1.15.</Paragraph> <Paragraph position="9"> Note that this algorithm is fairly domain inde* pendent. It can be used for almost any statistical parsing formalism that uses thresholds, or even for speech recognition.</Paragraph> </Section> <Section position="7" start_page="17" end_page="17" type="metho"> <SectionTitle> 6 Comparison to Previous Work </SectionTitle> <Paragraph position="0"> Beam thresholding is a common approach. While we don't know of other systems that have used exactly our techniques, our techniques are certainly similar to those of others. For instance, Collins (1996) uses a form of beam thresholding that differs from ours only in that it doesn't use the prior probability of nonterminals as a factor, and Caraballo and Charniak (1996) use a version with the prior, but with other factors as well.</Paragraph> <Paragraph position="1"> Much of the previous related work on thresholding is in the similar area of priority functions for agenda-based parsers. These parsers try to do &quot;best first&quot; parsing, with some function akin to a thresholding function determining what is best. The best comparison of these functions is due to Caraballo and Charniak (1996; 1997), who tried various prioritization methods. Several of their techniques are similar to our beam thresholding technique, and one of their techniques, not yet published (Caraballo and Charniak, 1997), would probably work better.</Paragraph> <Paragraph position="2"> The only technique that Caraballo and Charniak (1996) give that took into account the scores of other nodes in the priority function, the &quot;prefix model,&quot; required O(n 5) time to compute, compared to our O(n 3) system. On the other hand, all nodes in the agenda parser were compared to all other nodes, so in some sense all the priority functions were global.</Paragraph> <Paragraph position="3"> Note that agenda-based PCFG parsers in general require more than O(n 3) run time, because, when better derivations are discovered, they may be forced to propagate improvements to productions that they have previously considered. For instance, if an agenda-based system first computes the probability for a production S ~ NP VP, and then later computes some better probability for the NP, it must update the probability for the S as well. This could propagate through much of the chart. To remedy this, Caraballo et al. only propagated probabilities that caused a large enough change (Caraballo and Charniak, 1997). Also, the question of when an agenda-based system should stop is a little discussed issue, and difficult since there is no obvious stopping criterion. Because of these issues, we chose not to implement an agenda-based system for comparison.</Paragraph> <Paragraph position="4"> As mentioned earlier, Rayner and Carter (1996) describe a system that is the inspiration for global thresholding. Because of the limitation of their system to non-recursive grammars, and the other differences discussed in Section 3, global thresholding represents a significant improvement.</Paragraph> <Paragraph position="5"> Collins (1996) uses two thresholding techniques.</Paragraph> <Paragraph position="6"> The first of these is essentially beam thresholding for each rule P ~ L R if nonterminal L in left cell if nonterminal R in right cell add P to parent cell;</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> Algorithm One </SectionTitle> <Paragraph position="0"> for each nonterminal L in left cell for each'nonterminal R in right cell for each rule P ~ L R add P to parent cell; without a prior. In the second technique, there is a constant probability threshold. Any nodes with a probability below this threshold are pruned. If the parse fails, parsing is restarted with the constant lowered. We attempted to duplicate this technique, but achieved only negligible performance improvements. Collins (personal communication) reports a 38% speedup when this technique is combined with loose beam thresholding, compared to loose beam thresholding alone. Perhaps our lack of success is due to differences between our grammars, which are fairly different formalisms. When Collins began using a formalism somewhat closer to ours, he needed to change his beam thresholding to take into account the prior, so this is not unlikely. Hwa (personal communication) using a model similar to PCFGs, Stochastic Lexicalized Tree Insertion Grammars, also was not able to obtain a speedup using this technique.</Paragraph> <Paragraph position="1"> There is previous work in the speech recognition community on automatically optimizing some parameters (Schwartz et al., 1992). However, this previous work differed significantly from ours both in the techniques used, and in the parameters optimized. In particular, previous work focused on optimizing weights for various components, such as the language model component. In contrast, we optimize thresholding parameters. Previous techniques could not be used or easily adapted to thresholding parameters.</Paragraph> </Section> </Section> class="xml-element"></Paper>