File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1045_metho.xml
Size: 15,491 bytes
Last Modified: 2025-10-06 14:12:43
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1045"> <Title>Calculating the Probability of a Partial Parse of a Sentence</Title> <Section position="3" start_page="0" end_page="237" type="metho"> <SectionTitle> SHIFT-REDUCE PARSING </SectionTitle> <Paragraph position="0"> A bottom-up parser is one which reconstructs parsing trees by first constructing parsing subtrees over short disjoint segments of the text, then linking these together into a smaller number of larger trees, and so on recursively until a single parse tree emerges, covering the entire text. In this section we study a particular class of bottom-up parsers, called shift-reduce parsers, which conform to the following rules, leading to the reconstruction of a rightmost-first derivation of the sentence being parsed.</Paragraph> <Paragraph position="1"> The parser receives symbols one at a time from left to right and at each stage of the process, the parser's memory contains a sequence of disjoint parsing subtrees which completely cover the current input. Roughly speaking, as each new symbol is accepted (or shifted-in) the parser decides how to incorporate it into a subtree and perhaps how to link several existing subtrees together (i.e. reduce). The sequence of subtrees in the parser's memory at a given instant is called a parse hypothesis, or a parser stack.</Paragraph> <Paragraph position="2"> To be more precise, here is how a shift-reduce parser updates the current hypothesis into a new one. Consider a parse hypothesis consisting of n subtrees: 7&quot;1... I&quot;,,, having root symbols B1... B,, respectively.</Paragraph> <Paragraph position="3"> The three possible &quot;moves&quot; for reacting to the next input symbol 'z' are listed below.</Paragraph> <Paragraph position="4"> 1. 'z' can be shifted in and declared to be Tn+l.</Paragraph> <Paragraph position="5"> 2. If there is a rule A ~ B, in the grammar, then r, can be replaced by a new ~-, having A as a root and old ~-, as the left child of A. (Note that 'z' has not yet been shifted in.) 3. If there is a rule A ~ Bn-iB, in the grammar, then r,-1 and 7, can be removed from the hypothesis and a new sub-tree 7&quot;,,-1 is added, having A as a root and old ~',-1 as the left child of A and old ~-, as the right child of A. Again note that z remains to be shifted in.</Paragraph> <Paragraph position="6"> The &quot;input cycle&quot; of a shift-reduce parser is typically to shift in a new symbol via move 1, use move 2 to give that symbol a nonterminal root, and then to perform some number of moves of type three. Choosing which (if any) of the allowable type-two and type-three rules should be used next in the parse can be quite difficult, but doing so cleverly makes the difference between efficient and inefficient parsing algorithms. When faced with a choice among possible moves some parsers make a breadth-first search among the possibilities. Others use a depth-first scheme, or even something intermediate between these two extremes. We will not be concerned with such schemes here. We concern ourselves only with a probabilistic score for the plausibihty of available choices. (The best use of that score is a study in its own right.) The important fact about shift-reduce parsers from our point of view is that they are quite limited in the kind of superstructure they can build above a given set of subtrees. Since new parent nodes can only be generated over the final few subtrees in the hypothesis, one can not &quot;go back&quot; and make non-final pairs of subtrees into siblings. (A precise result is proved in \[3\]). Figure 1 shows the necessary superstructure for an n-subtree hypothesis.</Paragraph> <Paragraph position="7"> In this figure, the ellipses represent sequences of zero or more nodes in which each node is the left child of its parent. The diagram is also meant to admit the possibihty that Ai is the same node as Ci-l. The right children of the nodes labeled C (labeled X) as well as those in the ellipses are all to be found in the remaining input.</Paragraph> </Section> <Section position="4" start_page="237" end_page="238" type="metho"> <SectionTitle> THE LEFT-EDGE PROCESS </SectionTitle> <Paragraph position="0"> In order to calculate the sum of the probabilities of all complete parse trees that could result from the parser's further processing of a given hypothesis, we must sum across all the possibilities for the As and the Cs in figure 1 (which is a finite set) as well as summing over the potentially infinite set of sequences of nodes that could be lurking behind the ellipses. This sounds prohibitive, but we are saved by the fact that the sequences of nodes along the left-edge of a tree can be analyzed as the output of a Markov process. This fact is implicit in the work on trees and regular sets by \[6\], and was discovered independently by \[2\].</Paragraph> <Paragraph position="1"> Happily, this observation leads to a closed form solution to the problem of calculating all the necessary probabilities for the b c ... d ... w ... x y ... z</Paragraph> <Paragraph position="3"> To illustrate the use of Markov chain theory for the left-edges of trees, we compute the probability of the event that the left edge of a randomly generated subtree terminates in a specified terminal symbol a, given that the root is a specified nonterminal symbol A. This event is the disjoint union of the events that a is the n th symbol in the left-edge sequence, for all n > 1. Correspondingly, we want the sum P(the n th left-edge symbol is a I the root is A) n>l which is the sum of the (A, a) th entries in the sequence M, M 2, M 3, etc., which is in turn the (A, a) th entry in M+M2+M3+ ... . As it turns out, this matrix sum converges. The sum is equal to</Paragraph> <Paragraph position="5"> along the left edge of the tree.</Paragraph> <Paragraph position="6"> As another illustration we compute the probability that the left edge of a subtree T terminates in some specific subtree ~', again given that the root of T is A. More precisely, we compute the conditional probability that the subtree ~- appears as a subtree of T, with its root B somewhere in the left-edge of T, given that the root of T is A. This is the disjoint union of the events, as n varies, that B is the n th symbol in the left edge of T and that T then appears rooted at this B.</Paragraph> <Paragraph position="7"> If (just for a moment) we exclude the possibility that v is identical to T, then n must be at least 1. For each n > 1, the conditionM probability that ~- appears rooted at the n m symbol B, is P(r\]B) multiplied by the (A, B) th entry of the n th power of M. In this case we can find, much as in the preceding illustration, that the sum from 1 to infinity of these l~obabilities is P(rlB ) x the (A, B) th e~try of M(I- M) -1.</Paragraph> <Paragraph position="8"> To include the possibility that ~ is identical to T, then we must add the term: ~'(~IB) x P(A = BS.</Paragraph> <Paragraph position="9"> Since the second factor is one or zero depending on whether B = A, the sum of probabilities for all n > 0 is P(v\[B) x the (A, B) `h entry of \[I + M(I - M5 -I\] which simplifies to: P(rlB 5 x the (A, B) th entry of (I - M5 -1. (15 In order to calculate the probability of the set of parse trees which might complete a given parse hypothesis we will need formulas like these, but with the proviso that we need to specify the rule that is used to generate the root of ~ from its parent. So to calculate all the probabilities that could ever arise due to the ellipses, we have work of inverting a rather large, but rather sparse, matrix. This work is done when the rule probabilities are decided upon, and before any sentences are parsed. The size of the matrix depends on the number of symbols (terminals and nonterminals 5 in the grammar.</Paragraph> </Section> <Section position="5" start_page="238" end_page="239" type="metho"> <SectionTitle> THE PROBABILITY CALCULATION </SectionTitle> <Paragraph position="0"> The probability calculation must be divided into two cases. In one ease we are in the midst of processing input and do not know how many input symbols (if any 5 remain to be processed. The second situation is that we know that all input symbols have been processed. This second ease is special because it implies that the only unknown events are which rules are to be used to link up the subtrees to the root. In this case, the summation down the left edges of subtrees is no longer needed.</Paragraph> <Paragraph position="1"> When in the Midst of the Input When there may be more input to be processed, the calculation of the probability of a parser hypothesis with only one subtree is exactly the equation (15 in which the start symbol of the grammar, S, takes the place of the symbol A in the formula.</Paragraph> <Paragraph position="2"> For hypotheses with n > 1 subtrees we need to take the A and C nodes from figure 1 into account. To calculate the probability of a parser hypothesis with n subtrees ~-, ... r,, with root nodes B1 ...Bn, we keep track of what rule is used to generate each Bi. This defines the necessary relationships among the various Ai Bi and Ci in figure 1. To perform our calculation we need the following matrices:</Paragraph> <Paragraph position="4"> if rule ~&quot; is A .L BC for some B, C otherwise if rule r is A .L BC for some A, B otherwise for M as defined above The probability calculation requires the following four steps: Compute l.q* = the S th row of the product MoQ. Zero out all entries except those corresponding to rules which have Bi as a left child and call the result t~. For i = 2,...,n compute the product ~* = I,~-iZMoQ. Zero out all entries except those corresponding to rules which have Bi as a left child and call the result t~. Construct a final vector Vii, by zeroing out all entries of t~j-i except those corresponding to rules which have B, as a right child.</Paragraph> <Paragraph position="5"> The desired probability is the sum of the entries in Vn and Vii n multiplied by the conditional probability of the sub-trees already constructed:</Paragraph> <Paragraph position="7"> When at the End of the Input If there is no more input and the hypothesis has only one subtree, then either the root of the subtree is the start symbol of the grammar, and hence the hypothesis has yielded a well-formed sentence with probability P(rlSS ) or the hypothesis must be abandoned since it has not yielded a sentence and no further changes to it are possible.</Paragraph> <Paragraph position="8"> Things are more interesting if the hypothesis contains more than one subtree. Consider a parser hypothesis H, consisting of n > 1 subtrees rl through rn with root symbols B, through Bn respectively, with all of B1 through B, being nonterminal symbols. Suppose that the leaves of these subtrees exhaust the input, so no further shift operations are possible for the parser. For each nonterminal B let MB be the {symbols) x {symbols) matrix whose AC th entry is the probability P(A --* BC), if A --~ BC is a rule of the grammar while otherwise the AC th entry is zero. Also, for each pair of nonterminals BC, let FBc be the column vector indexed by nonterminals whose A th entry is P(A --~ BC) if A ~ BC is a rule of the grammar; otherwise the A th entry is zero. Let Vs be a row vector indexed by nonterminals with a 1 in the entry for S and zeros elsewhere.</Paragraph> <Paragraph position="9"> Then, for n > 1, the probability of the hypothesis is equal to VS,%IB~MB2... ,~IB.-2FB.-IB. x fi P(rilBi)</Paragraph> <Section position="1" start_page="238" end_page="239" type="sub_section"> <SectionTitle> i=1 Programming Considerations </SectionTitle> <Paragraph position="0"> There are several problems in making a practical parser based on the probabilities eMeulated above. First we must invert the rather large matrix I - M and then for each parse hypothesis we must perform two or three matrix operations for each subtree of the hypothesis. This is not actually as bad as it seems.</Paragraph> <Paragraph position="1"> First note that we can absorb two matrix operations for each subtree into one operation by precomputing MoQ. If we use this as our &quot;in-core&quot; matrix, we can reproduce Mo when needed (for n = I computations) by summing across the relevant rules.</Paragraph> <Paragraph position="2"> Next we note that the vector by which we are premultiplying is very sparse. This is true since the preceding step was to zeroout all entries in the vector that have the &quot;wrong&quot; left child. This means that there are only a few rows of the big MoQ matrix that concern us.</Paragraph> <Paragraph position="3"> Also note that immediately after we calculate the vector result, we will again zero out entries with the &quot;wrong&quot; left child. This means that we really only need calculate those few entries in the result vector that have the desired left child. This reduces the matrix operation to much lower order, say 5 x 5. The size of the calculation is determined by how many rules have a given nonterminal as left child. A grammar will be easy to parse with this method if each nonterminal only appears as the left child in a few rules.</Paragraph> <Paragraph position="4"> Finally, we note that each parse step can only create one new subtree, and that at the end of the hypothesis. So, ff we remember the vector associated with each subtree as we make it, we only need to do one of these order 5 x 5 calculations to get the probability of the new hypothesis.</Paragraph> </Section> </Section> <Section position="6" start_page="239" end_page="239" type="metho"> <SectionTitle> BUILDING A PARSER </SectionTitle> <Paragraph position="0"> One might consider implementing the above probability calculation in conjunction with some conventional shift-reduce parser.</Paragraph> <Paragraph position="1"> In this case one would let the LR0 parser suggest possibilities for updating a given parse hypothesis and use the above scheme to compute probabilities and discard unpromising hypotheses.</Paragraph> <Paragraph position="2"> It is worth pointing out that all the information needed for LR0 parsing can in fact be reproduced from the probability vectors we calculate. Hence we do not really need to construct such a parser at all! The point is that starting from a particular hypothesis, a given proposal for a next move leads to a nonzero probability if and only ff there is some completion of the input that would not &quot;crash&quot; for the conventional parser. The vectors C/~, and ~i, contain all the information we could desire about the next step for the parser.</Paragraph> <Paragraph position="3"> Finally, let us remark that our matrix calculations can be adapted to yield a shift-reduce parser even when no probabilities are initially present. We simply replace the transition matrix M with a suitably scaled incidence matrix M', in which M'(A, B) = C/ if B is the left child of A via sorae rule. Otherwise M'(A, B) = O. A similar replacement is made for the matrix Q. The specific values of the &quot;probabilities&quot; then arising from our calculations do not matter, only whether or not they are zero. Thus, the off-line construction of parser tables could be accomplished via a matrix inversion, rather than the conventional recursive calculations.</Paragraph> </Section> class="xml-element"></Paper>