XML Viewer - p04-1070

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1070_metho.xml
Size: 20,479 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1070">
  <Title>An alternative method of training probabilistic LR parsers</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 LR parsing
</SectionTitle>
    <Paragraph position="0"> As LR parsing has been extensively treated in existing literature, we merely recapitulate the main definitions here. For more explanation, the reader is referred to standard literature such as (Harrison, 1978; Sippu and Soisalon-Soininen, 1990).</Paragraph>
    <Paragraph position="1"> An LR parser is constructed on the basis of a CFG that is augmented with an additional ruleSy!'S, where S is the former start symbol, and the new nonterminal Sy becomes the start symbol of the augmented grammar. The new terminal ' acts as an imaginary start-of-sentence marker. We denote the set of terminals by and the set of nonterminals by N. We assume each rule has a unique label r.</Paragraph>
    <Paragraph position="2"> As explained before, we construct LR parsers as pushdown transducers. The main stack symbols of these automata are sets of dotted rules, which consist of rules from the augmented grammar with a distinguished position in the right-hand side indicated by a dot ' '. The initial stack symbol is pinit =fSy!' Sg.</Paragraph>
    <Paragraph position="3"> We define the closure of a set p of dotted rules as the smallest set closure(p) such that:  1. p closure(p); and 2. for (B ! A ) 2 closure(p) and A !  a rule in the grammar, also (A ! ) 2 closure(p).</Paragraph>
    <Paragraph position="4"> We define the operation goto on a set p of dotted rules and a grammar symbol X2 [N as:</Paragraph>
    <Paragraph position="6"> The set of LR states is the smallest set such that:  1. pinit is an LR state; and 2. if p is an LR state and goto(p;X) = q6=;, for  some X2 [N, then q is an LR state. We will assume that PDTs consist of three types of transitions, of the form P a;b7!P Q (a push transition), of the form P a;b7!Q (a swap transition), and of the formP Q a;b7!R (a pop transition). HereP, Q and R are stack symbols, a is one input terminal or is the empty string &amp;quot;, and b is one output terminal or is the empty string &amp;quot;. In our notation, stacks grow from left to right, so that P a;b7!P Q means that Q is pushed on top of P. We do not have internal states next to stack symbols.</Paragraph>
    <Paragraph position="7"> For the PDT that implements the LR strategy, the stack symbols are the LR states, plus symbols of the form [p;X], wherepis an LR state andX is a grammar symbol, and symbols of the form (p;A;m), where p is an LR state, A is the left-hand side of some rule, and m is the length of some prefix of the right-hand side of that rule. More explanation on these additional stack symbols will be given below. The stack symbols and transitions are simultaneously defined in Figure 1. The final stack symbol is p nal = (pinit;Sy;0). This means that an input a1 an is accepted if and only if it is entirely read by a sequence of transitions that take the stack consisting only of pinit to the stack consisting only of p nal . The computed output consists of the string of terminals b1 bn0 from the output components of the applied transitions. For the PDTs that we will use, this output string will consist of a sequence of rule labels expressing a right-most derivation of the input. On the basis of the original grammar, the corresponding parse tree can be constructed from such an output string.</Paragraph>
    <Paragraph position="8"> There are a few superficial differences with LR parsing as it is commonly found in the literature.</Paragraph>
    <Paragraph position="9"> The most obvious difference is that we divide reductions into 'binary' steps. The main reason is that this allows tabular interpretation with a time complexity cubic in the length of the input. Otherwise, the time complexity would be O(nm+1), where m is the length of the longest right-hand side of a rule in the CFG. This observation was made before by (Kipps, 1991), who proposed a solution similar to ours, albeit formulated differently. See also a related formulation of tabular LR parsing by (Nederhof and Satta, 1996).</Paragraph>
    <Paragraph position="10"> To be more specific, instead of one step of the</Paragraph>
    <Paragraph position="12"> where (A ! X1 Xm ) 2 pm, is a string of stack symbols and goto(p0;A) = q, we have a number of smaller steps leading to a series of stacks:</Paragraph>
    <Paragraph position="14"> There are two additional differences. First, we want to avoid steps of the form:</Paragraph>
    <Paragraph position="16"> by transitions p0 (A;0) &amp;quot;;&amp;quot;7!p0 q, as such transitions complicate the generic definition of 'properness' for PDTs, to be discussed in the following section.</Paragraph>
    <Paragraph position="17"> For this reason, we use stack symbols of the form [p;X] next to p, and split up p0 (A;0) &amp;quot;;&amp;quot;7!p0 q into pop [p0;X0] (A;0) &amp;quot;;&amp;quot;7! [p0;A] and push [p0;A] &amp;quot;;&amp;quot;7! [p0;A] q. This is a harmless modification, which increases the number of steps in any computation by at most a factor 2.</Paragraph>
    <Paragraph position="18"> Secondly, we use stack symbols of the form (p;A;m) instead of (A;m). This concerns the conditions of reverse-properness to be discussed in the For LR state p and a2 such that goto(p;a)6=;:</Paragraph>
    <Paragraph position="20"> following section. By this condition, we consider LR parsing as being performed from right to left, so backwards with regard to the normal processing order. If we were to omit the first components p from stack symbols (p;A;m), we may obtain 'dead ends' in the computation. We know that such dead ends make a (reverse-)proper PDT inconsistent, as probability mass lost in dead ends causes the sum of probabilities of all computations to be strictly smaller than 1. (See also (Nederhof and Satta, 2004).) It is interesting to note that the addition of the componentspto stack symbols (p;A;m) does not increase the number of transitions, and the nature of LR parsing in the normal processing order from left to right is preserved.</Paragraph>
    <Paragraph position="21"> With all these changes together, reductions are implemented by transitions resulting in the following sequence of stacks:  spond to several dotted rules (A! X ) 2p, with different of length m and different . If we were to multiply such transitions for different and , the PDT would become prohibitively large.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Properness and reverse-properness
</SectionTitle>
    <Paragraph position="0"> If a PDT is regarded to process input from left to right, starting with a stack consisting only of pinit , and ending in a stack consisting only of p nal , then it seems reasonable to cast this process into a probabilistic framework in such a way that the sum of probabilities of all choices that are possible at any given moment is 1. This is similar to how the notion of 'properness' is defined for probabilistic context-free grammars (PCFGs); we say a PCFG is proper if for each nonterminalA, the probabilities of all rules with left-hand side A sum to 1.</Paragraph>
    <Paragraph position="1"> Properness for PCFGs does not restrict the space of probability distributions on the set of parse trees.</Paragraph>
    <Paragraph position="2"> In other words, if a probability distribution can be defined by attaching probabilities to rules, then we may reassign the probabilities such that that PCFG becomes proper, while preserving the probability distribution. This even holds if the input grammar is non-tight, meaning that probability mass is lost in 'infinite derivations' (S'anchez and Bened'i, 1997; Chi and Geman, 1998; Chi, 1999; Nederhof and Satta, 2003).</Paragraph>
    <Paragraph position="3"> Although CFGs and PDTs are weakly equivalent, they behave very differently when they are extended with probabilities. In particular, there seems to be no notion similar to PCFG properness that can be imposed on all types of PDTs without losing generality. Below we will discuss two constraints, which we will call properness and reverseproperness. Neither of these is suitable for all types of PDTs, but as we will show, the second is more suitable for probabilistic LR parsing than the first.</Paragraph>
    <Paragraph position="4"> This is surprising, as only properness has been described in existing literature on probabilistic PDTs (PPDTs). In particular, all existing approaches to probabilistic LR parsing have assumed properness rather than anything related to reverse-properness.</Paragraph>
    <Paragraph position="5"> For properness we have to assume that for each stack symbol P, we either have one or more transitions of the form P a;b7! P Q or P a;b7! Q, or one or more transitions of the form QP a;b7! R, but no combination thereof. In the first case, properness demands that the sum of probabilities of all transitions P a;b7!P Q and P a;b7!Q is 1, and in the second case properness demands that the sum of probabilities of all transitions QP a;b7!R is 1 for each Q. Note that our assumption above is without loss of generality, as we may introduce swap transitions P &amp;quot;;&amp;quot;7! P1 and P &amp;quot;;&amp;quot;7! P2, where P1 and P2 are new stack symbols, and replace transitions P a;b7! P Q and P a;b7! Q by P1 a;b7! P1 Q and P1 a;b7! Q, and replace transitions QP a;b7!R by QP2 a;b7!R.</Paragraph>
    <Paragraph position="6"> The notion of properness underlies the normal training process for PDTs, as follows. We assume a corpus of PDT computations. In these computations, we count the number of occurrences for each transition. For each P we sum the total number of all occurrences of transitions P a;b7!P Q or P a;b7!Q.</Paragraph>
    <Paragraph position="7"> The probability of, say, a transition P a;b7! P Q is now estimated by dividing the number of occurrences thereof in the corpus by the above total number of occurrences of transitions with P in the left-hand side. Similarly, for each pair (Q;P) we sum the total number of occurrences of all transitions of the formQP a;b7!R, and thereby estimate the probability of a particular transitionQP a;b7!R by relative frequency estimation. The resulting PPDT is proper.</Paragraph>
    <Paragraph position="8"> It has been shown that imposing properness is without loss of generality in the case of PDTs constructed by a wide range of parsing strategies, among which are top-down parsing and left-corner parsing. This does not hold for PDTs constructed by the LR parsing strategy however, and in fact, properness for such automata may reduce the expressive power in terms of available probability distributions to strictly less than that offered by the original CFG.</Paragraph>
    <Paragraph position="9"> This was formally proven by (Nederhof and Satta, 2004), after (Ng and Tomita, 1991) and (Wright and Wrigley, 1991) had already suggested that creating a probabilistic LR parser that is equivalent to an input PCFG is difficult in general. The same difficulty for ELR parsing was suggested by (Tendeau, 1997).</Paragraph>
    <Paragraph position="10"> For this reason, we investigate a practical alternative, viz. reverse-properness. Now we have to assume that for each stack symbol R, we either have one or more transitions of the form P a;b7! R or QP a;b7! R, or one or more transitions of the form P a;b7!P R, but no combination thereof. In the first case, reverse-properness demands that the sum of probabilities of all transitions P a;b7!R or QP a;b7!R is 1, and in the second case reverse-properness demands that the sum of probabilities of transitions P a;b7!P R is 1 for each P. Again, our assumption above is without loss of generality.</Paragraph>
    <Paragraph position="11"> In order to apply relative frequency estimation, we now sum the total number of occurrences of transitions P a;b7! R or QP a;b7! R for each R, and we sum the total number of occurrences of transitions P a;b7!P R for each pair (P;R).</Paragraph>
    <Paragraph position="12"> We now prove that reverse-properness does not restrict the space of probability distributions, by means of the construction of a 'cover' grammar from an input CFG, as reported in Figure 2. This cover CFG has almost the same structure as the PDT resulting from Figure 1. Rules and transitions almost stand in a one-to-one relation. The only noteworthy difference is between transitions of type (6) and rules of type (12). The right-hand sides of those rules can be &amp;quot; because the corresponding transitions are deterministic if seen from right to left. Now it becomes clear why we needed the components p in stack symbols of the form (p;A;m). Without it, one could obtain an LR state q that does not match the underlying [p;X] in a reversed computation.</Paragraph>
    <Paragraph position="13"> We may assume without loss of generality that rules of type (12) are assigned probability 1, as a probability other than 1 could be moved to corresponding rules of types (10) or (11) where state q was introduced. In the same way, we may assume that transitions of type (6) are assigned probability 1. After making these assumptions, we obtain a bijection between probability functionspAfor the PDT and probability functions pG for the cover CFG. As was shown by e.g. (Chi, 1999) and (Nederhof and Satta, 2003), properness for CFGs does not restrict the space of probability distributions, and thereby the same holds for reverse-properness for PDTs that implement the LR parsing strategy.</Paragraph>
    <Paragraph position="14"> It is now also clear that a reverse-proper LR parser can describe any probability distribution that the original CFG can. The proof is as follows.</Paragraph>
    <Paragraph position="15"> Given a probability function pG for the input CFG, we define a probability function pA for the LR parser, by letting transitions of types (2) and (3) For LR state p and a2 such that goto(p;a)6=;: [p;a]!p (7) For LR state p and (A! )2p, where A!&amp;quot; has label r: [p;A]!pr (8) For LR state p and (A! )2p, wherej j= m&gt; 0 and A! has label r: (p;A;m 1)!pr (9) For LR state p and (A! X )2p, wherej j= m&gt; 0, such that goto(p;X) = q6=;: (p;A;m 1)![p;X] (q;A;m) (10) For LR state p and (A! X )2p, such that goto(p;X) = q6=;:</Paragraph>
    <Paragraph position="17"> (pinit;Sy;0). Terminals are rule labels. Generated language consists of right-most derivations in reverse.</Paragraph>
    <Paragraph position="18"> have probability pG(r), and letting all other transitions have probability 1. This gives us the required probability distribution in terms of a PPDT that is not reverse-proper in general. This PPDT can now be recast into reverse-proper form, as proven by the above.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We have implemented both the traditional training method for LR parsing and the novel one, and have compared their performance, with two concrete objectives: null 1. We show that the number of free parameters is significantly larger with the new training method. (The number of free parameters is the number of probabilities of transitions that can be freely chosen within the constraints of properness or reverse-properness.) 2. The larger number of free parameters does not make the problem of sparse data any worse, and precision and recall are at least comparable to, if not better than, what we would obtain with the established method.</Paragraph>
    <Paragraph position="1"> The experiments were performed on the Wall Street Journal (WSJ) corpus, from the Penn Treebank, version II. Training was done on sections 0221, i.e., first a context-free grammar was derived from the 'stubs' of the combined trees, taking parts of speech as leaves of the trees, omitting all affixes from the nonterminal names, and removing &amp;quot;generating subtrees. Such preprocessing of the WSJ corpus is consistent with earlier attempts to derive CFGs from that corpus, as e.g. by (Johnson, 1998).</Paragraph>
    <Paragraph position="2"> The obtained CFG has 10,035 rules. The dimensions of the LR parser constructed from this grammar are given in Table 1.</Paragraph>
    <Paragraph position="3"> The PDT was then trained on the trees from the same sections 02-21, to determine the number of times that transitions are used. At first sight it is not clear how to determine this on the basis of the treebank, as the structure of LR parsers is very different from the structure of the grammars from which they are constructed. The solution is to construct a second PDT from the PDT to be trained, replacing each transition a;b7! with label r by transition b;r7! . By this second PDT we parse the treebank, encoded as a series of right-most derivations in reverse.1 For each input string, there is exactly one parse, of which the output is the list of used transitions. The same method can be used for other parsing strategies as well, such as left-corner parsing, replacing right-most derivations by a suitable alternative representation of parse trees.</Paragraph>
    <Paragraph position="4"> By the counts of occurrences of transitions, we may then perform maximum likelihood estimation to obtain probabilities for transitions. This can be done under the constraints of properness or of reverse-properness, as explained in the previous section. We have not applied any form of smooth1We have observed an enormous gain in computational efficiency when we also incorporate the 'shifts' next to 'reductions' in these right-most derivations, as this eliminates a considerable amount of nondeterminism.</Paragraph>
    <Paragraph position="5">  properness and reverse-properness.</Paragraph>
    <Paragraph position="6"> ing or back-off, as this could obscure properties inherent in the difference between the two discussed training methods. (Back-off for probabilistic LR parsing has been proposed by (Ruland, 2000).) All transitions that were not seen during training were given probability 0.</Paragraph>
    <Paragraph position="7"> The results are outlined in Table 2. Note that the number of free parameters in the case of reverse-properness is much larger than in the case of normal properness. Despite of this, the number of transitions that actually receive non-zero probabilities is (predictably) identical in both cases, viz. 137,134.</Paragraph>
    <Paragraph position="8"> However, the potential for fine-grained probability estimates and for smoothing and parameter-tying techniques is clearly greater in the case of reverseproperness. null That in both cases the number of non-zero probabilities is lower than the total number of parameters can be explained as follows. First, the treebank contains many rules that occur a small number of times. Secondly, the LR automaton is much larger than the CFG; in general, the size of an LR automaton is bounded by a function that is exponential in the size of the input CFG. Therefore, if we use the same treebank to estimate the probability function, then many transitions are never visited and obtain a zero probability.</Paragraph>
    <Paragraph position="9"> We have applied the two trained LR automata on section 22 of the WSJ corpus, measuring labelled precision and recall, as done by e.g. (Johnson, 1998).2 We observe that in the case of reverseproperness, precision and recall are slightly better. 2We excluded all sentences with more than 30 words however, as some required prohibitive amounts of memory. Only one of the remaining 1441 sentences was not accepted by the parser.</Paragraph>
    <Paragraph position="10"> The most important conclusion that can be drawn from this is that the substantially larger space of obtainable probability distributions offered by the reverse-properness method does not come at the expense of a degradation of accuracy for large grammars such as those derived from the WSJ. For comparison, with a standard PCFG we obtain labelled precision and recall of 0.725 and 0.670, respectively.3 null We would like to stress that our experiments did not have as main objective the improvement of state-of-the-art parsers, which can certainly not be done without much additional fine-tuning and the incorporation of some form of lexicalization. Our main objectives concerned the relation between our newly proposed training method for LR parsers and the traditional one.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML