File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1070_metho.xml

Size: 18,693 bytes

Last Modified: 2025-10-06 14:15:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1070">
  <Title>Relating Probabilistic Grammars and Automata</Title>
  <Section position="4" start_page="543" end_page="545" type="metho">
    <SectionTitle>
3 Stochastic Push-Down Automata
</SectionTitle>
    <Paragraph position="0"> We use a somewhat nonstandard definition of pushdown automaton for convenience, but all our results hold for a variety of essentially equivalent definitions. In addition to the terminal alphabet ~, we will use sets of stack symbols and states as needed. A weighted push-down automaton (WPDA) consists of a distinguished start state q0, a distinguished start stack symbol X0 and a finite set of transitions of the following form where p and q are states, a E E L.J {e}, X and Z1, ..., Zn are stack symbols, and w is a nonnegative real weight: x, pa~ Zl ... Zn, q A WPDA is a probabilistic push-down automaton (PPDA) if all weights are in the interval \[0, 1\] and for each pair of a stack symbol X and a state q the sum of the weights of all transitions of the form X,p ~ Z1 ...Z=, q equals 1. A machine configuration is a pair (fl, q) of a finite sequence fl of stack symbols (a stack) and a machine state q. A machine configuration is called halting if the stack is empty. If M is a PPDA containing the transition X,p ~ Z1...Zn,q then any configuration of the form (fiX, p) has  probability w of being transformed into the configuration (f~Z1...Zn, q&gt; where this transformation has the effect of &amp;quot;outputting&amp;quot; a if a C/ e. A complete execution of M is a sequence of transitions between configurations starting in the initial configuration &lt;X0, q0&gt; and ending in a configuration with an empty stack. The probability of a complete execution is the product of the probabilities of the individual transitions between configurations in that execution. For any PPDA M and y E E* we define PM(Y) to be the sum of the probabilities of all complete executions outputting y. A PPDA M is called consistent if )-~ye~* PM(Y) = 1.</Paragraph>
    <Paragraph position="1"> We first show that the well known shift-reduce conversion of CFGs into PDAs can not be made to handle the stochastic case. Given a (non-probabilistic) CFG G in Chomsky normal form we define a (non-probabilistic) shift-reduce PDA SIt(G) as follows. The stack symbols of SIt(G) are taken to be nonterminals of G plus the special symbols T and +-. The states of SR(G) are in one-to-one correspondence with the stack symbols and we will abuse notation by using the same symbols for both states and stack symbols. The initial stack symbol is 1 and the initial state is (the state corresponding to) _L. For each production of the form X --+ a in G the PDA SIt(G) contains all shift transitions of the following form  Note that if G consists entirely of productions of the form S -+ a these transitions suffice. More generally, for each production of the form X -+ YZ in G the PDA SR(G) contains the following reduce transitions.</Paragraph>
    <Paragraph position="2"> Y, Z -~, X All reachable configurations are in one of the following four forms where the first is the initial configuration, the second is a template for all intermediate configurations with a E N*, and the last two are terminal configurations.</Paragraph>
    <Paragraph position="3"> &lt;1, 1&gt;, &lt;11., x&gt;, &lt;I,T&gt;, T&gt; Furthermore, a configuration of the form (l_l_a, X) can be reached after outputting y if and only if aX :~ y. In particular, the machine can reach configuration (+-_L, S) outputting y if and only if S :~ y. So the machine SR(G) generates the same language as G.</Paragraph>
    <Paragraph position="4"> We now show that the shift-reduce translation of CFGs into PDAs does not generalize to the stochastic case. For any PCFG G we define the underlying CFG to be the result of erasing all weights from the productions of G.</Paragraph>
    <Paragraph position="5"> Theorem 1 There exists a consistent PCFG G in Chomsky normal .form with underlying CFG G' such that no consistent weighting M of the</Paragraph>
    <Paragraph position="7"> To prove the theorem take G to be the following grammar.</Paragraph>
    <Paragraph position="9"> Note that G generates acca and bccb each with probability 1/2. Let M be a consistent PPDA whose transitions consist of some weighting of the transitions of SR(G'). We will assume that PM(Y) = PG(Y) for all y E E* and derive a contradiction. Call the nonterminals A, B, and C preterminals. Note that the only reduce transitions in SR(G ~) combining two preterminals are C, A -~,X2 and C, B -~,Y2. Hence the only machine configuration reachable after outputting the sequence ace is (.I__LAC, C&gt;. If PM(acca) -- 1/2 and PM(accb) -- 0 then the machine in configuration (.I_+-AC, C&gt; must deterministically move to configuration (I+-ACC, A&gt;. But this implies that configuration (IIBC, C&gt; also deterministically moves to configuration &lt;+-+-BCC, A&gt; so we have PM(bccb) -= 0 which violates the assumptions about M. ,, Although the standard shift-reduce translation of CFGs into PDAs fails to generalize to the stochastic case, the standard top-down conversion easily generalizes. A top-down PPDA is one in which only ~ transitions can cause the stack to grow and transitions which output a word must pop the stack.</Paragraph>
    <Paragraph position="10">  Theorem 2 Any string distribution definable by a consistent PCFG is also definable by a top-down PPDA.</Paragraph>
    <Paragraph position="11"> Here we consider only PCFGs in Chomsky normal form--the generalization to arbitrary PCFGs is straightforward. Any PCFG in Chomsky normal form can be translated to a top-down PPDA by translating each weighted production of the form X --~ YZ to the set of expansion moves of the form W, X ~ WZ, Y and each production of the form X -~ a to the set of pop moves of the form Z, X 72-'~, Z. * We also have the following converse of the above theorem.</Paragraph>
    <Paragraph position="12"> Theorem 3 Any string distribution definable by a consistent PPDA is definable by a PCFG.</Paragraph>
    <Paragraph position="13"> The proof, omitted here, uses a weighted version of the standard translation of a PDA into a CFG followed by a renormalization step using lemma 5. We note that it does in general involve an increase in the number of parameters in the derived PCFG.</Paragraph>
    <Paragraph position="14"> In this paper we are primarily interested in shift-reduce PPDAs which we now define formally. In a shift-reduce PPDA there is a one-to-one correspondence between states and stack symbols and every transition has one of the following two forms.</Paragraph>
    <Paragraph position="16"> Transitions of the first type are called shift transitions and transitions of the second type are called reduce transitions. Shift transitions output a terminal symbol and push a single symbol on the stack. Reduce transitions are e-transitions that combine two stack symbols.</Paragraph>
    <Paragraph position="17"> The above theorems leave open the question of whether shift-reduce PPDAs can express arbitrary context-free distributions. Our main theorem is that they can. To prove this some additional machinery is needed.</Paragraph>
  </Section>
  <Section position="5" start_page="545" end_page="546" type="metho">
    <SectionTitle>
4 Chomsky Normal Form
</SectionTitle>
    <Paragraph position="0"> A PCFG is in Chomsky normal form (CNF) if all productions are either of the form X -St a, a E E or X -~ Y1Y2, Y1,Y2 E N. Our next theorem states, in essence, that any PCFG can be converted to Chomsky normal form.</Paragraph>
    <Paragraph position="1"> Theorem 4 For any consistent PCFG G with PG(e) &lt; 1 there exists a consistent PCFG C(G) in Chomsky normal form such that, for all y E E+:</Paragraph>
    <Paragraph position="3"> To prove the theorem, note first that, without loss of generality, we can assume that all productions in G are of one of the forms X --~ YZ, X -5t Y, X -~ a, or X -Y+ e. More specifically, any production not in one of these forms must have the form X -5t C/rfl where a and fl are nonempty strings. Such a production can be replaced by X -~ AB, A -~ a, and B 2+ fl where A and B are fresh nonterminal symbols.</Paragraph>
    <Paragraph position="4"> By repeatedly applying this binarization transformation we get a grammar in the desired form defining the same distribution on strings.</Paragraph>
    <Paragraph position="5"> We now assume that all productions of G are in one of the above four forms. This implies that a node in a G-derivation has at most two children. A node with two children will be called a branching node. Branching nodes must be labeled with a production of the form X -~ YZ. Because G can contain productions of the form X --~ e there may be arbitrarily large G-derivations with empty yield.</Paragraph>
    <Paragraph position="6"> Even G-derivations with nonempty yield may contain arbitrarily large subtrees with empty yield. A branching node in the G-derivation will be called ephemeral if either of its children has empty yield. Any G-derivation d with la(d)l _ 2 must contain a unique shallowest non-ephemeral branching node, labeled by some production X ~ YZ. In this case, define fl(d) = YZ. Otherwise (la(d)l &lt; 2), let fl(d) = a(d). We say that a nonterminal X is nontrivial in the grammar G if Pa(a # e I P = X) &gt; O.</Paragraph>
    <Paragraph position="7"> We now define the grammar G' to consist of all productions of the following form where X, Y, and Z are nontrivial nonterminals of G and a is a terminal symbol appearing in G.</Paragraph>
    <Paragraph position="9"> We leave it to the reader to verify that G' has the property stated in theorem 4. * The above proof of theorem 4 is nonconstructive in that it does not provide any  way of computing the conditional probabilities</Paragraph>
    <Paragraph position="11"> a \[ p = X, a C/ e). However, it is not difficult to compute probabilities of the form PG(C/ \[ p = X, r &lt;_ t+ 1) from probabilities of the form PG((I) \] p = X, v _&lt; t), and PG(C/ I P = X) is the limit as t goes to infinity of Pa((I )\] p= X, r_&lt; t). We omit the details here.</Paragraph>
    <Paragraph position="12"> from X equals 1:</Paragraph>
    <Paragraph position="14"/>
  </Section>
  <Section position="6" start_page="546" end_page="546" type="metho">
    <SectionTitle>
5 Renormalization
</SectionTitle>
    <Paragraph position="0"> A nonterminal X is called reachable in a grammar G if either X is S or there is some (recursively) reachable nonterminal Y such that G contains a production of the form Y -~ a where contains X. A nonterminal X is nonempty in G if G contains X -~ a where u &gt; 0 and a contains only terminal symbols, or G contains X -~ o~\[Y1, ..., Yk\] where u &gt; 0 and each 1~ is (recursively) nonempty. A WCFG G is proper if every nonterminal is both reachable and nonempty. It is possible to efficiently compute the set of reachable and nonempty non-terminals in any grammar. Furthermore, the subset of productions involving only nonterminals that are both reachable and nonempty defines the same weight distribution on strings.</Paragraph>
    <Paragraph position="1"> So without loss of generality we need only consider proper WCFGs. A reweighting of G is any WCFG derived from G by changing the weights of the productions of G.</Paragraph>
    <Paragraph position="2"> Lemma 5 For any convergent proper WCFG G, there exists a reweighting G t of G such that G ~ is a consistent PCFG such that for all terminal strings y we have PG' (Y) = Pa (Y).</Paragraph>
    <Paragraph position="3"> Proof.&amp;quot; Since G is convergent, and every non-terminal X is reachable, we must have IIXIla &lt; oo. We now renormalize all the productions from X as follows. For each production X -~ a\[Y1,..., Yn\] we replace u by</Paragraph>
    <Paragraph position="5"> To show that G' is a PCFG we must show that the sum of the weights of all productions For any parse tree d admitted by G let d ~ be the corresponding tree admitted by G ~, that is, the result of reweighting the productions in d. One can show by induction on the depth of parse trees that if</Paragraph>
    <Paragraph position="7"> ticular, Ilaql = IlSlla,- 1, that is, G' is consistent. This implies that for any terminal string Y we have PG'(Y) = li-~Wa,(a = y, p = S) = Wa,(a = y, p = S). Furthermore, for any tree d with p(d) = S we have Wa,(d') = ~\[~cWa(d) and so WG,(a = y, p = S) - ~WG(a = y, p = S) = Pc(Y). &amp;quot;</Paragraph>
  </Section>
  <Section position="7" start_page="546" end_page="547" type="metho">
    <SectionTitle>
6 Greibach Normal Form
</SectionTitle>
    <Paragraph position="0"> A PCFG is in Greibach normal form (GNF) if every production X -~ a satisfies (~ E EN*.</Paragraph>
    <Paragraph position="1"> The following holds: Theorem 6 For any consistent PCFG G in CNF there exists a consistent PCFG G ~ in GNF such that Pc,(Y) = Pa(Y) for y e E*.</Paragraph>
    <Paragraph position="2"> Proof: A left corner G-derivation from X to Y is a G-derivation from X where the leftmost leaf, rather than being labeled with a production, is simply labeled with the nonterminal Y. For example, if G contains the productions X ~ YZ and Z -~ a then we canconstruct a left corner G-derivation from X to Y by building a tree with a root labeled by X Z.~ YZ, a left child labeled with Y and a right child labeled with Z -~ a. The weight of a left corner G-derivation is the product of the productions on the nodes. A tree consisting of a single node labeled with X is a left corner G-derivation from X toX.</Paragraph>
    <Paragraph position="3"> For each pair of nonterminals X, Y in G we introduce a new nonterminal symbol X/Y.</Paragraph>
    <Paragraph position="4">  The H-derivations from X/Y will be in one to one correspondence with the left-corner G-derivations from X to Y. For each production in G of the form X ~ a we include the following in H where S is the start symbol of G: S --~ a S/X We also include in H all productions of the following form where X is any nonterminal in G: x/x If G consists only of productions of the form S -~ a these productions suffice. More generally, for each nonterminal X/Y of H and each pair of productions U ~ YZ, W ~-~ a we include in H the following: X/Y ~2 a Z/W X/U Because of the productions X/X -~ e, WH(# : X/X) &gt; 1 , and H is not quite in GNF. These two issues will be addressed momentarily.</Paragraph>
    <Paragraph position="5"> Standard arguments can be used to show that the H-derivations from X/Y are in one-to-one correspondence with the left corner G-derivations from X to Y. Furthermore, this one-to-one correspondence preserves weight--if d is the H-derivation rooted at X/Y corresponding to the left corner G-derivation from X to Y then WH (d) is the product of the weights of the productions in the G-derivation.</Paragraph>
    <Paragraph position="6"> The weight-preserving one-to-one correspondence between left-corner G-derivations from X to Y and H-derivations from X/Y yields the following.</Paragraph>
    <Paragraph position="8"> Theorem 5 implies that we can reweight the proper subset of H (the reachable and nonempty productions of H) so as to construct a consistent PCFG g with Pj((~) = PG(~). To prove theorem 6 it now suffices to show that the productions of the form X/X -~ e can be eliminated from the PCFG J. Indeed, we can eliminate the e productions from J in a manner similar to that used in the proof of theorem 4. A node in an J-derivation is ephemeral if it is labeled X -~ e for some X. We now define a function 7 on J-derivations d as follows. If the root of d is labeled with X -~ aYZ then we have four subcases. If neither child of the root is ephemeral then 7(d) is the string aYZ. If only the left child is ephemeral then 7(d) is aZ. If only the right child is ephemeral then 7(d) is aY and if both children are ephemeral then 7(d) is a. Analogously, if the root is labeled with X -~ aY, then 7(d) is aY if the child is not ephemeral and a otherwise. If the root is labeled with X -~ e then 7(d) is e.</Paragraph>
    <Paragraph position="9"> A nonterminal X in K will be called trivial ifPj(7= e I P =X) = 1. We now define the final grammar G' to consist of all productions of the following form where X, Y, and Z are nontrivial nonterminals appearing in J and a is a terminal symbol appearing in J.</Paragraph>
    <Paragraph position="11"> As in section 4, for every nontrivial nonterminal X in K and terminal string (~ we have PK (a = (~ I P= X) = Pj(a= a I P= X, a ~ e). In particular, since Pj(e) = PG(() = 0, we have the following:</Paragraph>
    <Paragraph position="13"> The PCFG K is the desired PCFG in Greibach normal form. * The construction in this proof is essentially the standard left-corner transformation (Rosenkrantz and II, 1970), as extended by Salomaa and Soittola (1978, theorem 2.3) to algebraic formal power series.</Paragraph>
  </Section>
  <Section position="8" start_page="547" end_page="548" type="metho">
    <SectionTitle>
7 The Main Theorem
</SectionTitle>
    <Paragraph position="0"> We can now prove our main theorem.</Paragraph>
    <Paragraph position="1"> Theorem 7 For any consistent PCFG G there exists a shift-reduce PPDA M such that PM(Y) = PG(Y) for all y E ~*.</Paragraph>
    <Paragraph position="2"> Let G be an arbitrary consistent PCFG. By theorems 4 and 6~ we can assume that G consists of productions of the form S -~ e and  S l~w St plus productions in Greibach normal form not mentioning S. We can then replace the rule S 1_:+~ S ~ with all rules of the form S 0-__~)~' a where G contains S ~ ~' -+ a. We now assume without loss of generality that G consists of a single production of the form S -~ e plus productions in Greibach normal form not mentioning S on the right hand side.</Paragraph>
    <Paragraph position="3"> The stack symbols of M are of the form W~ where ce E N* is a proper suffix of the right hand side of some production in G. For example, if G contains the production X -~ aYZ then the symbols of M include Wyz, Wy, and We. The initial state is Ws and the initial stack symbol is +-. We have assumed that G contains a unique production of the form S -~ e. We include the following transition in M corresponding to this production.</Paragraph>
    <Paragraph position="4"> A_,Ws~,T Then, for each rule of the form X -~ a~ in G and each symbol of the form Wx,~ we include the following in M: Z, Wx. ~ ZWx., Wz We also include all &amp;quot;post-processing&amp;quot; rules of the following form:  Note that all reduction transitions are deterministic with the single exception of the first rule listed above. The nondeterministic shift transitions of M are in one-to-one correspondence with the productions of G. This yields the prop-erty that PM(Y) = PG(Y). *</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML