File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1011_metho.xml
Size: 10,738 bytes
Last Modified: 2025-10-06 14:08:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1011"> <Title>Kullback-Leibler Distance between Probabilistic Context-Free Grammars and Probabilistic Finite Automata</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Expectation of rule frequency </SectionTitle> <Paragraph position="0"> Here we discuss how we can compute the expectation of the frequency of a rule or a non-terminal over all derivations of a probabilistic context-free grammar. These quantities will be used later by our algorithms.</Paragraph> <Paragraph position="1"> 1Our de nition of PFAs amounts to a slight loss of generality with respect to standard de nitions, in that there are no epsilon transitions and no probability function on states being nal. We want to avoid these concepts as they would cause some technical complications later in this article. There is no loss of generality however if we may assume an end-of-sentence marker, which is often the case in practice.</Paragraph> <Paragraph position="2"> Let (A ! ) 2 R be a rule of PCFG Gp, and let d2R be a complete derivation in Gp.</Paragraph> <Paragraph position="3"> We de ne f(A! ;d) as the number of occurrences, or frequency, of A! in d. Similarly, the frequency of nonterminal A in d is de ned as f(A;d) = P f(A! ;d). We consider the following related quantities</Paragraph> <Paragraph position="5"> A method for the computation of these quantities is reported in (Hutchins, 1972), based on the so-called momentum matrix. We propose an alternative method here, based on an idea related to the inside-outside algorithm (Baker, 1979; Lari and Young, 1990; Lari and Young, 1991). We observe that we can factorize a derivation d at each occurrence of rule A! into an 'innermost' part d2 and two 'outermost' parts d1 and d3. We can then write</Paragraph> <Paragraph position="7"> Next we group together all of the innermost and all of the outermost derivations and write</Paragraph> <Paragraph position="9"> Both outGp(A) and inGp( ) can be described in terms of recursive equations, of which the least xed-points are the required values. If Gp is proper and consistent, then inGp( ) = 1 for each 2 ( [N) . Quantities outGp(A) for every A can all be (exactly) calculated by solving a linear system, requiring an amount of time proportional to the cube of the size of Gp; see for instance (Corazza et al., 1991).</Paragraph> <Paragraph position="10"> On the basis of all the above quantities, a number of useful statistical properties of Gp can be easily computed, such as the expected length of derivations, denoted EDL(Gp) and the expected length of sentences, denoted EWL(Gp), discussed before by (Wetherell, 1980). These quantities satisfy the relations</Paragraph> <Paragraph position="12"> where for a string 2 (N[ ) we write j j to denote the number of occurrences of terminal symbols in .</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Entropy of PCFGs </SectionTitle> <Paragraph position="0"> In this section we introduce the notion of derivational entropy of a PCFG, and discuss an algorithm for its computation.</Paragraph> <Paragraph position="1"> Let Gp = (G;pG) be a PCFG. For a nonterminal A of G, let us de ne the entropy of A as the entropy of the distribution pG on all rules of the form A! , i.e.,</Paragraph> <Paragraph position="3"> The derivational entropy of Gp is de ned as the expectation of the information of the complete derivations generated by Gp, i.e.,</Paragraph> <Paragraph position="5"> We now characterize derivational entropy using expected rule frequencies as</Paragraph> <Paragraph position="7"> As already discussed, under the assumption that Gp is proper and consistent we have inGp( ) = 1 for every . Thus we can write</Paragraph> <Paragraph position="9"> The computation of outGp(A) was discussed in Section 3, and also H(A) can easily be calculated. null Under the restrictive assumption that a PCFG is proper and consistent, the characterization in (2) was already known from (Grenander, 1976, Theorem 10.7, pp. 90{92). The proof reported in that work is di erent from ours and uses a momentum matrix (Section 3). Our characterization above is more general and uses simpler notation than the one in (Grenander, 1976). The sentential entropy, or entropy for short, of Gp is de ned as the expectation of the information of the strings generated by Gp, i.e.,</Paragraph> <Paragraph position="11"> assuming 0 log 10 = 0, for strings w not generated by Gp. It is not di cult to see that H(Gp) Hd(Gp) and equality holds if and only if G is unambiguous (Soule, 1974, Theorem 2.2).</Paragraph> <Paragraph position="12"> As ambiguity of CFGs is undecidable, it follows that we cannot hope to obtain a closed-form solution for H(Gp) for which equality to (2) is decidable. We will return to this issue in Section 6.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Weighted intersection </SectionTitle> <Paragraph position="0"> In order to compute the cross-entropy de ned in the next section, we need to derive a single probabilistic model that simultaneously accounts for both the computations of an underlying FA and the derivations of an underlying PCFG. We start from a construction originally presented in (Bar-Hillel et al., 1964), that computes the intersection of a context-free language and a regular language. The input consists of a CFG G = ( ;N;S;R) and an FA M = ( ;Q; q0; Qf; T); note that we assume, without loss of generality, that G and M share the same set of terminals .</Paragraph> <Paragraph position="1"> The output of the construction is CFG G\ = ( ; N\; S\; R\), where N\ = Q ( [N) Q[fS\g, and R\ consists of the set of rules that is obtained as follows.</Paragraph> <Paragraph position="2"> For each s2Qf, let S\ ! (q0;S;s) be a rule of G\.</Paragraph> <Paragraph position="3"> For each rule A ! X1 Xm of G and each sequence of states s0;:::;sm of M, with m 0, let (s0;A;sm) ! (s0;X1;s1) (sm 1;Xm;sm) be a rule of G\; form = 0, G\ has a rule (s0;A;s0)! for each state s0.</Paragraph> <Paragraph position="4"> For each transition s a7! t of M, let (s;a;t)!a be a rule of G\.</Paragraph> <Paragraph position="5"> Note that for each rule (s0;A;sm) ! (s0;X1;s1) (sm 1;Xm;sm) there is a unique rule A ! X1 Xm from which it has been constructed by the above. Similarly, each rule (s;a;t) ! a uniquely identi es a transition s a7!t. This means that if we take a complete derivation d\ in G\, we can extract a sequence h1(d\) of rules from G and a sequence h2(d\) of transitions from M, where h1 and h2 are string homomorphisms that we de ne point-wise as</Paragraph> <Paragraph position="7"> We de ne h(d\) = (h1(d\);h2(d\)). It can be easily shown that if S\d\)w and h(d\) = (d;c), then for the same w we have S d) w and (q0;w) c' (s; ), some s 2 Qf. Conversely, if for some w, d and c we have S d) w and (q0;w) c'(s; ), some s2Qf, then there is precisely one derivation d\ such that h(d\) = (d;c) and S\d\)w.</Paragraph> <Paragraph position="8"> As noted before by (Nederhof and Satta, 2003), this construction can be extended to apply to a PCFG Gp = (G;pG) and an FA M. The output is a PCFG G\;p = (G\;pG\), where G\ is de ned as above and pG\ is de ned by:</Paragraph> <Paragraph position="10"> Note that G\;p is non-proper. More speci cally, probabilities of rules with left-hand side S\ or (s0;A;sm) might not sum to one. This is not a problem for the algorithms presented in this paper, as we have never assumed properness for our PCFGs. What is most important here is the following property of G\;p. If d\, d and c are such that h(d\) = (d;c), then pG\(d\) = pG(d).</Paragraph> <Paragraph position="11"> Let us now assume that M is deterministic.</Paragraph> <Paragraph position="12"> (In fact, the weaker condition of M being unambiguous is su cient for our purposes, but unambiguity is not a very practical condition.) Given a string w and a transition s a7!t of M we de ne f(s a7!t;w) as the frequency (number of occurrences) of s a7!t in the unique computation of M, if it exists, that accepts w; this frequency is 0 if w is not accepted by M. On the basis of the above construction of G\;p and of Section 3, we</Paragraph> <Paragraph position="14"/> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Kullback-Leibler distance </SectionTitle> <Paragraph position="0"> In this section we consider the Kullback-Leibler distance between a PCFGs and a PFA, and present a method for its optimization under certain assumptions. Let Gp = (G;pG) be a consistent PCFG and let Mp = (M;pM) be a consistent PFA. We demand that M be deterministic (or more generally, unambiguous). Let us rst assume that L(G) L(M); we will later drop this constraint.</Paragraph> <Paragraph position="1"> The cross-entropy of Gp and Mp is de ned as usual for probabilistic models, viz. as the expectation under distribution pG of the information of the strings generated by M, i.e.,</Paragraph> <Paragraph position="3"> : The Kullback-Leibler distance of Gp and Mp is de ned as</Paragraph> <Paragraph position="5"> : Quantity D(GpjjMp) can also be expressed as the di erence between the cross-entropy of Gp and Mp and the entropy of Gp, i.e.,</Paragraph> <Paragraph position="7"> Let G\;p be the PCFG obtained by intersecting Gp with the non-probabilistic FA M underlying Mp, as in Section 5. Using (4) the cross-entropy of Gp and Mp can be expressed as The values of outG\;p can be calculated easily, as discussed in Section 3. Computation of H(Gp) in closed-form is problematic, as already pointed out in Section 4. However, for many purposes computation of H(Gp) is not needed. For example, assume that the non-probabilistic FA M underlying Mp is given, and our goal is to measure the distance between Gp and Mp, for di erent choices of pM. Then the choice that minimizes H(GpjjMp) determines the choice that minimizes D(GpjjMp), irrespective of H(Gp). Formally, we can use the above characterization to compute</Paragraph> <Paragraph position="9"> H(GpjjMp): When L(G) L(M) is non-empty, both D(GpjjMp) and H(GpjjMp) are unde ned, as their de nitions imply a division by pM(w) = 0 for w2L(G) L(M). In cases where the non-probabilistic FA M is given, and our goal is to compare the relative distances between Gp and Mp for di erent choices of pM, it makes sense to ignore strings in L(G) L(M), and de ne D(GpjjMp), H(GpjjMp) and H(Gp) on the domain L(G)\L(M). Our equations above then still hold. Note that strings in L(M) L(G) can be ignored since they do not contribute non-zero values to D(GpjjMp) and H(GpjjMp).</Paragraph> </Section> class="xml-element"></Paper>