XML Viewer - j99-1004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/j99-1004_metho.xml
Size: 30,757 bytes
Last Modified: 2025-10-06 14:15:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="J99-1004">
  <Title>Statistical Properties of Probabilistic Context-Free Grammars</Title>
  <Section position="3" start_page="136" end_page="145" type="metho">
    <SectionTitle>
3. Relative Weighted Frequency
</SectionTitle>
    <Paragraph position="0"> The relative weighted frequency method is motivated by the maximum-likelihood (ML) estimation of production probabilities. We shall first give a brief review of ML estimation.</Paragraph>
    <Paragraph position="1"> We consider two cases of ML estimation. In the first case, we assume the data are fully observed, which means that all the samples are fully observed finite parses trees. Let T1, T2 ..... 7n be the samples. Then the ML estimate of p(A ~ a) is the ratio between the total number of occurrences of the production A ~ a in the samples and  Computational Linguistics Volume 25, Number 1 the total number of occurrences of the symbol A in the samples,</Paragraph>
    <Paragraph position="3"> Because of the form of the estimator in (5), ML estimation in the full observation case is also called relative frequency estimation in computational linguistics. This simple estimator, as shown by Chi and Geman (1998), assigns proper production probabilities for PCFGs.</Paragraph>
    <Paragraph position="4"> In the second case, the parse trees are unobserved. Instead, the yields Y1 = Y(rl),...,Yn = Y('rn), which are the left-to-right sequences of terminals of the unknown parses rl ..... rn, form the data. It can be proved that the ML estimate fi is given by</Paragraph>
    <Paragraph position="6"> where fy is the set of all parses with yield Y, i.e., f~w = {r E f~ : Y(r) = Y}.</Paragraph>
    <Paragraph position="7"> Equation (6) cannot be solved in closed form. Usually, the solution is computed by the EM algorithm with the following iteration (Baum 1972; Baker 1979; Dempster, Laird, and Rubin 1977):</Paragraph>
    <Paragraph position="9"> Like ~ in (5), pk for k &gt; 0 impose proper probabifity distributions on f~ (Chi and Geman 1998).</Paragraph>
    <Paragraph position="10"> To unify (6) and (7), expand Ep~ \[f(A ~ c~; r)\]r E f~ri\], by the definition of expectation, into</Paragraph>
    <Paragraph position="12"> Let A be the set of parses whose yields belong to the data, i.e., A = {r : Y(r) E {Y1 ..... Yn}}. For each r E A, let y = Y(r) and i:Yi=y Then we observe that, for any production rule A ~ oz,</Paragraph>
    <Paragraph position="14"> The ML estimator in (6) can also be written in the above form, as can be readily checked by letting A be the set {T1 ..... rn} and W(T), for each r E A, be the number of occurrences of ~- in the data. In addition, in both full observation cases and partial observation cases, we can divide the weights of W(~-) by a constant so that their sum is 1.</Paragraph>
    <Paragraph position="15"> The above discussion leads us to define a procedure to assign production probabilities as follows. First, pick an arbitrary finite subset A of f~, with every production rule appearing in the trees in A. Second, assign to each T E A a positive weight W(T) such that ~=eA W(T) = 1. Finally, define a system of production probabilities p by</Paragraph>
    <Paragraph position="17"> Because of the similarity between (5) and (8), we call the procedure to assign production probabilities by (8) the &amp;quot;relative weighted frequency&amp;quot; method.</Paragraph>
    <Paragraph position="18"> Proposition 1 Suppose all the symbols of N occur in the parses of A, and all the parses have positive weight. Then the production probabilities given by (8) impose proper distributions on parses.</Paragraph>
    <Paragraph position="19"> Proof The proof is almost identical to the one given by Chi and Geman (1998). Let qa = p (derivation tree rooted in A fails to terminate). We will show that qs = 0 (i.e., derivation trees rooted in S always terminate). For each A E V, letf(A;T) be the number of non-root instances of A in T. Given a E (V U T)*, let ai be the ith symbol of the sentential form a. For any A E V</Paragraph>
    <Paragraph position="21"> reachable from S under p. Using the notation given in Definition 2, we have qs &gt;_ p({A E 7 and TA fails to terminate}) = p({TA fails to terminate}lA E T)p(A C/ q-) p({TA fails to terminate}lA C/ ~-) = 0, since qs = 0 and p(A C/ ~-) &gt; 0. By the nature of PCFGs, the form of TA is distributed  according to pA, independent of its location in 7 or of the choice of subtrees elsewhere in T. Therefore the conditional probability of q-A failIng to terminate, given that A occurs in T, equals qA, proving that qA = O. \[\] 4. Entropy and Moments of Parse Tree Sizes In this section, we will first show that if production probabilities are assigned by the relative weighted frequency method, then they impose PCFG distributions under which parse tree sizes have finite moment of any order. Based on this result, we will then demonstrate that such PCFG distributions have finite entropy and give the explicit form of the entropy.</Paragraph>
    <Paragraph position="22"> The mth moment of sizes of parses is given by</Paragraph>
    <Section position="1" start_page="140" end_page="142" type="sub_section">
      <SectionTitle>
Chi Probabilistic Context-Free Grammars
</SectionTitle>
      <Paragraph position="0"> and the entropy of a PC'FG distribution p is given by</Paragraph>
      <Paragraph position="2"> To make the proofs more readable, we define, for any given A = {v'l ..... rn}, for any (A ~ a) E R, and</Paragraph>
      <Paragraph position="4"> s.t. (A---~a)ER for any A E N; that is, F(A ~ a) is the weighted sum of the number of occurrences of the production rule A ~ a in A and F(A) is the weighted sum of the number of occurrences of A in A.</Paragraph>
      <Paragraph position="5"> The relative weighted frequency method given by (8) can be written as</Paragraph>
      <Paragraph position="7"> We have the following simple lemma:  Suppose all the symbols in N occur in the parses of A, and all parses have positive weights. If the production probabilities p are assigned by the relative weighted frequency method in (8), then for each m E N U {0}, EpIT\[ m &lt; exp. Proof We shall show that for any A E N, if p = pA, then EpJTI m &lt; oo. When m = 0, this is clearly true. Now suppose the claim is true for 0 ..... m - 1. For each A E N and k E N,</Paragraph>
      <Paragraph position="9"> where for ease of typing, we write L for la\[. For fixed o~, write</Paragraph>
      <Paragraph position="11"/>
    </Section>
    <Section position="2" start_page="142" end_page="144" type="sub_section">
      <SectionTitle>
Chi Probabilistic Context-Free Grammars
</SectionTitle>
      <Paragraph position="0"> There are less than L m = la\[ m terms in P(Inl ..... D-LI). Hence</Paragraph>
      <Paragraph position="2"> s.t. (A-+a)ER s.t. (A--~a)ER Because the set of production rules is finite, the length of a sentenfial form that occurs on the right-hand side of a production rule is upper bounded, i.e., sup{iaI: for some AEN, (A~a) ER}&lt;~.</Paragraph>
      <Paragraph position="3"> Therefore we can bound (laI + 1)mc j~l by a constant, say, K. Then we get</Paragraph>
      <Paragraph position="5"/>
      <Paragraph position="7"> linear relations between E(C). To this end, for each v E f~C/, let C --~ &amp;quot;7 be the production rule applied to T'S root. Suppose 7 is composed of m symbols, 71 .... ,7ra, and 71,. * *, Tm are the daughter subtrees of T rooted in 71 ..... 7m, respectively. Then</Paragraph>
      <Paragraph position="9"> Multiply both sides by p(7) and sum over all 7 E f~c which have C ~ 7 as the production rule applied at the root. By the definition of PCFG, p(7) = p(C --~</Paragraph>
    </Section>
    <Section position="3" start_page="144" end_page="145" type="sub_section">
      <SectionTitle>
Chi Probabilistic Context-Free Grammars
</SectionTitle>
      <Paragraph position="0"> r)p(rl).., p(rm), and rk can be any parse in f~Tk. Therefore, by factorization, we get</Paragraph>
      <Paragraph position="2"> where f~c~ stands for the set of trees in which C ~ 7 is the rule applied at the root.</Paragraph>
      <Paragraph position="3"> Similarly, for each k,</Paragraph>
      <Paragraph position="5"> s.t. (C---~3,)ER s.t. BE'y Replace p(C ~ 7) by F(C ~ 7)/F(C), according to (9). Then multiply both sides by F(C) and sum both sides over all C E N. We get</Paragraph>
      <Paragraph position="7"> completing the proof of (17). \[\] Now we can calculate the entropy of p in terms of production probabilities.  The calculation goes as follows,</Paragraph>
      <Paragraph position="9"/>
    </Section>
  </Section>
  <Section position="4" start_page="145" end_page="151" type="metho">
    <SectionTitle>
5. Gibbs Distributions on Parses and Renormalization of Improper PCFGs
</SectionTitle>
    <Paragraph position="0"> A Gibbs distribution on parses has the form</Paragraph>
    <Paragraph position="2"> where Z;~ = ~e ~u(r), and A = {Ai} and U(r) = {Ui(r)} are constants and functions on ~, respectively, both indexed by elements in a finite set I. The inner product A. U = E ~iUi is called the potential function of the Gibbs distribution and Z;~ is called the partition number for the exponential e ~u.</Paragraph>
    <Paragraph position="3"> The functions Ui are usually considered features of parses and the constants Ai are weights of these features. The index set I and the functions Ui(r) can take various forms. Among the simplest choices for I is R, the set of production rules, and</Paragraph>
    <Paragraph position="5"> A proper PCFG distribution is a Gibbs distribution of the form in (19). To see this, let AA___~ a = logp(A ~ o~) for each (A ~ a) E R. Then</Paragraph>
    <Paragraph position="7"> which is a Gibbs form.</Paragraph>
    <Paragraph position="8"> A Gibbs distribution usually is not a PCFG distribution, because its potential function in general includes features other than frequencies of production rules. What if its potential function only has frequencies as features? More specifically, is the Gibbs distribution in (19) a PCFG distribution? The next proposition gives a positive answer to this question.</Paragraph>
    <Paragraph position="9"> Proposition 4 The Gibbs distribution P;~ given by (19) is a PCFG distribution. That is, there are production probabilities p, such that for every ~- E f~,</Paragraph>
    <Paragraph position="11"> The Gibbs distributions we have seen so far are only defined for parses rooted in S. By obvious generalization, we can define for each nonterminal symbols A the partition</Paragraph>
    <Paragraph position="13"> and the Gibbs distribution P('r) on parses rooted in A. For simplicity, also define ZA(t) = 1 and Pt(t) = 1 for each t E T.</Paragraph>
    <Paragraph position="14"> We first show ZA(A) &lt; cx~ for all A. Suppose (S ~ o~) E R with \]oz\[ = n. The sum of e ~''f(T) over all 7 E f~s with S ~ o~ being applied at the root is equal to eAS~ZA(c~l)...Z;~(O:n), while less than the sum of e Af(~) over all ~- E f~s, which is ZA(S). Therefore, Z,x(S) &gt; e~'s-~z~,(C~l)...Z~(c~n).</Paragraph>
    <Paragraph position="15"> Since Z;~ &lt; cx~ and Z~(A) &gt; 0, for all A, it follows that Z~(ai) is finite. For any variable A, there are variables A0 ---- S, A1 ..... An -- A E N and sentential forms oL(deg) .... , ~(n-1) ff  We shall prove, by induction on h(T), that</Paragraph>
    <Paragraph position="17"> equation is true for all ~- c f~A with h('r) &lt; h, and all A C N. For any ~- E f~A with</Paragraph>
    <Paragraph position="19"> proving P), is imposed by p. \[\]</Paragraph>
    <Section position="1" start_page="148" end_page="151" type="sub_section">
      <SectionTitle>
Chi Probabilistic Context-Free Grammars
</SectionTitle>
      <Paragraph position="0"> Proposition 4 has a useful application to the renormalization of improper PCFGs.</Paragraph>
      <Paragraph position="1"> Suppose a PCFG distribution p on f~ = f~s is improper. We define a new, proper distribution fi on f~, by</Paragraph>
      <Paragraph position="3"> We call fi the renormalized distribution of p on fL We can also define the renormalized distribution of PA on f~A, for each A E N, by pA(9-) p(~A)&amp;quot; 9- C flA. (21) Comparing fi with (19), we see that fi is a Gibbs distribution with frequencies of production rules as features. Therefore, by Proposition 4, fi is a PCFG distribution, and from the proof of Proposition 4, we get Corollary 3.</Paragraph>
      <Paragraph position="4"> Corollary 3 Suppose the production probabilities of the improper distribution p are positive for all the production rules. Then the renormalized distributions ~ are induced by the production probabilities</Paragraph>
      <Paragraph position="6"> The only thing we have not mentioned is that )~A----~c~ ~- logp(A --~ a) are all bounded, since p are all positive. \[\] We have seen that PCFG distributions can be expressed in the form of Gibbs distributions. However, from the statistical point of view, this is not enough for regarding PCFG distributions as special cases of Gibbs distributions. An important statistical issue about a distribution is the estimation of its parameters. To equate PCFG distributions with special cases of Gibbs distributions, we need to show that estimators for production probabilities of PCFGs and parameters of Gibbs distributions produce the same results.</Paragraph>
      <Paragraph position="7"> Among many estimation procedures, the maximum-likelihood (ML) estimation procedure is commonly used. In the full observation case, if the data is composed of T1 ..... 9-n, then the estimator for the system of production probabilities is</Paragraph>
      <Paragraph position="9"> Computational Linguistics Volume 25, Number I for any A C N and the estimator for parameters of Gibbs distributions with A of the form in (19) is</Paragraph>
      <Paragraph position="11"> In addition, the ML estimate fi in (23) can be analytically solved and the solution is given by Equation (5).</Paragraph>
      <Paragraph position="12"> In the partial observation case, if Y1 ..... Yn are the observed yields, then the estimators for the two distributions are</Paragraph>
      <Paragraph position="14"> respectively.</Paragraph>
      <Paragraph position="15"> We want to compare the ML estimators for the two distributions and see if they produce the same results in some sense. Since the parameters p serve as base numbers in PCFG distributions, whereas A are exponents in Gibbs distributions, to make the comparison sensible, we take the logarithms of ~ and ask whether or not log p and are the same. Since the ML estimation procedure for PCFGs involves constrained optimization, whereas the estimation procedure for Gibbs distributions only involves unconstrained optimization, it is reasonable to suspect log ~ 7~ ~. Indeed, numerically log~ and ~ are different. For example, the estimator (23) only gives one estimate of the system of production probabilities, whereas the estimator (24) may yield infinitely many solutions. Such uniqueness and nonuniqueness of estimates is related to the identifiability of parameters. We will discuss this in more detail in Section 7.</Paragraph>
      <Paragraph position="16"> Despite their numerical differences, the ML estimators for PCFG distributions and Gibbs distributions with the form (19) are equivalent, In the sense that the estimates produced by the estimators impose the same distributions on parses. Because of this, In the context of ML estimation of parameters, we can regard PCFG distributions as special cases of Gibbs distributions.</Paragraph>
      <Paragraph position="17"> Corollary 4 If ~ is the solution of (23), then log fi is a solution of ML estimation (24). Similarly, if ~ is a solution of (25), then log fi is a solution of ML estimation (26). Hence, the estimates of production probabilities of PCFG distributions and parameters of Gibbs distributions with the form (19) impose the same distributions on parses.</Paragraph>
      <Paragraph position="18">  by a system of production probabilities ~. Then ~ is the solution of (23). Let A = log ~, i.e., A(A ~ a) = log~(A ~ a). Then A impose the same distribution on parses as ,~. Therefore A are also a solution to (24). This proves the first half of the result. The second half is similarly proved. \[\] 6. Branching Rates of PCFGs  In this section, we study PCFGs from the perspective of stochastic branching processes. Adopting the set-up given by Miller and O'Sullivan (1992), we define the mean matrix M of p as a \]N I x I N\] square matrix, with its (A, B)th entry being the expected number of variables B resulting from rewriting A:</Paragraph>
      <Paragraph position="20"> s.t. (A-*rx)cR Clearly, M is a nonnegative matrix. We say B E N can be reached from A E N, if for some n &gt; 0, M (n) (A, B) &gt; 0, where M (n) (A, B) is the (A, B)th element of M n. M is irreducible if for any pair A,'B E N, B can be reached from A. The corresponding branching process is called connected if M is irreducible (Walters 1982). It is easy to check that these definitions are equivalent to Definition 3.</Paragraph>
      <Paragraph position="21"> We need the result below for the study of branching processes.</Paragraph>
      <Paragraph position="22">  Let M = \[mq\] be a nonnegative k x k matrix.</Paragraph>
      <Paragraph position="23"> .</Paragraph>
      <Paragraph position="24"> .</Paragraph>
      <Paragraph position="25"> There is a nonnegative eigenvalue p such that no eigenvalue of A has absolute value greater than p.</Paragraph>
      <Paragraph position="26"> Corresponding to the eigenvalue p there is a nonnegative left (row) eigenvector L, = (Vl,..., ~'k) and a nonnegative right (column) eigenvector #k . If M is irreducible then p is a simple eigenvalue (i.e., the multiplicity of p is 1), and the corresponding eigenvectors are strictly positive (i.e. ui &gt; 0, vi &gt; 0 all i).</Paragraph>
      <Paragraph position="27"> The eigenvalue p is called the branching rate of the process. A branching process is called subcritical (critical, supercritical), if p &lt; 1 (p = 1, p &gt; 1). We also say a PCFG is subcritical (critical, supercritical), if its corresponding branching process is. When a PCFG is subcritical, it is proper. When a PCFG is supercritical, it is improper. The next result demonstrates that production probabilities assigned by the relative weighted frequency method impose subcritical PCFG distributions.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="151" end_page="159" type="metho">
    <SectionTitle>
AEN AEN
</SectionTitle>
    <Paragraph position="0"> We need to show that Ip(S) &gt; 0. Assume Ip(S) = 0. Then for any n &gt; 0, since Mnip = pnip, we have</Paragraph>
    <Paragraph position="2"> For each A E N, M(n)(S,A)IP(A) = 0. Because each A E N is reachable from S under p, there is n &gt; 0 such that M(n)(S,A) &gt; 0. So we get Ip(A) = 0. Hence Ip = 0. This contradicts the fact that Ip is a nonnegative eigenvector of M. Therefore Ip(S) &gt; 0. By  We will apply the above result to give another proof of Proposition 2. Before doing this, we need to introduce a spectral theorem, which is well-known in matrix analysis. Theorem 2 Suppose M is an n x n real matrix. Let C/(M) be the largest absolute value of M's eigenvalues. Then</Paragraph>
    <Paragraph position="4"> where 1 is defined as {1 ..... 1}, and for two-column vectors//and 17, //_&lt; 17 means each component of # is _&lt; the corresponding component of t,. Since the components in K1, M and/VIk are positive, the above relation implies M/~k+l G KM1 + M2/V\[k .</Paragraph>
    <Paragraph position="5"> Hence, we get</Paragraph>
    <Paragraph position="7"> By Theorem 2, for any p &lt; p' &lt; 1, IIMnl\] = o(p'n). Then (31) implies that IM, Ikl is bounded. Since/V/k are positive and increasing, it follows that M, Ik converge. \[\] Next we investigate how branching rates of improper PCFGs change after renormalization. First, let us look at a simple example. Consider the CFG given by (1).</Paragraph>
    <Paragraph position="8"> Assign probability p to the first production (S ~ SS), and 1 - p to the second one (S ~ a). It was proved that the total probability of parses is min(1,1/p - 1). If p &gt; 1/2,  Computational Linguistics Volume 25, Number 1 then min(1,1/p - 1) = 1/p - 1 &lt; 1, implying the PCFG is improper. To get the renormalized distribution, take a parse &amp;quot;r with yield a m. Since f(S ~ SS; ~-) = m - 1 and f(S ~ a;T) = m, p(T) = pro-l(1 - p)m. Then the renormalized probability of T equals ~p(~-) 1 -- pm-l(1 - p)m _ Pro( 1 _ p)m-1. ~(T) -- 1---1/p- 1 Therefore, ~ is assigned by a system of production probabilities ~ with ~(S ~ SS) = 1 - p &lt; 1/2, and ~(S ~ a) = p. So the renormalized PCFG is subcritical.</Paragraph>
    <Paragraph position="9"> More generally, we have the following result, which says a connected, improper PCFG, after being renormalized, becomes a subcritical PCFG.</Paragraph>
    <Section position="1" start_page="153" end_page="156" type="sub_section">
      <SectionTitle>
Proposition 7
</SectionTitle>
      <Paragraph position="0"> If p is a connected, improper PCFG distribution on parses, then its renormalized version ~ is subcritical.</Paragraph>
      <Paragraph position="1"> Proof We have 0 &lt; P(f~s) &lt; 1, and we shall first show, based on the fact that the PCFG is connected, that all 0 &lt; p(f~a) &lt; 1. Recall the proof of Corollary 1. There we got the relation qs _&gt; qaps(A C T), where qa is the probability that trees rooted in A fail to terminate. Because the PCFG is connected, S is reachable from A, too. By the same argument, we also have qA ~ qSPA(S C T). Since both qs and pA(S E T) &gt; O, qA &gt; O, then p(f~A) = 1 -- qA &lt; 1. Similarly, we can prove p(f~A) _&gt; p(f~S)pA(S E ~') &gt; O. For each A, define generating functions {gA} as in Harris (1963, Section 2.2),</Paragraph>
      <Paragraph position="3"> It is easy to see that ga(O) is the total probability of parses with root A and height  1. By induction, g(A n) (0) is the total probability of parses with root A and height &lt;_ n.</Paragraph>
      <Paragraph position="5"> Therefore, r is a nonnegative solution of g(s) = s. It is also the smallest among such solutions. That is, if there is another nonnegative solution r' ~ r, then r &lt; rq This is because 0 &lt; r ~ implies g(n)(0) &lt; g(n)(r') = r ~ for all n &gt; 0, and by letting n ~ oo, r &lt;_ r ~.</Paragraph>
      <Paragraph position="6"> Clearly, 1 is also a solution of g(s) = s.</Paragraph>
      <Paragraph position="7"> We now renormalize p to get ~ by (22). Define generating functionsf = ~fa} of and ff(n) in the same way as (32) and (33). Then  Because r is the smallest nonnegative solution of g(s) = s, by the above equation, I is the only solution off(s) = s in the unit cube. Since g(s) = s also has a solution 1, f(s) = s has a solution 1/r, which is strictly larger than 1.</Paragraph>
      <Paragraph position="8"> We want to know how f changes on the line segment connecting 1 and 1/r. Let</Paragraph>
      <Paragraph position="10"> s.t. (A---~c~)ER Differentiate h at t = 0. Then h'(O) = Mu-u, where M is the mean matrix corresponding to ~. Every hA(t) is a convex function. Then, because hA(O) = hA(l) = 0, h~(0) &lt; 0, which leads to Mu &lt; u.</Paragraph>
      <Paragraph position="11"> We now show that for at least one A, (MU)A &lt; UA. First of all, note that h~(0) = 0 only if hA(t) is linear. Assume Mu = u, which leads to h'(0) = 0 and the linearity of h(t). Together with h(0) = 0, this implies h(t) ~ O. Choose t &lt; 0 such that 1 + tuA &gt; 0 for all A. Then f(1 + tu) - 1 - tu = h(t) = 0. Therefore 1 + tu is a nonnegative solution of f(s) = s and is strictly less than 1. This contradicts the fact that 1 is the smallest nonnegative solution off(s) = s.</Paragraph>
      <Paragraph position="12"> Now we have Mu &lt; u, and 3A, s.t. (MU)A &lt; UA. Because p is connected, M is irreducible. By item (3) of Theorem 1, u is strictly positive, and there is a strictly positive left eigenvector L, such that L,M = py. Therefore yMu &lt; vu, or p~,u &lt; ~,u. Hence p &lt; 1. This completes the proof. \[\] 7. Identifiability and Approximation of Production Probabilities of PCFGs Identifiability of parameters is related to the consistency of estimates, both being important statistical issues. Proving the consistency of the ML estimate of a system of production probabilities given in (5) is relatively straightforward. Consistency in this case means that, if p imposes a proper distribution, then as the size of the data composed of independent and identically distributed (i.i.d.) samples goes to infinity, with probability one, the estimate ~ converges to p. To see this, think of the sample parses as taken independently from a branching process governed by p. By the context-free nature of the branching process, for A E N, each instance of A selects a production A ~ a by probability p(A ~ a) independently of the other instances of A. As the size of the data set goes to infinity, the number of occurrences of A goes to infinity. Therefore, by the law of large numbers, the ratio between the number of occurrences of A ~ o~ and the number of occurrences of A, which is f~(A ~ ol), converges to p(A ~ o~), with probability one.</Paragraph>
      <Paragraph position="13">  Computational Linguistics Volume 25, Number 1 By the consistency of the ML estimate of a system of production probabilities, we can prove that production probabilities are identifiable parameters of PCFGs. In other words, different systems of production probabilities impose different PCFG distributions. null Proposition 8 If pl, p2 impose distributions P1, P2, respectively, and plC/ p2, then P1 ~ P2. Proof Assume P1 = P2. Then draw n i.i.d, samples from P1. Because the ML estimator ~ is consistent, as n ~ cx~, ~ ~ Pl, with probability 1. Because the n i.i.d, samples can also be regarded as drawn from P2, with the same argument, ~ ~ p2, with probability 1. Hence pl = p2, a contradiction. \[\] We mentioned in Section 5 that the ML estimators (24) and (26) may produce infinitely many estimates if the Gibbs distributions on parses have the form (19). This phenomenon of multiple solutions results from the nonidentifiability of parameters of the Gibbs distributions (19), which means that different parameters may yield the same distributions.</Paragraph>
      <Paragraph position="14"> To see why parameters of Gibbs distribution (19) are nonidentifiable, we note that the frequencies of production rules are linearly dependent,</Paragraph>
      <Paragraph position="16"> Therefore, there exists )~0 C/ O, such that for any ~-, &amp;0 .f(T) = O. If ~ is a solution for (24), then for any number t,</Paragraph>
      <Paragraph position="18"> Thus for any t, ~ + t&amp;0 is also a solution for (24). This shows that the parameters of Gibbs distribution (19) are nonidentifiable.</Paragraph>
      <Paragraph position="19"> Finally, we consider how to approximate production probabilities by mean frequencies of productions. Given i.i.d, samples of parses rl ..... ~'n from the distribution imposed by p, by the consistency of the ML estimate of p given by (5),</Paragraph>
    </Section>
    <Section position="2" start_page="156" end_page="158" type="sub_section">
      <SectionTitle>
Chi Probabilistic Context-Free Grammars
</SectionTitle>
      <Paragraph position="0"> with probability 1, as n ~ oo. If the entropy of the distribution p is finite, then for every production rule (A ~ t) c R,</Paragraph>
      <Paragraph position="2"> If the entropy is infinite, the above argument does not work, because both the numerator and the denominator of the fraction are infinity. Can we change the fraction a little bit so that it still makes sense, and at the same time yields good approximation to p(A ~ c~)? One way to do this is to pick a large finite subset f~' of f~ and replace the fraction  bottom of the fraction is positive. Therefore the conditional expectation of f is finite. The conditional expectation Ep (f(A; q-)IT C/ f/') is similarly defined. The following result shows that as f~ expands, the approximation gets better. Proposition 9 Suppose a system of production probabilities p imposes a proper distribution. Then for any increasing sequence of finite subsets ~-~n of ~ with ~n T ~-~, i.e., fll C f~2.. * C f~, f~n finite and Uf~n = f~,</Paragraph>
      <Paragraph position="4"> To prove the proposition, we introduce the Kullback-Leibler divergence. For any two probability distributions p and q on f~, the Kullback-Leibler divergence between p and q is defined as D(PlIq) ZP(~)ldeg- p(;) TEf~ g q~&amp;quot; where 0log q~deg is defined as 0 for any q(r) _&gt; 0. D(pliq) is nonnegative and equal to 0 if and only if p = q. One thing to note is that q need not be proper in order  Computational Linguistics Volume 25, Number 1 to make D(pllq) nonnegative. Even when ~q(T) &lt; 1, it is still true that D(p\]lq) &gt; O. For more about the Kullback-Leibler divergence, we refer the readers to Cover and Thomas (1991).</Paragraph>
      <Paragraph position="5"> The Kullback-Leibler divergence has the simple property described below, which will be used in the proof of Proposition 9.</Paragraph>
      <Paragraph position="6">  mization, we consider the function</Paragraph>
    </Section>
    <Section position="3" start_page="158" end_page="159" type="sub_section">
      <SectionTitle>
Chi Probabilistic Context-Free Grammars
</SectionTitle>
      <Paragraph position="0"> where the unknown coefficients {)~A }AEN are called Lagrange multipliers.</Paragraph>
      <Paragraph position="1"> The q that minimizes Kn(q) subject to (37) satisfies  To see that there is a minimizer of Kn(q) subject to (37), consider the boundary points of the region {q_- &gt;0, Z q(A --~ o~) = 1} cC/ s.t. (A--~a)ER Any boundary point of the region has a component equal to zero, hence for some r E fin, q(T) = 0, implying Kn(q) = ~. Because K,~(q) is a continuous function, Kn must attain its minimum inside the above region, and this minimizer, as has been shown, is pn-We need to show Pn ~ p. Let fY = ~-~n and apply Lemma 2 to p(rlf~n ) and ~n(r). Since p(~nJ~n) = 1, we get 0 &lt; -logan(fin) &lt; Kn(pn). On the other hand, because Pn is the minimizer of Kn, Kn(pn) _&lt; Kn(p) -- - 1ogp(f~n).</Paragraph>
      <Paragraph position="2"> Because fin ~ f~ and p is proper, P(f~n) ~ 1. Therefore 0 &lt; -log~n(f~n) _&lt; - 1ogp(f~n) ~ 0. Hence Pn(f~n) --~ 1.</Paragraph>
      <Paragraph position="3"> Choose an arbitrary r EfL For all n large enough, r E f~n. Apply Lemma 2 to {r} and get</Paragraph>
      <Paragraph position="5"> Computational Linguistics Volume 25, Number 1 This nearly completes the proof. By the identifiability of production probabilities, Pn should converge to p. To make the argument more rigorous, by compactness of pn, every subsequence of pn has a limit point. Let p' be a limit point of a subsequence Phi. For any T, since Phi0-) ~ p('r), p'(T) = p(~-). By the identifiability of production probabilities, p' = p. Therefore p is the only limit point of pn. This proves Pn ~ p.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML