File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/99/j99-1004_abstr.xml
Size: 21,870 bytes
Last Modified: 2025-10-06 13:49:44
<?xml version="1.0" standalone="yes"?> <Paper uid="J99-1004"> <Title>Statistical Properties of Probabilistic Context-Free Grammars</Title> <Section position="2" start_page="0" end_page="136" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> This article proves a number of useful properties of probabilistic context-free grammars (PCFGs). In this section, we give an introduction to the results and related topics.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Assignment of Proper PCFG Distributions </SectionTitle> <Paragraph position="0"> Finite parse trees, or parses, generated by a context-free grammar (CFG) can be equipped with a variety of probability distributions. The simplest way to do this is by production probabilities. First, for each nonterminal symbol in the CFG, a probability distribution is placed on the set of all productions from that symbol. Then each finite parse tree is allocated a probability equal to the product of the probabilities of all productions in the tree. More specifically, denote a finite parse tree by T. For any production rule A ~ a of the CFG, letf(A ~ a; ~-) be the number of times it occurs in T. Let R be the set of all production rules. Then</Paragraph> <Paragraph position="2"> A CFG with a probability distribution on its parses assigned in this way is called a probabilistic context-free grammar (PCFG) (Booth and Thompson 1973; Grenander * Department of Statistics, University of Chicago, Chicago, IL 60637, USA. Email: chi@galton.uchicago.edu. This work was supported by the Army Research Office (DAAL03-92-G-0115), the National Science Foundation (DMS-9217655), and the Office of Naval Research (N00014-96-1-0647). (~) 1999 Association for Computational Linguistics Computational Linguistics Volume 25, Number 1 1976) 1 and the probability distribution is called a PCFG distribution. A PCFG may be improper, i.e., the total probability of parses may be less than one. For instance, consider the CFG in Chomsky normal form:</Paragraph> <Paragraph position="4"> where S is the only nonterminal symbol, and a is the only terminal symbol. If p(S SS) = p, then p(S ~ a) = 1 - p. Let Xh be the total probability of all parses with height no larger than h. Clearly, Xh is increasing. It is not hard to see that Xh+l = 1 -- p + px~.</Paragraph> <Paragraph position="5"> Therefore, the limit of Xh, which is the total probability of all parses, is a solution for the equation x = 1 - p + px 2. The equation has two solutions: 1 and 1/p - 1. It can be shown that x is the smaller of the two: x = min(1,1/p - 1). Therefore, if p > 1/2, x < 1--an improper probability.</Paragraph> <Paragraph position="6"> How to assign proper production probabilities is quite a subtle problem. A sufficient condition for proper assignment is established by Chi and Geman (1998), who prove that production probabilities estimated by the maximum-likelihood (ML) estimation procedure (or relative frequency estimation procedure, as it is called in computational linguistics) always impose proper PCFG distributions. Without much difficulty, this result can be generalized to a simple procedure, called &quot;the relative weighted frequency&quot; method, which assigns proper production probabilities of PCFGs. We will give more details of the generalization in Section 3 and summarize the method in Proposition 1.</Paragraph> </Section> <Section position="2" start_page="0" end_page="132" type="sub_section"> <SectionTitle> 1.2 Entropy and Moments of Parse Tree Sizes </SectionTitle> <Paragraph position="0"> As a probabilistic model for languages, the PCFG model has several important statistical properties, among which is the entropy of PCFG distribution on parses. Entropy is a measure of the uncertainty of a probability distribution. The larger its entropy, the less one can learn about parses randomly sampled from the distribution. As an example, suppose we have a set S of N parses--or any objects--T1 ..... TN, where N is very large. We may ask how much one can learn from the sentence &quot;T is a random sample from S.&quot; At one extreme, let the distribution on S be p(~-l) = 1, and P(;i) = O, for i ~ 1. Then, because with probability one, ; = T1, there is no uncertainty about the sample. In other words, we can get full information from the above sentence. At the other extreme, suppose the distribution is p(~-l) ..... p(TN) = 1/N. In this case, all the elements of S are statistically equivalent. No specific information is given about T that would make it possible to know it from S. Greater effort is required--for example, enumerating all the elements in S--to find what T is. Since S is big, the uncertainty about the sample is then much greater. Correspondingly, for the two cases, the entropy is 0 and N log N >> 0, respectively.</Paragraph> <Paragraph position="1"> Entropy plays a central role in the theory of information. For an excellent exposition of this theory, we refer the reader to Cover and Thomas (1991). The theory has been applied in probabilistic language modeling (Mark, Miller, and Grenander 1996; Mark et al. 1996; Johnson 1998), natural language processing (Berger, Della Pietra, and Della Pietra 1996; Della Pietra, Della Pietra, and Lafferty 1997), as well as computational vision (Zhu, Wu, and Mumford 1997). In addition, all the models proposed in these articles are based on an important principle called the maximum entropy principle.</Paragraph> </Section> <Section position="3" start_page="132" end_page="132" type="sub_section"> <SectionTitle> Chi Probabilistic Context-Free Grammars </SectionTitle> <Paragraph position="0"> Briefly, the maximum entropy principle says that among all the distributions that satisfy the same given conditions, the one that achieves the largest entropy should be the model of choice. For a distribution p on parses, its entropy is</Paragraph> <Paragraph position="2"> In order that the maximum entropy principle makes sense, all the candidate distributions should have finite entropies, and this is usually implicitly assumed.</Paragraph> <Paragraph position="3"> Take Mark, Miller, and Grenander's (1996) model, for example. First, a PCFG distribution p is selected to serve as a &quot;reference&quot; distribution on parses. Then, by invoking the minimum relative entropy principle, which is a variant of the maximum entropy principle, the distribution that minimizes</Paragraph> <Paragraph position="5"> subject to a set of constraints incorporating context-sensitive features is chosen to be the distribution of the model. It is then easy to see that the assumption that H(q) is finite is necessary.</Paragraph> <Paragraph position="6"> Conceptually, having finite entropy is a basic requirement for a &quot;good&quot; probabilistic model because a probability distribution with infinite entropy has too much uncertainty to be informative.</Paragraph> <Paragraph position="7"> Problems regarding entropies of PCFGs are relatively easy to tackle because they can be studied analytically. Several authors have reported results on this subject, including Miller and O'Sullivan (1992), who gave analytical results on the rates of entropies of improper PCFGs. It is worthwhile to add a few more results on entropies of proper PCFGs. In this paper, we show that the entropies of PCFG distributions imposed by production probabilities assigned by the relative weighted frequency method are finite (Section 4, Corollary 2).</Paragraph> <Paragraph position="8"> In addition to entropy, we will also study the moment of sizes of parses. The moment is of statistical interest because it gives information on how sizes of parses are distributed. For PCFG distributions, the first moment of sizes of parses, i.e., the mean size of parses, is directly linked with the entropy: the mean size of parses is finite if and only if the entropy is. The second moment of sizes is another familiar quantity.</Paragraph> <Paragraph position="9"> The difference between the second moment and the mean squared is the variance of sizes, which tells us how &quot;scattered&quot; sizes of parses are distributed around the mean. Proposition 2 shows that, under distributions imposed by production probabilities assigned by the relative weighted frequency method, sizes of parses have finite moment of any order.</Paragraph> </Section> <Section position="4" start_page="132" end_page="133" type="sub_section"> <SectionTitle> 1.3 Gibbs Distributions on Parses and Renormalization of Improper PCFGs </SectionTitle> <Paragraph position="0"> Besides PCFG distributions, a CFG can be equipped with many other types of probability distributions. Among the most widely studied is the Gibbs distribution (Mark, Miller, and Grenander 1996; Mark et al. 1996; Mark 1997; Abney 1997). Gibbs distributions arise naturally by invoking the maximum entropy principle. They are considered to be more powerful than PCFG distributions because they incorporate more features, especially context-sensitive features, of natural languages, whereas PCFG distributions only consider frequencies of production rules. On the other hand, Gibbs distributions are not always superior to PCFG distributions. A Gibbs distribution, with only frequencies of production rules in parse as its features, turns into a PCFG. More specifically, we will show in Proposition 4 in Section 5, that a CFG equipped with a Gibbs Computational Linguistics Volume 25, Number 1 distribution of the form</Paragraph> <Paragraph position="2"> is actually a PCFG, and we can get the production probabilities of the PCFG explicitly from the Gibbs form.</Paragraph> <Paragraph position="3"> The fact that a Gibbs distribution of the form in (2) is imposed by production probabilities has a useful consequence. Suppose p is an improper PCFG distribution. If we write the sum of p over all parses as Z, and assign to each parse tree a new probability equal to p(r)/Z, then we renormalize p to a Gibbs distribution ~ on parses. What (2) implies is that ~ is also a PCFG distribution (Corollary 3). Moreover, in Section 6 we will show that, under certain conditions, ~ is subcritical.</Paragraph> <Paragraph position="4"> There is another issue about the relations between PCFG distributions and Gibbs distributions of the form in (2), from a statistical point of view. Although PCFG distributions are special cases of Gibbs distributions in the sense that the former can be written in the form of the latter, PCFG distributions cannot be put in the framework of Gibbs distributions if they have different parameter estimation procedures. We will compare the maximum-likelihood (ML) estimation procedures for these two distributions. As will be seen in Section 5, numerically these two estimation procedures are different. However, Corollary 4 shows that they are equivalent in the sense that estimates by the two procedures impose the same distributions on parses. For this reason, a Gibbs distribution may be considered a generalization of PCFG, not only in form, but also in a certain statistical sense.</Paragraph> </Section> <Section position="5" start_page="133" end_page="133" type="sub_section"> <SectionTitle> 1.4 Branching Rates of PCFGs </SectionTitle> <Paragraph position="0"> Because of their context-free nature, PCFG distributions can also be studied from the perspective of stochastic processes. A PCFG can be described by a random branching process (Harris 1963), and its asymptotic behavior can be characterized by its branching rate. A branching process, or its corresponding PCFG, is called subcritical (critical, supercritical), if its branching rate < 1 (= 1, > 1). A subcritical PCFG is always proper, whereas a supercritical PCFG is always improper. Many asymptotic properties of supercritical branching processes are established by Miller and O'Sullivan (1992).</Paragraph> <Paragraph position="1"> Chi and Geman (1998) proved the properness of PCFG distributions imposed by estimated production probabilities, and around the same time S~nchez and Benedi (1997) established the subcriticality of the corresponding branching processes, hence their properness. In this paper we will explore properties of branching rate further. First, in Proposition 5, we will show that if a PCFG distribution is imposed by production probabilities assigned by the relative weighted frequency method, then the PCFG is subcritical. The result generalizes that of S~nchez and Bened~ (1997), and has a less involved proof. Then in Proposition 7, we will demonstrate that a connected and improper PCFG, after being renormalized, becomes a subcritical PCFG.</Paragraph> </Section> <Section position="6" start_page="133" end_page="134" type="sub_section"> <SectionTitle> 1.5 Identifiability and Approximation of Production Probabilities </SectionTitle> <Paragraph position="0"> Returning to the statistical aspect of PCFGs, we will discuss the identifiability of production probabilities of PCFGs as well as parameters of Gibbs distributions. Briefly speaking, production probabilities of PCFGs are identifiable, which means that different production probabilities always impose different distributions on parses (Proposition 8). In contrast, for the Gibbs distribution given by (2), the ,~ parameters are not identifiable; in fact, there are infinitely many different ,~ that impose the same Gibbs distribution.</Paragraph> </Section> <Section position="7" start_page="134" end_page="134" type="sub_section"> <SectionTitle> Chi Probabilistic Context-Free Grammars </SectionTitle> <Paragraph position="0"> Finally, in Proposition 9, we propose a method to approximate production probabilities. Perhaps the most interesting part about the result lies in its proof, which is largely information theoretic. We apply the Kullback-Leibler divergence, which is the information distance between two probability distributions, to prove the convergence of the approximation. In information theory literature, the Kullback-Leibler divergence is also called the relative entropy. We also use Lagrange multipliers to solve the constrained minimization problem involved. Both Kullback-Leibler divergence and Lagrange multipliers method are becoming increasingly useful in statistical modeling, e.g., modeling based on the maximum entropy principle.</Paragraph> </Section> <Section position="8" start_page="134" end_page="136" type="sub_section"> <SectionTitle> 1.6 Summary </SectionTitle> <Paragraph position="0"> As a simple probabilistic model, the PCFG model is applied to problems in linguistics and pattern recognition that do not involve much context sensitivity. To design sensible PCFG distributions for such problems, it is necessary to understand some of the statistical properties of the distributions. On the other hand, the PCFG model serves as a basis for more expressive linguistic models. For example, many Gibbs distributions are built upon PCFG distributions by defining</Paragraph> <Paragraph position="2"> where p is a PCFG distribution. Therefore, in order for the Gibbs distribution P to have certain desired statistical properties, it is necessary for p to have those properties first. This paper concerns some of the fundamental properties of PCFGs. However, the methods used in the proofs are also useful for the study of statistical issues on other probabilistic models.</Paragraph> <Paragraph position="3"> This paper proceeds as follows: In Section 2, we gather the notations for PCFGs that will be used in the remaining part of the paper. Section 3 establishes the relative weighted frequency method. Section 4 proves the finiteness of the entropies of PCFG distributions when production probabilities are assigned using the relative weighted frequency method. In addition, finiteness of the moment of sizes of parses are proved. Section 5 discusses the connections between PCFG distributions and Gibbs distributions on parses. Renormalization of improper PCFGs is also discussed here.</Paragraph> <Paragraph position="4"> In Section 6, PCFGs are studied from the random branching process point of view.</Paragraph> <Paragraph position="5"> Finally, in Section 7, idenfifiability of production probabilities and their approximation are addressed.</Paragraph> <Paragraph position="6"> * 2. Notations and Definitions In this section, we collect the notations and definitions we will use for the remaining part of the paper.</Paragraph> <Paragraph position="7"> Definition 1 A context-free grammar (CFG) G is a quadruple (N, T, R, S), where N is the set of variables, T the set of terminals, R the set of production rules, and S E N is the start symbol. 2 Elements of N are also called nonterminal symbols. N, T, and R are always 2 Some of our discussion requires that each sentential form have only finitely many parses. For this reason, we shall assume that in G, there are no null or unit productions.</Paragraph> <Paragraph position="8"> Computational Linguistics Volume 25, Number 1 assumed to be finite. Let f~ denote the set of finite parse trees of G, an element of which is always denoted as r. For each 7 c f~ and each production rule (A --+ o~) E R, definef(A --+ o4 7) to be the number of occurrences, or frequency, of the rule in r, and f(A; r) to be the number of occurrences of A in r. f(A; r) and f(A --+ c~; r) are related</Paragraph> <Paragraph position="10"> s.t. (A--+cz)ER Define h(7) as the height of 7, which is the number of nonterminal nodes on the longest route from 7's root to its terminal nodes. Define It\] as the size of r, which is the total number of nonterminal nodes in 7. For any A E N and any sentential form &quot;7 C (N tO T)*, define n(A; ~) as the number of instances of A in % Define I'7\] as the length of the sentential form.</Paragraph> <Paragraph position="11"> Definition 2 Let A C 7 denote that the symbol A occurs in the parse 7. If A E T, let T A be the left-most maximum subtree of T rooted in A, which is the subtree of 7 rooted in A satisfying the condition that if ;' ~ 7A is also a subtree of v rooted in A, then 7 ~ is either a subtree of &quot;rA, or a right sibling of 7A, or a subtree of a right sibling of 7A. Let AT be the root of rA, which is the left-most &quot;shallowest&quot; instance of A in r. Definition 3 For any two symbols A E N and B c N tO T, not necessarily different, B is said to be reachable from A in G, if there is a sequence of symbols A0, A1 ..... An with A0 = A and An = B, and a sequence of sentential forms oe0 ..... o~n-1, such that each Ai -+ oq is a production in R and each Oq contains the next symbol Ai+l. G is called connected if all elements in N U T can be reached from all nonterminal symbols.</Paragraph> <Paragraph position="12"> We now define the probabilistic version of reachability in CFG. Suppose p is a distribution on fL For any two symbols A C N and B c NU T, not necessarily different, B is said to be reachable from A in G under p, if there is a T E f~ with p(T) > 0 and there is a subtree r' of 7, such that 7' is rooted in A and B E 7'. G is called connected under p if all symbols in N tO T can be reached from all nonterminal symbols under p. Definition 4 A system of production probabilities of G is a function p : R ~ \[0,1\] such that for any A EN,</Paragraph> <Paragraph position="14"> We will also use p to represent the PCFG probability distribution on parses imposed by p, via the formula</Paragraph> <Paragraph position="16"/> </Section> <Section position="9" start_page="136" end_page="136" type="sub_section"> <SectionTitle> Chi Probabilistic Context-Free Grammars </SectionTitle> <Paragraph position="0"> Similarly, for any estimated system of production probabilities ~, we will also use ~ to represent the probability distribution on parses imposed by ~. We will write p(f~) as the total probability of all finite parse trees in fL Definition 5 Now we introduce a notation in statistics. Let p be an arbitrary distribution on f~ and g(~-) a function of T EfL The expected value of g under the distribution p, denoted Epg(~-), is defined as Epg(T)= Ep(T)g(7).</Paragraph> <Paragraph position="1"> TEn Definition 6 All the parse trees we have so far seen are rooted in S. It is often useful to investigate subtrees of parses, therefore it is necessary to consider trees rooted in symbols other than S. We call a tree rooted in A E N a parse (tree) rooted in A if it is generated from A by the production rules in R. Let f~A be the set of all finite parse trees with root A. Define pA as the probability distribution on f~A imposed by a system of production probabilities p, via (4). Also extend the notions of height and size of parses to trees in flA.</Paragraph> <Paragraph position="2"> When we write pA(T), we always assume that ~- is a parse tree rooted in A. When</Paragraph> <Paragraph position="4"> denote the total probability of finite parse trees in flA. With no subscripts, f~ and p are assumed to be f~s and Ps, respectively.</Paragraph> <Paragraph position="5"> For convenience, we also extend the notion of trees to terminals. For each terminal t E T, define f~t as the set of the single &quot;tree&quot; {t}. Define pt(t) = 1, Itl = 0 and h(t) = O. For this paper we make the following assumptions: .</Paragraph> <Paragraph position="6"> .</Paragraph> <Paragraph position="7"> For each symbol A ~ S, there is at least one parse T with root S such that A E ~-. This will guarantee that each A ~ S can be reached from S; When a system of production probabilities p is not explicitly assigned, each production rule (A ~ a) E R is assumed to have positive probability, i.e., p(A ~ a) > 0. This guarantees that there are no useless productions in the PCFG.</Paragraph> </Section> </Section> class="xml-element"></Paper>