File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/j98-2005_abstr.xml

Size: 4,918 bytes

Last Modified: 2025-10-06 13:49:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="J98-2005">
  <Title>Squibs and Discussions Estimation of Probabilistic Context-Free Grammars I</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Context-free grammars (CFG's) are useful because of their relatively broad coverage and because of the availability of efficient parsing algorithms. Furthermore, CFG's are readily fit with a probability distribution (to make probabilistic CFG's--or PCFG's), rendering them suitable for ambiguous languages through the maximum a posteriori rule of choosing the most probable parse.</Paragraph>
    <Paragraph position="1"> For each nonterminal symbol, a (normalized) probability is placed on the set of all productions from that symbol. Unfortunately, this simple procedure runs into an unexpected complication: the language generated by the grammar may have probability less than one. The reason is that the derivation tree may have probability greater than zero of never terminating--some mass can be lost to infinity. This phenomenon is well known and well understood, and there are tests for &amp;quot;tightness&amp;quot; (by which we mean total probability mass equal to one) involving a matrix derived from the expected growth in numbers of symbols generated by the probabilistic rules (see for example Booth and Thompson \[1973\], Grenander \[1976\], and Harris \[1963\]).</Paragraph>
    <Paragraph position="2"> What if the production probabilities are estimated from data? Suppose, for example, that we have a parsed corpus that we treat as a collection of (independent) samples from a grammar. It is reasonable to hope that if the trees in the sample are finite, then an estimate of production probabilities based upon the sample will produce a system that assigns probability zero to the set of infinite trees. For example, there is a simple maximum-likelihood prescription for estimating the production probabilities from a corpus of trees (see Section 2), resulting in a PCFG. Is it tight? If the corpus is unparsed then there is an iterative approach to maximum-likelihood estimation (the EM or Baum-Welsh algorithm--again, see Section 2) and the same question arises: do we get actual probabilities or do the estimated PCFG's assign some mass to infinite trees? We will show that in both cases the estimated probability is tight. 2  see Section 2.</Paragraph>
    <Paragraph position="3"> Computational Linguistics Volume 24, Number 2 Wetherell (1980) has asked a similar question: a scheme (different from maximum likelihood) is introduced for estimating production probabilities from an unparsed corpus, and it is conjectured that the resulting system is tight. (Wetherell and others use the designation &amp;quot;consistent&amp;quot; instead of &amp;quot;tight,&amp;quot; but in statistics, consistency refers to the asymptotic correctness of an estimator.) A trivial example is the CFG with one nonterminal and one terminal symbol, in Chomsky normal form: A ~ AA a ~ a where a is the only terminal symbol. Assign probability p to the first production (A ~ AA) and q = 1 -p to the second (A ~ a). Let Sh be the total probability of all trees with depth less than or equal to h. For example, $2 = q corresponding to A ~ a, and $3 = q + pq2 corresponding to {A ~ a} tO {A ~ AA, A --~ a,A --~ a}. In general, Sh+l = q + pSi. (Condition on the first production: with probability q the tree terminates and with probability p it produces two nonterminal symbols, each of which must now terminate with depth less than or equal to h.) It is not hard to show that Sh is nondecreasing and converges to min(1, I), meaning that a proper probability is a obtained if and only if p &lt; ~.</Paragraph>
    <Paragraph position="4"> What if p is estimated from data? Given a set of finite parse trees wl, w2 ..... w,, the maximum-likelihood estimator for p (see Section 2) is, sensibly enough, the &amp;quot;relative frequency&amp;quot; estimator</Paragraph>
    <Paragraph position="6"> where f(.;w) is the number of occurrences of the production &amp;quot;.&amp;quot; in the tree w. The sentence a m, although ambiguous (there are multiple parses when m &gt; 2), always involves m - 1 of the A ~ AA productions and m of the A ~ a productions. Hence</Paragraph>
    <Paragraph position="8"> for each wi, and ~ &lt; 1/2. The maximum-likelihood probability is tight.</Paragraph>
    <Paragraph position="9"> If only the yields (left-to-right sequence of terminals) Y(o;1), Y(w2) ..... Y(wn) are available, the EM algorithm can be used to iteratively &amp;quot;climb&amp;quot; the likelihood surface (see Section 2). In the simple example here, the estimator converges in one step and is the same ~ as if we had observed the entire parse tree for each wi. Thus, ~ is again less than 1/2 and the distribution is again tight.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML