File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1040_metho.xml

Size: 24,831 bytes

Last Modified: 2025-10-06 14:10:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1040">
  <Title>Probabilistic Context-Free Grammar Induction Based on Structural Zeros</Title>
  <Section position="3" start_page="312" end_page="314" type="metho">
    <SectionTitle>
2 Grammar induction
</SectionTitle>
    <Paragraph position="0"> A context-free grammar G = (V,T,S+,P), or CFG in short, consists of a set of non-terminal symbols V , a set of terminal symbols T, a start symbol S+ [?] V , and a set of production P of the form: A - a, where A [?] V and a [?] (V [?] T)[?]. A PCFG is a CFG with a probability assigned to each production.</Paragraph>
    <Paragraph position="1"> Thus, the probabilities of the productions expanding a given non-terminal sum to one.</Paragraph>
    <Section position="1" start_page="312" end_page="313" type="sub_section">
      <SectionTitle>
2.1 Smoothing and factorization
</SectionTitle>
      <Paragraph position="0"> PCFGs induced from the Penn Treebank have many productions with long sequences of non-terminals on the RHS. Probability estimates of the RHS given the LHS are often smoothed by making a Markov assumption regarding the conditional independence of a category on those more than k categories away</Paragraph>
      <Paragraph position="2"> Making such a Markov assumption is closely related to grammar transformations required for certain efficient parsing algorithms. For example, the CYK parsing algorithm takes as input a Chomsky Normal Form PCFG, i.e., a grammar where all productions are of the form X - YZ or X - a, where X, Y , and Z are non-terminals and a a terminal symbol.1. Binarized PCFGs are induced from a treebank whose trees have been factored so that n-ary productions with n&gt;2 become sequences of n[?]1 binary productions. Full right-factorization involves concatenating the final n[?]1 categories from the RHS of an n-ary production to form a new composite non-terminal. For example, the original production NP - DT JJ NN NNS shown in Figure 1(a) is factored into three binary rules, as shown in Figure 1(b). Note that a PCFG induced from such rightfactored trees is weakly equivalent to a PCFG induced from the original treebank, i.e., it describes the same language.</Paragraph>
      <Paragraph position="3"> From such a factorization, one can make a Markov assumption for estimating the production probabilities by simply recording only the labels of the first k children dominated by the composite factored label. Figure 1 (c), (d), and (e) show rightfactored trees of Markov orders 2, 1 and 0 respec- null stated Markov order for all dependencies in the productions, because we are restricting factorization to only produce binary productions. For example, in Figure 1(e), the probability of the  from sections 2-21 of the Penn WSJ Treebank and tested on all sentences of section 24 (no length limit), given weighted k-best POS-tagger output. The second and third columns report the total parsing time in seconds and the number of words parsed per second. The number of non-terminals, |V|, is indicated in the next column. The last three columns show the labeled recall (LR), labeled precision (LP), and F-measure (F).</Paragraph>
      <Paragraph position="4"> as mentioned above, these factorizations reduce the size of the non-terminal set, which in turn improves CYK efficiency. The efficiency benefit of making a Markov assumption in factorization can be substantial, given the reduction of both non-terminals and productions, which improves the grammar constant.</Paragraph>
      <Paragraph position="5"> With standard right-factorization, as in Figure 1(b), the non-terminal set for the PCFG induced from sections 2-21 of the Penn WSJ Treebank grows from its original size of 72 to 10105, with 23220 productions. With a Markov factorization of orders 2, 1 and 0 we get non-terminal sets of size 2492, 564, and 99, and rule production sets of 11659, 6354, and 3803, respectively.</Paragraph>
      <Paragraph position="6"> These reductions in the size of the non-terminal set from the original factored grammar result in an order of magnitude reduction in complexity of the CYK algorithm. One common strategy in statistical parsing is what can be termed an approximate coarse-to-fine approach: a simple PCFG is used to prune the search space to which richer and more complex models are applied subsequently (Charniak, 2000; Charniak and Johnson, 2005). Producing a &amp;quot;coarse&amp;quot; chart as efficiently as possible is thus crucial (Charniak et al., 1998; Blaheta and Charniak, 1999), making these factorizations particularly useful. null</Paragraph>
    </Section>
    <Section position="2" start_page="313" end_page="314" type="sub_section">
      <SectionTitle>
2.2 CYK parser and baselines
</SectionTitle>
      <Paragraph position="0"> To illustrate the importance of this reduction in non-terminals for efficient parsing, we will present base-line parsing results for a development set. For these baseline trials, we trained a PCFG on sections 2-21 of the Penn WSJ Treebank (40k sentences, 936k words), and evaluated on section 24 (1346 sentences, 32k words). The parser takes as input the weighted k-best POS-tag sequences of a final NNS depends on the preceding NN, despite the Markov order-0 factorization. Because of our focus on efficient CYK, we accept these higher order dependencies rather than producing unary productions. Only n-ary rules n&gt;2 are factored. perceptron-trained tagger, using the tagger documented in Hollingshead et al. (2005). The number of tagger candidates k for all trials reported in this paper was 0.2n, where n is the length of the string.</Paragraph>
      <Paragraph position="1"> From the weighted k-best list, we derive a conditional probability of each tag at position i by taking the sum of the exponential of the weights of all candidates with that tag at position i (softmax).</Paragraph>
      <Paragraph position="2"> The parser is an exhaustive CYK parser that takes advantage of the fact that, with the grammar factorization method described, factored non-terminals can only occur as the second child of a binary production. Since the bulk of the non-terminals result from factorization, this greatly reduces the number of possible combinations given any two cells. When parsing with a parent-annotated grammar, we use a version of the parser that also takes advantage of the partitioning of the non-terminal set, i.e., the fact that any given non-terminal has already its parent indicated in its label, precluding combination with any non-terminal that does not have the same parent annotated. null Table 1 shows baseline results for standard right-factorization and factorization with Markov orders 0-2. Training consists of applying a particular grammar factorization to the treebank prior to inducing a PCFG using maximum likelihood (relative frequency) estimation. Testing consists of exhaustive CYK parsing of all sentences in the development set (no length limit) with the induced grammar, then detransforming the maximum likelihood parse back to the original format for evaluation against the reference parse. Evaluation includes the standard PAR-SEVAL measures labeled precision (LP) and labeled recall (LR), plus the harmonic mean (F-measure) of these two scores. We also present a result using parent annotation (Johnson, 1998) with a 2nd-order Markov assumption. Parent annotation occurs prior to treebank factorization. This condition is roughly equivalent to the h = 1,v = 2 in Klein and Manning  (2003b)3.</Paragraph>
      <Paragraph position="3"> From these results, we can see the large efficiency benefit of the Markov assumption, as the size of the non-terminal and production sets shrink. However, the efficiency gains come at a cost, with the Markov order-0 factored grammar resulting in a loss of a full 8 percentage points of F-measure accuracy. Parent annotation provides a significant accuracy improvement over the other baselines, but at a substantial efficiency cost.</Paragraph>
      <Paragraph position="4"> Note that the efficiency impact is not a strict function of either the number of non-terminals or productions. Rather, it has to do with the number of competing non-terminals in cells of the chart. Some grammars may be very large, but less ambiguous in a way that reduces the number of cell entries, so that only a very small fraction of the productions need to be applied for any pair of cells. Parent annotation does just the opposite - it increases the number of cell entries for the same span, by creating entries for the same constituent with different parents. Some non-terminal annotations, e.g., splitting POS-tags by annotating their lexical items, result in a large grammar, but one where the number of productions that will apply for any pair of cells is greatly reduced.</Paragraph>
      <Paragraph position="5"> Ideally, one would obtain the efficiency benefit of the small non-terminal set demonstrated with the Markov order-0 results, while encoding key grammatical constraints whose absence results in an accuracy loss. The method we present attempts to achieve this by using a statistical test to determine structural zeros and modifying the factorization to remove the probability mass assigned to them.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="314" end_page="315" type="metho">
    <SectionTitle>
3 Detecting Structural Zeros
</SectionTitle>
    <Paragraph position="0"> The main idea behind our method for detecting structural zeros is to search for events that are individually very frequent but that do not co-occur.</Paragraph>
    <Paragraph position="1"> For example, consider the Markov order-0 binary rule production in Figure 2. The production NP-NP NP: may be very frequent, as is the NP:-CC NN production, but they never co-occur together, because NP does not conjoin with NN in the Penn Treebank. If the counts of two such events a and b, e.g., NP-NP NP: and NP:-CC NN are very large, but the count of their co-occurrence  ear order of the children, but rather includes the head-child plus one other, whereas our factorization does not involve identification of the head child.</Paragraph>
    <Paragraph position="2">  is zero, then the co-occurrence of a and b can be viewed as a candidate for the list of events that are structurally inadmissible. The probability mass for the co-occurrence of a and b can be removed by replacing the factored non-terminal NP: with NP:CC:NN whenever there is a CC and an NN combining to form a factored NP non-terminal.</Paragraph>
    <Paragraph position="3"> The expansion of the factored non-terminals is not the only event that we might consider. For example, a frequent left-most child of the first child of the production, or a common left-corner POS or lexical item, might never occur with certain productions. For example, 'SBAR-IN S' and 'IN-of' are both common productions, but they never co-occur. We focus on left-most children and left-corners because of the factorization that we have selected, but the same idea could be applied to other possible state splits.</Paragraph>
    <Paragraph position="4"> Different statistical criteria can be used to compare the counts of two events with that of their cooccurrence. This section examines several possible criteria that are presented, for ease of exposition, with general sequences of events. For our specific purpose, these sequences of events would be two rule productions.</Paragraph>
    <Section position="1" start_page="314" end_page="315" type="sub_section">
      <SectionTitle>
3.1 Notation
</SectionTitle>
      <Paragraph position="0"> This section describes several statistical criteria to determine if a sequence of two events should be viewed as a structural zero. These tests can be generalized to longer and more complex sequences, and to various types of events, e.g., word, word class, or rule production sequences.</Paragraph>
      <Paragraph position="1"> Given a corpus C, and a vocabulary S, we denote by ca the number of occurrences of a in C. Let n be the total number of observations in C. We will denote by -a the set {b [?] S : b negationslash= a}. Hence c-a = n[?]ca. Let P(a) = can , and for b [?] S, let P(a|b) = cab cb . Note that c-ab = cb [?]cab.</Paragraph>
    </Section>
    <Section position="2" start_page="315" end_page="315" type="sub_section">
      <SectionTitle>
3.2 Mutual information
</SectionTitle>
      <Paragraph position="0"> The mutual information between two random variables X and Y is defined as</Paragraph>
      <Paragraph position="2"> For a particular event sequence of length two ab, this suggests the following statistic:</Paragraph>
      <Paragraph position="4"> = logcab [?]logca [?]logcb + logn Unfortunately, for cab = 0, I(ab) is not finite. If we assume, however, that all unobserved sequences are given some epsilon1 count, then when cab = 0,</Paragraph>
      <Paragraph position="6"> where K is a constant. Since we need these statistics only for ranking purposes, we can ignore the constant factor.</Paragraph>
    </Section>
    <Section position="3" start_page="315" end_page="315" type="sub_section">
      <SectionTitle>
3.3 Log odds ratio
</SectionTitle>
      <Paragraph position="0"> Another statistic that, like mutual information, is ill-defined with zeros, is the log odds ratio:</Paragraph>
      <Paragraph position="2"> Here again, if cab = 0, log(^th) is not finite. But, if we assign to all unobserved pairs a small count epsilon1, when</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="4" start_page="315" end_page="315" type="sub_section">
      <SectionTitle>
3.4 Pearson chi-squared
</SectionTitle>
      <Paragraph position="0"> For any i,j [?] S, define ^uij = cicjn . The Pearson chi-squared test of independence is then defined as follows:</Paragraph>
      <Paragraph position="2"> In the case of interest for us, cab = 0 and the statistic simplifies to:</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="5" start_page="315" end_page="315" type="sub_section">
      <SectionTitle>
3.5 Log likelihood ratio
</SectionTitle>
      <Paragraph position="0"> Pearson's chi-squared statistic assumes a normal or approximately normal distribution, but that assumption typically does not hold for the occurrences of rare events (Dunning, 1994). It is then preferable to use the likelihood ratio statistic which allows us to compare the null hypothesis, that P(b) = P(b|a) = P(b|-a) = cbn , with the hypothesis that P(b|a) = cabca and P(b|-a) = c-abc-a . In words, the null hypothesis is that the context of event a does not change the probability of seeing b. These discrete conditional probabilities follow a binomial distribution, hence the likelihood ratio is</Paragraph>
      <Paragraph position="2"> where B[p,x,y] = px(1 [?] p)y[?]x( yx ). In the special case where cab = 0, P(b|-a) = P(b), and this expression can be simplified as follows:</Paragraph>
      <Paragraph position="4"> The log-likelihood ratio, denoted by G2, is known to be asymptotically X2-distributed. In this case,</Paragraph>
      <Paragraph position="6"> and with the binomial distribution, it has has one degree of freedom, thus the distribution will have asymptotically a mean of one and a standard deviation of [?]2.</Paragraph>
      <Paragraph position="7"> We experimented with all of these statistics.</Paragraph>
      <Paragraph position="8"> While they measure different ratios, empirically they seem to produce very similar rankings. For the experiments reported in the next section, we used the log-likelihood ratio because this statistic is well-defined with zeros and is preferable to the Pearson chi-squared when dealing with rare events.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="315" end_page="318" type="metho">
    <SectionTitle>
4 Experimental results
</SectionTitle>
    <Paragraph position="0"> We used the log-likelihood ratio statistic G2 to rank unobserved events ab, where a [?] P and b [?] V . Let Vo be the original, unfactored non-terminal set, and let a [?] (Vo :)[?] be a sequence of zero or more nonterminal/colon symbol pairs. Suppose we have a frequent factored non-terminal X:aB for X,B [?] Vo.</Paragraph>
    <Paragraph position="1"> Then, if the set of productions X - YX:aA with  A [?] Vo is also frequent, but X - YX:aB is unobserved, this is a candidate structural zero. Similar splits can be considered with non-factored nonterminals. null There are two state split scenarios we consider in this paper. Scenario 1 is for factored non-terminals, which are always the second child of a binary production. For use in Equation 7,</Paragraph>
    <Paragraph position="3"> Scenario 2 is for non-factored non-terminals, which we will split using the leftmost child, the left-corner POS-tag, and the left-corner lexical item, which are easily incorporated into our grammar factorization approach. In this scenario, the non-terminal to be split can be either the left or right child in the binary production. Here we show the counts for the left child case for use in Equation 7:</Paragraph>
    <Paragraph position="5"> A c(Y [aA]) In this case, the possible splits are more complicated than just non-terminals as used in factoring. Here, the first possible split is the left child category, along with an indication of whether it is a unary production. One can further split by including the left-corner tag, and even further by including the left-corner word. For example, a unary S category might be split as follows: first to S[1:VP] if the single child of the S is a VP; next to S[1:VP:VBD] if the left-corner POS-tag is VBD; finally to S[1:VP:VBD:went] if the VBD verb was 'went'.</Paragraph>
    <Paragraph position="6"> Note that, once non-terminals are split by annotating such information, the base non-terminals, e.g., S, implicitly encode contexts other than the ones that were split.</Paragraph>
    <Paragraph position="7"> Table 2 shows the unobserved rules with the largest G2 score, along with the ten non-terminals  productions leading to their addition to the non-terminal set. that these productions suggest for inclusion in our non-terminal set. The highest scoring unobserved production is PP - IN[that] NP. It receives such a high score because the base production</Paragraph>
    <Paragraph position="9"> but they jointly never occur, since 'IN-that' is a complementizer. This split non-terminal also shows up in the second-highest ranked zero, an SBAR with 'that' complementizer and an S child that consists of a unary VP. The unary S-VP production is very common, but never with a 'that' complementizer in an SBAR.</Paragraph>
    <Paragraph position="10"> Note that the fourth-ranked production uses two split non-terminals. The fifth ranked rule presumably does not add much information to aid parsing disambiguation, since the AUX MD tag sequence is unlikely4. The eighth ranked production is the first with a factored category, ruling out coordination between NN and NP.</Paragraph>
    <Paragraph position="11"> Before presenting experimental results, we will mention some practical issues related to the approach described. First, we independently parameterized the number of factored categories to select and the number of non-factored categories to select. This was done to allow for finer control of the amount of splitting of non-terminals of each type.</Paragraph>
    <Paragraph position="12"> To choose 100 of each, every non-terminal was assigned the score of the highest scoring unobserved production within which it occurred. Then the 100 highest scoring non-terminals of each type were added to the base non-terminal list, which originally consisted of the atomic treebank non-terminals and Markov order-0 factored non-terminals.</Paragraph>
    <Paragraph position="13"> Once the desired non-terminals are selected, the training corpus is factored, and non-terminals are split if they were among the selected set. Note, how4In fact, we do not consider splits when both siblings are POS-tags, because these are unlikely to carry any syntactic disambiguation. null  number of non-factored splits for the given run. Points represent different numbers of factored splits.</Paragraph>
    <Paragraph position="14"> ever, that some of the information in a selected non-terminal may not be fully available, requiring some number of additional splits. Any non-terminal that is required by a selected non-terminal will be selected itself. For example, suppose that NP:CC:NP was chosen as a factored non-terminal. Then the second child of any local tree with that non-terminal on the LHS must either be an NP or a factored non-terminal with at least the first child identified as an NP, i.e., NP:NP. If that factored non-terminal was not selected to be in the set, it must be added.</Paragraph>
    <Paragraph position="15"> The same situation occurs with left-corner tags and words, which may be arbitrarily far below the category. null After factoring and selective splitting of nonterminals, the resulting treebank corpus is used to train a PCFG. Recall that we use the k-best output of a POS-tagger to parse. For each POS-tag and lexical item pair from the output of the tagger, we reduce the word to lower case and check to see if the combination is in the set of split POS-tags, in which case we split the tag, e.g., IN[that].</Paragraph>
    <Paragraph position="16"> Figure 3 shows the F-measure accuracy for our trials on the development set versus the number of non-factored splits parameterized for the trial. From this plot, we can see that 500 non-factored splits provides the best F-measure accuracy on the dev set. Presumably, as more than 500 splits are made, sparse data becomes more problematic. Figure 4 shows the development set F-measure accuracy versus the number of words-per-second it takes to parse the development set, for non-factored splits of 0 and 500, at a range of factored split parameterizations.</Paragraph>
    <Paragraph position="17"> With 0 non-factored splits, efficiency is substantially impacted by increasing the factored splits, whereas it can be seen that with 500 non-factored splits, that impact is much less, so that the best performance  lected); (2) 500 non-factored splits, which was the best performing; and (3) four baseline results.</Paragraph>
    <Paragraph position="18"> is reached with both relatively few factored non-terminal splits, and a relatively small efficiency impact. The non-factored splits provide substantial accuracy improvements at relatively small efficiency cost.</Paragraph>
    <Paragraph position="19"> Table 3 shows the 1-best and reranked 50-best results for the baseline Markov order-2 model, and the best-performing model using factored and non-factored non-terminal splits. We present the efficiency of the model in terms of words-per-second over the entire dev set, including the longer strings (maximum length 116 words)5. We used the k-best decoding algorithm of Huang and Chiang (2005) with our CYK parser, using on-demand k-best back-pointer calculation. We then trained a MaxEnt reranker on sections 2-21, using the approach outlined in Charniak and Johnson (2005), via the publicly available reranking code from that paper.6 We used the default features that come with that package. The processing time in the table includes the time to parse and rerank. As can be seen from the trials, there is some overhead to these processes, but the time is still dominated by the base parsing.</Paragraph>
    <Paragraph position="20"> We present the k-best results to demonstrate the benefits of using a better model, such as the one we have presented, for producing candidates for downstream processing. Even with severe pruning to only the top 50 candidate parses per string, which results in low oracle and reranked accuracy for the Markov order-2 model, the best-performing model based on structural zeros achieves a relatively high oracle accuracy, and reaches 88.0 and 87.5 percent F-measure accuracy on the dev (f24) and eval (f23) sets respectively. Note that the well-known Char- null best-performing structural zero model, with 200 factored and 500 non-factored non-terminal splits. 1-best results, plus reranking using a trained version of an existing reranker with 50 candidates. niak parser (Charniak, 2000; Charniak and Johnson, 2005) uses a Markov order-3 baseline PCFG in the initial pass, with a best-first algorithm that is run past the first parse to populate the chart for use by the richer model. While we have demonstrated exhaustive parsing efficiency, our model could be used with any of the efficient search best-first approaches documented in the literature, from those used in the Charniak parser (Charniak et al., 1998; Blaheta and Charniak, 1999) to A[?] parsing (Klein and Manning, 2003a). By using a richer grammar of the sort we present, far fewer edges would be required in the chart to include sufficient quality candidates for the richer model, leading to further downstream savings of processing time.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML