File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1055_intro.xml

Size: 18,296 bytes

Last Modified: 2025-10-06 14:03:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1055">
  <Title>Learning Accurate, Compact, and Interpretable Tree Annotation</Title>
  <Section position="4" start_page="433" end_page="436" type="intro">
    <SectionTitle>
2 Learning
</SectionTitle>
    <Paragraph position="0"> To obtain a grammar from the training trees, we want to learn a set of rule probabilities b on latent annotations that maximize the likelihood of the training trees, despite the fact that the original trees lack the latent annotations. The Expectation-Maximization (EM) algorithm allows us to do exactly that.2 Given a sentence w and its unannotated tree T, consider a non-terminal A spanning (r,t) and its children B and C spanning (r,s) and (s,t). Let Ax be a subsymbol of A, By of B, and Cz of C. Then the inside and outside probabilities PIN(r,t,Ax) def= P(wr:t|Ax) and POUT(r,t,Ax) def= P(w1:rAxwt:n) can be computed reencourages sparsity) suggest a large reduction.</Paragraph>
    <Paragraph position="1"> 2Other techniques are also possible; Henderson (2004) uses neural networks to induce latent left-corner parser states. cursively:</Paragraph>
    <Paragraph position="3"> Although we show only the binary component here, of course there are both binary and unary productions that are included. In the Expectation step, one computes the posterior probability of each annotated rule and position in each training set tree T:</Paragraph>
    <Paragraph position="5"> In the Maximization step, one uses the above probabilities as weighted observations to update the rule probabilities: null</Paragraph>
    <Paragraph position="7"> Note that, because there is no uncertainty about the location of the brackets, this formulation of the inside-outside algorithm is linear in the length of the sentence rather than cubic (Pereira and Schabes, 1992).</Paragraph>
    <Paragraph position="8"> For our lexicon, we used a simple yet robust method for dealing with unknown and rare words by extracting a small number of features from the word and then computing appproximate tagging probabilities.3</Paragraph>
    <Section position="1" start_page="433" end_page="434" type="sub_section">
      <SectionTitle>
2.1 Initialization
</SectionTitle>
      <Paragraph position="0"> EM is only guaranteed to find a local maximum of the likelihood, and, indeed, in practice it often gets stuck in a suboptimal configuration. If the search space is very large, even restarting may not be sufficient to alleviate this problem. One workaround is to manually specify some of the annotations. For instance, Matsuzaki et al.</Paragraph>
      <Paragraph position="1"> (2005) start by annotating their grammar with the identity of the parent and sibling, which are observed (i.e.</Paragraph>
      <Paragraph position="2"> not latent), before adding latent annotations.4 If these manual annotations are good, they reduce the search space for EM by constraining it to a smaller region. On the other hand, this pre-splitting defeats some of the purpose of automatically learning latent annotations, 3A word is classified into one of 50 unknown word categories based on the presence of features such as capital letters, digits, and certain suffixes and its tagging probability is given by: P'(word|tag) = k ^P(class|tag) where k is a constant representing P(word|class) and can simply be dropped.</Paragraph>
      <Paragraph position="3"> Rare words are modeled using a combination of their known and unknown distributions.</Paragraph>
      <Paragraph position="5"> each subcategory and their respective probability.</Paragraph>
      <Paragraph position="6"> leaving to the user the task of guessing what a good starting annotation might be.</Paragraph>
      <Paragraph position="7"> We take a different, fully automated approach. We start with a completely unannotated X-bar style grammar as described in Section 1.1. Since we will evaluate our grammar on its ability to recover the Penn Treebank nonterminals, we must include them in our grammar.</Paragraph>
      <Paragraph position="8"> Therefore, this initialization is the absolute minimum starting grammar that includes the evaluation nonterminals (and maintains separate grammar symbols for each of them).5 It is a very compact grammar: 98 symbols,6 236 unary rules, and 3840 binary rules. However, it also has a very low parsing performance: 65.8/59.8 LP/LR on the development set.</Paragraph>
    </Section>
    <Section position="2" start_page="434" end_page="434" type="sub_section">
      <SectionTitle>
2.2 Splitting
</SectionTitle>
      <Paragraph position="0"> Beginning with this baseline grammar, we repeatedly split and re-train the grammar. In each iteration we initialize EM with the results of the smaller grammar, splitting every previous annotation symbol in two and adding a small amount of randomness (1%) to break the symmetry. The results are shown in Figure 3. Hierarchical splitting leads to better parameter estimates over directly estimating a grammar with 2k subsymbols per symbol. While the two procedures are identical for only two subsymbols (F1: 76.1%), the hierarchical training performs better for four subsymbols (83.7% vs. 83.2%). This advantage grows as the number of subsymbols increases (88.4% vs.</Paragraph>
      <Paragraph position="1"> 87.3% for 16 subsymbols). This trend is to be expected, as the possible interactions between the subsymbols grows as their number grows. As an example of how staged training proceeds, Figure 2 shows the evolution of the subsymbols of the determiner (DT) tag, which first splits demonstratives from determiners, then splits quantificational elements from demonstratives along one branch and definites from indefinites along the other.</Paragraph>
      <Paragraph position="2"> 5If our purpose was only to model language, as measured for instance by perplexity on new text, it could make sense to erase even the labels of the Penn Treebank to let EM find better labels by itself, giving an experiment similar to that of Pereira and Schabes (1992).</Paragraph>
      <Paragraph position="3"> 645 part of speech tags, 27 phrasal categories and the 26 intermediate symbols which were added during binarization Because EM is a local search method, it is likely to converge to different local maxima for different runs.</Paragraph>
      <Paragraph position="4"> In our case, the variance is higher for models with few subcategories; because not all dependencies can be expressed with the limited number of subcategories, the results vary depending on which one EM selects first.</Paragraph>
      <Paragraph position="5"> As the grammar size increases, the important dependencies can be modeled, so the variance decreases.</Paragraph>
    </Section>
    <Section position="3" start_page="434" end_page="435" type="sub_section">
      <SectionTitle>
2.3 Merging
</SectionTitle>
      <Paragraph position="0"> It is clear from all previous work that creating more latent annotations can increase accuracy. On the other hand, oversplitting the grammar can be a serious problem, as detailed in Klein and Manning (2003). Adding subsymbols divides grammar statistics into many bins, resulting in a tighter fit to the training data. At the same time, each bin gives a less robust estimate of the grammar probabilities, leading to overfitting. Therefore, it would be to our advantage to split the latent annotations only where needed, rather than splitting them all as in Matsuzaki et al. (2005). In addition, if all symbols are split equally often, one quickly (4 split cycles) reaches the limits of what is computationally feasible in terms of training time and memory usage.</Paragraph>
      <Paragraph position="1"> Consider the comma POS tag. We would like to see only one sort of this tag because, despite its frequency, it always produces the terminal comma (barring a few annotation errors in the treebank). On the other hand, we would expect to find an advantage in distinguishing between various verbal categories and NP types. Additionally, splitting symbols like the comma is not only unnecessary, but potentially harmful, since it needlessly fragments observations of other symbols' behavior. null It should be noted that simple frequency statistics are not sufficient for determining how often to split each symbol. Consider the closed part-of-speech classes (e.g. DT, CC, IN) or the nonterminal ADJP. These symbols are very common, and certainly do contain subcategories, but there is little to be gained from exhaustively splitting them before even beginning to model the rarer symbols that describe the complex inner correlations inside verb phrases. Our solution is to use a split-and-merge approach broadly reminiscent of ISODATA, a classic clustering procedure (Ball and  Hall, 1967).</Paragraph>
      <Paragraph position="2"> To prevent oversplitting, we could measure the utility of splitting each latent annotation individually and then split the best ones first. However, not only is this impractical, requiring an entire training phase for each new split, but it assumes the contributions of multiple splits are independent. In fact, extra subsymbols may need to be added to several nonterminals before they can cooperate to pass information along the parse tree.</Paragraph>
      <Paragraph position="3"> Therefore, we go in the opposite direction; that is, we split every symbol in two, train, and then measure for each annotation the loss in likelihood incurred when removing it. If this loss is small, the new annotation does not carry enough useful information and can be removed. What is more, contrary to the gain in likelihood for splitting, the loss in likelihood for merging can be efficiently approximated.7 Let T be a training tree generating a sentence w.</Paragraph>
      <Paragraph position="4"> Consider a node n of T spanning (r,t) with the label A; that is, the subtree rooted at n generates wr:t and has the label A. In the latent model, its label A is split up into several latent labels, Ax. The likelihood of the data can be recovered from the inside and outside probabilities at n:</Paragraph>
      <Paragraph position="6"> Consider merging, at n only, two annotations A1 and A2. Since A now combines the statistics of A1 and A2, its production probabilities are the sum of those of A1 and A2, weighted by their relative frequency p1 and p2 in the training data. Therefore the inside score of A is:</Paragraph>
      <Paragraph position="8"> Since A can be produced as A1 or A2 by its parents, its outside score is:</Paragraph>
      <Paragraph position="10"> Replacing these quantities in (2) gives us the likelihood Pn(w,T) where these two annotations and their corresponding rules have been merged, around only node n.</Paragraph>
      <Paragraph position="11"> We approximate the overall loss in data likelihood due to merging A1 and A2 everywhere in all sentences wi by the product of this loss for each local change:</Paragraph>
      <Paragraph position="13"> This expression is an approximation because it neglects interactions between instances of a symbol at multiple places in the same tree. These instances, however, are 7The idea of merging complex hypotheses to encourage generalization is also examined in Stolcke and Omohundro (1994), who used a chunking approach to propose new productions in fully unsupervised grammar induction. They also found it necessary to make local choices to guide their likelihood search.</Paragraph>
      <Paragraph position="14"> often far apart and are likely to interact only weakly, and this simplification avoids the prohibitive cost of running an inference algorithm for each tree and annotation. We refer to the operation of splitting annotations and re-merging some them based on likelihood loss as a split-merge (SM) cycle. SM cycles allow us to progressively increase the complexity of our grammar, giving priority to the most useful extensions.</Paragraph>
      <Paragraph position="15"> In our experiments, merging was quite valuable. Depending on how many splits were reversed, we could reduce the grammar size at the cost of little or no loss of performance, or even a gain. We found that merging 50% of the newly split symbols dramatically reduced the grammar size after each splitting round, so that after 6 SM cycles, the grammar was only 17% of the size it would otherwise have been (1043 vs. 6273 subcategories), while at the same time there was no loss in accuracy (Figure 3). Actually, the accuracy even increases, by 1.1% at 5 SM cycles. The numbers of splits learned turned out to not be a direct function of symbol frequency; the numbers of symbols for both lexical and nonlexical tags after 4 SM cycles are given in Table 2.</Paragraph>
      <Paragraph position="16"> Furthermore, merging makes large amounts of splitting possible. It allows us to go from 4 splits, equivalent to the 24 = 16 substates of Matsuzaki et al. (2005), to 6 SM iterations, which take a few days to run on the Penn Treebank.</Paragraph>
    </Section>
    <Section position="4" start_page="435" end_page="436" type="sub_section">
      <SectionTitle>
2.4 Smoothing
</SectionTitle>
      <Paragraph position="0"> Splitting nonterminals leads to a better fit to the data by allowing each annotation to specialize in representing only a fraction of the data. The smaller this fraction, the higher the risk of overfitting. Merging, by allowing only the most beneficial annotations, helps mitigate this risk, but it is not the only way. We can further minimize overfitting by forcing the production probabilities from annotations of the same nonterminal to be similar. For example, a noun phrase in subject position certainly has a distinct distribution, but it may benefit from being smoothed with counts from all other noun phrases. Smoothing the productions of each subsymbol by shrinking them towards their common base symbol gives us a more reliable estimate, allowing them to share statistical strength.</Paragraph>
      <Paragraph position="1"> We perform smoothing in a linear way. The estimated probability of a production px = P(Ax By Cz) is interpolated with the average over all subsymbols of A.</Paragraph>
      <Paragraph position="3"> Here, a is a small constant: we found 0.01 to be a good value, but the actual quantity was surprisingly unimportant. Because smoothing is most necessary when production statistics are least reliable, we expect smoothing to help more with larger numbers of subsymbols.</Paragraph>
      <Paragraph position="4"> This is exactly what we observe in Figure 3, where smoothing initially hurts (subsymbols are quite distinct  and do not need their estimates pooled) but eventually helps (as symbols have finer distinctions in behavior and smaller data support).</Paragraph>
    </Section>
    <Section position="5" start_page="436" end_page="436" type="sub_section">
      <SectionTitle>
2.5 Parsing
</SectionTitle>
      <Paragraph position="0"> When parsing new sentences with an annotated grammar, returning the most likely (unannotated) tree is intractable: to obtain the probability of an unannotated tree, one must sum over combinatorially many annotation trees (derivations) for each tree (Sima'an, 1992).</Paragraph>
      <Paragraph position="1"> Matsuzaki et al. (2005) discuss two approximations.</Paragraph>
      <Paragraph position="2"> The first is settling for the most probable derivation rather than most probable parse, i.e. returning the single most likely (Viterbi) annotated tree (derivation). This approximation is justified if the sum is dominated by one particular annotated tree. The second approximation that Matsuzaki et al. (2005) present is the Viterbi parse under a new sentence-specific PCFG, whose rule probabilities are given as the solution of a variational approximation of the original grammar. However, their rule probabilities turn out to be the posterior probability, given the sentence, of each rule being used at each position in the tree. Their algorithm is therefore the labelled recall algorithm of Goodman (1996) but applied to rules. That is, it returns the tree whose expected number of correct rules is maximal. Thus, assuming one is interested in a per-position score like F1 (which is its own debate), this method of parsing is actually more appropriate than finding the most likely parse, not simply a cheap approximation of it, and it need not be derived by a variational argument. We refer to this method of parsing as the max-rule parser. Since this method is not a contribution of this paper, we refer the reader to the fuller presentations in Goodman (1996) and Matsuzaki et al. (2005). Note that contrary to the original labelled recall algorithm, which maximizes the number of correct symbols, this tree only contains rules allowed by the grammar. As a result, the percentage of complete matches with the max-rule parser is typically higher than with the Viterbi parser. (37.5% vs. 35.8% for our best grammar).</Paragraph>
      <Paragraph position="3"> These posterior rule probabilities are still given by (1), but, since the structure of the tree is no longer known, we must sum over it when computing the inside and outside probabilities:</Paragraph>
      <Paragraph position="5"> For efficiency reasons, we use a coarse-to-fine pruning scheme like that of Caraballo and Charniak (1998).</Paragraph>
      <Paragraph position="6"> For a given sentence, we first run the inside-outside algorithm using the baseline (unannotated) grammar,  ter estimates. Merging reduces the grammar size significantly, while preserving the accuracy and enabling us to do more SM cycles. Parameter smoothing leads to even better accuracy for grammars with high complexity. null producing a packed forest representation of the posterior symbol probabilities for each span. For example, one span might have a posterior probability of 0.8 of the symbol NP, but e[?]10 for PP. Then, we parse with the larger annotated grammar, but, at each span, we prune away any symbols whose posterior probability under the baseline grammar falls below a certain threshold (e[?]8 in our experiments). Even though our baseline grammar has a very low accuracy, we found that this pruning barely impacts the performance of our better grammars, while significantly reducing the computational cost. For a grammar with 479 subcategories (4 SM cycles), lowering the threshold to e[?]15 led to an F1 improvement of 0.13% (89.03 vs. 89.16) on the development set but increased the parsing time by a factor of 16.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML