File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1126_metho.xml

Size: 16,860 bytes

Last Modified: 2025-10-06 14:07:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1126">
  <Title>Recovering latent information in treebanks</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Rule-based augmentation
</SectionTitle>
    <Paragraph position="0"> In the interest of reducing the e ort required to construct augmentation heuristics, we would like a notation for specifying rules for selecting nodes in bracketed data that is both flexible enough to encode the kinds of rule sets used by existing parsers, and intuitive enough that a rule set for a new language can be written easily without knowledge of computer programming. Such a notation would simplify the task of writing new rule sets, and facilitate experimentation with di erent rules. Moreover, rules written in this notation would be interchangeable between di erent models, so that, ideally, adaptation of a model to a new corpus would be trivial.</Paragraph>
    <Paragraph position="1"> We define our notation in two parts: a structure pattern language, whose basic patterns are specifications of single nodes written in a label pattern language.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Structure patterns
</SectionTitle>
      <Paragraph position="0"> Most existing head-finding rules and argumentfinding rules work by specifying parent-child relations (e.g., NN is the head of NP, or NP is an argument of VP). A generalization of this scheme that is familiar to linguists and computer scientists alike would be a context-free grammar with rules of the form</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
A!A1 (Ai)l An;
</SectionTitle>
    <Paragraph position="0"> where the superscript l specifies that if this rule gets used, the ith child of A should be marked with the label l.</Paragraph>
    <Paragraph position="1"> However, there are two problems with such an approach. First, writing down such a grammar would be tedious to say the least, and impossible if we want to handle trees with arbitrary branching factors. So we can use an extended CFG (Thatcher, 1967), a CFG whose right-hand sides are regular expressions. Thus we introduce a union operator ([) and a Kleene star ( ) into the syntax for right-hand sides.</Paragraph>
    <Paragraph position="2"> The second problem that our grammar may be ambiguous. For example, the grammar X!YhY[YYh could mark with an h either the first or second symbol of YY. So we impose an ordering on the rules of the grammar: if two rules match, the first one wins. In addition, we make the [ operator noncommutative: [ tries to match first, and only if it does not match , as in Perl. (Thus the above grammar would mark the first Y.) Similarly, tries to match as many times as possible, also as in Perl.</Paragraph>
    <Paragraph position="3"> But this creates a third and final problem: in the grammar X!(YYh[Yh)(YY[Y); it is not defined which symbol of YYY should be marked, that is, which union operator takes priority over the other. Perl circumvents this problem by always giving priority to the left. In algebraic terms, concatenation left-distributes over union but does not right-distribute over union in general.</Paragraph>
    <Paragraph position="4"> However, our solution is to provide a pair of concatenation operators: , which gives priority to the left, and , which gives priority to the right:</Paragraph>
    <Paragraph position="6"> where ? is a wildcard pattern which matches any single label (see below). Rule (3) mark with an h the rightmost VB child of a VP, whereas rule (4) marks the leftmost VB. This is because the Kleene star always prefers to match as many times as possible, but in rule (3) the first Kleene star's preference takes priority over the last's, whereas in rule (4) the last Kleene star's preference takes priority over the first's.</Paragraph>
    <Paragraph position="7"> Consider the slightly more complicated examples: null</Paragraph>
    <Paragraph position="9"> Rule (5) marks the leftmost child which is either a VB or a MD, whereas rule (6) marks the leftmost VB if any, or else the leftmost MD. To see why this so, consider the string MD VB X. Rule (5) would mark the MD as h, whereas rule (6) would mark the VB. In both rules VB is preferred over MD, and symbols to the left over symbols to the right, but in rule (5) the leftmost preference (that is, the preference of the last Kleene star to match as many times as possible) takes priority, whereas in rule (6) the preference for VB takes priority.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Label patterns
</SectionTitle>
      <Paragraph position="0"> Since nearly all treebanks have complex nonterminal alphabets, we need a way of concisely specifying classes of labels. Unfortunately, this will necessarily vary somewhat across treebanks: all we can define that is truly treebank-independent is the ? pattern, which matches any label. For Penn Tree-bank II style annotation (Marcus et al., 1993), in which a nonterminal symbol is a category together with zero or more functional tags, we adopt the following scheme: the atomic pattern a matches any label with category a or functional tag a; moreover, we define Boolean operators^,_, and:. Thus NP^:ADV matches NP-SBJ but not NP-ADV.1</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Summary
</SectionTitle>
      <Paragraph position="0"> Using the structure pattern language and the label pattern language together, one can fully encode the head/argument rules used by Xia (which resemble (5) above), and the family of rule sets used by Black, Magerman, Collins, Ratnaparkhi, and others (which resemble (6) above). In Collins' version of the head rules, NP and PP require special treatment, but these can be encoded in our notation as well.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="73" type="metho">
    <SectionTitle>
4 Unsupervised learning of augmentations
</SectionTitle>
    <Paragraph position="0"> In the type of approach we have been discussing so far, hand-written rules are used to augment the training data, and this augmented training data is then used to train a statistical model. However, if we train the model by maximum-likelihood estimation, the estimate we get will indeed maximize the likelihood of the training data as augmented by the hand-written rules, but not necessarily that of the training data itself. In this section we explore the possibility of training a model directly on unaugmented data.</Paragraph>
    <Paragraph position="1"> A generative model that estimates P(S;T;T+) (where T+ is an augmented tree) is normally used for parsing, by computing the most likely (T;T+) for a given S . But we may also use it for augmenting trees, by computing the most likely T+ for a given sentence-tree pair (S;T). From the latter perspective, because its trees are unaugmented, a tree-bank is a corpus of incomplete data, warranting the use of unsupervised learning methods to reestimate a model that includes hidden parameters. The approach we take below is to seed a parsing model using hand-written rules, and then use the Inside-Outside algorithm to reestimate its parameters. The resulting model, which locally maximizes the likelihood of the unaugmented training data, can then be used in two ways: one might hope that as a parser, it would parse more accurately than a model which only maximizes the likelihood of training data augmented by hand-written rules; and that as a treeaugmenter, it would augment trees in a more data-sensitive way than hand-written rules.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Background: tree adjoining grammar
</SectionTitle>
      <Paragraph position="0"> The parsing model we use is based on the stochastic tree-insertion grammar (TIG) model described 1Note that unlike the noncommutative union operator[, the disjunction operator_has no preference for its first argument. by Chiang (2000). TIG (Schabes and Waters, 1995) is a weakly-context free restriction of tree adjoining grammar (Joshi and Schabes, 1997), in which tree fragments called elementary trees are combined by two composition operations, substitution and adjunction (see Figure 3). In TIG there are certain restrictions on the adjunction operation.</Paragraph>
      <Paragraph position="1"> Chiang's model adds a third composition operation called sister-adjunction (see Figure 3), borrowed from D-tree substitution grammar (Rambow et al.,  There is an important distinction between derived trees and derivation trees (see Figure 3). A derivation tree records the operations that are used to combine elementary trees into a derived tree. Thus there is a many-to-one relationship between derivation trees and derived trees: every derivation tree specifies a derived tree, but a derived tree can be the result of several di erent derivations.</Paragraph>
      <Paragraph position="2"> The model can be trained directly on TIG derivations if they are available, but corpora like the Penn Treebank have only derived trees. Just as Collins uses rules to identify heads and arguments and thereby lexicalize trees, Chiang uses nearly the same rules to reconstruct derivations: each training example is broken into elementary trees, with each head child remaining attached to its parent, each argument broken into a substitution node and an initial root, and each adjunct broken o as a modifier auxiliary tree.</Paragraph>
      <Paragraph position="3"> However, in this experiment we view the derived trees in the Treebank as incomplete data, and try to reconstruct the derivations (the complete data) using the Inside-Outside algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Implementation
</SectionTitle>
      <Paragraph position="0"> The expectation step (E-step) of the Inside-Outside algorithm is performed by a parser that computes all possible derivations for each parse tree in the training data. It then computes inside and outside probabilities as in Hwa's experiment (1998), and uses these to compute the expected number of times each event occurred. For the maximization step (M-step), we obtain a maximum-likelihood estimate of the parameters of the model using relative-frequency es2The parameters for sister-adjunction in the present model di er slightly from the original. In the original model, all the modifier auxiliary trees that sister-adjoined at a particular position were generated independently, except that each sister-adjunction was conditioned on whether it was the first at that position. In the present model, each sister-adjunction is conditioned on the root label of the previous modifier tree.</Paragraph>
      <Paragraph position="1">  timation, just as in the original experiment, as if the expected values for the complete data were the training data.</Paragraph>
      <Paragraph position="2"> Smoothing presents a special problem. There are several several backo levels for each parameter class that are combined by deleted interpolation. Let 1, 2 and 3 be functions from full history contexts Y to less specific contexts at levels 1, 2 and 3, respectively, for some parameter class with three backo levels (with level 1 using the most specific contexts). Smoothed estimates for parameters in this class are computed as follows: e= 1e1+(1 1)( 2e2+(1 2)e3) where ei is the estimate of p(X j i(Y)) for some future context X, and the i are computed by the formula found in (Bikel et al., 1997), modified to use the multiplicative constant 5 found in the similar formula of (Collins, 1999):</Paragraph>
      <Paragraph position="4"> where di is the number of occurrences in training of the context i(Y) (and d0 =0), and ui is the number of unique outcomes for that context seen in training.</Paragraph>
      <Paragraph position="5"> There are several ways one might incorporate this smoothing into the reestimation process, and we chose to depart as little as possible from the original smoothing method: in the E-step, we use the smoothed model, and after the M-step, we use the original formula (7) to recompute the smoothing weights based on the new counts computed from the E-step. While simple, this approach has two important consequences. First, since the formula for the smoothing weights intentionally does not maximize the likelihood of the training data, each iteration of reestimation is not guaranteed to increase the  likelihood of the training data. Second, reestimation tends to increase the size of the model in memory, since smoothing gives nonzero expected counts to many events which were unseen in training. Therefore, since the resulting model is quite large, if an event at a particular point in the derivation forest has an expected count below 10 15, we throw it out.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="73" type="sub_section">
      <SectionTitle>
4.3 Experiment
</SectionTitle>
      <Paragraph position="0"> We first trained the initial model on sections 02-21 of the WSJ corpus using the original head rules, and then ran the Inside-Outside algorithm on the same data. We tested each successive model on some held-out data (section 00), using a beam width of 10 4, to determine at which iteration to stop. The F-measure (harmonic mean of labeled precision and recall) for sentences of length 100 for each iteration is shown in Figure 4. We then selected the ninth reestimated model and compared it with the initial model on section 23 (see Figure 7). This model did only marginally better than the initial model on section 00, but it actually performs worse than the initial model on section 23. One explanation is that the  head rules, since they have been extensively finetuned, do not leave much room for improvement.</Paragraph>
      <Paragraph position="1"> To test this, we ran two more experiments.</Paragraph>
      <Paragraph position="2"> The second experiment started with a simplified rule set, which simply chooses either the leftmost or rightmost child of each node as the head, depending on the label of the parent: e.g., for VP, the left-most child is chosen; for NP, the rightmost child is chosen. The argument rules, however, were not changed. This rule set is supposed to represent the kind of rule set that someone with basic familiarity with English syntax might write down in a few minutes. The reestimated models seemed to improve on this simplified rule set when parsing section 00 (see Figure 5); however, when we compared the 30th reestimated model with the initial model on section 23 (see Figure 7), there was no improvement.</Paragraph>
      <Paragraph position="3"> The third experiment was on the Chinese Treebank, starting with the same head rules used in (Bikel and Chiang, 2000). These rules were originally written by Xia for grammar development, and although we have modified them for parsing, they have not received as much fine-tuning as the English rules have. We trained the model on sections 001270 of the Penn Chinese Treebank, and reestimated it on the same data, testing it at each iteration on sections 301-325 (Figure 6). We selected the 38th reestimated model for comparison with the initial model on sections 271-300 (Figure 7). Here we did observe a small improvement: an error reduction of 3.4% in the F-measure for sentences of length 40.</Paragraph>
    </Section>
    <Section position="4" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
4.4 Discussion
</SectionTitle>
      <Paragraph position="0"> Our hypothesis that reestimation does not improve on the original rule set for English because that rule set is already fine-tuned was partially borne out by the second and third experiments. The model trained with a simplified rule set for English showed improvement on held-out data during reestimation, but showed no improvement in the final evaluation; however, the model trained on Chinese did show a small improvement in both. We are uncertain as to why the gains observed during the second experiment were not reflected in the final evaluation, but based on the graph of Figure 5 and the results on Chinese, we believe that reestimation by EM can be used to facilitate adaptation of parsing models to new languages or corpora.</Paragraph>
      <Paragraph position="1"> It is possible that our method for choosing smoothing weights at each iteration (see x4.2) is causing some interference. For future work, more careful methods should be explored. We would also like to experiment on the parsing model of Collins (1999), which, because it can recombine smaller structures and reorder subcategorization frames, might open up the search space for better reestimation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML