File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3201_evalu.xml

Size: 8,297 bytes

Last Modified: 2025-10-06 13:59:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3201">
  <Title>Max-Margin Parsing</Title>
  <Section position="8" start_page="2" end_page="2" type="evalu">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> We used the Penn English Treebank for all of our experiments. We report results here for each model and setting trained and tested on only the sentences of length [?] 15 words. Aside from the length restriction, we used the standard splits: sections 2-21 for training (9753 sentences), 22 fordevelopment (603 sentences), and 23 for final testing (421 sentences).</Paragraph>
    <Paragraph position="1"> As a baseline, we trained a CNF transformation of the unlexicalized model of Klein and Manning (2003) on this data. The resulting grammar had 3975 non-terminal symbols and contained two kindsof productions: binary non-terminal rewrites and tag-word rewrites.5 The scores for the binary rewrites were estimated using unsmoothed relative frequency estimators.</Paragraph>
    <Paragraph position="2"> The tagging rewrites were estimated with a smoothed model of P(w|t), also using the model from Klein and Manning (2003). Figure 3 shows the performance of this model (generative): 87.99 F1 on the test set.</Paragraph>
    <Paragraph position="3"> For the basic max-margin model, we used exactly the same set of allowed rewrites (and therefore the same set of candidate parses) as in the generative case, but estimated their weights according to the discriminative method of section 4. Tag-word production weights were fixed to be the log of the generative P(w|t) model.</Paragraph>
    <Paragraph position="4"> That is, the only change between generative and basic is the use of the discriminative maximum-margin criterion in place of the generative maximum likelihood one. This change alone results in a small improvement (88.20 vs.</Paragraph>
    <Paragraph position="5"> 87.99 F1).</Paragraph>
    <Paragraph position="6"> On top of the basic model, we first added lexical features of each span; this gave a lexical model. For a span &lt;s,e&gt; of a sentence x, the base lexical features were:  These base features were conjoined with the span length for spans of length 3 and below, since short spans have highly distinct behaviors (see the examples below). The features are lexical in the sense than they allow specific words 5Unary rewrites were compiled into a single compound symbol, so for example a subject-gapped sentence would have label like s+vp. These symbols were expanded back into their source unary chain before parses were evaluated.</Paragraph>
    <Paragraph position="7"> and word pairs to influence the parse scores, but are distinct from traditional lexical features in several ways. First, there is no notion of head-word here, nor is there any modeling of word-to-word attachment. Rather, these features pick up on lexical trends in constituent boundaries, for example the trend that in the sentence The screen was a sea of red., the (length 2) span between the word was and the word of is unlikely to be a constituent. These non-head lexical features capture a potentially very different source of constraint on tree structures than head-argument pairs, one having to do more with linear syntactic preferences than lexical selection. Regardless of the relative merit of the two kinds of information, one clear advantage of the present approach is that inference in the resulting model remains cubic, since the dynamic program need not track items with distinguished headwords. With the addition of these features, the accuracy jumped past the generative baseline, to 88.44.</Paragraph>
    <Paragraph position="8"> As a concrete (and particularly clean) example of how these features can sway a decision, consider the sentence The Egyptian president said he would visit Libya today to resume the talks. The generative model incorrectly considers Libya today to be a base np. However, this analysis is counter to the trend of today being a one-word constituent. Two features relevant to this trend are: (constituent [?] first-word = today [?] length = 1) and (constituent [?] lastword = today [?] length = 1). These features represent the preference of the word today for being the first and and last word in constituent spans of length 1.6 In the lexical model, however, these features have quite large positive weights: 0.62 each. As a result, this model makes this parse decision correctly.</Paragraph>
    <Paragraph position="9"> Another kind of feature that can usefully be incorporated into the classification process is the output of other, auxiliary classifiers. For this kind of feature, one must take care that its reliability on the training not be vastly greater than its reliability on the test set. Otherwise, its weight will be artificially (and detrimentally) high. To ensure that such features are as noisy on the training data as the test data, we split the training into two folds. We then trained the auxiliary classifiers in jacknife fashion on each 6In this length 1 case, these are the same feature. Note also that the features are conjoined with only one generic label class &amp;quot;constituent&amp;quot; rather than specific constituent types.</Paragraph>
    <Paragraph position="10"> fold, and using their predictions as features on the other fold. The auxiliary classifiers were then retrained on the entire training set, and their predictions used as features on the development and test sets.</Paragraph>
    <Paragraph position="11"> We used two such auxiliary classifiers, giving a prediction feature for each span (these classifiers predicted only the presence or absence of a bracket over that span, not bracket labels). The first feature was the prediction of the generative baseline; this feature added little information, but made the learning phase faster. The second feature was the output of a flat classifier which was trained to predict whether single spans, in isolation, were constituents or not, based on a bundle of features including the list above, but also the following: the preceding, first, last, and following tag in the span, pairs of tags such as preceding-first, last-following, preceding-following, first-last, and the entire tag sequence.</Paragraph>
    <Paragraph position="12"> Tag features on the test sets were taken from a pretagging of the sentence by the tagger described in Toutanova et al. (2003). While the flat classifier alone was quite poor (P 78.77 / R 63.94 / F1 70.58), the resulting max-margin model (lexical+aux) scored 89.12 F1. To situate these numbers with respect to other models, the parser in Collins (1999), which is generative, lexicalized, andintricately smoothedscores 88.69 over the same train/test configuration.</Paragraph>
    <Paragraph position="13"> It is worth considering the cost of this kind of method. At training time, discriminative methods are inherently expensive, since they all involve iteratively checking current model performance on the training set, which means parsing the training set (usually many times). In our experiments, 10-20 iterations were generally required for convergence (except the basic model, which took about 100 iterations.) There are several nice aspects of the approach described here. First, it is driven by the repeated extraction, over the training examples, of incorrect parses which the model currently prefers over the true parses. The procedure that provides these parses need not sum over all parses, nor even necessarily find the Viterbi parses, to function. This allows a range of optimizations not possible for CRF-like approaches which must extract feature expectations from the entire set of parses.7 Nonetheless, generative approaches 7One tradeoff is that this approach is more inherently sequential and harder to parallelize.</Paragraph>
    <Paragraph position="14"> are vastly cheaper to train, since they must only collect counts from the training set.</Paragraph>
    <Paragraph position="15"> On the other hand, the max-margin approach does have the potential to incorporate many new kinds of features over the input, and the current feature set allows limited lexicalization in cubic time, unlike other lexicalized models (including the Collins model which it outperforms in the present limited experiments).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML