File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-2006_metho.xml

Size: 21,467 bytes

Last Modified: 2025-10-06 14:08:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-2006">
  <Title>Automatic Article Restoration</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Approach
</SectionTitle>
    <Paragraph position="0"> The article generation task constitutes one component of the article correction task. The other component is a natural language parser that maps an input sentence to a parse tree, from which context features of NPs are extracted. In addition, the article correction task needs to address two issues: a0 Ideally, the parse tree of an input sentence with inappropriate articles should be identical (except, of course, the leaves for the articles) to that of the equivalent correct sentence. However, a natural language parser, trained on grammatical sentences, does not perform as well on sentences with inappropriate articles. It might not be able to identify all NPs accurately. We evaluate this problem in a1 4.4.</Paragraph>
    <Paragraph position="1"> Further, the context features of the NPs might be distorted. The performance of the article generator is likely to suffer. We measure this effect in a1 4.5.</Paragraph>
    <Paragraph position="2"> a0 The input sentence may already contain some articles. If the sentence is of high 'quality', one should be conservative in making changes to its articles. We characterize 'quality' using a 3 a2 3 confusion matrix. The articles on the rows are the correct ones; those on the columns are the ones actually used in the sentence. For example, if a sentence has the matrix null a null the a a3a5a4a7a6 a3a8a4 a9 a3 null a3 a6 a3 the a3 a3a5a4a11a10 a3a8a4 a12 then the article the is correctly used in the sentence with a 40% chance, but is mistakenly dropped (i.e., substituted with null) with a 60% chance. If one could accurately estimate the underlying confusion matrix of a sentence, then one could judiciously use the existing articles as a factor when generating articles. null For the article restoration task, we assume that articles may be dropped, but no unnecessary articles are inserted, and the articles the and a are not confused with each other. In other words, the four zero entries in the matrix above are fixed. We report experiments on article restoration in a1 4.6.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Features
</SectionTitle>
      <Paragraph position="0"> Our context features are drawn from two sources: the output of Model 3 of Collins' statistical natural language parser (Collins, 1999), and WordNet Version 2.0. For each base NP in the parse tree, we extract 15 categories of syntactic and semantic features. As an example, the sentence Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29 is parsed as: ... the/DT board/NN)</Paragraph>
      <Paragraph position="2"> From this parse tree the following features are extracted for the base NP a nonexecutive director: Article* The correct article, which may be the, null, or a (covering both a and an).</Paragraph>
      <Paragraph position="3"> Article (a) The article in the original sentence.</Paragraph>
      <Paragraph position="4"> Head (director) The root form of the head of the NP. A number is rewritten as a13 numbera14 . The head is determined using the rules in (Collins, 1999), except for possessive NPs. The head of a possessive NP is 's, which is not indicative of its article preference.</Paragraph>
      <Paragraph position="5"> Instead, we use the second best candidate for NP head.</Paragraph>
      <Paragraph position="6"> Number (singular) If the POS tag of the NP head is NN or NNP, the number of the head is singular; if the tag is NNS or NNPS, it is plural; for all other tags, it is n/a.</Paragraph>
      <Paragraph position="7"> Head POS (NN) The POS tag of the NP head. Any information about the head's number is hidden; NNS is re-written as NN, and NNPS as NNP.</Paragraph>
      <Paragraph position="8"> Parent (PP) The category of the parent node of the NP. Non-article determiner (null) A determiner other than a or the in the NP.</Paragraph>
      <Paragraph position="9"> Words before head (nonexecutive) Words inside the NP that precede the head, excluding determiners.</Paragraph>
      <Paragraph position="10"> Words after head (null) Words inside the NP that follow the head, excluding determiners.</Paragraph>
      <Paragraph position="11"> POS of words before head (JJ) The POS tags of words inside the NP that precede the head, excluding determiners. null POS of words after head (null) The POS tags of words inside the NP that follow the head, excluding determiners. null Words before NP (board, as) The two words preceding the base NP. This feature may be null.</Paragraph>
      <Paragraph position="12"> Words after NP (Nov, a13 numbera14 ) The two words following the base NP. This feature may be null.</Paragraph>
      <Paragraph position="13"> Hypernyms (a15 entitya16 , a15 object, physical objecta16 , ..., a15 head, chief, top doga16 , a15 administrator, decision makera16 ) Each synset in the hierarchy of hypernyms for the head in WordNet is considered a feature. We do not attempt any sense disambiguation, but always use the hypernyms for the first sense.</Paragraph>
      <Paragraph position="14"> Referent (no) If the same NP head appears in one of the 5 previous sentences, then yes; otherwise, no.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Log-linear Model
</SectionTitle>
      <Paragraph position="0"> We use the log-linear model (Ratnaparkhi, 1998), which has the maximum entropy property, to estimate the conditional probabilities of each value of the Article* feature, given any combination of features. This model is able to incorporate all these features, despite their interdependence, in a straightforward manner. Furthermore, unlike in decision trees, there is no need to partition the training data, thereby alleviating the data sparseness problem.</Paragraph>
      <Paragraph position="1"> In this model, the Article* feature is paired up with each of the other features to form contextual predicates (also called &amp;quot;features&amp;quot; in (Ratnaparkhi, 1998)). Thus, our example sentence has the following predicates:</Paragraph>
      <Paragraph position="3"/>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Training Sets for Article Restoration
</SectionTitle>
      <Paragraph position="0"> We ran Ratnaparkhi's MXPOST part-of-speech tagger and Model 3 of Collins' parser on the text in sections 00 to 21 of the Penn Treebank-3. We then extracted all base NPs and their features from the parser's output.1 There are about 260000 base NPs. The distribution of the articles in this set is roughly 70.5% null, 20% the and 9.5% a.</Paragraph>
      <Paragraph position="1"> The articles in the original sentences were initially assigned to both the Article* and Article features. This would imply a very high quality for the input sentences, in the sense that their articles were extremely likely to be correct. As a result, the model would be overly conservative about inserting new articles. To simulate varying qualities of input sentences, we perturbed the Article feature with two different confusion matrices, resulting in the following training sets: a0 TRAINDROP70: The Article feature is perturbed according to the confusion matrix</Paragraph>
      <Paragraph position="3"> That is, 70% of the feature (Article = the), and 70% of the feature (Article = a), are replaced with the feature (Article = null). The rest are unchanged.</Paragraph>
      <Paragraph position="4"> This set trains the model to aim to insert enough articles such that the initial number of articles in a sentence would constitute about 30% of the final number of articles.</Paragraph>
      <Paragraph position="5">  same treebank, the accuracy of our context features is higher than what we would expect from other texts. Our motivation for using the text of the Penn Treebank is to facilitate comparison between our article generation results and those reported in (Knight and Chander, 1994) and (Minnen et al., 2000), both of which read context features directly from the Penn Treebank. a0 TRAINDROP30: The Article feature is perturbed according to the confusion matrix</Paragraph>
      <Paragraph position="7"> seeing a null in an input sentence, all else being equal, TRAINDROP30 should be less predisposed than TRAINDROP70 to change it to the or a. In other words, the weight of (Article* = the) &amp; (Article = null) and (Article* = a) &amp; (Article = null) should be heavier in TRAINDROP70 than TRAINDROP30.</Paragraph>
      <Paragraph position="8"> Contextual predicates that were true in less than 5 base NPs in the training sets were deemed unreliable and rejected. The weight for each predicate was initialized to zero, and then trained by iterative scaling.</Paragraph>
      <Paragraph position="9"> After training on TRAINDROP30 for 1500 rounds, the ten heaviest weights were:</Paragraph>
      <Paragraph position="11"> Notice that two features, Head and Word before head, dominated the top 10 weights.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Training Sets for Article Generation
</SectionTitle>
      <Paragraph position="0"> We created three additional training sets which omit the Article feature. In other words, the articles in input sentences would be ignored. These sets were used in the article generation experiments.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
a0 TRAINGEN
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Test Sets
</SectionTitle>
      <Paragraph position="0"> We generated four test sets from the text in section 23 of the Penn Treebank-3 by dropping 70%, 30% and 0% of the articles. We call these sets DROP70, DROP30 and DROP0. There are about 1300 a's and 2800 the's in the section.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Identifying Noun Phrases
</SectionTitle>
      <Paragraph position="0"> We would like to measure the degree to which the missing articles corrupted the parser output. We analyzed the following for each sentence: whether the correct NP heads were extracted; and, if so, whether the boundaries of the NPs were correct. DROP30 and DROP70 were POS-tagged and parsed, and then compared against DROP0.</Paragraph>
      <Paragraph position="1"> 97.6% of the sentences in DROP30 had all their NP heads correctly extracted. Among these sentences, 98.7% of the NPs had correct boundaries.</Paragraph>
      <Paragraph position="2"> The accuracy rate for NP heads decreased to 94.7% for DROP70. Among the sentences in DROP70 with correct heads, 97.5% of the NPs had correctly boundaries.</Paragraph>
      <Paragraph position="3"> We now turn our attention to how these errors affected performance in article generation.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Article Generation
</SectionTitle>
      <Paragraph position="0"> We trained the log-linear model with TRAINGEN, TRAINGENa34a36a35a38a37a39a37 a32 a37 and TRAINGENa27a29a28a31a30a33a32 , then performed the article generation task on all test sets. Table 1 shows the accuracy rates.</Paragraph>
      <Paragraph position="1"> Our baseline accuracy rate on DROP0, 80.1%, is close to the corresponding rate (80.8% for the &amp;quot;head+its partof-speech&amp;quot;feature) reported in (Minnen et al., 2000). Our best result, 87.7%, is an improvement over both (Minnen et al., 2000) and (Knight and Chander, 1994).</Paragraph>
      <Paragraph position="2"> We added 8 more features (see a1 3.1) to TRAINGENa34a36a35a38a37a39a37 a32 a37 to make up TRAINGEN. After adding the features Words before/after head and POS of words before/after head, the accuracy increased by more than 4%. In fact, these features dominated the 10 heaviest weights in our training; they were not used in (Minnen et al., 2000).</Paragraph>
      <Paragraph position="3"> Article null generated the generated a generated  The Words before/after NP features gave another 0.8% boost to the accuracy. These features were also used in (Knight and Chander, 1994) but not in (Minnen et al., 2000). The Hypernyms feature, which placed NP heads under the WordNet semantic hierarchy, was intended to give a smoothing effect. It further raised the accuracy by 0.3%.</Paragraph>
      <Paragraph position="4"> At this point, the biggest source of error was generating null instead of the correct the. We introduced the Referent feature to attack this problem. It had, however, only a modest effect. Among weights that involved this feature, the one with the largest magnitude was (Article* = a) &amp; (Referent = yes), at a meagre -0.71. The others were within a42 0.3. Table 2 is the final contingency table for TRAINGEN on DROP0.</Paragraph>
      <Paragraph position="5"> The confusion between null and the remained the biggest challenge. The 656 misclassifications seemed rather heterogeneous. There was an almost even split between singular and plural NP heads; more than three quarters of these heads appeared in the list three times or less. The most frequent ones were a13 numbera14 (22 times), bond, year, security, court (8 times), fifth and show (7 times).</Paragraph>
      <Paragraph position="6"> As expected, the performance of TRAINGEN degraded on DROP30 and DROP70.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.6 Article Restoration
</SectionTitle>
      <Paragraph position="0"> So far, our experiments have not made use of the Article feature; articles in the original sentences are simply ignored. In the article restoration task, it is possible to take advantage of this feature.</Paragraph>
      <Paragraph position="1"> We trained the log-linear model with TRAINDROP30, TRAINDROP70 and TRAINGEN. Our baseline was keeping the original sentences intact. The test sets were processed as follows: If an NP contained an article, the new article (that is, the output of the article generator) would replace it; otherwise, the new article would be inserted at the beginning of the NP. The final sentences were evaluated against the original sentences for three kinds of errors: null</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Deletions The number of articles deleted.
Substitutions The number of a's replaced by the's, and
</SectionTitle>
      <Paragraph position="0"> vice versa.</Paragraph>
      <Paragraph position="1"> Insertions The number of articles inserted.  The article error rate is the total number of errors divided by the number of articles in the original sentences. The results in Table 3 reflect the intuition that, for a test set where a43 % of the articles have been dropped, the optimal model is the one that has been trained on sentences with a43 % of the articles missing. More generally, one could expect that the optimal training set is the one whose underlying confusion matrix is the most similar to that of the test set.</Paragraph>
      <Paragraph position="2"> Whereas TRAINGEN ignores the original articles, both TRAINDROP30 and TRAINDROP70 led the model to become extremely conservative in deleting articles, and in changing the to a, or vice versa. Thus, the only major distinguishing characteristic between them was their aggressiveness in inserting articles: TRAINDROP70 was more aggressive than TRAINDROP30. Tables 4 to 6 illustrate the breakdown of the kinds of error contributing to the article error rate:  The trends in the deletion error rate (Table 4) were quite straightforward: the rate was lower when the model inserted more articles, and when fewer articles were dropped in the original sentences.</Paragraph>
      <Paragraph position="3">  Most of the substitution errors (Table 5) were caused by the following: an article (e.g., a) was replaced by null in the test set; then, the wrong article (e.g., the) was generated to replace the null. In general, the substitution rate was higher when the model inserted more articles, and when more articles were dropped in the original sentences. null  The more aggressive the model was in inserting articles, the more likely it &amp;quot;over-inserted&amp;quot;, pushing up the insertion error rate (Table 6). With the aggressiveness kept constant, it might not be obvious why the rate should rise as more articles were dropped in the test set. It turned out that, in many cases, inaccurate parsing (see a1 4.4) led to incorrect NP boundaries, and hence incorrect insertion points for articles.</Paragraph>
      <Paragraph position="4"> As the wide range of error rates suggest, it is important to choose the optimal training set with respect to the input sentences. As one becomes more aggressive in inserting articles, the decreasing deletion rate is counter-balanced by the increasing substitution and insertion rates. How could one determine the optimal point? Table 7 shows the changes in the number of articles, as a percentage of the number of articles in the final sentences. When running TRAINGEN on DROP30 and DROP70, there was an increase of 23.8% and 65.9% in the number of articles. These rates of increase were close to those obtained (24.4% and 66.0%) when running their respective optimal sets, TRAINDROP30 and TRAINDROP70. It appeared that TRAINGEN was able to provide a reasonable estimate of the number of articles that &amp;quot;should&amp;quot; be restored. When given new input sentences, one could use TRAINGEN to estimate the percentage of missing articles, then choose the most appropriate training set accordingly.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Future Work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Article Generation
</SectionTitle>
      <Paragraph position="0"> We would like to improve the performance of the article generator. Our largest source of error is the confusion between null and the. In this work, we used predominantly intra-sentential features to disambiguate the articles. Article generation, however, clearly depends on previous sentences. Our only inter-sentential feature, Refer- null ent, rather na&amp;quot;ively assumed that the referent was explicitly mentioned using the same noun within 5 preceding sentences. Techniques in anaphora resolution could help refine this feature.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Parser Robustness
</SectionTitle>
      <Paragraph position="0"> The performance of the article generator degraded by more than 5% on when 30% of the articles in a sentence were dropped, and by more than 11% when 70% were dropped (see a1 4.5). This degradation was due to errors in the extraction of context features, and in identifying the NPs (see a1 4.4).</Paragraph>
      <Paragraph position="1"> These errors could be reduced by retraining the POS tagger and the natural language parser on sentences with missing articles. New training sets for the tagger and parser could be readily created by dropping the article leaves from the Penn Treebank.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Weight Estimation
</SectionTitle>
      <Paragraph position="0"> We used different confusion matrices to create training sets that simulated discrete percentages of dropped articles. Given some input sentences, the best one could do is to estimate their underlying confusion matrix, and choose the training set whose underlying matrix is the most similar. null Suppose a sentence is estimated to have half of its articles missing, but we do not have weights for a TRAIN-DROP50 set. Rather than retraining such a set from scratch, could we interpolate optimal weights for this sentence from existing weights?</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Other Types of Grammatical Mistakes and
Texts
</SectionTitle>
      <Paragraph position="0"> We would like to lift our restrictions on the confusion matrix; in other words, to expand our task from restoring articles to correcting articles.</Paragraph>
      <Paragraph position="1"> We have also identified a few other common categories of grammatical mistakes, such as the number of the NP head (singular vs. plural), and the verb tenses (present vs. past vs. continuous). For native speakers of languages that do not inflect nouns and verbs, it is a common mistake to use the root forms of nouns and verbs instead of the inflected form.</Paragraph>
      <Paragraph position="2"> Finally, we would like to investigate how well the rules learned by our model generalize to other genres of texts. After all, most non-native speakers of English do not write in the style of the Wall Street Journal! We plan to train and test our model on other corpora and, if possible, on writing samples of non-native speakers.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML