File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1049_metho.xml
Size: 17,254 bytes
Last Modified: 2025-10-06 14:13:48
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1049"> <Title>A Report of Recent Progress in Transformation-Based Error-Driven Learning*</Title> <Section position="4" start_page="256" end_page="256" type="metho"> <SectionTitle> UNANNOTATEDTExT I STATE ANNO~TAJD I TRUTH RULES </SectionTitle> <Paragraph position="0"> based learning, one must specify the following: (1) the start state annotator, (2) the space of transformations the learner is allowed to examine, and (3) the scoring function for comparing the corpus to the lrulh and choosing a transformation.</Paragraph> <Paragraph position="1"> Once an ordered list of transformations is learned, new text can be annotated by first applying the initial state annotator to it and then applying each of the learned transformations, in order.</Paragraph> </Section> <Section position="5" start_page="256" end_page="256" type="metho"> <SectionTitle> 3. AN EARLIER ATTEMPT </SectionTitle> <Paragraph position="0"> The original tranformation-based tagger \[Brill 92\] works as follows. The start state annotator assigns each word its most likely tag as indicated in the training corpus.</Paragraph> <Paragraph position="1"> The most likely tag for unknown words is guessed based on a number of features, such as whether the word is capitalized, and what the last three letters of the word are. The allowable transformation templates are: Change tag a to tag b when: 1. The preceding (following) word is tagged z.</Paragraph> <Paragraph position="2"> 2. The word two before (after) is tagged z.</Paragraph> <Paragraph position="3"> 3. One of the two preceding (following) words is tagged 2'.</Paragraph> <Paragraph position="4"> 4. One of the three preceding (following) words is tagged z.</Paragraph> <Paragraph position="5"> 5. The preceding word is tagged z and the following word is tagged w.</Paragraph> <Paragraph position="6"> 6. The preceding (following)word is tagged z and the word two before (after) is tagged w.</Paragraph> <Paragraph position="7"> where a,b,z and w are variables over the set of parts of speech. To learn a transformation, the learner in essence applies every possible transformation, a counts the number of tagging errors after that transformation is applied, and chooses that transformation resulting in the greatest error reduction. 5 Learning stops when no transformations can be found whose application reduces errors beyond some prespecified threshold. An example of a transformation that was learned is: change the tagging of a word from noun to verb if the previous word is tagged as a modal. Once the system is trained, a new sentence is tagged by applying the start state annotator and then applying each transformation, in turn, to the sentence.</Paragraph> </Section> <Section position="6" start_page="256" end_page="257" type="metho"> <SectionTitle> 4. LEXICALIZING THE TAGGER </SectionTitle> <Paragraph position="0"> No relationships between words are directly captured in stochastic taggers. In the Markov model, state transition probabilities (P(Tagi\]Tagi-z...Tagi_,~)) express the likelihood of a tag immediately following n other tags, and emit probabilities (P(WordjlTagi)) express the likelihood of a word given a tag. Many useful relationships, such as that between a word and the previous word, or between a tag and the following word, are not directly captured by Markov-model based taggers. The same is true of the earlier transformation-based tagger, where transformation templates did not make reference to words.</Paragraph> <Paragraph position="1"> To remedy this problem, the transformation-based tagger was extended by adding contextual transformations that could make reference to words as well as part of speech tags. The transformation templates that were added are: Change tag a to tag b when: 1. The preceding (following) word is w.</Paragraph> <Paragraph position="2"> 2. The word two before (after) is w.</Paragraph> <Paragraph position="3"> 3. One of the two preceding (following) words is w. 4. The current word is w and the preceding (following) word is x.</Paragraph> <Paragraph position="4"> 4 All possible instantiations of transformation templates. 5The search is data-d.riven~ so only a very small percentage of possible transformations need be examined.</Paragraph> <Paragraph position="5"> 5. The current word is w and the preceding (following) word is tagged z.</Paragraph> <Paragraph position="6"> where w and x are variables over all words in the training corpus, and z is a variable over all parts of speech. Below we list two lexicalized transformations that were learned. 6 Change the tag: (12) From preposition to adverb if the word two positions to the right is as.</Paragraph> <Paragraph position="7"> (16) From non-3rd person singular present verb to base form verb if one of the previous two words is n~t.7 The Penn Treebank tagging style manual specifies that in the collocation as... as, the first as is tagged as an adverb and the second is tagged as a preposition. Since as is most frequently tagged as a preposition in the training corpus, the start state tagger will mistag the phrase as ~all as as: as/preposition tall/adjective as/preposition The first lexicalized transformation corrects this mistagging. Note that a stochastic tagger trained on our training set would not correctly tag the first occurrence of as. Although adverbs are more likely than prepositions to follow some verb form tags, the fact that P(aslprcposition ) is much greater than P(as\[adverb), and P(adjectiveIpreposition ) is much greater than P(adjective\]adverb) lead to as being incorrectly tagged as a preposition by a stochastic tagger. A trigram tagger will correctly tag this collocation in some instances, due to the fact that P(preposition\[adverb adjective) is greater than P(prepositionlpreposition adjective), but the outcome will be highly dependent upon the context in which this collocation appears.</Paragraph> <Paragraph position="8"> The second transformation arises from the fact that when a verb appears in a context such as We do n'~ __ or We did n't usually ___, the verb is in base form. A stochastic trigram tagger would have to capture this linguistic information indirectly from frequency counts of all trigrams of the form: s</Paragraph> </Section> <Section position="7" start_page="257" end_page="258" type="metho"> <SectionTitle> * ADVERB PRESENT_VERB * ADVERB BASE_VERB </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> and from the fact that P(n'tlADVERB ) is fairly high.</Paragraph> <Paragraph position="3"> In \[Weischedel et al. 93\], results are given when training and testing a Markov-model based tagger on the Penn Treebank Tagged Wall Street Journal Corpus.</Paragraph> <Paragraph position="4"> They cite results making the closed vocabulary assumption that all possible tags for all words in the test set are known. When training contextual probabilities on 1 million words, an accuracy of 96.7% was achieved. Accuracy dropped to 96.3% when contextual probabilities were trained on 64,000 words. We trained the transformation-based tagger on 600,000 words from the same corpus, making the same closed vocabulary assumption, 9 and achieved an accuracy of 97.2% on a separate 150,000 word test set. The transformation-based learner achieved better performance, despite the fact that contextual information was captured in only 267 simple nonstochastic rules, as opposed to 10,000 contextual probabilities that were learned by the stochastic tagger. To see whether lexicalized transformations were contributing to the accuracy rate, we ran the exact same test using the tagger trained using the earlier transformation template set, which contained no transformations making reference to words. Accuracy of that tagger was 96.9%. Disallowing lexicalized transformations resulted in an 11% increase in the error rate. These results are summarized in table 1.</Paragraph> <Paragraph position="5"> 9In both \[Weischedel et al. 93\] and here, the test set was incorporated into the lexicon, but was not used in learning contextual information. Testing with no unknown words might seem llke an unrealistic test. We have done so for three reasons (We show results when unknown words are included later in the paper): (1) to allow for a comparison with previously quoted results, (2) to isolate known word accuracy from unknown word accuracy, and (3) in some systems, such as a closed vocabulary speech recognition system, the assumption that all words are known is valid.</Paragraph> <Paragraph position="6"> When transformations are allowed to make reference to words and word pairs, some relevant information is probably missed due to sparse data. we are currently exploring the possibility of incorporating word classes into the rule-based learner in hopes of overcoming this problem.</Paragraph> <Paragraph position="7"> The idea is quite simple. Given a source of word class information, such as WordNet \[Miller 90\], the learner is extended such that a rule is allowed to make reference to parts of speech, words, and word classes, allowing for rules such as Change the tag from X to Y if the following word belongs to word class Z. This approach has already been successfully applied to a system for prepositional phrase disambiguation \[Brill 93a\].</Paragraph> </Section> <Section position="8" start_page="258" end_page="259" type="metho"> <SectionTitle> 5. UNKNOWN WORDS </SectionTitle> <Paragraph position="0"> In addition to not being lexicalized, another problem with the original transformation-based tagger was its relatively low accuracy at tagging unknown words3 deg In the start state annotator for tagging, words are assigned their most likely tag, estimated from a training corpus.</Paragraph> <Paragraph position="1"> In khe original formulation of the rule-based tagger, a rather ad-hoc algorithm was used to guess the most likely tag for words not appearing in the training corpus. To try to improve upon unknown word tagging accuracy, we built a transformation-based learner to learn rules for more accurately guessing the most likely tag for words not seen in the training corpus. If the most likely tag for unknown words can be assigned with high accuracy, then the contexual rules can be used to improve accuracy, as described above.</Paragraph> <Paragraph position="2"> In the transformation-based unknown-word tagger, the start state annotator naively labels the most likely tag for unknown words as proper noun if capitalized and common noun otherwise, lz</Paragraph> <Paragraph position="4"> Adding the character string x as a suffix results in a word (Izl <= 4).</Paragraph> <Paragraph position="5"> Adding the character string x as a prefix results in a word (1 :1 <= 4).</Paragraph> <Paragraph position="6"> Word W ever appears immediately to the left (right) of the word.</Paragraph> <Paragraph position="7"> 8. Character Z appears in the word.</Paragraph> <Paragraph position="8"> An unannotated text can be used to check the conditions in all of the above transformation templates. Annotated text is necessary in training to measure the effect of transformations on tagging accuracy. Below are the first 10 transformation learned for tagging unknown words in the Wall Street Journal corpus: Change tag: 1. From common noun to plural common noun if the word has suffix -s t2 2. From common noun to number if the word has character .</Paragraph> <Paragraph position="9"> 3. From common noun to adjective if the word has character 4. From common noun to past participle verb if the word has suffix -ed 5. From common noun to gerund or present participle verb if the word has suffix -ing 6. To adjective if adding the suffix -ly results in a word Below we list the set of allowable transformations: 7. To adverb if the word has suffix -ly Change the guess of the most-likely tag of a word (from X) to Y if: 1. Deleting the prefix x, Ixl <=4, results in a word (x is any string of length 1 to 4).</Paragraph> <Paragraph position="10"> 2. The first (1,2,3,4) characters of the word are x. 3. Deleting the suffix x, Ix I <= 4, results in a word. '4. The last (1,2,3,4) characters of the word are x. 10 This section describes work done in part while the author was at the University of Pennsylvania.</Paragraph> <Paragraph position="11"> llIf we change the tagger to tag all unknown words as common nouns, then a number of rules are learned of the form: change tag to proper noun if the prefix is &quot;E&quot;, since the learner is not provided with the concept of upper case in its set of transformation templates.</Paragraph> <Paragraph position="12"> 8. From common noun to number if the word $ ever appears immediately to the left 9. From common noun to adjective if the word has suffix -al 10. From noun to base form verb if the word would ever appears immediately to the left.</Paragraph> <Paragraph position="13"> Keep in mind that no specific affixes are prespecified. A transformation can make reference to any string of characters up to a bounded length. So while the first rule specifies the English suffix &quot;s&quot;, the rule learner also 12Note that this transformation will result in the mistagging of mistress. The 17th learned rule fixes this problem. This rule states: change a tag from plural common noun to singular common noun if the word has suffix ss.</Paragraph> <Paragraph position="14"> considered such nonsensical rules as: change a tag to adjective if the word has suffix &quot;xhqr'. Also, absolutely no English-specific information need be prespecified in the learner. 13 We then ran the following experiment using 1.1 million words of the Penn Treebank Tagged Wall Street Journal Corpus. The first 950,000 words were used for training and the next 150,000 words were used for testing. Annotations of the test corpus were not used in any way to train the system. From the 950,000 word training corpus, 350,000 words were used to learn rules for tagging unknown words, and 600,000 words were used to learn contextual rules. 148 rules were learned for tagging unknown words, and 267 contextual tagging rules were learned. Unknown word accuracy on the test corpus was 85.0%, and overall tagging accuracy on the test corpus was 96.5%. To our knowledge, this is the highest over-all tagging accuracy ever quoted on the Penn Treebank Corpus when making the open vocabulary assumption. In \[Weischedel et al. 93\], a statistical approach to tagging unknown words is shown. In this approach, a number of suffixes and important features are prespecified. Then, for unknown words:</Paragraph> <Paragraph position="16"> Using this equation for unknown word emit probabilities within the stochastic tagger, an accuracy of 85% was obtained on the Wall Street Journal corpus. This portion of the stochastic model has over 1,000 parameters, with 108 possible unique emit probabilities, as opposed to only 148 simple rules that are learned and used in the rule-based approach. We have obtained comparable performance on unknown words, while capturing the information in a much more concise and perspicuous manner, and without prespecifying any language-specific or corpus-specific information.</Paragraph> </Section> <Section position="9" start_page="259" end_page="259" type="metho"> <SectionTitle> 6. K-BEST TAGS </SectionTitle> <Paragraph position="0"> There are certain circumstances where one is willing to relax the one tag per word requirement in order to increase the probability that the correct tag will be assigned to each word. In \[DeMarcken 90, Weischedel et al. 93\], k-best tags are assigned within a stochastic tagger by returning all tags within some threshold of probability of being correct for a particular word.</Paragraph> <Paragraph position="1"> We can modify the transformation-based tagger to return multiple tags for a word by making a simple mod-Z3This learner has also been applied to tagging Old English. See \[Srin 93a\].</Paragraph> <Paragraph position="2"> of Rules Accuracy Avg. -~ of tags per word ification to the contextual transformations described above. The initial-state annotator is the tagging output of the transformation-based tagger described above. The allowable transformation templates are the same as the contextual transformation templates listed above, but with the action change tag X to tag Y modified to add tag X to tag Y or add tag X to word W. Instead of changing the tagging of a word, transformations now add alternative taggings to a word.</Paragraph> <Paragraph position="3"> When allowing more than one tag per word, there is a trade-off between accuracy and the average number of tags for each word. Ideally, we would like to achieve as large an increase in accuracy with as few extra tags as possible. Therefore, in training we find transformations that maximize precisely this function.</Paragraph> <Paragraph position="4"> In table 2 we present results from first using the one-tagper-word transformation-based tagger described in the previous section and then applying the k-best tag transformations. These transformations were learned from a separate 240,000 word corpus. 14</Paragraph> </Section> class="xml-element"></Paper>