XML Viewer - w95-0101

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/w95-0101_intro.xml
Size: 19,594 bytes
Last Modified: 2025-10-06 14:05:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0101">
  <Title>Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging</Title>
  <Section position="4" start_page="2" end_page="10" type="intro">
    <SectionTitle>
UNANNOTATED
TEXT
STATE
NER RULES
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="8" type="sub_section">
      <SectionTitle>
Transformation-Based Part of Speech Tagging
</SectionTitle>
      <Paragraph position="0"> In transformation-based part of speech tagging, 3 all words are initially tagged with their most likely tag, as indicated in the training corpus. Below are some of the transformation templates used by the learner. 4 Change tag a to tag b when:  1. The preceding (following) word is tagged z.</Paragraph>
      <Paragraph position="1"> 2. The preceding (following) word is w.</Paragraph>
      <Paragraph position="2"> 3. The word two before (after) is w.</Paragraph>
      <Paragraph position="3"> 4. One of the two preceding (following) words is tagged z. 5. The current word is w and the preceding (following) word is x. 6. The current word is w and the preceding (following) word is tagged z.  The evaluation measure is simply tagging accuracy. In each learning iteration, the system learns that transformation whose application results in the greatest reduction of error. 5 Because the learning algorithm is data-driven, it only needs to consider a small  the former attempts to maximize the probability of a string, whereas the latter attempts to minimize the number of errors.</Paragraph>
      <Paragraph position="4">  percentage of all possible transformations when searching for the best one. An example of a learned transformation is: Change the tag of a word from VERB to NOUN if the previous word is a DETERMINER. null If the word race occurs more frequently as a verb than as a noun in the training corpus, the initial state annotator will mistag this word as a verb in the sentence: The race was very exciting. The above transformation will correct this tagging error. It was shown in \[Brill, 1994\] that the transformation-based tagger achieves a high rate of tagging accuracy. The transformation-based tagger captures its learned information in a set of simple rules, compared to the many thousands of opaque probabilities learned by Markov-model based taggers. 6 Supervised training is feasible when one has access to a large manually tagged training corpus from the same domain as that to which the trained tagger will be applied. We next explore unsupervised and weakly supervised training as a practical alternative when the necessary resources are not available for supervised training. Unsupervised Learning of Transformations In supervised training, the corpus is used for scoring the outcome of applying transformations, in order to find the best transformation in each iteration of learning. In order to derive an unsupervised version of the learner, an objective function must be found for training that does not need a manually tagged corpus.</Paragraph>
      <Paragraph position="5"> We begin our exploration providing the training algorithm with a minimal amount of initial knowledge, namely knowing the allowable tags for each word, and nothing else. 7 The relative likelihoods of tags for words is not known, nor is any information about which tags are likely to appear in which contexts. This would correspond to the knowledge that could be extracted from an on-line dictionary or through morphological and distributional analysis.</Paragraph>
      <Paragraph position="6"> The unsupervised rule learning algorithm is based on the following simple idea. Given the sentence: The can will be crushed.</Paragraph>
      <Paragraph position="7"> with no information beyond the dictionary entry for the word can, the best we can do is randomly guess between the possible tags for can in this context. However, using an unannotated corpus and a dictionary, it could be discovered that of the words that appear after The in the corpus that have only one possible tag listed in the dictionary, nouns are much more common than verbs or modals. From this the following rule could be learned: Change the tag of a word from (modal OR noun OR verb) to noun if the previous word is The.</Paragraph>
      <Paragraph position="8"> SThe transformation-based tagger is available through anonymous ftp to ftp.cs.jhu.edu in /pub/brill/Programs.</Paragraph>
      <Paragraph position="9"> Tin this paper we ignore the problem of unknown words: words appearing in the test set which did not appear in the training set. We plan to explore ways of processing unknown words in future work, either by initially assigning them all open-class tags, or devising an unsupervised version of the rule-based unknown. word tagger described in \[Brill, 1994\].</Paragraph>
      <Paragraph position="10">  To fully define the learner, we must specify the three components of the learner: the initial state annotator, the set of transformation templates, and the scoring criterion. Initial State Annotator The unsupervised learner begins with an unannotated text corpus, and a dictionary listing words and the allowable part of speech tags for each word. The tags are not listed in any particular order. The initial state annotator tags each word in the corpus with a list of all allowable tags. Below is an example of the initial-state tagging of a sentence from the Penn Treebank \[Marcus et al., 1993\], where an underscore is to be read as or. 8</Paragraph>
      <Paragraph position="12"> Transformation Templates The learner currently has four transformation templates.</Paragraph>
      <Paragraph position="13"> They are: Change the tag of a word from X to Y if: 1. The previous tag is T.</Paragraph>
      <Paragraph position="14"> 2. The previous word is W.</Paragraph>
      <Paragraph position="15"> 3. The next tag is T.</Paragraph>
      <Paragraph position="16"> 4. The next word is W.</Paragraph>
      <Paragraph position="17">  Transformations are used differently in the unsupervised learner than in the supervised learner. Here, a transformation will reduce the uncertainty as to the correct tag of a word in a particular context, instead of changing one tag to another. So all learned transformations will have the form: Change the tag of a word from X to Y in context C where X is a set of two or more part of speech tags, and Y is a single part of speech tag, such that Y E X. Below we list some transformations that were actually learned by the system.</Paragraph>
      <Paragraph position="18"> Change the tag: From NN_VB_VBP to VBP if the previous tag is NNS From NN_VB to VB if the previous tag is MD From JJ_NNP to JJ if the following tag is NNS</Paragraph>
      <Paragraph position="20"> Scoring Criterion When using supervised transformation-based learning to train a part of speech tagger, the scoring function is just the tagging accuracy that results from applying a transformation. With unsupervised learning, the learner does not have a gold standard training corpus with which accuracy can be measured. Instead, we can try to use information from the distribution of unambiguous words to find reliable disambiguating contexts.</Paragraph>
      <Paragraph position="21"> In each learning iteration, the score of a transformation is computed based on the current tagging of the training set. Recall that this is completely unsupervised. Initially, each word in the training set is tagged with all tags allowed for that word, as indicated in the dictionary. In later learning iterations, the training set is transformed as a result of applying previously learned transformations. To score the transformation: Change the tag of a word from X to Y in context C, where Y E X, we do the following. For each tag Z E X,</Paragraph>
      <Paragraph position="23"> where freq(Y) is the number of occurrences of words unambiguously tagged with tag Y in the corpus, freq(Z) is the number of occurrences of words unambiguously tagged with tag Z in the corpus, and incontext(Z,C) is the number of times a word unambiguously tagged with tag Z occurs in context C in the training corpus. 9</Paragraph>
      <Paragraph position="25"> Then the score for the transformation Change the tag of a word from X to Y in context Cis: incontext(Y, C) - freq(Y)/ freq( R) * incontext( R, C) A good transformation for removing the part of speech ambiguity of a word is one for which one of the possible tags appears much more frequently as measured by unambiguously tagged words than all others in the context, after adjusting for the differences in relative frequency between the different tags. The objective function for this transformation measures this by computing the difference between the number of unambiguous instances of tag Y in context C and the number of unambiguous instances of the most likely tag R in context C, where R E X, R ~ Y, adjusting for relative frequency. In each learning iteration, the learner searches for the transformation which maximizes this function. Learning stops when no positive scoring transformations can be found.</Paragraph>
      <Paragraph position="26">  To test the effectiveness of the above unsupervised learning algorithm, we ran a number of experiments using two different corpora and part of speech tag sets: the Penn Treebank Wall Street Journal Corpus \[Marcus et al., 1993\] and the original Brown Corpus \[Francis and Kucera, 1982\]. First, a dictionary was created listing all possible tags for each word in the corpus. This means that the test set contains no unknown words. We have set up the experiments in this way to facilitate comparisons with results given in other papers, where the same was done.</Paragraph>
      <Paragraph position="27">  In this experiment, a training set of 120,000 words and a separate test set of 200,000 words were used. We measure the accuracy of the tagger by comparing text tagged by the trained tagger to the gold standard manually annotated corpus. In the case where the tag of a word is not fully disambiguated by the tagger, a single tag is randomly chosen from the possible tags, and this tag is then compared to the gold standard. Initial state tagging accuracy on the training set is 90.7%. After learning 1,151 transformations, training set accuracy increases to 95.0%. Initial state tagging accuracy on the test set is also 90.7%. Accuracy increases to 95.1% after applying the learned transformations.</Paragraph>
      <Paragraph position="28"> Figure 2 shows test set tagging accuracy as a function of transformation number. In figure 3, we plot the difference between training and test set accuracies after the apphcation of each transformation, including a smoothed curve. 1deg Notice that there is no overtraining: the difference in accuracies on training and test set remain within a very narrow range throughout, with test set accuracy exceeding training set accuracy by a small margin. Overtraining did not occur when using the original Brown Corpus either. When training a stochastic tagger using the Baum-Welch algorithm, overtraining often does occur \[Meriaido, 1995; Elworthy, 1994\], requiring an additional held-out training corpus for determining an</Paragraph>
      <Paragraph position="30"/>
    </Section>
    <Section position="2" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
Brown Corpus Results
</SectionTitle>
      <Paragraph position="0"> In this experiment, we also used a training set of 120,000 words and a separate test set of 200,000 words. Initial state tagging accuracy on the training set is 89.8%. After learning 1,729 transformations and applying them to the training set, accuracy increases to 95.6%.</Paragraph>
      <Paragraph position="1"> Initial state tagging accuracy on the test set is 89.9%, with accuracy increasing to 95.6% after applying the learned transformations. Expanding the training set to 350,000 words and testing on the same test set, accuracy increases to 96.0%. All unsupervised learning results are summarized in table 1.</Paragraph>
    </Section>
    <Section position="3" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
Comparison With Other Results
</SectionTitle>
      <Paragraph position="0"> In \[Merialdo, 1995\], tagging experiments are described training a tagger using the Baum-Welch algorithm with a dictionary constructed as described above and an untagged corpus.</Paragraph>
      <Paragraph position="1"> Experiments were run on Associated Press articles which were manually tagged at the University of Lancaster. When training on one million words of text, test set accuracy  peaks at 86.6%. In \[Elworthy, 1994\], similar experiments were run. There, a peak accuracy of 92.0% was attained using the LOB corpus, n Using the Penn Treebank corpus, a peak accuracy of 83.6% resulted. These results are significantly lower than the results achieved using unsupervised transformation-based learning.</Paragraph>
      <Paragraph position="2"> In \[Kupiec, 1992\] a novel twist to the Baum-Welch algorithm is presented, where instead of having contextual probabilities for a tag following one or more previous tags, words are pooled into equivalence classes, where all words in an equivalence class have the same set of allowable part of speech assignments. Using these equivalence classes greatly reduces the number of parameters that need to be estimated. Kupiec ran experiments using the original Brown Corpus. When training on 440,000 words, test set accuracy was 95.7%, excluding punctuation. As shown above, test set accuracy using the transformation-based algorithm described in this paper gives an accuracy of 96.0% when trained on 350,000 words. Excluding punctuation, this accuracy is 95.6%. Note that since the Baum-Welch algorithm frequently overtrains, a tagged text would be necessary to figure out what training iteration gives peak performance.</Paragraph>
    </Section>
    <Section position="4" start_page="8" end_page="10" type="sub_section">
      <SectionTitle>
Weakly Supervised Rule Learning
</SectionTitle>
      <Paragraph position="0"> We have explored a method of training a transformation-based tagger when no information is known other than a list of possible tags for each word. Next we explore weakly supervised learning, where a small amount of human intervention is permitted. With Markov-model based taggers, there have been two different methods proposed for adding knowledge to a tagger trained using the Baum-Welch algorithm. One method is to manually alter the tagging model, based on human error analysis. This method is employed in \[Kupiec, 1992; Cutting et al., 1992\]. Another approach is to obtain the initial probabilities for the model directly from a manually tagged corpus instead of using random or evenly distributed initial probabilities, and then adjust these probabilities using the Baum-Welch algorithm and an untagged corpus. This approach is described in \[Merialdo, 1995; Elworthy, 1994\].</Paragraph>
      <Paragraph position="1"> A tagged corpus can also be used to improve the accuracy of unsupervised transformation-based learning. A transformation-based system is a processor and not a classifier. Being a processor, it can be applied to the output of any initial state annotator. As mentioned above, in the supervised transformation-based tagger described in \[Brill, 1994\], each word is initially tagged with its most likely tag. Here, we use the trained unsupervised part of speech tagger as the initial state annotator for a supervised learner. Transformations will then be learned to fix errors made by the unsupervised learner. As shown in figure 4, unannotated text is first passed through the unsupervised initial-state annotator, where each word is assigned a list of all allowable tags. The output of this tagger is then passed to the unsupervised learner, which learns an ordered list of transformations. The initial-state annotator and learned unsupervised transformations are then applied to unannotated text, which is then input to the supervised learner, along with the corresponding manually tagged corpus. The supervised learner learns a second ordered list of transformations.</Paragraph>
      <Paragraph position="2"> Once the system is trained, fresh text is tagged by first passing it through the unsupervised initial state annotator, then applying each of the unsupervised transformations, in order, and then applying each of the supervised transformations, in order.</Paragraph>
      <Paragraph position="3"> The advantage of combining unsupervised and supervised learning over using supervised n\[Elworthy, 1994\] quotes accuracy on ambiguous words, which we have converted to overall accuracy.  learning alone is that the combined approach allows us to utifize both tagged and untagged text in training. Since manually tagged text is costly and time-consuming to generate, it is often the case that when there is a corpus of manually tagged text available there will also be a much larger amount of untagged text available, a resource not utilized by purely supervised training algorithms.</Paragraph>
      <Paragraph position="4"> One significant difference between this approach and that taken in using the Baum-Welch algorithm is that here the supervision influences the learner after unsupervised training, whereas when using tagged text to bias the initial probabilities for Baum-Welch training, supervision influences the learner prior to unsupervised training. The latter approach has the potential weakness of unsupervised training erasing what was learned from the manually annotated corpus. For example, in \[Merialdo, 1995\], extracting probability estimates from a 50,000 word manually tagged corpus gave a test set accuracy of 95.4%.</Paragraph>
      <Paragraph position="5"> After applying ten iterations of the Baum-Welch algorithm, accuracy dropped to 94.4%.</Paragraph>
      <Paragraph position="6"> Using the transformations learned in the above unsupervised training experiment run on the Penn Treebank, we apply these transformations to a separate training corpus. New supervised transformations are then learned by comparing the tagged corpus that results from applying these transformations with the correct tagging, as indicated in the manually annotated training corpus.</Paragraph>
      <Paragraph position="7"> In table 2, we show tagging accuracy on a separate test set using different sizes of manually annotated corpora. In each case, a 120,000 word untagged corpus was used for initial unsupervised training. This table also gives results from supervised training using the annotated corpus, without any prior unsupervised training. 12 In all cases, the combined training outperformed the purely supervised training at no added cost in terms of annotated  In this paper, we have presented a new algorithm for unsupervised training of a rule-based part of speech tagger. The rule-based tagger trained using this algorithm significantly outperforms the traditional method of applying the Baum-Welch algorithm for unsupervised training of a stochastic tagger, and achieves comparable performance to a class-based Baum-Welch training algorithm. In addition, we have shown that by combining unsupervised and supervised learning, we can obtain a tagger that significantly outperforms a tagger trained using purely supervised learning. We are encouraged by these results, and expect an improvement in performance when the number of transformation templates provided to the unsupervised learner increases beyond the four currently used. We have also demonstrated that overtraining, a problem in Baum-Welch training, is not a problem in transformation-based learning.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML