XML Viewer - p98-2186

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-2186_evalu.xml
Size: 9,353 bytes
Last Modified: 2025-10-06 14:00:33
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2186">
  <Title>Part of Speech Tagging Using a Network of Linear Separators</Title>
  <Section position="6" start_page="1139" end_page="1141" type="evalu">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> The data for all the experiments was extracted from the Penn Treebank WSJ corpus. The training and test corpus consist of 600000 and 150000, respectively. The first set of experiment uses only the SNOW system and evaluate its performance under various conditions. In the second set SNOW is compared with a naive Bayes algorithm and with Brill's TBL, all trained and tested on the same data. We also compare with Baseline which simply assigns each word in the test corpus its most common POS in the lexicon. Baseline performance on our test corpus is 94.1%.</Paragraph>
    <Paragraph position="1"> A lexicon is computed from both the training and the test corpus. The lexicon has 81227 distinct words, with an average of 2.2 possible POS tags per word in the lexicon.</Paragraph>
    <Section position="1" start_page="1139" end_page="1140" type="sub_section">
      <SectionTitle>
5.1 Investigating SNO W
</SectionTitle>
      <Paragraph position="0"> We first explore the ability of the network to adapt to new data. While online algorithms are at a disadvantage - each example is processed only once before being discarded - they have the advantage of (in principle) being able to quickly adapt to new data. This is done within SNOW by allowing it to update its weights in test mode.</Paragraph>
      <Paragraph position="1"> That is, after prediction, the network receives a label for a word, and then uses the label for updating its weights.</Paragraph>
      <Paragraph position="2"> In test mode, however, the true tag is not available to the system. Instead, we used as the feedback label the corresponding baseline tag taken from the lexicon. In this way, the algorithm never uses more information than is available to batch algorithms tested on the same data. The intuition is that, since the baseline itself for this task is fairly high, this information will allow the tagger to better tolerate new trends in the data and steer the predictors in the right direction. This is the default system that we call SNOW in the discussion that follows.</Paragraph>
      <Paragraph position="3"> Another policy with on-line algorithms is to supply it with the true feedback, when it makes a mistake in testing. This policy (termed adp-SNOW) is especially useful when the test data comes from a different source than the training data, and will allow the algorithm to adapt to the new context. For example, a language acquisition system with a tagger trained on a general corpus can quickly adapt to a specific domain, if allowed to use this policy, at least occasionally. What we found surprising is that in this case supplying the true feadback did not improve the performance of SNOW significantly. Both on-line methods though, perform significantly better than if we disallow on-line update, as we did for noadp-SNOW. The results, presented in table 1, exhibit the advantage of using an on-line algorithm.</Paragraph>
      <Paragraph position="4">  formance of the tagger network with no adaptation(noadp-SNOW), baseline adaptation(SNOH0, and true adaptation(adpSNOW). null One difficulty in applying the SNOW approach to the POS problem is the problem of attribute noise alluded to before. Namely, the classifiers receive a noisy set of features as input due to the attribute dependence on (unknown) tags of neighboring words. We address this by studying quality of the classifier, when it is guaranteed to get (almost) correct input.</Paragraph>
      <Paragraph position="5"> Table 2 summarizes the effects of this noise on the performance. Under SNOW we give the results under normal conditions, when the the features of the each example are determined based on the baseline tags. Under SNOW-i-cr we determine the features based on the correct tags, as read from the tagged corpus. One can see that this results in a significant improvement, indicating that the classifier learned by SNOW is almost perfect. In normal conditions, though, it is affected by the attribute noise.</Paragraph>
      <Paragraph position="6">  tagger was tested with correct initial tags (SNOW+cr) and, as usual, with baseline based initial tags.</Paragraph>
      <Paragraph position="7"> Next, we experimented with the sensitivity of SNOW to several options of labeling the training data. Usually both features and labels of the training examples are computed in terms of  correct parts of speech for words in the training corpus. We call the labeling Semi-supervised when we only require the features of the training examples to be computed in terms of the most probable pos for words in the training corpus, but the labels still correspond to the correct parts of speech. The labeling is Unsupervised when both features and labels of the training examples are computed in terms of most probable POS of words in the training corpus.</Paragraph>
      <Paragraph position="8">  of SNOW with unsupervised (SNOW+uns), semi-supervised (SNOW+ss) and normal mode of supervised training.</Paragraph>
      <Paragraph position="9"> It is not surprising that the performance of the tagger learned in an semi-supervised fashion is the same as that of the one trained from the correct corpus. Intuitively, since in the test stage the input to the classifier uses the base-line classifier, in this case there is a better fit between the data supplied in training (with a correct feedback!) and the one used in testing.</Paragraph>
    </Section>
    <Section position="2" start_page="1140" end_page="1140" type="sub_section">
      <SectionTitle>
5.2 Comparative Study
</SectionTitle>
      <Paragraph position="0"> We compared performance of the SNOW tagger with one of the best POS taggers, based on Brill's TBL, and with a naive Bayes (e.g.,(Duda and Hart, 1973) based tagger. We used the same training and test sets. The results are summarized in table 4.</Paragraph>
      <Paragraph position="1">  mance, In can be seen that the TBL tagger and SNOW perform essentially the same. However, given that SNOW is an online algoril:hm, we have tested it also in its (true feedback) adaptive mode, where it is shown to outperform them. It is interesting to note that a simple minded NB method also performs quite well. Another important point of comparison is that the NB tagger and the SNOW taggers are trained with the features described in section 4. TBL, on the other hand, uses a much larger set of features. Moreover, the learning and tagging mechanism in TBL relies on the inter-dependence between the produced labels and the features. However, (Ramshaw and Marcus, 1996) demonstrate that the inter-dependence impacts only 12% of the predictions. Since the classifier used in TBL without inter-dependence can be represented as a linear separator(Roth, 1998), it is perhaps not surprising that SNOW performs as well as TBL. Also, the success of the adaptive SNOWtaggers shows that we can alleviate the lack of the inter-dependence by adaptation to the testing corpus. It also highlights importance of relationship between a tagger and a corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="1140" end_page="1141" type="sub_section">
      <SectionTitle>
5.3 Alternative Performance Metrics
</SectionTitle>
      <Paragraph position="0"> Out of 150000 words in the test corpus used about 65000 were non-ambiguous. That is, they can assume only one POS. Incorporating these in the performance measure is somewhat misleading since it does not provide a good measure of the classifier performance.</Paragraph>
      <Paragraph position="1"> Table 5: Performance for ambiguous words.</Paragraph>
      <Paragraph position="2"> Sometimes we may be interested in determining POS classes of words rather than simply parts of speech. For example, some natural language applications may require identifying that a word is a noun without specifying the exact noun tag for the word(singular, plular, proper, etc.). In this case, we want to measure performance with respect to POS classes. That is, if the predicted part of speech for a word is in the same class with the correct tag for the word, then the prediction is termed correct.</Paragraph>
      <Paragraph position="3"> Out of 50 POS tags we created 12 POS classes: punctuation marks, determiners, preposition and conjunctions, existentials &amp;quot;there&amp;quot;, foreign words, cardinal numbers and list markers, adjectives, modals, verbs, adverbs, particles, pronouns, nouns, possessive endings, interjections. The performance results for the classes are shown in table 5.3.</Paragraph>
      <Paragraph position="4"> In analyzing the results, one can see that many of the mistakes of the tagger are &amp;quot;within&amp;quot; classes. We are currently exploring a few issues that may allow us to use class information, within SNO W, to improve tagging accuracy. In  particular, we can incorporate POS classes into our SNOW tagger network. We can create another level of output nodes. Each of the nodes will correspond to a POS class and will be connected to the output nodes of the POS tags in the class. The update mechanism of network will then be made dependent on both class and tag prediction for a word.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML