File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-1031_evalu.xml

Size: 8,966 bytes

Last Modified: 2025-10-06 13:58:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1031">
  <Title>TnT -- A Statistical Part-of-Speech Tagger</Title>
  <Section position="4" start_page="226" end_page="229" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluate the tagger's performance under several aspects. First of all, we determine the tagging accuracy averaged over ten iterations. The overall accuracy, as well as separate accuracies for known and unknown words are measured.</Paragraph>
    <Paragraph position="1"> Second, learning curves are presented, that indicate the performance when using training corpora of different sizes, starting with as few as 1,000 tokens and ranging to the size of the entire corpus (minus the test set).</Paragraph>
    <Paragraph position="2"> An important characteristic of statistical taggers is that they not only assign tags to words but also probabilities in order to rank different assignments. We distinguish reliable from unreliable assignments by the quotient of the best and second best assignments 1. All assignments for which this quotient is larger than some threshold are regarded as reliable, the others as unreliable. As we will see below, accuracies for reliable assignments are much higher.</Paragraph>
    <Paragraph position="3"> The tests are performed on partitions of the corpora that use 90% as training set and 10% as test set, so that the test data is guaranteed to be unseen during training. Each result is obtained by repeating the experiment 10 times with different partitions and averaging the single outcomes.</Paragraph>
    <Paragraph position="4"> In all experiments, contiguous test sets are used.</Paragraph>
    <Paragraph position="5"> The alternative is a round-robin procedure that puts every 10th sentence into the test set. We argue that contiguous test sets yield more realistic results because completely unseen articles are tagged. Using the round-robin procedure, parts of an article are already seen, which significantly reduces the percentage of unknown words. Therefore, we expect even 1 By definition, this quotient is co if there is only one possible tag for a given word.</Paragraph>
    <Paragraph position="6"> higher results when testing on every 10th sentence instead of a contiguous set of 10%.</Paragraph>
    <Paragraph position="7"> In the following, accuracy denotes the number of correctly assigned tags divided by the number of tokens in the corpus processed. The tagger is allowed to assign exactly one tag to each token.</Paragraph>
    <Paragraph position="8"> We distinguish the overall accuracy, taking into account all tokens in the test corpus, and separate accuracies for known and unknown tokens. The latter are interesting, since usually unknown tokens are much more difficult to process than known tokens, for which a list of valid tags can be found in the lexicon.</Paragraph>
    <Section position="1" start_page="226" end_page="229" type="sub_section">
      <SectionTitle>
3.1 Tagging the NEGRA corpus
</SectionTitle>
      <Paragraph position="0"> The German NEGRA corpus consists of 20,000 sentences (355,000 tokens) of newspaper texts (Frankfurter Rundschau) that are annotated with parts-of-speech and predicate-argument structures (Skut et al., 1997). It was developed at the Saarland University in Saarbrficken 2. Part of it was tagged at the IMS Stuttgart. This evaluation only uses the part-of-speech annotation and ignores structural annotations. null Tagging accuracies for the NEGRA corpus are shown in table 2.</Paragraph>
      <Paragraph position="1"> Figure 3 shows the learning curve of the tagger, i.e., the accuracy depending on the amount of training data. Training length is the nmnber of tokens used for training. Each training length was tested ten times, training and test sets were randomly chosen and disjoint, results were averaged. The training length is given on a logarithmic scale.</Paragraph>
      <Paragraph position="2"> It is remarkable that tagging accuracy for known words is very high even for very small training cotpora. This means that we have a good chance of getting the right tag if a word is seen at least once during training. Average percentages of unknown tokens are shown in the bottom line of each diagram.</Paragraph>
      <Paragraph position="3"> We exploit the fact that the tagger not only determines tags, but also assigns probabilities. If there is an alternative that has a probability &amp;quot;close to&amp;quot; that of the best assignment, this alternative can be viewed as almost equally well suited. The notion of &amp;quot;close to&amp;quot; is expressed by the distance of probabilities, and this in turn is expressed by the quotient of probabilities. So, the distance of the probabilities of a best tag tbest and an alternative tag tart is expressed by P(tbest)/p(tau), which is some value greater or equal to 1 since the best tag assignment has the highest probability.</Paragraph>
      <Paragraph position="4"> Figure 4 shows the accuracy when separating assignments with quotients larger and smaller than the threshold (hence reliable and unreliable assignments). As expected, we find that accuracies for  reliable assignments are much higher than for unreliable assignments. This distinction is, e.g., useful for annotation projects during the cleaning process, or during pre-processing, so the tagger can emit multiple tags if the best tag is classified as unreliable.</Paragraph>
    </Section>
    <Section position="2" start_page="229" end_page="229" type="sub_section">
      <SectionTitle>
3.2 Tagging the Penn Treebank
</SectionTitle>
      <Paragraph position="0"> We use the Wall Street Journal as contained in the Penn Treebank for our experiments. The annotation consists of four parts: 1) a context-free structure augmented with traces to mark movement and discontinuous constituents, 2) phrasal categories that are annotated as node labels, 3) a small set of grammatical functions that are annotated as extensions to the node labels, and 4) part-of-speech tags (Marcus et al., 1993). This evaluation only uses the part-of-speech annotation.</Paragraph>
      <Paragraph position="1"> The Wall Street Journal part of the Penn Tree-bank consists of approx. 50,000 sentences (1.2 million tokens).</Paragraph>
      <Paragraph position="2"> Tagging accuracies for the Penn Treebank are shown in table 5. Figure 6 shows the learning curve of the tagger, i.e., the accuracy depending on the amount of training data. Training length is the number of tokens used for training. Each training length was tested ten times. Training and test sets were disjoint, results are averaged. The training length is given on a logarithmic scale. As for the NEGRA corpus, tagging accuracy is very high for known tokens even with small amounts of training data.</Paragraph>
      <Paragraph position="3"> We exploit the fact that the tagger not only determines tags, but also assigns probabilities. Figure 7 shows the accuracy when separating assignments with quotients larger and smaller than the threshold (hence reliable and unreliable assignments). Again, we find that accuracies for reliable assignments are much higher than for unreliable assignments.</Paragraph>
    </Section>
    <Section position="3" start_page="229" end_page="229" type="sub_section">
      <SectionTitle>
3.3 Summary of Part-of-Speech Tagging
Results
</SectionTitle>
      <Paragraph position="0"> Average part-of-speech tagging accuracy is between 96% and 97%, depending on language and tagset, which is at least on a par with state-of-the-art results found in the literature, possibly better. For the Penn Treebank, (Ratnaparkhi, 1996) reports an accuracy of 96.6% using the Maximum Entropy approach, our much simpler and therefore faster HMM approach delivers 96.7%. This comparison needs to be re-examined, since we use a ten-fold crossvalidation and averaging of results while Ratnaparkhi only makes one test run.</Paragraph>
      <Paragraph position="1"> The accuracy for known tokens is significantly higher than for unknown tokens. For the German newspaper data, results are 8.7% better when the word was seen before and therefore is in the lexicon, than when it was not seen before (97.7% vs. 89.0%).</Paragraph>
      <Paragraph position="2"> Accuracy for known tokens is high even with very small amounts of training data. As few as 1000 tokens are sufficient to achieve 95%-96% accuracy for them. It is important for the tagger to have seen a word at least once during training.</Paragraph>
      <Paragraph position="3"> Stochastic taggers assign probabilities to tags. We exploit the probabilities to determine reliability of assignments. For a subset that is determined during processing by the tagger we achieve accuracy rates of over 99%. The accuracy of the complement set is much lower. This information can, e.g., be exploited in an annotation project to give an additional treatment to the unreliable assignments, or to pass selected ambiguities to a subsequent processing step.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML