File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/a92-1018_evalu.xml

Size: 3,508 bytes

Last Modified: 2025-10-06 14:00:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1018">
  <Title>A Practical Part-of-Speech Tagger</Title>
  <Section position="6" start_page="137" end_page="137" type="evalu">
    <SectionTitle>
5 Performance
</SectionTitle>
    <Paragraph position="0"> In this section, we detail how our tagger meets the desiderata that we outlined in section 1.</Paragraph>
    <Section position="1" start_page="137" end_page="137" type="sub_section">
      <SectionTitle>
5.1 Efficient
</SectionTitle>
      <Paragraph position="0"> The system is implemented in Common Lisp \[Steele, 1990\].</Paragraph>
      <Paragraph position="1"> All timings reported are for a Sun SPARCStation2. The English lexicon used contains 38 tags (M -- 38) and 174 ambiguity classes (N -- 174).</Paragraph>
      <Paragraph position="2"> Training was performed on 25,000 words in articles selected randomly from Grolier's Encyclopedia. Five iterations of training were performed in a total time of 115 CPU seconds. Following is a time breakdown by component: Training: average #seconds per token tokenizer lexicon 1 iteration 5 iterations total 640 400 680 3400 4600 Tagging was performed on 115,822 words in a collection of articles by the journalist Dave Barry. This required a total of of 143 CPU seconds. The time breakdown for this was as follows: Tagging: average #seconds per token tokenizer lexicon Viterbi total 604 388 233 1235 It can be seen from these figures that training on a new corpus may be accomplished in a matter of minutes, and that tens of megabytes of text may then be tagged per hour.</Paragraph>
    </Section>
    <Section position="2" start_page="137" end_page="137" type="sub_section">
      <SectionTitle>
5.2 Accurate and Robust
</SectionTitle>
      <Paragraph position="0"> When using a lexicon and tagset built from the tagged text of the Brown corpus \[Francis and Ku~era, 1982\], training on one half of the corpus (about 500,000 words) and tagging the other, 96% of word instances were assigned the correct tag. Eight iterations of training were used. This level of accuracy is comparable to the best achieved by other taggers \[Church, 1988, Merialdo, 1991\].</Paragraph>
      <Paragraph position="1"> The Brown Corpus contains fragments and ungrammaticalities, thus providing a good demonstration of robustness. null</Paragraph>
    </Section>
    <Section position="3" start_page="137" end_page="137" type="sub_section">
      <SectionTitle>
5.3 Tunable and Reusable
</SectionTitle>
      <Paragraph position="0"> A tagger should be tunable, so that systematic tagging errors and anomalies can be addressed. Similarly, it is important that it be fast and easy to target the tagger to new genres and languages, and to experiment with different tagsets reflecting different insights into the linguistic phenomena found in text. In section 3.5, we describe how the HMM implementation itself supports tuning. In addition, our implementation supports a number of explicit parameters to facilitate tuning and reuse, including specification of lexicon and training corpus. There is also support for a flexible tagset. For example, if we want to collapse distinctions in the lexicon, such as those between positive, comparative, and superlative adjectives, we only have to make a small change in the mapping from lexicon to tagset.</Paragraph>
      <Paragraph position="1"> Similarly, if we wish to make finer grain distinctions than those available in the lexicon, such as case marking on pronouns, there is a simple way to note such exceptions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML