XML Viewer - a92-1018

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/a92-1018_intro.xml
Size: 2,252 bytes
Last Modified: 2025-10-06 14:05:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1018">
  <Title>A Practical Part-of-Speech Tagger</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Desiderata
</SectionTitle>
    <Paragraph position="0"> Many words are ambiguous in their part of speech. For example, &amp;quot;tag&amp;quot; can be a noun or a verb. However, when a word appears in the context of other words, the ambiguity is often reduced: in '% tag is a part-of-speech label,&amp;quot; the word &amp;quot;tag&amp;quot; can only be a noun. A part-of-speech tagger is a system that uses context to assign parts of speech to words.</Paragraph>
    <Paragraph position="1"> Automatic text tagging is an important first step in discovering the linguistic structure of large text corpora. Part-of-speech information facilitates higher-level analysis, such as recognizing noun phrases and other patterns in text.</Paragraph>
    <Paragraph position="2"> For a tagger to function as a practical component in a language processing system, we believe that a tagger must be: Robust Text corpora contain ungrammatical constructions, isolated phrases (such as titles), and non-linguistic data (such as tables). Corpora are also likely to contain words that are unknown to the tagger. It is desirable that a tagger deal gracefully with these situations.</Paragraph>
    <Paragraph position="3"> Efficient If a tagger is to be used to analyze arbitrarily large corpora, it must be efficient--performing in time linear in the number of words tagged. Any training required should also be fast, enabling rapid turnaround with new corpora and new text genres.</Paragraph>
    <Paragraph position="4"> Accurate A tagger should attempt to assign the correct part-of-speech tag to every word encountered.</Paragraph>
    <Paragraph position="5"> Tunable A tagger should be able to take advantage of linguistic insights. One should be able to correct systematic errors by supplying appropriate a priori &amp;quot;hints.&amp;quot; It should be possible to give different hints for different corpora.</Paragraph>
    <Paragraph position="6"> Reusable The effort required to retarget a tagger to new corpora, new tagsets, and new languages should be minimal.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML