File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/a92-1018_intro.xml
Size: 2,252 bytes
Last Modified: 2025-10-06 14:05:07
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1018"> <Title>A Practical Part-of-Speech Tagger</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Desiderata </SectionTitle> <Paragraph position="0"> Many words are ambiguous in their part of speech. For example, &quot;tag&quot; can be a noun or a verb. However, when a word appears in the context of other words, the ambiguity is often reduced: in '% tag is a part-of-speech label,&quot; the word &quot;tag&quot; can only be a noun. A part-of-speech tagger is a system that uses context to assign parts of speech to words.</Paragraph> <Paragraph position="1"> Automatic text tagging is an important first step in discovering the linguistic structure of large text corpora. Part-of-speech information facilitates higher-level analysis, such as recognizing noun phrases and other patterns in text.</Paragraph> <Paragraph position="2"> For a tagger to function as a practical component in a language processing system, we believe that a tagger must be: Robust Text corpora contain ungrammatical constructions, isolated phrases (such as titles), and non-linguistic data (such as tables). Corpora are also likely to contain words that are unknown to the tagger. It is desirable that a tagger deal gracefully with these situations.</Paragraph> <Paragraph position="3"> Efficient If a tagger is to be used to analyze arbitrarily large corpora, it must be efficient--performing in time linear in the number of words tagged. Any training required should also be fast, enabling rapid turnaround with new corpora and new text genres.</Paragraph> <Paragraph position="4"> Accurate A tagger should attempt to assign the correct part-of-speech tag to every word encountered.</Paragraph> <Paragraph position="5"> Tunable A tagger should be able to take advantage of linguistic insights. One should be able to correct systematic errors by supplying appropriate a priori &quot;hints.&quot; It should be possible to give different hints for different corpora.</Paragraph> <Paragraph position="6"> Reusable The effort required to retarget a tagger to new corpora, new tagsets, and new languages should be minimal.</Paragraph> </Section> class="xml-element"></Paper>