File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/w06-3321_abstr.xml

Size: 2,135 bytes

Last Modified: 2025-10-06 13:45:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3321">
  <Title>Rapid Adaptation of POS Tagging for Domain Specific Uses</Title>
  <Section position="1" start_page="2" end_page="2" type="abstr">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Part-of-speech (POS) tagging is a fundamental component for performing natural language tasks such as parsing, information extraction, and question answering. When POS taggers are trained in one domain and applied in significantly different domains, their performance can degrade dramatically. We present a methodology for rapid adaptation of POS taggers to new domains. Our technique is unsupervised in that a manually annotated corpus for the new domain is not necessary.</Paragraph>
    <Paragraph position="1"> We use suffix information gathered from large amounts of raw text as well as orthographic information to increase the lexical coverage. We present an experiment in the Biological domain where our POS tager achieves results comparable to POS taggers specifically trained to this domain.</Paragraph>
    <Paragraph position="2"> Many machine-learning and statistical techniques employed for POS tagging train a model on an annotated corpus, such as the Penn Treebank (Marcus et al, 193). Most state-of-the-art POS taggers use two main sources of information: 1) Information about neighboring tags, and 2) Information about the word itself. Methods using both sources of information for tagging are: Hiden Markov Modeling, Maximum Entropy modeling, and Transformation Based Learning (Bril, 195).</Paragraph>
    <Paragraph position="3"> In moving to a new domain, performance can degrade dramatically because of the increase in the unknown word rate as well as domain-specific word use. We improve tagging performance by attacking these problems. Since our goal is to employ minimal manual effort or domain-specific knowledge, we consider only orthographic, inflectional and derivational information in deriving POS. We bypass the time, cost, resource, and content expert intensive approach of annotating a corpus for a new domain.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML