File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-2027_intro.xml

Size: 3,099 bytes

Last Modified: 2025-10-06 14:01:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2027">
  <Title>Bayesian Nets in Syntactic Categorization of Novel Words</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Uncovering the syntactic structure of texts is a necessary step towards extracting their meaning. In order to obtain an accurate parse for an unseen text, we need to assign Part-of-Speech (PoS) tags to a string of words. This paper covers one aspect of our work of PoS tagging with Dynamic Bayesian Networks (DBNs), which demonstrates their success at tagging unknown (OoV) words. Please refer to the companion paper [Peshkin, 2003] for substantial discussion of our method and other details. Although currently existing algorithms exhibit high word-level accuracy, PoS tagging is not a solved problem. First, even a small percentage of errors may derail subsequent processing steps.</Paragraph>
    <Paragraph position="1"> Second, the results of tagging are not robust if a large proportion of words are unknown, or if the testing corpus differs in style from the training corpus. At the same time, diverse training corpora are lacking and most taggers are trained on a large annotated corpus extracted from the Wall Street Journal (WSJ). These factors significantly hamper the use PoS tagging to extract information from non-standard corpora, such as email messages and websites. Our work on Information Extraction from an email corpus left us searching for a PoS tagger that would perform well on Internet texts and integrate easily into a large probabilistic reasoning system by producing a distribution over tags rather than deterministic answer. Internet sources exhibit a set of idiosyncratic characteristics not present in the training corpora available to taggers to date. They are often written in telegraphic style, omitting closed-class words, which leads to a higher percentage of ambiguous items. Most importantly, as a consequence of the rapidly evolving Netlingo, Internet texts are full of new words, misspelled words and one-time expressions. These characteristics are bound to lower the accuracy of existing taggers. A look at the literature confirms that error rates for unknown words are quite high.</Paragraph>
    <Paragraph position="2"> According to several recent publications [Toutanova 2002, Lafferty et al.2002] OoV tagging presents a serious challenge to the field. The transformation-based Brill tagger, achieves 96.5% accuracy for the WSJ, but a mere 85% on unknown words. Existing probabilistic taggers also don't fare well on unknown words. Reported results on OoV rarely exceed Brill's performance by a tiny fraction. They are mostly based on (Hidden) Markov Models [Brants 2000, Kupiec, 1992]. A model based on Conditional Random Fields [Lafferty et al.] outperforms the HMM tagger on unknown words yielding 24% error rate. The best result known to us is achieved by Toutanova[2002] by enriching the feature representation of the MaxEnt approach [Ratnaparkhi, 1996].</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML