File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/e99-1018_intro.xml

Size: 3,387 bytes

Last Modified: 2025-10-06 14:06:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1018">
  <Title>POS Disambiguation and Unknown Word Guessing with Decision Trees</Title>
  <Section position="2" start_page="0" end_page="134" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Part-of-speech (POS) taggers are software devices that aim to assign unambiguous morphosyntactic tags to words of electronic texts. Although the hardest part of the tagging process is performed by a computational lexicon, a POS tagger cannot solely consist of a lexicon due to: (i) morphosyntactic ambiguity (e.g., 'love' as verb or noun) and (ii) the existence of unknown words (e.g., proper nouns, place names, compounds, etc.). When the lexicon can assure high coverage, unknown word guessing can be viewed as a decision taken upon the POS of open-class words (i.e., Noun, Verb, Adjective, Adverb or Participle).</Paragraph>
    <Paragraph position="1"> Towards the disambiguation of POS tags, two main approaches have been followed. On one hand, according to the linguistic approach, experts encode handcrafted rules or constraints based on abstractions derived from language paradigms (usually with the aid of corpora) (Green and Rubin, 1971; Voutilainen 1995). On the other hand, according to the data-driven approach, a frequency-based language model is acquired from corpora and has the forms of n-grams (Church, 1988; Cutting et al., 1992), rules (Hindle, 1989; Brill, 1995), decision trees (Cardie, 1994; Daelemans et al., 1996) or neural networks (Schmid, 1994).</Paragraph>
    <Paragraph position="2"> In order to increase their robusmess, most POS taggers include a guesser, which tries to extract the POS of words not present in the lexicon. As a common strategy, POS guessers examine the endings of unknown words (Cutting et al. 1992) along with their capitalization, or consider the distribution of unknown words over specific parts-of-speech (Weischedel et aL, 1993). More sophisticated guessers further examine the prefixes of unknown words (Mikheev, 1996) and the categories of contextual tokens (Brill, 1995; Daelemans et aL, 1996).</Paragraph>
    <Paragraph position="3"> This paper presents a POS tagger for Modem Greek (M. Greek), a highly inflectional language, and focuses on a data-driven approach for the induction of decision trees used as disambiguation/guessing devices. Based on a high-coverage 1 lexicon, we prepared a tagged corpus capable of showing off the behavior of all POS ambiguity schemes present in M. Greek (e.g., Pronoun-Clitic-Article, Pronoun-Clitic, Adjective-Adverb, Verb-Noun, etc.), as well as the characteristics of unknown words.</Paragraph>
    <Paragraph position="4"> Consequently, we used the corpus for the induction of decision trees, which, along with  the lexicon, are integrated into a robust POS tagger for M. Greek texts.</Paragraph>
    <Paragraph position="5"> The disambiguating methodology followed is highly influenced by the Memory-Based Tagger (MBT) presented in (Daelemans et aL, 1996).</Paragraph>
    <Paragraph position="6"> Our main contribution is the successful application of the decision-tree methodology to M. Greek with three improvements/customizations: (i) injection of linguistic bias to the learning procedure, (ii) formation of tag-set independent training patterns, and (iii) handling of set-valued features.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML