File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/97/j97-3003_abstr.xml

Size: 8,011 bytes

Last Modified: 2025-10-06 13:48:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-3003">
  <Title>Automatic Rule Induction for Unknown-Word Guessing</Title>
  <Section position="2" start_page="0" end_page="406" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Words unknown to the lexicon present a substantial problem to NLP modules (as, for instance, part-of-speech (pos-) taggers) that rely on information about words, such as their part of speech, number, gender, or case. Taggers assign a single POS-tag to a word-token, provided that it is known what Pos-tags this word can take on in general and the context in which this word was used. A Pos-tag stands for a unique set of morpho-syntactic features, as exemplified in Table 1, and a word can take several Pos-tags, which constitute an ambiguity class or POS-class for this word. Words with their POs-classes are usually kept in a lexicon. For every input word-token, the tagger accesses the lexicon, determines possible POS-tags this word can take on, and then chooses the most appropriate one. However, some domain-specific words or infrequently used morphological variants of general-purpose words can be missing from the lexicon and thus, their POs-classes should be guessed by the system and only then sent to the disambiguation module.</Paragraph>
    <Paragraph position="1"> The simplest approach to POS-class guessing is either to assign all possible tags to an unknown word or to assign the most probable one, which is proper singular noun for capitalized words and common singular noun otherwise. The appealing feature of these approaches is their extreme simplicity. Not surprisingly, their performance is quite poor: if a word is assigned all possible tags, the search space for the disambiguation of a single POS-tag increases and makes it fragile; if every unknown word is classified as a noun, there will be no difficulties for disambiguation but accuracy will suffer--such a guess is not reliable enough. To assign capitalized unknown words the category proper noun seems a good heuristic, but may not always work. As argued in Church (1988), who proposes a more elaborated heuristic, Dermatas and Kokkinakis (1995) proposed a simple probabilistic approach to unknown-word guessing:  VBZ verb present, 3d person takes VBP verb, present, non-3d take the probability that an unknown word has a particular Pos-tag is estimated from the probability distribution of hapax words (words that occur only once) in the previously seen texts. 1 Whereas such a guesser is more accurate than the naive assignments and easily trainable, the tagging performance on unknown words is reported to be only about 66% correct for English. 2 More advanced word-guessing methods use word features such as leading and trailing word segments to determine possible tags for unknown words. Such methods can achieve better performance, reaching tagging accuracy of up to 85% on unknown words for English (Brill 1994; Weischedel et al. 1993). The Xerox tagger (Cutting et al. 1992) comes with a set of rules that assign an unknown word a set of possible pos-tags (i.e., POS-class) on the basis of its ending segment. We call such rules ending-guessing rules because they rely only on ending segments in their predictions. For example, an ending-guessing rule can predict that a word is a gerund or an adjective if it ends with ing. The ending-guessing approach was elaborated in Weischedel et al. (1993), where an unknown word was guessed by using the probability for an unknown word to be of a particular Pos-tag, given its capitalization feature and its ending. Brill (1994, 1995) describes a system of rules that uses both ending-guessing and more morphologically motivated rules. A morphological rule, unlike an ending-guessing rule, uses information about morphologically related words already known to the lexicon in its prediction. For instance, a morphologically motivated guessing rule can say that a word is an adjective if adding the suffix ly to it will result in a word. Clearly, ending-guessing rules have wider coverage than morphologically oriented ones, but their predictions can be less accurate.</Paragraph>
    <Paragraph position="2"> The major topic in the development of word-Pos guessers is the strategy used for the acquisition of the guessing rules. A rule-based tagger described in Voutilainen (1995) was equipped with a set of guessing rules that had been hand-crafted using knowledge of English morphology and intuitions. A more appealing approach is automatic acquisition of such rules from available lexical resources, since it is usually less labor-intensive and less error-prone. Zhang and Kim (1990) developed a system for automated learning of morphological word formation rules. This system divides a string into three regions and infers from training examples their correspondence to underlying morphological features. Kupiec (1992) describes a guessing component that uses a prespecified list of suffixes (or rather endings) and then statistically learns the  1 A similar idea for estimating lexical prior probabilities for unknown words was suggested in Baayen and Sproat (1995). 2 The best result was detected for GermanM2% accuracy and the worst result for Italian--50% accuracy.  predictive properties of those endings from an untagged corpus. In Brill (1994, 1995) a transformation-based learner that learns guessing rules from a pretagged training corpus is outlined: First the unknown words are labeled as common nouns and a list of generic transformations is defined. Then the learner tries to instantiate the generic transformations with word features observed in the text. A statistical-based suffix learner is presented in Schmid (1994). From a training corpus, it constructs a suffix tree where every suffix is associated with its information measure to emit a particular pos-tag. Although the learning process in these systems is fully automated and the accuracy of obtained guessing rules reaches current state-of-the-art levels, for estimation of their parameters they require significant amounts of specially prepared training data--a large training corpus (usually pretagged), training examples, and so on.</Paragraph>
    <Paragraph position="3"> In this paper, we describe a novel, fully automatic technique for the induction of Pos-class-guessing rules for unknown words. This technique has been partially outlined in (Mikheev 1996a, 1996b) and, along with a level of accuracy for the induced rules that is higher than any previously quoted, it has an advantage in terms of quantity and simplicity of annotation of data for training. Unlike many other approaches, which implicitly or explicitly assume that the surface manifestations of morpho-syntactic features of unknown words are different from those of general language, we argue that within the same language unknown words obey general morphological regularities. In our approach, we do not require large amounts of annotated text but employ fully automatic statistical learning using a pre-existing general-purpose lexicon mapped to a particular tag set and word-frequency distribution collected from a raw corpus. The proposed technique is targeted to the acquisition of both morphological and ending-guessing rules, which then can be applied cascadingly using the most accurate guessing rules first. The rule induction process is guided by a thorough guessing-rule evaluation methodology that employs precision, recall, and coverage as evaluation metrics.</Paragraph>
    <Paragraph position="4"> In the rest of the paper we first introduce the kinds of guessing rules to be induced and then present a semi-unsupervised 3 statistical rule induction technique using data derived from the CELEX lexical database (Burnage 1990). Finally we evaluate the induced guessing rules by removing all the hapax words from the lexicon and tagging the Brown Corpus (Francis and Kucera 1982) by a stochastic tagger and a rule-based tagger.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML