File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/j97-3003_concl.xml

Size: 6,223 bytes

Last Modified: 2025-10-06 13:57:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-3003">
  <Title>Automatic Rule Induction for Unknown-Word Guessing</Title>
  <Section position="6" start_page="419" end_page="421" type="concl">
    <SectionTitle>
5. Conclusion
</SectionTitle>
    <Paragraph position="0"> We have presented a technique for fully automated statistical acquisition of rules that guess possible Pos-tags for words unknown to the lexicon. This technique does not require specially prepared training data and uses for training a pre-existing general-purpose lexicon and word frequencies collected from a raw corpus. Using such training data, three types of guessing rules are induced: prefix morphological rules, suffix morphological rules, and ending-guessing rules.</Paragraph>
    <Paragraph position="1"> Evaluation of tagging accuracy on unknown words using texts and words unseen at the training phase showed that tagging with the automatically induced cascading guesser was consistently more accurate than previously quoted results known to the author (85%). Tagging accuracy on unknown words using the cascading guesser was 87.7-88.7%. The cascading guesser outperformed the guesser supplied with the Xerox tagger and the guesser supplied with Brill's tagger both on unknown proper nouns  (which is a relatively easy-to-guess category of words) and on the rest of the unknown words, where it had an advantage of 6.5-8.5.%. When the unknown words were made known to the lexicon, the accuracy of tagging was 93.6-94.3% which makes the accuracy drop caused by the cascading guesser to be less than 6% in general.</Paragraph>
    <Paragraph position="2"> Another important conclusion from the evaluation experiments is that the morphological guessing rules do improve guessing performance. Since they are more accurate than ending-guessing rules they were applied first and improved the precision of the guesses by about 8%. This resulted in about 2% higher accuracy in the tagging of unknown words. The ending-guessing rules constitute the backbone of the guesser and cope with unknown words without clear morphological structure. For instance, discussing the problem of unknown words for the robust parsing Bod (1995, 84) writes: &amp;quot;Notice that richer, morphological annotation would not be of any help here; the words &amp;quot;return&amp;quot;, &amp;quot;stop&amp;quot; and &amp;quot;cost&amp;quot; do not have a morphological structure on the basis of which their possible lexical categories can be predicted.&amp;quot; When we applied the ending-guessing rules to these words, the words return and stop were correctly classified as noun/verbs (NN VB VBP) and only the word cost failed to be guessed by the rules.</Paragraph>
    <Paragraph position="3"> The acquired guessing rules employed in our cascading guesser are, in fact, of a standard nature, which, in some form or other, is present in other word-Pos guessers.</Paragraph>
    <Paragraph position="4"> For instance, our ending-guessing rules are akin to those of Xerox and the morphological rules resemble some rules of Brill's, but ours use more constraints and provide a set of all possible tags for a word rather than a single best tag. The two additional types of features used by Brill's guesser are implicitly represented in our approach as well: One of the Brill schemata checks the context of an unknown word. In our approach we guess the words using their features only and provide several possibilities for a word; then at the disambiguation phase the context is used to choose the right tag. As for Brill's schemata that checks the presence of a particular character in an unknown word, we capture a similar feature by collecting the ending-guessing rules for proper nouns and hyphenated words separately. We believe that the technique for the induction of the ending-guessing rules is quite similar to that of Xerox 1deg or Schmid (1994) but differs in the scoring and pruning methods. The major advantage of the proposed technique can be seen in the cascading application of the different sets of guessing rules and in far superior training data. We use for training a pre-existing general-purpose (as opposed to corpus-tuned) lexicon. This has three advantages: * the size of the training lexicon is large and does not depend on the size or even the existence of the annotated corpus. This allows for the induction of more rules than from a lexicon derived from an annotated corpus. For instance, the ending guesser of Xerox includes 536 rules whereas our Ending * guesser includes 2,196 guessing rules; * the information listed in a general-purpose lexicon can be considered to be of better quality than that derived from an annotated corpus, since it lists all possible readings for a word rather than only those that happen to occur in the corpus. We also believe that general-purpose lexicons contain less erroneous information than those derived from annotated corpora; 10 Xerox's technique is not documented and can be determined only by inspection of the source code.  Computational Linguistics Volume 23, Number 3 * the amount of work required to prepare the training lexicon is minimal and does not require any additional manual annotation.</Paragraph>
    <Paragraph position="5"> Our experiments with the lexicon derived from the CELEX lexical database and word frequencies derived from the Brown Corpus resulted in guessing rule sets that proved to be domain- and corpus-independent (but tag-set-dependent), producing similar results on texts of different origins. An interesting by-product of the proposed rule-induction technique is the automatic discovery of the template morphological rules advocated in Mikheev and Liubushkina (1995). The induced morphological guessing rules turned out to consist mostly of the expected prefixes and suffixes of English and closely resemble the rules employed by the ispel |UNIX spell-checker. The rule acquisition and evaluation methods described here are implemented as a modular set of c++ and AWK tools, and the guesser is easily extendible to sublanguage-specific regularities and retrainable to new tag sets and other languages, provided that these languages have affixational morphology.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML