File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3604_intro.xml

Size: 4,545 bytes

Last Modified: 2025-10-06 14:04:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3604">
  <Title>All-word prediction as the ultimate confusable disambiguation</Title>
  <Section position="3" start_page="0" end_page="25" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Word prediction is an intriguing language engineering semi-product. Arguably it is the &amp;quot;archetypical prediction problem in natural language processing&amp;quot; (Even-Zohar and Roth, 2000). It is usually not an engineering end in itself to predict the next word in a sequence, or fill in a blanked-out word in a sequence.</Paragraph>
    <Paragraph position="1"> Yet, it could be an asset in higher-level proofing or authoring tools, e.g. to be able to automatically discern among confusables and thereby to detect confusable errors (Golding and Roth, 1999; Even-Zohar and Roth, 2000; Banko and Brill, 2001; Huang and Powers, 2001). It could alleviate problems with low-frequency and unknown words in natural language processing and information retrieval, by replacing them with likely and higher-frequency alternatives that carry similar information. And also, since the task of word prediction is a direct interpretation of language modeling, a word prediction system could provide useful information for to be used in speech recognition systems.</Paragraph>
    <Paragraph position="2"> A unique aspect of the word prediction task, as compared to most other tasks in natural language processing, is that real-world examples abound in large amounts. Any digitized text can be used as training material for a word prediction system capable of learning from examples, and nowadays gigascale and terascale document collections are available for research purposes.</Paragraph>
    <Paragraph position="3"> A specific type of word prediction is confusable prediction, i.e., learn to predict among limited sets of confusable words such as to/two/too and there/their/they're (Golding and Roth, 1999; Banko and Brill, 2001). Having trained a confusable predictor on occurrences of words within a confusable set, it can be applied to any new occurrence of a word from the set; if its prediction based on the context deviates from the word actually present, then  this word might be a confusable error, and the classifier's prediction might be its correction. Confusable prediction and correction is a strong asset in proofing tools.</Paragraph>
    <Paragraph position="4"> In this paper we generalize the word prediction task to predicting any word in context. This is basically the task of a generic language model. An explicit choice for the particular study on &amp;quot;all-words&amp;quot; prediction is to encode context only by words, and not by any higher-level linguistic non-terminals which have been investigated in related work on word prediction (Wu et al., 1999; Even-Zohar and Roth, 2000). This choice leaves open the question how the same tasks can be learned from examples when non-terminal symbols are taken into account as well.</Paragraph>
    <Paragraph position="5"> The choice for our algorithm, a decision-tree approximation of k-nearest-neigbor (k-NN) based or memory-based learning, is motivated by the fact that, as we describe later in this paper, this particular algorithm can scale up to predicting tens of thousands of words, while simultaneously being able to scale up to tens of millions of examples as training material, predicting words at useful rates of hundreds to thousands of words per second. Another motivation for our choice is that our decision-tree approximation of k-nearest neighbor classification is functionally equivalent to back-off smoothing (Zavrel and Daelemans, 1997); not only does it share its performance capacities with n-gram models with back-off smoothing, it also shares its scaling abilities with these models, while being able to handle large values of n.</Paragraph>
    <Paragraph position="6"> The article is structured as follows. In Section 2 we describe what data we selected for our experiments, and we provide an overview of the experimental methodology used throughout the experiments, including a description of the IGTREE algorithm central to our study. In Section 3 the results of the word prediction experiments are presented, and the subsequent Section 4 contains the experimental results of the experiments on confusables. We briefly relate our work to earlier work that inspired the current study in Section 5. The results are discussed, and conclusions are drawn in Section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML