File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2017_intro.xml
Size: 6,198 bytes
Last Modified: 2025-10-06 14:00:39
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2017"> <Title>A Classification Approach to Word Prediction*</Title> <Section position="2" start_page="0" end_page="124" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The task of predicting the most likely word based on properties of its surrounding context is the archetypical prediction problem in natural language processing (NLP). In many NLP tasks it is necessary to determine the most likely word, part-of-speech (POS) tag or any other token, given its history or context.</Paragraph> <Paragraph position="1"> Examples include part-of speech tagging, word-sense disambiguation, speech recognition, accent restoration, word choice selection in machine translation, context-sensitive spelling correction and identifying discourse markers. Most approaches to these problems are based on n-gram-like modeling. Namely, the learning methods make use of features which are conjunctions of typically (up to) three consecutive words or POS tags in order to derive the predictor.</Paragraph> <Paragraph position="2"> In this paper we show that incorporating additional information into the learning process is very * This research is supported by NSF grants IIS-9801638 and SBR-987345.</Paragraph> <Paragraph position="3"> beneficial. In particular, we provide the learner with a rich set of features that combine the information available in the local context along with shallow parsing information. At the same time, we study a learning approach that is specifically tailored for problems in which the potential number of features is very large but only a fairly small number of them actually participates in the decision. Word prediction experiments that we perform show significant improvements in error rate relative to the use of the traditional, restricted, set of features.</Paragraph> <Paragraph position="4"> Background The most influential problem in motivating statistical learning application in NLP tasks is that of word selection in speech recognition (Jelinek, 1998).</Paragraph> <Paragraph position="5"> There, word classifiers are derived from a probabilistic language model which estimates the probability of a sentence s using Bayes rule as the product of conditional probabilities,</Paragraph> <Paragraph position="7"> where hi is the relevant history when predicting wi.</Paragraph> <Paragraph position="8"> Thus, in order to predict the most likely word in a given context, a global estimation of the sentence probability is derived which, in turn, is computed by estimating the probability of each word given its local context or history. Estimating terms of the form Pr(wlh ) is done by assuming some generative probabilistic model, typically using Markov or other independence assumptions, which gives rise to estimating conditional probabilities of n-grams type features (in the word or POS space). Machine learning based classifiers and maximum entropy models which, in principle, are not restricted to features of these forms have used them nevertheless, perhaps under the influence of probabilistic methods (Brill, 1995; Yarowsky, 1994; Ratnaparkhi et al., 1994).</Paragraph> <Paragraph position="9"> It has been argued that the information available in the local context of each word should be augmented by global sentence information and even information external to the sentence in order to learn better classifiers and language models. Efforts in this directions consists of (1) directly adding syntactic information, as in (Chelba and Jelinek, 1998; Rosenfeld, 1996), and (2) indirectly adding syntactic and semantic information, via similarity models; in this case n-gram type features are used whenever possible, and when they cannot be used (due to data sparsity), additional information compiled into a similarity measure is used (Dagan et al., 1999). Nevertheless, the efforts in this direction so far have shown very insignificant improvements, if any (Chelba and Jelinek, 1998; Rosenfeld, 1996).</Paragraph> <Paragraph position="10"> We believe that the main reason for that is that incorporating information sources in NLP needs to be coupled with a learning approach that is suitable for it.</Paragraph> <Paragraph position="11"> Studies have shown that both machine learning and probabilistic learning methods used in NLP make decisions using a linear decision surface over the feature space (Roth, 1998; Roth, 1999). In this view, the feature space consists of simple functions (e.g., n-grams) over the the original data so as to allow for expressive enough representations using a simple functional form (e.g., a linear function). This implies that the number of potential features that the learning stage needs to consider may be very large, and may grow rapidly when increasing the expressivity of the features. Therefore a feasible computational approach needs to be feature-efficient. It needs to tolerate a large number of potential features in the sense that the number of examples required for it to converge should depend mostly on the number features relevant to the decision, rather than on the number of potential features.</Paragraph> <Paragraph position="12"> This paper addresses the two issues mentioned above. It presents a rich set of features that is constructed using information readily available in the sentence along with shallow parsing and dependency information. It then presents a learning approach that can use this expressive (and potentially large) intermediate representation and shows that it yields a significant improvement in word error rate for the task of word prediction.</Paragraph> <Paragraph position="13"> The rest of the paper is organized as follows. In section 2 we formalize the problem, discuss the information sources available to the learning system and how we use those to construct features. In section 3 we present the learning approach, based on the SNoW learning architecture. Section 4 presents our experimental study and results. In section 4.4 we discuss the issue of deciding on a set of candidate words for each decision. Section 5 concludes and discusses future work.</Paragraph> </Section> class="xml-element"></Paper>