File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3238_intro.xml

Size: 8,271 bytes

Last Modified: 2025-10-06 14:02:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3238">
  <Title>Spelling correction as an iterative process that exploits the collective knowledge of web users</Title>
  <Section position="4" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Problem Formulation. Prior Work
</SectionTitle>
    <Paragraph position="0"> Comprehensive reviews of the spelling correction literature were provided by Peterson (1980), Kukich (1992), and Jurafsky and Martin (2000). In this section, we survey a few lexicon-based spelling correction approaches by using a series of formal definitions of the task and presenting concrete examples showing the strengths and the limits corresponding to each situation. We iteratively redefine the problem, starting from an approach purely based on a trusted lexicon and ending up with an approach in which the role of the trusted lexicon is greatly diminished. While doing so, we also make concrete forward steps in our attempt to provide a definition of valid web queries.</Paragraph>
    <Paragraph position="1"> Let S be the alphabet of a language and *S[?]L a broad-coverage lexicon of the language. The simplest and historically the first definition of lexicon-based spelling correction (Damerau, 1964) is: Given an unknown word Lw \*S[?] , find Lw [?]' such that ),(min)',( vwdistwwdist</Paragraph>
    <Paragraph position="3"> i.e. for any out-of-lexicon word in a text, find the closest word form(s) in the available lexicon and hypothesize it as the correct spelling alternative.</Paragraph>
    <Paragraph position="4"> dist can be any string-based function; for example, it can be the ratio between the number of letters two words do not have in common and the number of letters they share.1 The two most used classes of distances in spelling correction are edit distances, as proposed by Damerau (1964) and Levenshtein (1965), and correlation matrix distances (Cherkassky et al., 1974). In our study, we use a modified version of the Damerau-Levenshtein edit distance, as presented in Section 3. One flaw of the preceding formulation is that it does not take into account the frequency of words in a language. A simple solution to this problem is to compute the probability of words in the target language as maximum likelihood estimates (MLE) over a large corpus and reformulate the general spelling-correction problem as follows: Given Lw \*S[?] , find Lw [?]' such that d[?])',( wwdist and )(max)'( ),(:</Paragraph>
    <Paragraph position="6"> In this formulation, all in-lexicon words that are within some &amp;quot;reasonable&amp;quot; distance d of the unknown word are considered as good candidates, the correction being chosen based on its prior probability in the language. While there is an implicit conditioning on the original spelling because of the domain on which the best correction is searched, this objective function only uses the prior probability of words in the language and not the actual distances between each candidate and the input word One solution that allows using a probabilistic edit distance is to condition the probability of a correction on the original spelling )|( wvP : Given Lw \*S[?] , find Lw [?]' such that d[?])',( wwdist and )|(max)|'( ),(:</Paragraph>
    <Paragraph position="8"> In a noisy channel model framework, as employed for spelling correction by Kernigham et al.</Paragraph>
    <Paragraph position="9"> (1990), the objective function can be written by using Bayesian inversion as the product between the prior probability of words in a language )(vP (the language model), and the likelihood of misspelling a word v as w, )|( vwP (which models the noisy channel and will be called the error model).</Paragraph>
    <Paragraph position="10"> In the above formulations, unknown words are corrected in isolation. This is a rather major flaw because context is extremely important for spelling correction, as illustrated in the following example: power crd power cord video crd video card 1 Note that the function does not have to be symmetric; thus, the notation dist(w,w') is used with a loose sense.</Paragraph>
    <Paragraph position="11"> The misspelled word crd should be corrected to two different words depending on its contexts.2 A formulation of the spelling correction problem that takes into account context is the following: Given a string *S[?]s , rl wccs = , with Lw \*S[?] and *, Lcc rl [?] , find Lw [?]' such that d[?])',( wwdist and )|(max)|'( ),(: rlvwdistLvrl</Paragraph>
    <Paragraph position="13"> Spaces and other word delimiters are ignored in this formulation and the subsequent formulations for simplicity, although text tokenization represents an important part of the spelling-correction process, as discussed in Sections 5 and 6.</Paragraph>
    <Paragraph position="14"> The task definitions enumerated up to this point (on which most traditional spelling correction systems are based) ignore word substitution errors. In the case of web searches, it is extremely important to provide correction suggestions for valid words when they are more meaningful as a search query than the original query, for example: golf war gulf war sap opera soap opera This problem is partially addressed by the task of CSSC, which can be formalized as follows: Given a set of confusable valid word forms in a language },...,,{ 21 nwwwW = and a string ril cwcs = , choose Wwj [?] such that</Paragraph>
    <Paragraph position="16"> In the CSSC literature, the sets of confusables are presumed known, but they could also be built for each in-lexicon word w as all words 'w with d[?])',( wwdist , similarly to the approach investigated by Mays et al. (1991), in which they chose a 1=d and employed an edit distance with all point changes having the same cost 1.</Paragraph>
    <Paragraph position="17"> The generalized problem of phrasal spelling correction can then be formulated as follows: Given *S[?]s , find *' Ls [?] such that d[?])',( ssdist and )|(max)|'(</Paragraph>
    <Paragraph position="19"> Typically, a correction is desirable when *Ls [?] (i.e. at least one of the component words is unknown) but, as shown above, there are frequent cases (e.g. golf war) when sequences of valid words should be changed to other word sequences.</Paragraph>
    <Paragraph position="20"> Note that word boundaries are hidden in this latter 2 To simplify the exposition, we only consider two highly probable corrections, but other valid alternatives exist, e.g. video cd.</Paragraph>
    <Paragraph position="21"> formulation, making it more general and allowing it to cover two other important spelling error classes, concatenation and splitting, e.g.: power point slides powerpoint slides chat inspanich chat in spanish Yet, it still does not account for another important class of cases in web query correction which is represented by out-of-lexicon words that are valid in certain contexts (therefore, *' Ls [?] ), for example: amd processors amd processors (no change) The above phrase represents a legitimate query, despite the fact that it may contain unknown words when employing a traditional English lexicon.</Paragraph>
    <Paragraph position="22"> Some even more interesting cases not handled by traditional spellers and also not covered by the latter formulation are those in which in-lexicon words should be changed to out-of-lexicon words, as in the following examples, where two valid words must be concatenated into an out of lexicon word: gun dam planet gundam planet limp biz kit limp bizkit These observations lead to an even more general formulation of the spelling-correction problem: Given *S[?]s , find *' S[?]s such that d[?])',( ssdist and )|(max)|'(</Paragraph>
    <Paragraph position="24"> For the first time, the formulation no longer makes explicit use of a lexicon of the language.3 In some sense, the actual language in which the web queries are expressed becomes less important than the query-log data from which the string probabilities are estimated. This probability model can be seen as a substitute for a measure of the meaningfulness of strings as web-queries. For example, an implausible random noun phrase in any of the traditional corpora such as sad tomatoes is meaningful in the context of web search (being the name of a somewhat popular music band).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML