File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1129_metho.xml
Size: 16,437 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1129"> <Title>Exploring Distributional Similarity Based Models for Query Spelling Correction</Title> <Section position="5" start_page="1025" end_page="1028" type="metho"> <SectionTitle> 3 Distributional Similarity-Based Mod- </SectionTitle> <Paragraph position="0"> els for Query Spelling Correction</Paragraph> <Section position="1" start_page="1025" end_page="1026" type="sub_section"> <SectionTitle> 3.1 Motivation </SectionTitle> <Paragraph position="0"> Most of the previous work on spelling correction concentrates on the problem of designing better error models based on properties of character strings. This direction ever evolves from simple Damerau-Levenshtein distance (Damerau, 1964; Levenshtein, 1966) to probabilistic models that estimate string edit probabilities from corpus (Church and Gale, 1991; Mayes et al, 1991; Ristad and Yianilos, 1997; Brill and Moore, 2000; and Ahmad and Kondrak, 2005). In the mentioned methods, however, the similarities between two strings are modeled on the average of many misspelling-correction pairs, which may cause many idiosyncratic spelling errors to be ignored. Some of those are typical word-level cognitive errors. For instance, given the query term adventura, a character string-based error model usually assigns similar similarities to its two most probable corrections adventure and aventura. Taking into account that adventure has a much higher frequency of occurring, it is most likely that adventure would be generated as a suggestion. However, our observation into the query logs reveals that adventura in most cases is actually a common misspelling of aventura. Two annotators were asked to judge 36 randomly sampled queries that contain more than one term, and they agreed upon that 35 of them should be aventura.</Paragraph> <Paragraph position="1"> To solve this problem, we consider alternative methods to make use of the information beyond a term's character strings. Distributional similarity provides such a dimension to view the possibility that one word can be replaced by another based on the statistics of words co-occuring with them. Distributional similarity has been proposed to perform tasks such as language model smoothing and word clustering, but to the best of our knowledge, it has not been explored in estimating similarities between misspellings and their corrections. In this section, we will only involve the consine metric for illustration purpose.</Paragraph> <Paragraph position="2"> Query logs can serve as an excellent corpus for distributional similarity estimation. This is because query logs are not only an up-to-date term base, but also a comprehensive spelling error repository (Cucerzan and Brill, 2004; Ahmad and Kondrak, 2005). Given enough size of query logs, some misspellings, such as adventura, will occur so frequently that we can obtain reliable statistics of their typical usage. Essential to our method is the observation of high distributional similarity between frequently occurring spelling errors and their corrections, but low between irrelevant terms. For example, we observe that adventura occurred more than 3,300 times in a set of logged queries that spanned three months, and its context was similar to that of aventura.</Paragraph> <Paragraph position="3"> Both of them usually appeared after words like peurto and lyrics, and were followed by mall, palace and resort. Further computation shows that, in the tf (term frequency) vector space based on surrounding words, the cosine value between them is approximately 0.8, which indicates these two terms are used in a very similar way among all the users trying to search aventura. The cosine between adventura and adventure is less than 0.03 and basically we can conclude that they are two irrelevant terms, although their spellings are similar.</Paragraph> <Paragraph position="4"> Distributional similarity is also helpful to address another challenge for query spelling correction: differentiating valid OOV terms from frequently occurring misspellings.</Paragraph> <Paragraph position="5"> pairs, each of pair of words have similar spelling, lexicon and frequency properties. But the distributional similarity between each pair of words provides the necessary information to make correction classification that vacuum is a spelling error while seraphin is a valid OOV term.</Paragraph> </Section> <Section position="2" start_page="1026" end_page="1026" type="sub_section"> <SectionTitle> 3.2 Problem Formulation </SectionTitle> <Paragraph position="0"> In this work, we view the query spelling correction task as a statistical sequence inference problem. Under the probabilistic model framework, it can be conceptually formulated as follows.</Paragraph> <Paragraph position="1"> Given a correction candidate set C for a query string q: }),(|{ d<= cqEditDistcC in which each correction candidate c satisfies the constraint that the edit distance between c and q is less than a given threshold d, the model is to</Paragraph> <Paragraph position="3"> In practice, the correction candidate set C is not generated from the entire query string directly. Correction candidates are generated for each term of a query first, and then C is constructed by composing the candidates of individual terms. The edit distance threshold d is set for each term proportionally to the length of the term.</Paragraph> </Section> <Section position="3" start_page="1026" end_page="1027" type="sub_section"> <SectionTitle> 3.3 Source Channel Model </SectionTitle> <Paragraph position="0"> Source channel model has been widely used for spelling correction (Kernigham et al., 1990; Mayes, Damerau et al., 1991; Brill and More, 2000; Ahmad and Kondrak, 2005). Instead of directly optimize (1), source channel model tries to solve an equivalent problem by applying Bayes's rule and dropping the constant denominator: null</Paragraph> <Paragraph position="2"> In this approach, two component generative models are involved: source model P(c) that generates the user's intended query c and error model P(q|c) that generates the real query q given c. These two component models can be independently estimated.</Paragraph> <Paragraph position="3"> In practice, for a multi-term query, the source model can be approximated with an n-gram statistical language model, which is estimated with tokenized query logs. Taking bigram model for example, c is a correction candidate containing n terms, ncccc ...21= , then P(c) can be written as the product of consecutive bigram probabilities: [?] [?]= )|()( 1ii ccPcP Similarly, the error model probability of a query is decomposed into generation probabilities of individual terms which are assumed to be independently generated: [?]= )|()|( ii cqPcqP Previous proposed methods for error model estimation are all based on the similarity between the character strings of qi and ci as described in 3.1. Here we describe a distributional similarity-based method for this problem. Essentially there are different ways to estimate distributional similarity between two words (Dagan et al., 1997), and the one we propose to use is confusion probability (Essen and Steinbiss, 1992). Formally, confusion probability cP estimates the possibility that one word w1 can be replaced by another</Paragraph> <Paragraph position="5"> where w belongs to the set of words that co-occur with both w1 and w2.</Paragraph> <Paragraph position="6"> From the spelling correction point of view, given w1 to be a valid word and w2 one of its spelling errors, )|( 12 wwPc actually estimates opportunity that w1 is misspelled as w2 in query logs. Compared to other similarity measures such as cosine or Euclidean distance, confusion probability is of interest because it defines a probabilistic distribution rather than a generic measure. This property makes it more theoretically sound to be used as error model probability in the Bayesian framework of the source channel model.</Paragraph> <Paragraph position="7"> Thus it can be applied and evaluated independently. However, before using confusion probability as our error model, we have to solve two problems: probability renormalization and smoothing.</Paragraph> <Paragraph position="8"> Unlike string edit-based error models, which distribute a major portion of probability over terms with similar spellings, confusion probability distributes probability over the entire vocabulary in the training data. This property may cause the problem of unfair comparison between different correction candidates if we directly use (3) as the error model probability. This is because the synonyms of different candidates may share different portion of confusion probabilities. This problem can be solved by re-normalizing the probabilities only over a term's possible correction candidates and itself. To obtain better estimation, here we also require that the frequency of a correction candidate should be higher than that of the query term, based on the observation that correct spellings generally occur more often in query logs. Formally, given a word w and its correction candidate set C, the confusion probability of a word w' conditioned on w can be</Paragraph> <Paragraph position="10"> where )|( wwPc '' is the original definition of confusion probability.</Paragraph> <Paragraph position="11"> In addition, we might also have the zeroprobability problem when the query term has not appeared or there are few context words for it in the query logs. In such cases there is no distributional similarity information available to any known terms. To solve this problem, we define the final error model probability as the linear combination of confusion probability and a string edit-based error model probability )|( cqPed : )|()1()|()|( cqPcqPcqP edc ll [?]+= (5) where l is the interpolation parameter between 0 and 1 that can be experimentally optimized on a development data set.</Paragraph> </Section> <Section position="4" start_page="1027" end_page="1028" type="sub_section"> <SectionTitle> 3.4 Maximum Entropy Model </SectionTitle> <Paragraph position="0"> Theoretically we are more interested in building a unified probabilistic spelling correction model that is able to leverage all available features, which could include (but not limited to) traditional character string-based typographical similarity, phonetic similarity and distributional similarity proposed in this work. The maximum entropy model (Berger et al., 1996) provides us with a well-founded framework for this purpose, which has been extensively used in natural lan guage processing tasks ranging from part-of-speech tagging to machine translation.</Paragraph> <Paragraph position="1"> For our task, the maximum entropy model defines a posterior probabilistic distribution )|( qcP over a set of feature functions fi (q, c) defined on an input query q and its correction</Paragraph> <Paragraph position="3"> where ls are feature weights, which can be optimized by maximizing the posterior probability on the training set:</Paragraph> <Paragraph position="5"> where TD denotes the set of training samples in the form of query-truth pairs presented to the training algorithm.</Paragraph> <Paragraph position="6"> We use the Generalized Iterative Scaling (GIS) algorithm (Darroch and Ratcliff, 1972) to learn the model parameter ls of the maximum entropy model. GIS training requires normalization over all possible prediction classes as shown in the denominator in equation (6). Since the potential number of correction candidates may be huge for multi-term queries, it would not be practical to perform the normalization over the entire search space. Instead, we use a method to approximate the sum over the n-best list (a list of most probable correction candidates). This is similar to what Och and Ney (2002) used for their maximum entropy-based statistical machine translation training.</Paragraph> <Paragraph position="7"> Features used in our maximum entropy model are classified into two categories I) baseline features and II) features supported by distributional similarity evidence. Below we list the feature templates.</Paragraph> <Paragraph position="8"> Category I: 1. Language model probability feature. This is the only real-valued feature with feature value set to the logarithm of source model probability: )(log),( cPcqfprob = 2. Edit distance-based features, which are generated by checking whether the weighted Levenshtein edit distance between a query term and its correction is in certain range; All the following features, including this one, are binary features, and have the feature function of the following form:</Paragraph> <Paragraph position="10"> in which the feature value is set to 1 when the constraints described in the template are satisfied; otherwise the feature value is set to 0.</Paragraph> <Paragraph position="11"> 3. Frequency-based features, which are generated by checking whether the frequencies of a query term and its correction candidate are above certain thresholds; 4. Lexicon-based features, which are generated by checking whether a query term and its correction candidate are in a conventional spelling lexicon; 5. Phonetic similarity-based features, which are generated by checking whether the edit distance between the metaphones (Philips, 1990) of a query term and its correction candidate is below certain thresholds.</Paragraph> <Paragraph position="12"> Category II: 6. Distributional similarity based term fea- null tures, which are generated by checking whether a query term's frequency is higher than certain thresholds but there are no candidates for it with higher frequency and high enough distributional similarity. This is usually an indicator that the query term is valid and not covered by the spelling lexicon. The frequency thresholds are enumerated from 10,000 to 50,000 with the interval 5,000.</Paragraph> <Paragraph position="13"> 7. Distributional similarity based correction candidate features, which are generated by checking whether a correction candidate's frequency is higher than the query term or the correction candidate is in the lexicon, and at the same time the distributional similarity is higher than certain thresholds. This generally gives the evidence that the query term may be a common misspelling of the current candidate. The distributional similarity thresholds are enumerated from 0.6 to 1 with the interval 0.1.</Paragraph> </Section> </Section> <Section position="6" start_page="1028" end_page="1029" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1028" end_page="1029" type="sub_section"> <SectionTitle> 4.1 Dataset </SectionTitle> <Paragraph position="0"> We randomly sampled 7,000 queries from daily query logs of MSN Search and they were manually labeled by two annotators. For each query identified to contain spelling errors, corrections were given by the annotators independently.</Paragraph> <Paragraph position="1"> From the annotation results that both annotators agreed upon 3,061 queries were extracted, which were further divided into a test set containing 1,031 queries and a training set containing 2,030 queries. In the test set there are 171 queries identified containing spelling errors with an error rate of 16.6%. The numbers on the training set is 312 and 15.3%, respectively. The average length of queries on training set is 2.8 terms and on test set it is 2.6.</Paragraph> <Paragraph position="2"> In our experiments, a term bigram model is used as the source model. The bigram model is trained with query log data of MSN Search during the period from October 2004 to June 2005. Correction candidates are generated from a term base extracted from the same set of query logs.</Paragraph> <Paragraph position="3"> For each of the experiments, the performance is evaluated by the following metrics: Accuracy: The number of correct outputs generated by the system divided by the total number of queries in the test set; Recall: The number of correct suggestions for misspelled queries generated by the system divided by the total number of misspelled queries in the test set; Precision: The number of correct suggestions for misspelled queries generated by the system divided by the total number of suggestions made by the system.</Paragraph> </Section> </Section> class="xml-element"></Paper>