File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1037_metho.xml

Size: 16,150 bytes

Last Modified: 2025-10-06 14:07:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1037">
  <Title>An Improved Error Model for Noisy Channel Spelling Correction</Title>
  <Section position="4" start_page="1" end_page="3" type="metho">
    <SectionTitle>
HUURU PRGHO /HW EH DQ DOSKDEHW 2XU
</SectionTitle>
    <Paragraph position="0"> model allows all edit operations of the form &amp;quot; ZKHUH *S[?], . 3 &amp;quot; LV WKH probability that when users intends to type WKH VWULQJ WKH\ W\SH LQVWHDG 1RWH WKDW the edit operations allowed in Church and Gale (1991), Mayes, Damerau et al. (1991) and Ristad and Yianilos (1997), are properly subsumed by our generic string to string substitutions.</Paragraph>
    <Paragraph position="1"> In addition, we condition on the position in the string that the edit occurs in, 3 &amp;quot; _ 361 ZKHUH 361 ^VWDUW RI word, middle of word, end of word}.</Paragraph>
    <Paragraph position="2">  The position is determined by the location of VXEVWULQJ LQ WKH VRXUFH GLFWLRQDU\ ZRUG Positional information is a powerful conditioning feature for rich edit operations. For instance, P(e  |a) does not vary greatly between the three positions mentioned above. However, P(ent  |ant) is highly dependent upon position. People rarely mistype antler as entler, but often mistype reluctant as reluctent.</Paragraph>
    <Paragraph position="3"> Within the noisy channel framework, we can informally think of our error model as follows. First, a person picks a word to generate. Then she picks a partition of the characters of that word. Then she types each partition, possibly erroneously. For example, a person might choose to generate the word physical. She would then pick a partition from the set of all possible partitions, say: ph y s i c al. Then she would generate each partition, possibly with errors. After choosing this particular word and partition, the probability of generating the string fisikle with the partition f i s i k le would be P(f  |ph) *P(i  |y) * P(s  |s) *P(i  |i)</Paragraph>
    <Paragraph position="5"> The above example points to advantages of our model compared to previous models based on weighted Damerau-Levenshtein distance. Note that neither P(f  |ph) nor P(le  |al) are modeled directly in the previous approaches to error modeling. A number of studies have pointed out that a high percentage of misspelled words are wrong due to a single letter insertion, substitution, or deletion, or from a letter pair transposition (Damerau 1964; Peterson 1986). However, even if this is the case, it does not imply that nothing is  Another good PSN feature would be morpheme boundary.</Paragraph>
    <Paragraph position="6">  We will leave off the positional conditioning information for simplicity.</Paragraph>
    <Paragraph position="7"> to be gained by modeling more powerful edit operations. If somebody types the string confidant, we do not really want to model this error as P(a  |e), but rather P(ant | ent). And anticedent can more accurately be modeled by P(anti  |ante), rather than P(i  |e). By taking a more generic approach to error modeling, we can more accurately model the errors people make.</Paragraph>
    <Paragraph position="8"> A formal presentation of our model follows. Let Part(w) be the set of all possible ways of partitioning string w into adjacent (possibly null) substrings. For a particular partition R[?]Part(w), where |R|=j (R consists of j contiguous segments), let R</Paragraph>
    <Paragraph position="10"> One particular pair of alignments for s and w induces a set of edits that derive s from w. By only considering the best partitioning of s and w, we can simplify this to:</Paragraph>
    <Paragraph position="12"> We do not yet have a good way to derive P(R  |w), and in running experiments we determined that poorly modeling this distribution gave slightly worse performance than not modeling it at all, so in practice we drop this term.</Paragraph>
  </Section>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3 Training the Model
</SectionTitle>
    <Paragraph position="0"> To train the model, we need a training set consisting of {s</Paragraph>
    <Paragraph position="2"> the correct spelling of the word w</Paragraph>
    <Paragraph position="4"> based on minimizing the edit distance</Paragraph>
    <Paragraph position="6"> , based on single character insertions, deletions and substitutions. For instance, given the training pair &lt;akgsual, actual&gt;, this could be aligned as: a c t u a l a k g s u a l This corresponds to the sequence of edit operations: a&amp;quot;a c&amp;quot;N &amp;quot;g t&amp;quot;s u&amp;quot;u a&amp;quot;a l&amp;quot;l To allow for richer contextual information, we expand each nonmatch substitution to incorporate up to N additional adjacent edits. For example, for the first nonmatch edit in the example above, with N=2, we would generate the following substitutions:</Paragraph>
    <Paragraph position="8"> We would do similarly for the other nonmatch edits, and give each of these substitutions a fractional count.</Paragraph>
    <Paragraph position="9"> We can then calculate the probability</Paragraph>
  </Section>
  <Section position="6" start_page="3" end_page="3" type="metho">
    <SectionTitle>
RI HDFK VXEVWLWXWLRQ &amp;quot; DV FRXQW &amp;quot;
FRXQW FRXQW &amp;quot; LV VLPSO\ WKH VXP
</SectionTitle>
    <Paragraph position="0"> of the counts derived from our training data as explained above. Estimating FRXQW LV D bit tricky. If we took a text corpus, then extracted all the spelling errors found in the corpus and then used those errors for training, FRXQW ZRXOG VLPSO\ EH WKH number of times VXEVWULQJ RFFXUV LQ WKH text corpus. But if we are training from a set</Paragraph>
    <Paragraph position="2"> } tuples and not given an associated corpus, we can do the following:  (a) From a large collection of representative WH[W FRXQW WKH QXPEHU RI RFFXUUHQFHV RI (b) Adjust the count based on an estimate of  the rate with which people make typing errors.</Paragraph>
    <Paragraph position="3"> Since the rate of errors varies widely and is difficult to measure, we can only crudely approximate it. Fortunately, we have found empirically that the results are not very sensitive to the value chosen. Essentially, we are doing one iteration of the Expectation-Maximization algorithm (Dempster, Laird et al. 1977). The idea is that contexts that are useful will accumulate fractional counts across multiple instances, whereas contexts that are noise will not accumulate significant counts.</Paragraph>
  </Section>
  <Section position="7" start_page="3" end_page="3" type="metho">
    <SectionTitle>
4 Applying the Model
</SectionTitle>
    <Paragraph position="0"> Given a string s, where Ds [?] , we want to return )|()|(argmax w contextwPswP . Our approach will be to return an n-best list of candidates according to the error model, and then rescore these candidates by taking into account the source probabilities.</Paragraph>
    <Paragraph position="1"> We are given a dictionary D and a set of parameters P, where each parameter is</Paragraph>
  </Section>
  <Section position="8" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3 &amp;quot; IRU VRPH *S[?], , meaning the
SUREDELOLW\ WKDW LI D VWULQJ LV LQWHQGHG WKH
QRLV\ FKDQQHO ZLOO SURGXFH LQVWHDG )LUVW
</SectionTitle>
    <Paragraph position="0"> note that for a particular pair of strings {s, w} we can use the standard dynamic programming algorithm for finding edit distance by filling a |s|*|w |weight matrix (Wagner and Fisher 1974; Hall and Dowling 1980), with only minor changes. For computing the Damerau-Levenshtein distance between two strings, this can be done in O(|s|*|w|) time. When we allow generic edit operations, the complexity  ). In filling in a cell (i,j) in the matrix for computing Damerau-Levenshtein distance we need only examine cells (i,j-1), (i-1,j) and (i-1,j-1). With generic edits, we have to examine all cells (a,b) where a [?] i and b [?] j.</Paragraph>
    <Paragraph position="1"> We first precompile the dictionary into a trie, with each node in the trie corresponding to a vector of weights. If we think of the x-axis of the standard weight matrix for computing edit distance as corresponding to w (a word in the dictionary), then the vector at each node in the trie corresponds to a column in the weight matrix associated with computing the distance between s and the string prefix ending at that trie node.</Paragraph>
    <Paragraph position="2"> :H VWRUH WKH &amp;quot; SDUDPHWHUV DV D trie of tries. We have one trie corresponding to DOO VWULQJV WKDW DSSHDU RQ WKH OHIW KDQG VLGH of some substitution in our parameter set. At every node in this trie, corresponding to a</Paragraph>
  </Section>
  <Section position="9" start_page="3" end_page="3" type="metho">
    <SectionTitle>
VWULQJ ZH SRLQW WR D trie consisting of all
VWULQJV WKDW DSSHDU RQ WKH ULJKW KDQG VLGH
RI D VXEVWLWXWLRQ LQ RXU SDUDPHWHU VHW ZLWK
</SectionTitle>
    <Paragraph position="0"> on the left hand side. We store the substitution probabilities at the terminal</Paragraph>
  </Section>
  <Section position="10" start_page="3" end_page="93" type="metho">
    <SectionTitle>
QRGHV RI WKH WULHV
%\ VWRULQJ ERWK DQG VWULQJV LQ
</SectionTitle>
    <Paragraph position="0"> reverse order, we can efficiently compute edit distance over the entire dictionary. We process the dictionary trie from the root downwards, filling in the weight vector at each node. To find the substitution parameters that are applicable, given a particular node in the trie and a particular position in the input string s (this corresponds to filling in one cell in one vector of a dictionary trie node) we trace up from the node to the root, while tracing GRZQ WKH trie from the root. As we trace GRZQ WKH trie, if we encounter a terminal node, we follow the pointer to the FRUUHVSRQGLQJ trie, and then trace backwards from the position in s while WUDFLQJ GRZQ WKH trie.</Paragraph>
    <Paragraph position="1"> Note that searching through a static dictionary D is not a requirement of our error model. It is possible that with a different search technique, we could apply our model to languages such as Turkish for which a static dictionary is inappropriate (Oflazer 1994).</Paragraph>
    <Paragraph position="2"> Given a 200,000-word dictionary, and using our best error model, we are able to spell check strings not in the dictionary in approximately 50 milliseconds on average, running on a Dell 610 500mhz Pentium III workstation.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5Results
5.1 Error Model in Isolation
</SectionTitle>
      <Paragraph position="0"> We ran experiments using a 10,000word corpus of common English spelling errors, paired with their correct spelling.</Paragraph>
      <Paragraph position="1"> We used 80% of this corpus for training and 20% for evaluation. Our dictionary contained approximately 200,000 entries, including all words in the test set. The results in this section are obtained with a language model that assigns uniform probability to all words in the dictionary. In Table 1 we show K-best results for different maximum context window sizes, without using positional information. For instance, the 2-best accuracy is the percentage of time the correct answer is one of the top two answers returned by the system. Note that a maximum window of zero corresponds to the set of single character insertion, deletion and substitution edits, weighted with their probabilities. We see that, up to a point, additional context provides us with more accurate spelling correction and beyond that, additional context neither helps nor hurts.</Paragraph>
      <Paragraph position="2">  In Table 1, the row labelled CG shows the results when we allow the equivalent set of edit operations to those used in (Church and Gale 1991). This is a proper superset of the set of edits where the maximum window is zero and a proper subset of the edits where the maximum window is one. The CG model is essentially equivalent to the Church and Gale error model, except (a) the models above can posit an arbitrary number of edits and (b) we did not do parameter reestimation (see below).</Paragraph>
      <Paragraph position="3"> Next, we measured how much we gain by conditioning on the position of the edit relative to the source word. These results are shown in Table 2. As we expected, positional information helps more when using a richer edit set than when using only single character edits. For a maximum window size of 0, using positional information gives a 13% relative improvement in 1-best accuracy, whereas for a maximum window size of 4, the gain is 22%. Our full strength model gives a 52% relative error reduction on 1-best accuracy compared to the CG model (95.0% compared to 89.5%).</Paragraph>
      <Paragraph position="4">  We experimented with iteratively reestimating parameters, as was done in the original formulation in (Church and Gale 1991). Doing so resulted in a slight degradation in performance. The data we are using is much cleaner than that used in (Church and Gale 1991) which probably explains why reestimation benefited them in their experiments and did not give any benefit to the error models in our experiments.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="93" type="sub_section">
      <SectionTitle>
5.2 Adding a Language Model
</SectionTitle>
      <Paragraph position="0"> Next, we explore what happens to our results as we add a language model. In order to get errors in context, we took the Brown Corpus and found all occurrences of all words in our test set. Then we mapped these words to the incorrect spellings they were paired with in the test set, and ran our spell checker to correct the misspellings.</Paragraph>
      <Paragraph position="1"> We used two language models. The first assumed all words are equally likely, i.e. the null language model used above. The second used a trigram language model derived from a large collection of on-line text (not including the Brown Corpus).</Paragraph>
      <Paragraph position="2"> Because a spell checker is typically applied right after a word is typed, the language model only used left context.</Paragraph>
      <Paragraph position="3"> We show the results in Figure 1, where we used the error model with positional information and with a maximum context window of four, and used the language model to rescore the 5 best word candidates returned by the error model.</Paragraph>
      <Paragraph position="4"> Note that for the case of no language model, the results are lower than the results quoted above (e.g. a 1-best score above of 95.0%, compared to 93.9% in the graph). This is because the results on the Brown Corpus are computed per token, whereas above we were computing results per type.</Paragraph>
      <Paragraph position="5"> One question we wanted to ask is whether using a good language model would obviate the need for a good error model. In Figure 2, we applied the trigram model to resort the 5-best results of the CG model. We see that while a language model improves results, using the better error model (Figure 1) still gives significantly better results. Using a language model with our best error model gives a 73.6% error reduction compared to using a language model with the CG error model. Rescoring the 20-best output of the CG model instead of the 5-best only improves the 1-best accuracy from 90.9% to 91.0%.</Paragraph>
      <Paragraph position="6">  We have presented a new error model for noisy channel spelling correction based on generic string to string edits, and have demonstrated that it results in a significant improvement in performance compared to previous approaches. Without a language model, our error model gives a 52% reduction in spelling correction error rate compared to the weighted Damerau-Levenshtein distance technique of Church and Gale. With a language model, our model gives a 74% reduction in error.</Paragraph>
      <Paragraph position="7"> One exciting future line of research is to explore error models that adapt to an individual or subpopulation. With a rich set of edits, we hope highly accurate individualized spell checking can soon become a reality.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML