File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1019_intro.xml

Size: 4,341 bytes

Last Modified: 2025-10-06 14:01:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1019">
  <Title>Pronunciation Modeling for Improved Spelling Correction</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Brill and Moore Noisy Channel Spelling
Correction Model
</SectionTitle>
    <Paragraph position="0"> Many statistical spelling correction methods can be viewed as instances of the noisy channel model. The misspelling of a word is viewed as the result of corruption of the intended word as it passes through a noisy communications channel.</Paragraph>
    <Paragraph position="1"> The task of spelling correction is a task of finding, for a misspelling DB, a correct word D6 BE BW, where BW is a given dictionary and D6 is the most probable word to have been garbled into DB. Equivalently, the problem is to find a word D6 for which</Paragraph>
    <Paragraph position="3"> is maximized. Since the denominator is constant, this is the same as maximizing C8B4D6B5C8B4DBCYD6B5. In the terminology of noisy channel modeling, the distribution C8B4D6B5 is referred to as the source model, and the distribution C8B4DBCYD6B5 is the error or channel model.</Paragraph>
    <Paragraph position="4"> Typically, spelling correction models are not used for identifying misspelled words, only for proposing corrections for words that are not found in a dictionary. Notice, however, that the noisy channel model offers the possibility of correcting misspellings without a dictionary, as long as sufficient data is available to estimate the source model factors. For example, if D6 BP Osama bin Laden and DB BP Ossama bin Laden, the model will predict that the correct spelling D6 is more likely than the incorrect spelling DB, provided that  where C8B4DBCYD6B5BPC8B4DBCYDBB5 would be approximately the odds of doubling the s in Osama. We do not pursue this, here, however.</Paragraph>
    <Paragraph position="5"> Brill and Moore (2000) present an improved error model for noisy channel spelling correction that goes beyond single insertions, deletions, substitutions, and transpositions. The model has a set of parameters C8B4AB AX ACB5 for letter sequences of lengths up to BH. An extension they presented has refined parameters C8B4AB AX ACCYC8CBC6B5 which also depend on the position of the substitution in the source word.</Paragraph>
    <Paragraph position="6"> According to this model, the misspelling is generated by the correct word as follows: First, a person picks a partition of the correct word and then types each partition independently, possibly making some errors. The probability for the generation of the misspelling will then be the product of the substitution probabilities for each of the parts in the partition.</Paragraph>
    <Paragraph position="7"> For example, if a person chooses to type the word bouncy and picks the partition boun cy, the probability that she mistypes this word as boun cie will be C8B4CQD3D9D2 AX CQD3D9D2B5C8B4CRCXCT AX CRDDB5. The probability C8B4DBCYD6B5 is estimated as the maximum over all partitions of D6 of the probability that DB is generated from D6 given that partition.</Paragraph>
    <Paragraph position="8"> We use this method to build an error model for letter strings and a separate error model for phone sequences. Two models are learned; one model LTR (standing for &amp;quot;letter&amp;quot;) has a set of substitution probabilities C8B4AB AX ACB5 where AB and AC are character strings, and another model PH (for &amp;quot;phone&amp;quot;) has a set of substitution probabilities C8B4AB AX ACB5 where AB and AC are phone sequences.</Paragraph>
    <Paragraph position="9"> We learn these two models on the same data set of misspellings and correct words. For LTR, we use the training data as is and run the Brill and Moore training algorithm over it to learn the parameters of LTR.ForPH, we convert the misspelling/correctword pairs into pairs of pronunciations of the misspelling and the correct word, and run the Brill and Moore training algorithm over that.</Paragraph>
    <Paragraph position="10"> For PH, we need word pronunciations for the correct words and the misspellings. As the misspellings are certainly not in the dictionary we need a letter-to-phone converter that generates possible pronunciations for them. The next section describes our letter-to-phone model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML