File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1019_metho.xml
Size: 15,028 bytes
Last Modified: 2025-10-06 14:07:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1019"> <Title>Pronunciation Modeling for Improved Spelling Correction</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Letter-to-Phone Model </SectionTitle> <Paragraph position="0"> There has been a lot of research on machine learning methods for letter-to-phone conversion. High accuracy is achieved, for example, by using neural networks (Sejnowski and Rosenberg, 1987), decision trees (Jiang et al., 1997), and C6-grams (Fisher, 1999). We use a modified version of the method proposed by Fisher, incorporating several extensions resulting in substantial gains in performance. In this section we first describe how we do alignment at the phone level, then describe Fisher's model, and finally present our extensions and the resulting letter-to-phone conversion accuracy.</Paragraph> <Paragraph position="1"> The machine learning algorithms for converting text to phones usually start off with training data in the form of a set of examples, consisting of letters in context and their corresponding phones (classifications). Pronunciation dictionaries are the major source of training data for these algorithms, but they do not contain information for correspondences between letters and phones directly; they have correspondences between sequences of letters and sequences of phones.</Paragraph> <Paragraph position="2"> A first step before running a machine learning algorithm on a dictionary is, therefore, alignment between individual letters and phones. The alignment algorithm is dependent on the phone set used.</Paragraph> <Paragraph position="3"> We experimented with two dictionaries, the NETtalk dataset and the Microsoft Speech dictionary. Statistics about them and how we split them into training and test sets are shown in Table 1. The NETtalk dataset contains information for phone level alignment and we used it to test our algorithm for automatic alignment. The Microsoft Speech dictionary is not aligned at the phone level but it is much bigger and is the dictionary we used for learning our final letter-to-phone model.</Paragraph> <Paragraph position="4"> The NETtalk dictionary has been designed so that each letter correspond to at most one phone, so a word is always longer, or of the same length as, its pronunciation. The alignment algorithm has to decide which of the letters correspond to phones and which ones correspond to nothing (i.e., are silent).</Paragraph> <Paragraph position="5"> For example, the entry in NETtalk (when we remove the empties, which contain information for phone level alignment) for the word able is ABLEebL.</Paragraph> <Paragraph position="6"> The correct alignment is A/e B/b L/L E/-, where - denotes the empty phone. In the Microsoft Speech dictionary, on the other hand, each letter can naturally correspond to BC, BD,orBE phones. For example, the entry in that dictionary for able is ABLE ey b ax l. The correct alignment is A/ey B/b L/ax&l E/-. If we also allowed two letters as a group to correspond to two phones as a group, the correct alignment might be A/ey B/b LE/ax&l, but that would make it harder for the machine learning algorithm.</Paragraph> <Paragraph position="7"> Our alignment algorithm is an implementation of hard EM (Viterbi training) that starts off with heuristically estimated initial parameters for C8B4D4CWD3D2CTD7CYD0CTD8D8CTD6B5 and, at each iteration, finds the most likely alignment for each word given the parameters and then re-estimates the parameters collecting counts from the obtained alignments. Here D4CWD3D2CTD7 ranges over sequences of BC (empty), BD, and BE phones for the Microsoft Speech dictionary and BC or BD phones for NETtalk. The parameters C8B4D4CWD3D2CTD7CYD0CTD8D8CTD6B5 were initialized by a method similar to the one proposed in (Daelemans and van den Bosch, 1996). Word frequencies were not taken into consideration here as the dictionary contains no frequency information.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Initial Letter-to-Phone Model </SectionTitle> <Paragraph position="0"> The method we started with was the N-gram model of Fisher (1999). From training data, it learns rules that predict the pronunciation of a letter based on D1 letters of left and D2 letters of right context. The rules are of the following form: Here C4D1 stands for a sequence of D1 letters to the left of CC and CAD2 is a sequence of D2 letters to the right. The number of letters in the context to the left and right varies. We used from BC to BG letters on each side. For example, two rules learned for the letter B were: CJBTBUBMBUBMC7CC AXA0BDBMBCCL and CJBU AX CQBMBLBI A0BMBCBGCL, meaning that in the first context the letter B is silent with probability BDBMBC, and in the second it is pronounced as CQ with probability BMBLBI and is silent with probability BMBCBG.</Paragraph> <Paragraph position="1"> Training this model consists of collecting counts for the contexts that appear in the data with the selected window size to the left and right. We collected counts for all configurations C4D1BMCCBMCAD2 for D1 BECUBCBNBDBNBEBNBFBNBGCV, D2 BECUBCBNBDBNBEBNBFBNBGCV that occurred in the data. The model is applied by choosing for each letter CC the most probable translation as predicted by the most specific rule for the context of occurrence of the letter. For example, if we want to find how to pronounce the second b in abbot we would chose the empty phone because the first rule mentioned above is more specific than the second.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Extensions </SectionTitle> <Paragraph position="0"> We implemented five extensions to the initial model which together decreased the error rate of the letter-to-phone model by around BEBCB1. These are : AF Combination of the predictions of several applicable rules by linear interpolation AF Rescoring of C6-best proposed pronunciations for a word using a trigram phone sequence language model AF Explicit distinction between middle of word versus start or end AF Rescoring of C6-best proposed pronunciations for a word using a fourgram vowel sequence language model The performance figures reported by Fisher (1999) are significantly higher than our figures using the basic model, which is probably due to the cleaner data used in their experiments and the differences in phoneset size.</Paragraph> <Paragraph position="1"> The extensions we implemented are inspired largely by the work on letter-to-phone conversion using decision trees (Jiang et al., 1997). The last extension, rescoring based on vowel fourgams, has not been proposed previously. We tested the algorithms on the NETtalk and Microsoft Speech dictionaries, by splitting them into training and test sets in proportion 80%/20% training-set to test-set size. We trained the letter-to-phone models using the training splits and tested on the test splits. We are reporting accuracy figures only on the NETtalk dataset since this dataset has been used extensively in building letter-to-phone models, and because phone accuracy is hard to determine for the nonphonetically-aligned Microsoft Speech dictionary. For our spelling correction algorithm we use a letter-to-phone model learned from the Microsoft Speech dictionary, however.</Paragraph> <Paragraph position="2"> The results for phone accuracy and word accuracy of the initial model and extensions are shown in Table 2. The phone accuracy is the percentage correct of all phones proposed (excluding the empties) and the word accuracy is the percentage of words for which pronunciations were guessed without any error.</Paragraph> <Paragraph position="3"> For our data we noticed that the most specific rule that matches is often not a sufficiently good predictor. By linearly interpolating the probabilities given by the five most specific matching rules we decreased the word error rate by 14.3%. The weights for the individual rules in the top five were set to be equal. It seems reasonable to combine the predictions from several rules especially because the choice of which rule is more specific of two is arbitrary when neither is a substring of the other. For example, of the two rules with contexts BTBMBUBM and BMBUBMBU, where the first has BC right context and the second has BC left letter context, one heuristic is to choose the latter as more specific since right context seems more valuable than left (Fisher, 1999). However this choice may not always be the best and it proves useful to combine predictions from several rules. In Table 2 the row labeled &quot;Interpolation of contexts&quot; refers to this extension of the basic model. Adding a symbol for interior of word produced a gain in accuracy. Prior to adding this feature, we had features for beginning and end of word. Explicitly modeling interior proved helpful and further decreased our error rate by 4.3%. The results after this improvement are shown in the third row of Table 2.</Paragraph> <Paragraph position="4"> After linearly combining the predictions from the top matching rules we have a probability distribution over phones for each letter. It has been shown that modeling the probability of sequences of phones can greatly reduce the error (Jiang et al., 1997). We learned a trigram phone sequence model and used it to re-score the C6-best predictions from the basic model. We computed the score for a sequence of phones given a sequence of letters, as follows: distributions over phones that we obtain for each letter from combination of the matching rules. The weight AB for the phone sequence model was estimated from a held-out set by a linear search. This model further improved our performance and the results it achieves are in the fourth row of Table 2. The final improvement is adding a term from a vowel fourgram language model to equation 1 with a weight AC. The term is the log probability of the sequence of vowels in the word according to a four-gram model over vowel sequences learned from the data. The final accuracy we achieve is shown in the fifth row of the same table. As a comparison, the best accuracy achieved by Jiang et al. (1997) on NETalk using a similar proportion of training and test set sizes was BIBHBMBKB1. Their system uses more sources of information, such as phones in the left context as features in the decision tree. They also achieve a large performance gain by combining multiple decision trees trained on separate portions of the training data. The accuracy of our letter-to-phone model is comparable to state of the art systems. Further improvements in this component may lead to higher spelling correction accuracy.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Combining Pronunciation and </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Letter-Based Models </SectionTitle> <Paragraph position="0"> Our combined error model gives the probability</Paragraph> <Paragraph position="2"> B4DBCYD6B5 where w is the misspelling and r is a word in the dictionary. The spelling correction algorithm selects for a misspelling w the word r in the dictionary for which the product C8B4D6B5C8</Paragraph> <Paragraph position="4"> is maximized. In our experiments we used a uniform source language model over the words in the dictionary. Therefore our spelling correction algorithm selects the word D6 that maximizes C8 BVC5BU B4DBCYD6B5. Brill and Moore (2000) showed that adding a source language model increases the accuracy significantly. They also showed that the addition of a language model does not obviate the need for a good error model and that improvements in the error model lead to significant improvements in the full noisy channel model.</Paragraph> <Paragraph position="5"> We build two separate error models, LTR and PH (standing for &quot;letter&quot; model and &quot;phone&quot; model). The letter-based model estimates a prob- null B4DBCYD6B5 in a way to be made precise shortly. We combine the two models to estimate scores as follows: shown in Figure 1 after several simplifying assumptions. The probabilities C8B4D4D6D3D2 D6CYD6B5 are taken to be equal for all possible pronunciations of D6 in the dictionary. Next we assume independence of the misspelling from the right word given the pronunciation of the right word i.e. C8B4DBCYD6BND4D6D3D2D6B5BP C8B4DBCYD4D6D3D2 D6B5. By inversion of the conditional probability this is equal to C8B4D4D6D3D2 D6CYDBB5 multiplied by C8B4DBB5BPC8B4D4D6D3D2 D6B5. Since we do not model these marginal probabilities, we drop the latter factor. Next the probability C8B4D4D6D3D2 D6CYDBB5 is expressed as</Paragraph> <Paragraph position="7"> which is approximated by the maximum term in the sum. After the following decomposition: where the second part represents a final independence assumption, we get the expression in Figure 1. The probabilities C8B4D4D6D3D2 DBCYDBB5 are given by the letter-to-phone model. In the following subsections, we first describe how we train and apply the individual error models, and then we show performance results for the combined model compared to the letter-based error model.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Training Individual Error Models </SectionTitle> <Paragraph position="0"> The error model LTR was trained exactly as described originally by Brill and Moore (2000). Given a training set of pairs CUDB of misspelling and correct word as for the LTR model. We convert this set to a set of pronunciations of misspellings and pronunciations of correct words in the following way: For each training CV we generate D1 training samples of corresponding pronunciations where D1 is the number of pronunciations of the correct word D6</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> in our dictionary. Each of those D1 samples is the most probable pronunciation of DB</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> according to our letter-to-phone model paired with one of the possible pronunciations of D6</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> . Using this training set, we run the algorithm of Brill and Moore to estimate a set of substitution probabilities AB AX AC for sequences of phones to sequences of phones. The</Paragraph> <Paragraph position="2"> as a product of the substitution probabilities in the most probable alignment, as Brill and Moore did.</Paragraph> </Section> class="xml-element"></Paper>