File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1010_metho.xml
Size: 21,536 bytes
Last Modified: 2025-10-06 14:14:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1010"> <Title>Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction</Title> <Section position="4" start_page="72" end_page="72" type="metho"> <SectionTitle> 3 Baseline </SectionTitle> <Paragraph position="0"> As an indicator of the difficulty of the task, we compared each of the methods to the method which ignores the context in which the word occurred, and just guesses based on the priors.</Paragraph> <Paragraph position="1"> Table 1 shows the performance of the baseline method for the 18 confusion sets.</Paragraph> </Section> <Section position="5" start_page="72" end_page="72" type="metho"> <SectionTitle> 4 Trigrams </SectionTitle> <Paragraph position="0"> Mays, Damerau, and Mercer (1991) proposed a word-trigram method for context-sensitive spelling correction based on the noisy channel model. Since this method is based on word trigrams, it requires an enormous training corpus to fit all of these parameters accurately; in addition, at run time it requires extensive system resources to store and manipulate the resulting huge word-trigram table.</Paragraph> <Paragraph position="1"> In contrast, the method proposed here uses part-of-speech trigrams. Given a target occurrence of a word to correct, it substitutes in turn each word in the confusion set into the sentence. Por each substitution, it calculates the probability of the resulting sentence. It selects as its answer the word that gives the highest probability.</Paragraph> <Paragraph position="2"> More precisely, assume that the word wh occurs in a sentence W = wl...Wk...wn, and that w~ is a word we are considering substituting for it, yielding sentence W I. Word w~ is then preferred over wk iff P(W') > P(W), where P(W) and P(W') are the probabilities of sentences W and W f respectively. 1 We calculate P(W) using the tag sequence of W as an intermediate quantity, and summing, over all possible tag sequences, the probability of the sentence with that tagging; that is:</Paragraph> <Paragraph position="4"> where T is a tag sequence for sentence W.</Paragraph> <Paragraph position="5"> The above probabilities are estimated as is traditionally done in trigram-based part-of-speech tagging (Church, 1988; DeRose, 1988):</Paragraph> <Paragraph position="7"> where T = tl ...tn, and P(ti\]tl-2ti-1) is the prob ability of seeing a part-of-speech tag tl given the two preceding part-of-speech tags ti-2 and ti-1. Equations 1 and 2 will also be used to tag sentences W and W ~ with their most likely part-of-speech sequences. This will allow us to determine the tag that we actually compare the per-word geometric mean of the sentence probabilities. Otherwise, the shorter sequence will usually be preferred, as shorter sequences tend to have higher probabilities than longer ones.</Paragraph> <Paragraph position="8"> would be assigned to each word in the confusion set when substituted into the target sentence.</Paragraph> <Paragraph position="9"> Table 2 gives the results of the trigram method (as well as the Bayesian method of the next section) for the 18 confusion sets. 2 The results are broken down into two cases: &quot;Different tags&quot; and &quot;Same tags&quot;. A target occurrence is put in the latter iff all words in the confusion set would have the same tag when substituted into the target sentence. In the &quot;Different tags&quot; condition, Trigrams generally does well, outscoring Bayes for all but 3 confusion sets -and in each of these cases, making no more than 3 errors more than Bayes.</Paragraph> <Paragraph position="10"> In the &quot;Same tags&quot; condition, however, Trigrams performs only as well as Baseline. This follows from Equations 1 and 2: when comparing P(W) and P(WI), the dominant term corresponds to the most likely tagging; and in this term, if the target word wk and its substitute w~ have the same tag t, then the comparison amounts to comparing P(wk \[/) and P(w~lt ). In other words, the decision reduces to which of the two words, Wk and w~, is the more common representative of part-of-speech class t. 3</Paragraph> </Section> <Section position="6" start_page="72" end_page="74" type="metho"> <SectionTitle> 5 Bayes </SectionTitle> <Paragraph position="0"> The previous section showed that the part-of-speech trigram method works well when the words in the confusion set have different parts of speech, but essentially cannot distinguish among the words if they have the same part of speech. In this case, a more effective approach is to learn features that characterize the different contexts in which each word tends to occur. A number of feature-based methods have been proposed, including Bayesian classifiers (Gale, Church, and Yarowsky, 1993), decision lists (Yarowsky, 1994), Bayesian hybrids (Golding, 1995), and, more recently, a method based on the Winnow multiplicative weight-updating algorithm (Golding and Roth, 1996). We adopt the Bayesian hybrid method, which we will call Bayes, having experimented with each of the methods and found Bayes to be among the best-performing for the task at hand.</Paragraph> <Paragraph position="1"> This method has been described elsewhere (Golding, 1995) and so will only be briefly reviewed here; however, the version used here uses an improved smoothing technique, which is mentioned briefly below.</Paragraph> <Paragraph position="2"> ~In the experiments reported here, the trigram method was run using the tag inventory derived from the Brown corpus, except that a handful of common function words were tagged as themselves, namely: except, than, then, to, too, and whether.</Paragraph> <Paragraph position="3"> 3 In a few cases, however, Trig'rams does not get exactly the same score as Baseline. This can happen when the words in the confusion set have more than one tag in common; e.g., for (affect, effect}, the words can both be norms or verbs. Trigrams may then choose differently when the words are tagged as nouns versus verbs, whereas Baseline makes the same choice in all cases.</Paragraph> <Paragraph position="4"> Confusion set their, there, they're than, then its, it's your, you're begin, being passed, past quiet, quite weather, whether accept, except lead, led cite, sight, site scores are given as percentages of correct predictions. The results are broken down by whether or not all words in the confusion set would have the same tagging when substituted into the target sentence. The &quot;Breakdown&quot; columns show the percentage of examples that fall under each condition. Bayes uses two types of features: context words and collocations. Context-word features test for the presence of a particular word within +k words of the target word; collocations test for a pattern of up to ~ contiguous words and/or part-of-speech tags around the target word. Examples for the confusion set {dairy, diary} include: (2) milk within +10 words (3) in POSS-DET where (2) is a context-word feature that tends to imply dairy, while (3) is a collocation implying diary. Feature (3) includes the tag POSS-I)ET for possessive determiners (his, her, etc.), and matches, for example, the sequence in his 4 in: (4) He made an entry in his diary.</Paragraph> <Paragraph position="5"> Bayes learns these features from a training corpus of correct text. Each time a word in the confusion set occurs in the corpus, Bayes proposes every feature that matches the context -- one context-word feature for every distinct word within +k words of the target word, and one collocation for every way of 4A tag is taken to match a word in the sentence iff the tag is a member of the word's set of possible part-of-speech tags. Tag sets are used, rather than actual tags, because it is in general impossible to tag the sentence uniquely at spelling-correction time, as the identity of the target word has not yet been established.</Paragraph> <Paragraph position="6"> expressing a pattern of up to ~ contiguous elements. After working through the whole training corpus, Bayes collects and returns the set of features proposed. Pruning criteria may be applied at this point to eliminate features that are based on insufficient data, or that are ineffective at discriminating among the words in the confusion set.</Paragraph> <Paragraph position="7"> At run time, Bayes uses the features learned during training to correct the spelling of target words. Let jr be the set of features that match a particular target occurrence. Suppose for a moment that we were applying a naive Bayesian approach. We would then calculate the probability that each word wi in the confusion set is the correct identity of the target word, given that we have observed features 9 r, using Bayes' rule with the independence assumption:</Paragraph> <Paragraph position="9"> where each probability on the right-hand side is calculated by a maximum-likelihood estimate (MLE) over the training set. We would then pick as our answer the wi with the highest P(wiI.T&quot; ). The method presented here differs from the naive approach in two respects: first, it does not assume independence among features, but rather has heuristics for detecting strong dependencies, and resolving them by deleting features until it is left with a reduced set .T &quot;~ of (relatively) independent features, which are then used in place of ~&quot; in the formula above. Second, to estimate the P(flwi) terms, rather than using a simple MLE, it performs smoothing by interpolating between the MLE of P(flwi) and the MLE of the unigram probability, P(f). These enhancements greatly improve the performance of Bayes over the naive Bayesian approach.</Paragraph> <Paragraph position="10"> The results of Bayes are shown in Table 2. 5 Generally speaking, Bayes does worse than Trigrams when the words in the confusion set have different parts of speech. The reason is that, in such cases, the predominant distinction to be made among the words is syntactic; and the trigram method, which brings to bear part-of-speech knowledge for the whole sentence, is better equipped to make this distinction than Bayes, which only tests up to two syntactic elements in its collocations. Moreover, Bayes' use of context-word features is arguably misguided here, as context words pick up differences in topic and tense, which are irrelevant here, and in fact tend to degrade performance by detecting spurious differences. In a few cases, such as {begin, being}, this effect is enough to drive Bayes slightly below Baseline. 6 For the condition where the words have the same part of speech, Table 2 shows that Bayes almost always does better than Trigrams. This is because, as discussed above, Trigrams is essentially acting like Baseline in this condition. Bayes, on the other hand, learns features that allow it to discriminate among the particular words at issue, regardless of their part of speech. The one exception is {country, county}, for which Bayes scores somewhat below Baseline.</Paragraph> <Paragraph position="11"> This is another case in which context words actually hurt Bayes, as running it without context words again improved its performance to the Baseline level.</Paragraph> </Section> <Section position="7" start_page="74" end_page="75" type="metho"> <SectionTitle> 6 Tribayes </SectionTitle> <Paragraph position="0"> The previous sections demonstrated the complementarity between Trigrams and Bayes: Trigrams works best when the words in the confusion set do not all have the same part of speech, while Bayes works best when they do. This complementarity leads directly to a hybrid method, Tribayes, that gets the best of each. It applies Trigrams first; in the process, it ascertains whether all the words in the confusion set would have the same tag when substituted into the 5For the experiments reported here, Bayes was configured as follows: k (the half-width of the window of context words) was set to 10; PS (the maximum length of a collocation) was set to 2; feature strength was measured using the reliability metric; pruning of collocations at training time was enabled; and pruning of context words was minimal -- context words were pruned only if they had fewer than 2 occurrences or non-occurrences.</Paragraph> <Paragraph position="1"> eWe confirmed this by running Bayes without context words (i.e., with collocations only). Its performance was then always at or above Baseline.</Paragraph> <Paragraph position="2"> target sentence. If they do not, it accepts the answer provided by Trigrams; if they do, it applies Bayes.</Paragraph> <Paragraph position="3"> Two points about the application of Bayes in the hybrid method: first, Bayes is now being asked to distinguish among words only when they have the same part of speech. It should be trained accordingly -- that is, only on examples where the words have the same part of speech. The Bayes component of the hybrid will therefore be trained on a subset of the examples that would be used for training the stand-alone version of Bayes.</Paragraph> <Paragraph position="4"> The second point about Bayes is that, like Trigrams, it sometimes makes uninformed decisions -decisions based only on the priors. For Bayes, this happens when none of its features matches the target occurrence. Since, for now, we do not have a good &quot;third-string&quot; algorithm to call when both Trigrams and Bayes fall by the wayside, we content ourselves with the guess made by Bayes in such situations.</Paragraph> <Paragraph position="5"> Table 3 shows the performance of Tribayes compared to its components. In the &quot;Different tags&quot; condition, Tribayes invokes Trigrams, and thus scores identically. In the &quot;Same tags&quot; condition, Tribayes invokes Bayes. It does not necessarily score the same, however, because, as mentioned above, it is trained on a subset of the examples that stand-alone Bayes is trained on. This can lead to higher or lower performance -- higher because the training examples are more homogeneous (representing only cases where the words have the same part of speech); lower because there may not be enough training examples to learn from. Both effects show up in Table 3.</Paragraph> <Paragraph position="6"> Table 4 summarizes the overall performance of all methods discussed. It can be seen that Trigrams and Bayes each have their strong points. Tribayes, however, achieves the maximum of their scores, by and large, the exceptions being due to cases where one method or the other had an unexpectedly low score (discussed in Sections 4 and 5). The confusion set {raise, rise} demonstrates (albeit modestly) the ability of the hybrid to outscore both of its components, by putting together the performance of the better component for both conditions.</Paragraph> </Section> <Section position="8" start_page="75" end_page="77" type="metho"> <SectionTitle> 7 Comparison with Microsoft Word </SectionTitle> <Paragraph position="0"> The previous section evaluated the performance of Tribayes with respect to its components, and showed that it got the best of both. In this section, we calibrate this overall performance by comparing Tribayes with Microsoft Word (version 7.0), a widely used word-processing system whose grammar checker represents the state of the art in commercial context-sensitive spelling correction.</Paragraph> <Paragraph position="1"> Unfortunately we cannot evaluate Word using &quot;prediction accuracy&quot; (as we did above), as we do not always have access to the system's predictions -sometimes it suppresses its predictions in an effort to filter out the bad ones. Instead, in this section Confusion set Different tags Same tags Break- System scores Break- System scores down T TB down B TB their, there, they're 100 97.6 97.6 0 than, then 100 94.9 94.9 0 its, it's 100 98.1 98.1 0 your, you're 100 98.9 98.9 0 begin, being 100 97.3 97.3 0 passed, past 100 95.9 95.9 0 quiet, quite 100 95.5 95.5 0 weather, whether 100 93.4 93.4 0 accept, except 100 82.0 82.0 0 lead, led 100 83.7 83.7 0 cite, sight, site 100 70.6 70.6 0 principal, principle 29 100.0 100.0 71 91.7 83.3 raise, rise 8 100.0 100.0 92 72.2 75.0 affect, effect 6 100.0 100.0 94 97.8 95.7 peace, piece 2 100.0 100.0 98 89.8 89.8 country, county 0 100 85.5 85.5 amount, number 0 100 82.9 82.9 among, between 0 100 75.3 75.3 System scores are given as percentages of correct predictions. The results are broken down by whether or not all words in the confusion set would have the same tagging when substituted into the target sentence. The &quot;Breakdown&quot; columns give the percentage of examples under each condition. we will use two parameters to evaluate system performance: system accuracy when tested on correct usages of words, and system accuracy on incorrect usages. Together, these two parameters give a complete picture of system performance: the score on correct usages measures the system's rate of false negative errors (changing a right word to a wrong one), while the score on incorrect usages measures false positives (failing to change a wrong word to a right one). We will not attempt to combine these two parameters into a single measure of system &quot;goodness&quot;, as the appropriate combination varies for different users, depending on the user's typing accuracy and tolerance of false negatives and positives.</Paragraph> <Paragraph position="2"> The test sets for the correct condition are the same ones used earlier, based on 20% of the Brown corpus. The test sets for the incorrect condition were generated by corrupting the correct test sets; in particular, each correct occurrence of a word in the confusion set was replaced, in turn, with each other word in the confusion set, yielding n - 1 incorrect occurrences for each correct occurrence (where n is the size of the confusion set). We will also refer to the incorrect condition as the corrupted condition.</Paragraph> <Paragraph position="3"> To run Microsoft Word on a particular test set, we started by disabling error checking for all error types except those needed for the confusion set at issue. This was done to avoid confounding effects.</Paragraph> <Paragraph position="4"> For {their, there, they're}, for instance, we enabled &quot;word usage&quot; errors (which include substitutions of their for there, etc.), but we disabled &quot;contractions&quot; (which include replacing they're with they are). We then invoked the grammar checker, accepting every suggestion offered. Sometimes errors were pointed out but no correction given; in such cases, we skipped over the error. Sometimes the suggestions led to an infinite loop, as with the sentence: (5) Be sure it's out when you leave.</Paragraph> <Paragraph position="5"> where the system alternately suggested replacing it's with its and vice versa. In such cases, we accepted the first suggestion, and then moved on.</Paragraph> <Paragraph position="6"> Unlike Word, Tribayes, as presented above, is purely a predictive system, and never suppresses its suggestions. This is somewhat of a handicap in the comparison, as Word can achieve higher scores in the correct condition by suppressing its weaker suggestions (albeit at the cost of lowering its scores in the corrupted condition). To put Tribayes on an equal footing, we added a postprocessing step in which it uses thresholds to decide whether to suppress its suggestions. A suggestion is allowed to go through iff the ratio of the probability of the word being suggested to the probability of the word that appeared originally in the sentence is above a threshold. The probability associated with each word is the per-word sentence probability in the case of Trigrams, or the conditional probability P(wi\[~) in the case of Bayes. The thresholds are set in a preprocessing phase based on the training set (80% of Brown, in our case). A single tunable parameter controls how steeply the thresholds are set; for the study here, this parameter was set to the middle of its useful range, providing a fairly neutral balance between reducing false negatives and increasing false positives.</Paragraph> <Paragraph position="7"> The results of Word and Tribayes for the 18 confusion sets appear in Table 5. Six of the confusion sets (marked with asterisks in the table) are not handled by Word; Word's scores in these cases are 100% for the correct condition and 0% for the corrupted condition, which are the scores one gets by never making a suggestion. The opposite behavior -- always suggesting a different word -- would result in scores of 0% and 100% (for a confusion set of size 2). Although this behavior is never observed in its extreme form, it is a good approximation of Word's behavior in a few cases, such as {principal, principle}, where it scores 12% and 94%. In general, Word achieves a high score in either the correct or the corrupted condition, but not both at once.</Paragraph> <Paragraph position="8"> Tribayes compares quite favorably with Word in this experiment. In both the correct and corrupted conditions, Tribayes' scores are mostly higher (often by a wide margin) or the same as Word's; in the cases where they are lower in one condition, they are almost always considerably higher in the other.</Paragraph> <Paragraph position="9"> The one exception is {raise, rise}, where Tribayes and Word score about the same in both conditions.</Paragraph> </Section> class="xml-element"></Paper>