File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1003_metho.xml

Size: 20,443 bytes

Last Modified: 2025-10-06 14:14:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1003">
  <Title>Towards a single proposal in spelling correction</Title>
  <Section position="2" start_page="22" end_page="23" type="metho">
    <SectionTitle>
3 A recent version of ENGCG, known as EngCG-2,
</SectionTitle>
    <Paragraph position="0"> can be tested at http://www.conexor.fi/analysers.html The discrimination of the correct category is unable to distinguish among readings belonging to the same category, so we also applied a word-sense disambiguator based on Wordnet, that had already been tried for nouns on free-running text.</Paragraph>
    <Paragraph position="1"> In our case it would choose the correction proposal semantically closer to the surrounding context. It has to be noticed that Conceptual Density can only be applied when all the proposals are categorised as nouns, due to the structure of Wordnet.</Paragraph>
    <Paragraph position="2">  Frequency data was calculated as word-form frequencies obtained from the document where the error was obtained (Document frequency, DF) or from the rest of the documents in the whole Brown Corpus (Brown frequency, BF). The experiments proved that word-forms were better suited for the task, compared to frequencies on lemmas.</Paragraph>
    <Section position="1" start_page="22" end_page="23" type="sub_section">
      <SectionTitle>
1.4 Other interesting heuristics (HI, H2)
</SectionTitle>
      <Paragraph position="0"> We eliminated proposals beginning with an uppercase character when the erroneous word did not begin with uppercase and there were alternative proposals beginning with lowercase. In example 1, the fourth reading for the misspelling &amp;quot;bos&amp;quot; was eliminated, as &amp;quot;Bose&amp;quot; would be at an editing distance of two from the misspelling (heuristic HI). This heuristic proved very reliable, and it was used in all experiments. After obtaining the first results, we also noticed that words with less than 4 characters like &amp;quot;si&amp;quot;, &amp;quot;teh&amp;quot;, ... (misspellings for &amp;quot;is&amp;quot; and &amp;quot;the&amp;quot;) produced too many proposals, difficult to disambiguate. As they were one of the main error sources for our method, we also evaluated the results excluding them  (heuristic H2).</Paragraph>
    </Section>
    <Section position="2" start_page="23" end_page="23" type="sub_section">
      <SectionTitle>
1.5 Combination of the basic techniques
</SectionTitle>
      <Paragraph position="0"> using votes We considered all the possible combinations among the different techniques, e.g. CG+BF, BF+DF, and CG+DF. The weight of the vote can be varied for each technique, e.g. CG could have a weight of 2 and BF a weight of 1 (we will represent this combination as CG2+BF1). This would mean that the BF candidate(s) will only be chosen if CG does not select another option or if CG selects more than one proposal. Several combinations of weights were tried. This simple method to combine the techniques can be improved using optimization algorithms to choose the best weights among fractional values. Nevertheless, we did some trials weighting each technique with its expected precision, and no improvement was observed. As the best combination of techniques and weights for a given set of texts can vary, we separated the error corpora in two, trying all the possibilities on the first half, and testing the best ones on the second half (c.f. section 2.1).</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="23" end_page="25" type="metho">
    <SectionTitle>
2 The experiments
</SectionTitle>
    <Paragraph position="0"> Based on each kind of knowledge, we built simple guessers and combined them in different ways. In the first phase, we evaluated all the possibilities and selected the best ones on part of the corpus with artificially generated errors.</Paragraph>
    <Paragraph position="1"> Finally, the best combinations were tested against the texts with genuine spelling errors.</Paragraph>
    <Section position="1" start_page="23" end_page="23" type="sub_section">
      <SectionTitle>
2.1 The error corpora
</SectionTitle>
      <Paragraph position="0"> We chose two different corpora for the experiment. The first one was obtained by systematically generating misspellings from a sample of the Brown Corpus, and the second one was a raw text with genuine errors. While the first one was ideal for experimenting, allowing for automatic verification, the second one offered a realistic setting. As we said before, we are testing language models, so that both kinds of data are appropriate. The corpora with artificial errors, artificial corpora for short, have the following features: a sample was extracted from SemCor (a subset of the Brown Corpus) selecting 150 paragraphs at random. This yielded a seed corpus of 505 sentences and 12659 tokens. To simulate spelling errors, a program named antispell, which applies Damerau's rules at random, was run, giving an average of one spelling error for each 20 words (non-words were left untouched). Antispell was run 8 times on the seed corpus, creating 8 different corpora with the same text but different errors. Nothing was done to prevent two errors in the same sentence, and some paragraphs did not have any error.</Paragraph>
      <Paragraph position="1"> The corpus of genuine spelling errors, which we also call the &amp;quot;real&amp;quot; corpus for short, was magazine text from the Bank of English Corpus, which probably was not previously spell-checked (it contained many misspellings), so it was a good source of errors. Added to the difficulty of obtaining texts with real misspellings, there is the problem of marking the text and selecting the correct proposal for automatic evaluation.</Paragraph>
      <Paragraph position="2"> As mentioned above, the artificial-error corpora were divided in two subsets. The first one was used for training purposes 4. Both the second half and the &amp;quot;real&amp;quot; texts were used for testing.</Paragraph>
    </Section>
    <Section position="2" start_page="23" end_page="24" type="sub_section">
      <SectionTitle>
2.2 Data for each corpora
</SectionTitle>
      <Paragraph position="0"> The two corpora were passed trough ispell, and for each unknown word, all its correction proposals were inserted. Table 1 shows how, if the misspellings are generated at random, 23.5% of them are real words, and fall out of the scope of this work. Although we did not make a similar counting in the real texts, we observed that a similar percentage can be expected.</Paragraph>
      <Paragraph position="1"> words ~rrors aon real-word errors ispell proposals ~vords with multiple proposals Long word errors (H2) proposals for long words (H2)  For the texts with genuine errors, the method used in the selection of the misspellings was the following: after applying ispell, no correction was found for 150 words (mainly proper nouns and foreign words), and there were about 300 which  4 In fact, there is no training in the statistical sense. It just involves choosing the best alternatives for voting. 5 As we focused on non-word words, there is not a count of real-word errors.</Paragraph>
      <Paragraph position="2">  were formed by joining two consecutive words or by special affixation rules (ispell recognised them correctly). This left 369 erroneous word-forms. After examining them we found that the correct word-form was among ispell's proposals, with very few exceptions. Regarding the selection among the different alternatives for an erroneous word-form, we can see that around half of them has a single proposal. This gives a measure of the work to be done. For example, in the real error corpora, there were 158 word-forms with 1046 different proposals. This means an average of  proposals (2 &amp;quot;d half) than 4 are not taken into account, there are 807 proposals, that is, 4.84 alternatives per word.</Paragraph>
    </Section>
    <Section position="3" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
2.3 Results
</SectionTitle>
      <Paragraph position="0"> We mainly considered three measures: * coverage: the number of errors for which the technique yields an answer.</Paragraph>
      <Paragraph position="1"> * precision: the number of errors with the correct proposal among the selected ones * remaining proposals: the average number of selected proposals.</Paragraph>
      <Paragraph position="2"> 2.3.1 Search for the best combinations Table 2 shows the results on the training corpora. We omit many combinations that we tried, for the sake of brevity. As a baseline, we show the results when the selection is done at random. Heuristic H1 is applied in all the cases, while tests are performed with and without heuristic H2. If we focus on the errors for which ispell generates more  a better estimate of the contribution of each guesser. There were 8.26 proposals per word in the general case, and 3.96 when H2 is applied. The results for all the techniques are well above the random baseline. The single best techniques are DF and CG. CG shows good results on precision, but fails to choose a single proposal. H2 raises the precision of all techniques at the cost of losing coverage. CD is the weakest of all techniques, and we did not test it with the other corpora. Regarding the combinations, CGI+DF2+H2 gets the best precision overall, but it only gets 52% coverage, with 1.43 remaining proposals. Nearly 100% coverage is attained by the H2 combinations, with highest precision for CGI+DF2 (83% precision, 1.28 proposals).</Paragraph>
      <Paragraph position="3"> 2.3.2 Validation of the best combinations In the second phase, we evaluated the best combinations on another corpus with artificial errors. Tables 4 and 5 show the results, which agree with those obtained in 2.3.1. They show slightly lower percentages but always in parallel.</Paragraph>
    </Section>
    <Section position="4" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
2.3.3 Corpus of genuine errors
</SectionTitle>
      <Paragraph position="0"> As a final step we evaluated the best combinations on the corpus with genuine typing errors. Table 6 shows the overall results obtained, and table 7 the results for errors with multiple proposals. For the latter there were 6.62 proposals per word in the general case (2 less than in the artificial corpus), and 4.84 when heuristic H2 is applied (one more that in the artificial corpus). These tables are further commented in the following section.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="25" end_page="26" type="metho">
    <SectionTitle>
3 Evaluation of results
</SectionTitle>
    <Paragraph position="0"> This section reviews the results obtained. The results for the &amp;quot;real&amp;quot; corpus are evaluated first, and the comparison with the other corpora comes later.</Paragraph>
    <Paragraph position="1"> Concerning the application of each of the simple techniques separately6: * Any of the guessers performs much better than random.</Paragraph>
    <Paragraph position="2"> * DF has a high precision (75%) at the cost of a low coverage (12%). The difference in coverage compared to the artificial error corpora (84%) is mainly due to the smaller size of the documents in the real error corpus (around 50 words per document). For medium-sized documents we expect a coverage similar to that of the artificial error corpora.</Paragraph>
    <Paragraph position="3"> * BF offers lower precision (54%) with the gains of a broad coverage (96%).</Paragraph>
    <Paragraph position="4"> * CG presents 62% precision with nearly 100% coverage, but at the cost of leaving many proposals (2.45) * The use of CD works only with a small fraction of the errors giving modest results. The fact that it was only applied a few times prevents us from making further conclusions.</Paragraph>
    <Paragraph position="5"> Combining the techniques, the results improve: * The CGI+DF2 combination offers the best results in coverage (100%) and precision (70%) for all tests. As can be seen, CG raises the 6 If not explicitly noted, the figures and comments refer to the &amp;quot;real&amp;quot; corpus, table 7.</Paragraph>
    <Paragraph position="6">  coverage of the DF method, at the cost of also increasing the number of proposals (1.9) per erroneous word. Had the coverage of DF increased, so would also the number of proposals decrease for this combination, for instance, close to that of the artificial error corpora (1.28).</Paragraph>
    <Paragraph position="7"> * The CGI+DFI+BF1 combination provides the same coverage with nearly one interpretation per word, but decreasing precision to a 55%.</Paragraph>
    <Paragraph position="8"> * If full coverage is not necessary, the use of the H2 heuristic raises the precision at least 4% for all combinations.</Paragraph>
    <Paragraph position="9"> When comparing these results with those of the artificial errors, the precisions in tables 2, 4 and 6 can be misleading. The reason is that the coverage of some techniques varies and the precision varies accordingly. For instance, coverage of DF is around 70% for real errors and 90% for artificial errors, while precisions are 93% and 89% respectively (cf. tables 6 and 2). This increase in precision is not due to the better performance of DF 7, but can be explained because the lower the coverage, the higher the proportion of errors with a single proposal, and therefore the higher the precision.</Paragraph>
    <Paragraph position="10"> The comparison between tables 3 and 7 is more clarifying. The performance of all techniques drops in table 7. Precision of CG and BF drops 15 and 20 points. DF goes down 20 points in precision and 50 points in coverage. This latter degradation is not surprising, as the length of the documents in this corpus is only of 50 words on average. Had we had access to medium sized documents, we would expect a coverage similar to that of the artificial error corpora.</Paragraph>
    <Paragraph position="11"> The best combinations hold for the &amp;quot;real&amp;quot; texts, as before. The highest precision is for CGI+DF2 (with and without H2). The number of proposals left is higher in the &amp;quot;real&amp;quot; texts than in the artificial ones (1.99 to 1.28). It can be explained because DF does not manage to cover all errors, and that leaves many CG proposals untouched.</Paragraph>
    <Paragraph position="12"> We think that the drop in performance for the &amp;quot;real&amp;quot; texts was caused by different factors. First of all, we already mentioned that the size of the documents strongly affected DF. Secondly, the nature of the errors changes: the algorithm to 7 In fact the contrary is deduced from tables 3 and 7. produce spelling errors was biased in favour of frequent words, mostly short ones. We will have to analyse this question further, specially regarding the origin of the natural errors. Lastly, BF was trained on the Brown corpus on American English, while the &amp;quot;real&amp;quot; texts come from the Bank of English. Presumably, this could have also affected negatively the performance of these algorithms.</Paragraph>
    <Paragraph position="13"> Back to table 6, the figures reveal which would be the output of the correction system. Either we get a single proposal 98% of the times (1.02 proposals left on average) with 80% precision for all non-word errors in the text (CGI+DFI+BF1) or we can get a higher precision of 90% with 89% coverage and an average of 1.43 proposals (CGI+DF2+H2).</Paragraph>
  </Section>
  <Section position="5" start_page="26" end_page="27" type="metho">
    <SectionTitle>
4 Comparison with other context-
</SectionTitle>
    <Paragraph position="0"> sensitive correction systems There is not much literature about automatic spelling correction with a single proposal. Menezo et al. (1996) present a spelling/grammar checker that adjusts its strategy dynamically taking into account different lexical agents (dictionaries .... ), the user and the kind of text. Although no quantitative results are given, this is in accord with using document and general frequencies.</Paragraph>
    <Paragraph position="1"> Mays et al. (1991) present the initial success of applying word trigram conditional probabilities to the problem of context based detection and correction of real-word errors.</Paragraph>
    <Paragraph position="2"> Yarowsky (1994) experiments with the use of decision lists for lexical ambiguity resolution, using context features like local syntactic patterns and collocational information, so that multiple types of evidence are considered in the context of an ambiguous word. In addition to word-forms, the patterns involve POS tags and lemmas. The algorithm is evaluated in missing accent restoration task for Spanish and French text, against a predefined set of a few words giving an accuracy over 99%.</Paragraph>
    <Paragraph position="3"> Golding and Schabes (1996) propose a hybrid method that combines part-of-speech trigrams and context features in order to detect and correct real-word errors. They present an experiment where their system has substantially higher performance than the grammar checker in MS Word, but its coverage is limited to eighteen particular confusion sets composed by two or three similar words (e.g.: weather, whether).</Paragraph>
    <Paragraph position="4">  The last three systems rely on a previously collected set of confusion sets (sets of similar words or accentuation ambiguities). On the contrary, our system has to choose a single proposal for any possible spelling error, and it is therefore impossible to collect the confusion sets (i.e. sets of proposals for each spelling error) beforehand. We also need to correct as many errors as possible, even if the amount of data for a particular case is scarce.</Paragraph>
    <Paragraph position="5"> Conclusion This work presents a study of different methods that build on the correction proposals of ispell, aiming at giving a single correction proposal for misspellings. One of the difficult aspects of the problem is that of testing the results. For that reason, we used both a corpus with artificially generated errors for training and testing, and a corpus with genuine errors for testing.</Paragraph>
    <Paragraph position="6"> Examining the results, we observe that the results improve as more context is taken into account.</Paragraph>
    <Paragraph position="7"> The word-form frequencies serve as a crude but helpful criterion for choosing the correct proposal. The precision increases as closer contexts, like document frequencies and Constraint Grammar are incorporated. From the results on the corpus of genuine errors we can conclude the following. Firstly, the correct word is among ispell's proposals 100% of the times, which means that all errors can be recovered.</Paragraph>
    <Paragraph position="8"> Secondly, the expected output from our present system is that it will correct automatically the spelling errors with either 80% precision with full coverage or 90% precision with 89% coverage and leaving an average of 1.43 proposals.</Paragraph>
    <Paragraph position="9"> Two of the techniques proposed, Brown Frequencies and Conceptual Density, did not yield useful results. CD only works for a very small fraction of the errors, which prevents us from making further conclusions.</Paragraph>
    <Paragraph position="10"> There are reasons to expect better results in the future. First of all, the corpus with genuine errors contained very short documents, which caused the performance of DF to degrade substantially.</Paragraph>
    <Paragraph position="11"> Further tests with longer documents should yield better results. Secondly, we collected frequencies from an American English corpus to correct British English texts. Once this language mismatch is solved, better performance should be obtained. Lastly, there is room for improvement in the techniques themselves. We knowingly did not use any model of common misspellings.</Paragraph>
    <Paragraph position="12"> Although we expect limited improvement, stronger methods to combine the techniques can also be tried.</Paragraph>
    <Paragraph position="13"> Continuing with our goal of attaining a single proposal as reliably as possible, we will focus on short words and we plan to also include more syntactic and semantic context in the process by means of collocational information. This step opens different questions about the size of the corpora needed for accessing the data and the space needed to store the information.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML