File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/69/c69-0701_abstr.xml
Size: 10,635 bytes
Last Modified: 2025-10-06 13:45:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C69-0701"> <Title>Automatic error-correction in natural languages</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Automatic error-correction in natural language processing is based on the principle of 'elastic matching'. Text words are segmented into 'lines' with letters arranged according to a pre-determined sequence, and then matched line-by-line, shifts being applied if the numbers of lines are unequal.</Paragraph> <Paragraph position="1"> In order to resolve the possible multiple choices produced, the method may be supplemented by another one, based on the observed repetition of words in natural texts, and also by syntactic analysis. This paper describes the above methods and gives an account of an experiment now in progress at the National Physical Laboratory.</Paragraph> <Paragraph position="2"> I. Elastic matchin~ ~ith increased application of computers in the processing of natural languages comes the need for correcting errors introduced by human operators at the input stage.</Paragraph> <Paragraph position="3"> A statistic investigation \[1\] revealed that roughly 80 per cent of all misspelled words contain only one error, belonging to one of the following cases: a letter missing, an extra letter, a wrong letter and finally two adjacent letters interchanged. As such an error can occur in any position, a check by trying all possible alternatives in turn is clearly impracticable.</Paragraph> <Paragraph position="4"> A method which can obtain the same result but in a less tedious and time-constming way has been worked out and experimented upon at the National Physical Laboratory, Teddington, England. This method, named'elastic matching' was first proposed at the 1968 I.F.I.P.</Paragraph> <Paragraph position="5"> Congress in Edinburgh, Scotland \[2\].</Paragraph> <Paragraph position="6"> The elastic matching of words consists basically of coding all the characters (letters) as bits in a computer word, allotting to each letter a specific position. The whole English alphabet will therefore be represented by a sequence of 26 bits, although their order, as will be shown below, may, and indeed should, differ from the usual order of letters in the alphabet.</Paragraph> <Paragraph position="7"> -2-All words belonging to a complete set, which may be a list of words or a whole dictionary, are 'linearized', that is converted into segments, called 'lines', in which the letters are arranged in the agreed orderdeg if the current letter has a position prior to the last stored, a new line must be started. Thus, if the sequence in question were the alphabet itself, the word 'interest' (for example) would be linearized as follows: 'int-er-est'. The actual sequence, by the way, has to be chosen in such a way that it would produce the longest possible lines or, in other words, the minimum number of lines for a given sample of text.</Paragraph> <Paragraph position="8"> The matching is carried out not between words but between lines.</Paragraph> <Paragraph position="9"> ~ll errors will then stand out immediately as one or more disagreeing bits ~. In the case of two bits a simple check will reveal whether this is the result of an accepted type of error (one wrong letter, or two adjacent letters interchanged), or the result of two separate errors, and therefore to be rejected under the limit accepted (one error per word).</Paragraph> <Paragraph position="10"> In the examples shown below the alphabet has been assumed to be the linearizing sequence; this is done for the sake of better clarity only.</Paragraph> <Paragraph position="12"> The result (d) is unacceptable because the two disagreeing bits For example by using the logical 'NOT-EQUIVALENT' operation.</Paragraph> <Paragraph position="13"> -3are formed as a result of two errors (extra S and missing E). In the computer check this is shown by the two outstanding bits (letters) being separated by another bit (letter).</Paragraph> <Paragraph position="14"> If the numbers of lines in the two versions (misspelled and correct) are unequal, the procedure is as follows. The next line of the longer version is shifted back and matched against the result (that is, the disagreeing bits) of the previous match. Thus, for example,</Paragraph> <Paragraph position="16"> In the case of two disagreeing bits some simple checks have again to be made to eliminate the two-error cases, and also to prevent spurious matches resulting from the self-cancellation of characters between the two successive lines of the same version.</Paragraph> <Paragraph position="17"> More particulars of the operation of this method can be found in a special paper \[5\].</Paragraph> <Paragraph position="18"> 2. The dictionary organization The elastic matching, as was mentioned above, is applicable against any set of (correct) words, which may be, for example, a list of proper names, or any other words, even of artificial (e.g.</Paragraph> <Paragraph position="19"> programming) languages.</Paragraph> <Paragraph position="20"> It is, however, the application to natural languages, in particular English, which is the subject of this paper. There are two problems which have to be overcome or, at least, reduced to manageable proportions before this method can be applied using a complete English dictionary.</Paragraph> <Paragraph position="21"> The first problem is access to the dictionary which may contain tens or even hundreds of thousands of entries. This number, however, includes all grammatical forms of English words (fortunately, they are not so numerous as in highly inflected languages such as Russian). The dictionary look-up takes different forms depending on the -4way in which the dictionary is organized. The latter could have either a tree-like structure (preferably built of 'lines') which is likely to be quicker in operation, or a list structure, in which words may be grouped by their line numbers, then by numbers of letters and finally, if the lists are still too long, by part-alphabetization (according to the accepted sequence). This structure is easier to prepare.</Paragraph> <Paragraph position="22"> The words to be checked against the dictionary of the list structure will be linearized, and during this process the numbers of their lines and letters will be determined. The sections of the dictionary to be used in the matching process will be those with equal numbers of lines (and letters) and those ir~,ediately below and above these numbers (depending on the error threshold accepted). The other problem is connected with the number of multiple matches likely to occur, especially for short words.</Paragraph> <Paragraph position="23"> Two ways of alleviating this problem are described in the next section.</Paragraph> <Paragraph position="24"> 3- The supplementary procedures</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The ~eneral-content check </SectionTitle> <Paragraph position="0"> One possibility of choosing between the multiple equivalents produced by dictionary look-up is to select those which are repeated throughout the article or speech in question. For this purpose a procedure called 'general-content check' has been devised. ~s the text is processed, each different word satisfying certain conditions is stored. Then all multiple results from dictionary look-up are compared with the contents of this store (which may also be organized into sections) and words found there are given preference to others.</Paragraph> <Paragraph position="1"> The idea behind this is, of course, that words tend to be repeated by one writer or speaker.</Paragraph> <Paragraph position="2"> The size of the sample processed for the general-content check must not be too small or too large. The optimum size should be determined experimentally, but one may risk the guess that perhaps 1-2 thousand (current) test words are a practical amount.</Paragraph> <Paragraph position="3"> Further, there is no need to store all the different words.</Paragraph> <Paragraph position="4"> Ideally, these should be the so-called'content' words, such as nouns, verbs, adjectives and adverbs, whereas the remaining, 'function' words (prepositions, conjunctions, etc.) should be left aside, as not being content-typical. The selection can easily be done in the storing process if dictionary entries are suitably marked.</Paragraph> <Paragraph position="5"> Also, if one grammatical form of a word is stored, there is no -5need for storing others, so that the general-content vocabulary may assume the character of a stem-word list. This again, can conveniently be arranged both in storing and in matching.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 S~xntactic analysis </SectionTitle> <Paragraph position="0"> 2mother possibility of making a choice between multiple equivai\[ents is syntactic ~alysis. This is especially promising, because if one consider~ a typical lexical set of common words, one must notice that long words (which give, as a rule, better results in elastic matching) usually belong to Icontent' words, whereas the 'function' words, which are specially amenable to syntactic analysis are normally short and, therefore, would either produce more multiple choices er~ if of less than four letters, would escape the elastic matching ~together*. In this way the two methods are largely complementary.</Paragraph> <Paragraph position="1"> More of syntactic analysis in error-correction will be said below.</Paragraph> <Paragraph position="2"> Neither of the two supplementary methods mentioned above is applicable where elastic matching is used for non-textual material (list of names, etc.).</Paragraph> <Paragraph position="3"> 4. c~experiment in automatic error-correction An experiment has been carried out at the NPL on the lines described above.</Paragraph> <Paragraph position="4"> First of all, an optimum linearizing sequence had to be established for English texts. Several methods were used for this purpose, both statistical and purely linguistic, and the results were submitted to computer tests. Sequences bringing lower yield had been gradually eliminated and changes were made in those remaining, in order to determine the optimum sequence by the well-known lhillclimbing' technique. This investigation has been fully described elsewhere \[3\] and it has produced the following sequence:</Paragraph> </Section> </Section> class="xml-element"></Paper>