XML Viewer - c88-2146

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/88/c88-2146_abstr.xml
Size: 21,905 bytes
Last Modified: 2025-10-06 13:46:35
<?xml version="1.0" standalone="yes"?>
<Paper uid="C88-2146">
  <Title>Morphosyntactic correction in natural language interfaces</Title>
  <Section position="1" start_page="0" end_page="712" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Morphosyntax cannot be simply ignored in natural-language man-machine dialogue since it constitutes an important part of the meaning. Nevertheless, troublesome side effects can arise when morphosyntactic errors are combined with other types of errors. We describe here an efficient means of handling quite complex combinations of typographical, phonographic and agreement errors in French, which are typical of C.A.I. users : a sentence as erroneous as les cott6 adgassan ~ I'ippeauttainuz son perpndiqul~re (!) will be perfectly recognized and translated into les c6t~s adjacents ~ I'hypot6nuse sont perpendiculaires (the legs adjacent to the hypotenuse are perpendicular). null 'slips of the pen'), whereas competence errors reflect ignorance about language rules or misconceptions about the domain. Phonographic errors (in French : ippeauttainuz for hypot6nuse ) or agreement errors (les c5t6 oppos6 for les c6t6s oppos6s ) are typical competence errors. In man-machine communication, the correction of competence errors is far more important than the correction of performance ones (see V~ronis, 1988c). In fact, when faced with an error message, the user can correct typographical errors, for example, but he will generally be unable to correct phonographic or agreement errors. He can only try various spellings at random, which is a rather frustrating way of interacting with a system. We have tried elsewhere (V6ronis, 1987b, c) to demonstrate how some semantic and conceptual errors can be handled (especially wrong presuppositions) using a special many-sorted logic. The present paper focuses on morphosyntactic errors, inflexion and agreements.</Paragraph>
    <Paragraph position="1"> I. Introduction This study was carried out within the context of a C.A.I.</Paragraph>
    <Paragraph position="2"> system for teaching plane geometry at high school level, which is being developed at G.R.T.C (Chouraqui, Inghilterra and V6ronis, 1988). In this system, natural-language interfaces occur in various places : experts are enabled to transfer knowledge (tl~eorems, problems), and students to make demonstrations, using natural language. Error correction is particularly important in C.A.I. systems, since students are generally poor spellers and poor grammarians, and they make .many conceptual errors in the Subject they are learning.</Paragraph>
    <Paragraph position="3"> We introduce a distinction between competence and performance errors. Performance errors are simply due to mechanical or neuro-motor problems (typographical errors, We would stress that morphosyntax cannot be simply ignored in natural-language man-machine dialogue. Let us take, for example, the following wrong sentence, concerning a right triangle (we translate word for word; in French, determiners and adjectives agree in gender and number with nouns) : Trace les c6t6 oppos6 a rangle droit.</Paragraph>
    <Paragraph position="4"> (Draw thepl, sidesg, oppositesg. \[to\] the right angle). Two corrections of this sentence can be performed : Trace le c6t6 oppos6 ~ rangle droit (singular).</Paragraph>
    <Paragraph position="5"> Trace les c6t6s oppos6s ~ rangle droit (plural).</Paragraph>
    <Paragraph position="6"> In tl~e first case, there is no conceptual error, whereas, in the second, the user could (for example) have confused c6t6 oppos6 (opposite side : there is exactly one such side, the hypotenuse) and c&amp;t~s adjacents (legs adjacent to the right angle : there are two of them). This second interpretation * The author's paper entitled &amp;quot;Une extension ,~ la distance entre chaines&amp;quot; was accepted at the COLING'86 conference in Bonn, and actually presented in the session Morphology. Due to some technical error the paper was not included in the final program and was omitted ~om the Proceedings.  should trigger an error message such as : &gt; Warning : in a right triangle, there is exactly one side opposite the right angle, the hypotenuse. Do you want to see the figure (y/n)? problems, such as errors caused by the input devices, or transmission) or typographical errors, due to keyboard typing slips, such as those listed in Damerau's (1964) oftenquoted study, which shows that 80% of errors in words belong to one of the following categories : We must therefore correct morphosyntactic errors (gender and number, but also person, tense and moods) with great care, and apply appropriate rules to find out the right i nte rpretations.</Paragraph>
    <Paragraph position="7"> - substitution of a letter for another, - addition of a letter, -deletion of a letter, - transposition of two adjacent letters.</Paragraph>
    <Paragraph position="8"> The problem becomes rather more complicated when several types of errors (typographical, phonographic and morphosyntactic) are combined in a single word.</Paragraph>
    <Paragraph position="9"> Troublesome side-effects can then arise when a morphological program attempts to reduce such words to their root form. For example, the wrong form hippoth6nuses will be reduced to a hypothetical root form hippoth~nuse, which is not to be found in the dictionary. In addition, the inflexion itself may be misspelt (e.g. d6montron, instead of d6montrons ). In such a case, the wrong ending may in addition no longer be a possible inflexion, so that the standard morphological program will fail in trying to construct ~ hypothetical root form. We therefore need a two.'~tage process, in order to first find out the root and inflexion of inflected words despite typographical or phonographic errors, and then to apply appropriate rules to obtain the right agreement interpretations. These rules will involve some weighting of the possible agreement errors, which makes certain interpretations more likely than others.</Paragraph>
    <Paragraph position="10"> II. Root and inflexion retrieval The most common strategy in spelling correction consists of applying reverse morphological transformations on words to produce a hypothetical root form, and then looking it up in the dictionary. If there is no matching entry, a spelling correction program is triggered. Nevertheless, if the inflexion i.~ misspelt, the problem is really troublesome since, as mentioned above, the morphological program will be unable to produce a hypothetical root. The solution consisting of avoiding any morphological analysis by storing all inflected forms in the dictionary is a very inefficient one, since spelling correction algorithms all involve scanning a sometimes quite large portion of the dictionary. The time spent on spelling correction will then naturally be even greater in an inflected dictionary (remember, for example, that French verbs have about forty different inflected forms). Moreover, much research has been devoted to spelling correction since the very beginning of computer science (for a review, see Peterson, 1980, and Pollock, 1982), but has generally focused on noise errors (due to hardware The first three errors can result from either noise or typographical causes, and the fourth is specifically a typographical one. We agree with Damerau (1964) that when writing computer programs or indexing documents by means of keywords, these errors are almost the only ones which occur. The same words are constantly repeated, and the operator (a specialist) knows exactly how to spell them. The mistakes made are therefore nearly all performance errors. But when the general public (especially in C.A.I.) uses computer services, very different problems can arise.</Paragraph>
    <Paragraph position="11"> Performance errors are still present, of course, but they are coupled with a very large number of competence errors such as phonographic ones, which, as we said previously, must be dealt with first and foremost.</Paragraph>
    <Paragraph position="12"> The mathematical framework developed for noise and typographical errors is very badly suited to phonographic errors. For example taking Wagner and Ficher's (1974) and Lowrance and Wagner's (1975) distance between strings (based on edit operations which model Damerau's four kinds of errors ), the wrong spelling ippeauttainnuz is very far from the right one hypotenuse, though it is obvious to any French speaker that the pronunciation is exactly the same. In addition, methods based on a transcription of words into some phonetic form cannot work when phonographic errors are combined with typographical ones.</Paragraph>
    <Paragraph position="13"> We have therefore extended the notion of proximity between strings to take phonetic similarity into account. In the case of phonographic errors, a whole grapheme, which can be more than one letter long, can be replaced by another grapheme having the same phonetic value. This defines a similarity relation between graphemes, as shown in Figure 1. The basic idea is to extend the edit operations to similar-substring substitution, and to associate high costs with edit operations altering pronunciation (most noise and typographical errors) and low costs to edit operations preserving pronunciation (phonographic errors) (V6ronis, 1988a, b).</Paragraph>
    <Paragraph position="14"> In addition, we established a precise quantitative inventory of sound-to-spelling correspondences, which, although absolutely necessary in any attempt to build efficient phonographic correctors, was sorely lacking for French. This collection of data has subsequently proved to  This led us to the building of an efficient algorithm for retrieving from a dictionary words'which can be riddled by both phonographic and typographical errors. This algorithm is an extension to phonographic errors of the algorithm proposed by Damerau (1964), Morgan (1970), and Durham et aL (1983). There are two essential differences between the latter and the algorithm that we propose. First, we try to match the entire unknown word against a dictionary of root forms, as we shall describe later. Secondly, we scan the strings x and y from left to right, no longer by simply checking at each point (i,j) that the symbols x\[i \] and y \[j'\] .are the same, but rather by testing whether these symbols constitute the beginning of any similarly-pronounced substrings.</Paragraph>
    <Paragraph position="15"> The problem is to find as quickly as possible the longest similar substrings at each point (i, j.) of the analysis. We have no room here to go into technical details, but this is possible using rather sophisticated methods which consist of pre-computing tables from the similarly-pronounced relation between graphemes, and storing the dictionary in a coded form where each character is replaced by a code which stands for the longest substring which begins with this character and can be involved in some similarly-pronounced relation (V~ronis, 1988b).</Paragraph>
    <Paragraph position="16"> The restriction stipulated by Morgan (1970), and Durham et al. (1983) is that the unknown word must contain no more than one typographical mistake, since this will cover the large majority of cases : two typographical errors rarely occur in the same word (Pollock and Zamora, 1983).</Paragraph>
    <Paragraph position="17"> We soften this restriction by allowing one typographical error in the root, and another at the ending of the word, in the inflexion, while within a word we accept an unfimited number of phonographic errors. Words as incorrectly spelt as ippeauttainnuz, hipptainuz, hyoth6nnuse (for hypotenuse) are perfectly recognized. This algorithm is quite fast enough for natural-language interfaces using dictionaries stored in R.A.M., since the access time to the correct entry in a 3O0-word French dictionary generating &amp;quot;700 inflected forms is about 25 ms with a Pascal program on a Macintosh II computer. The time taken hardly depends at all on the length of the word or on the number of phonographic errors it contains. Better results could be obtained by a more sophisticated organization of the dictionary (in tree form, for example).</Paragraph>
    <Paragraph position="18"> 1)As long as x\[i \]and y\[j\] are the beginning of similar substrings, the indexes i and j are incremented by the lengths of the respective similarly-pronounced substrings, and this step is repeated (Fig 2.a).</Paragraph>
    <Paragraph position="19">  2) When two symbols are found which do not fulfill this requirement (Fig. 2.b), the following four hypotheses are tested (they correspond to typographical errors) : - the next - the next - the next - the next  two adjacent letters have been transposed, letter is missing (as in the example), letter has been inserted, letter has been replaced by another.</Paragraph>
    <Paragraph position="20"> In each case, it is attempted to match the tail substrings according to 1), while skipping the appropriate letters (Fig. 2.c).</Paragraph>
    <Paragraph position="21"> d alm nttront d al m nttront 171 1711  completely scanned, if some substring remains in the unknown word, it is matched against a list of inflexions, using the same procedure (Fig. 2.d).</Paragraph>
    <Paragraph position="22"> Figure 2 : phonographic correction of root and Inflexion  ~il. A~re~rnent correction Once the right root and inflexion have been found in the dictionary during the lexical analysis, the morphological inlormation (gender, number, etc.) associated with the word are passed on to the parser, which deals with any wrong a,.)Jeenlents. In sucl, a case, the parser builds various interpretatio,s : le triangles (thesg. trianglespl.) can be eerrected into les triangles (plural)or into le triangle (singLilar).'\]he problem is how to classify these interpretatior~s depending on their plausibility. The few methods proposed so far (as in Richard and Lapalme, 1986) are not satisfactory. There are in fact two classical approaches. rhe first consists of favouring the interpretation which rninimizes the total number of errors. For example, correcting le triangles rectangles (thesg. rightpl, trianglespl.) into le trianule rectangle (singular) implies two errors, whereas the correction into les triangles rectangles (plural) implies a single error. The second approach consists of always favouring the morphological features of fixed syntactic categories. For example, Richard and Lapalme (1986) propose favouring the determiner in French over the noun. \]his leads, in the previous example, to a correction into le hiangle rectangle (singular). The two approaches are in m~Jny cases, as here, contradictory. One can use a cornbination of the two methods, for example by applying the ,;econd when the first fails (same number of errors upon each hypothesis), but this will not solve all problems, In fact, we needed to carefully investigate the agreement phenomena, in order to establish a weighting of errors.</Paragraph>
    <Paragraph position="23"> Our first linding concerned the non=symmetry of errors.</Paragraph>
    <Paragraph position="24"> People very often forget unpronounced morphological markers but very rarely add them with no reason. Adding a marker cost.'; more than removing it. Therefore, the group triangles r~ctangle (rightsg. triangleSpl.) should be preferably corrected into triangles rectangles (plural).</Paragraph>
    <Paragraph position="25"> One should also note the very important role of pronunciation. For example, it is very unlikely that a user might write ~.quilat6raux (equilateralpl.) for 6quilat#ral (equilateralsg.), since the two forms do not have the same pronunciation. Consequently, triangle 6quilat6raux (equilateralpl, trianglesg.) should be preferably corrected into triangles 6quilat6raux (plural). In addition, one can assume that native speakers of French are unlikely to produce errors involving the knowledge of morphological features of words such as gender, number, person. Everybody knows that chien (dog) is masculine and chienne (female dog) is feminirle. The difficulty is due to the transcription of agreement markers in an orthographical system. Therefore, errors such a.'~ chienne dress6 (trainedmasc. dogfem.) should be corrected into chienne dress6e (feminine) and not into chien dress~} (masculine). The situation would be different with non-native speakers of French, for example in a C.A.I. system for learning French, where gender errors would be very frequent. In this case, the weighting of errors would have to be different.</Paragraph>
    <Paragraph position="26"> We postulate three classes of errors with increasing costs.</Paragraph>
    <Paragraph position="27"> I. The least costly type of error consists of deleting a marker involving no change in the pronunciation (e.g.</Paragraph>
    <Paragraph position="28"> French triangles-~ triangle ).</Paragraph>
    <Paragraph position="29"> I1. The second class consists of adding a marker which entails no pronunciation change (e.g. French triangle -~ triangles ).</Paragraph>
    <Paragraph position="30"> III. The third and most costly class consists of errors altering the pronunciation (e.g. le-9 la ).</Paragraph>
    <Paragraph position="31"> Some intermediate cases are distributed among these three classes. For example, errors involving a final so-called 'mute' e (which indicates the feminine, and has an unstable pronunciation) will belong to class II in the case of a deletion (e.g., petite ~ petit =small), and class III in the case of an addition (e.g., petit -~ petite ).</Paragraph>
    <Paragraph position="32"> The main point is that we cannot simply attibute an increasing weight to each class, and add the weights when combining phrases. It should be noted that an arbitrary number of errors in a given class remains less costly than a single error in the next class. For example, les triangle rectangle et isoc#le (thepl. rightsg, and isocelessg, trianglesg.) should be corrected into les triangles rectangles et isoc#les (plural) with three class I errors, whereas the correction into le triangle rectangle et isoc61e (singular) would involve a single error, but of class II.</Paragraph>
    <Paragraph position="33"> This can be modelled by ordinal numbers : O, 1, 2, 3 .... co, m+l..., m 2, etc. (let us remember that coi.k &lt; (o i+1, V k ). Class I has costs of the form k, class II of the form ~.k, and class III of the form e)2.k. In practice, ordinals can be coded by integers, by choosing a sufficiently large integer B (for example 10), and mapping ~n.k'n+...+ (o.k'l + k'o to k'nBn+...+ k'lB+ k'o. For example, e}2.2 + e).3+ 1 will be coded by 231. This coding is adopted in the Figures.</Paragraph>
    <Paragraph position="34"> The parser conducts the various possible morphological analyses in parallel, in order to avoid the costly backtrackiflgs needed to repeat the analysis as soon as an error occurs, and also to avoid the need for any special error recovery procedure. This is achieved by associating a vector of the costs upon each possible morphological hypothesis with each node of the syntactic tree. The lexicon provides these values for each word  (figure 3). For example, the word petits (smallpl.) will be associated with the vector \[~,ce.2,0,1\], which means that it can be a mistake for : petit (masc. sing.) with a cost co (adding s ) petite (fem. sing.) with a cost (o.2 (deleting mute e + adding s ) petits (masc. plur.) with a cost 0 (no error) - petites (fem. plur.) with a cost (~ (deleting mute e ). In addition, each word is associated with a domain, which consists of the only possible corrections, since many words have restricted morphological features. This is the case with most nouns: homme (man) can be only masculine, femme (woman) only feminine, gens (people) only masculine plural, but also some adjectives : enceinte (pregnant) can be only feminine. We represent the domains by hatching the forbidden part, which is coded by a special value in the vector.</Paragraph>
    <Paragraph position="35"> &amp;quot; masc. sing.-~~-\] ' fem. sing. ~ lO0 I I masc. plur-----H_ ~o I I f~m. plur. ---n u~ul linterpretation of cost vectors homme hommes Figure 3 : cost vectors for words When phrases are combined during the parsing, domains are intersected and costs are added separately in each vector column in the following way :</Paragraph>
    <Paragraph position="37"> Under the above-mentioned assumption for coding ordinals, the addition ~) can be reduced, in practice, to the ordinary addition of integers in base B. Therefore, the parallel computation of the various morphological hypotheses is not much more expensive than the usual exact, non-parallel, computation.</Paragraph>
    <Paragraph position="38"> the same process can be applied to the other morphological features, persons, tenses and moods. In the final stage of parsing, the least costly hypothesis is chosen (Fig. 4, 5). If semantic constraints prove this interpretation to be impossible, the next hypothesis is chosen, and so on.</Paragraph>
    <Paragraph position="39"> This part is implemented in Prolog and calls on the Pascal module described in section II.</Paragraph>
    <Section position="1" start_page="711" end_page="712" type="sub_section">
      <SectionTitle>
IVo Conclusion
</SectionTitle>
      <Paragraph position="0"> An efficient means of handling quite complex combinations of typographical, phonographic and agreement errors, which are frequent with C.A.I. users, is described : a sentence as erroneous as les cott6 adgassan I'ippeauttainuz son perpndiqul~re (l) will be perfectly recognized and translated into les c6t~s adjacents I'hypot6nuse son t perpendiculaires (the legs adjacent to the hypotenuse are perpendicular). This feature can make interaction with systems more pleasant for non-specialists. I 1 )o )1 les homme parle =the (plur.) man (fern. sing.) talk (sing.)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML