Morphosyntactic correction in natural language interfaces 
Jean VERONIS * 
Groupe Representation et Traitement des Connaissances 
Centre National de la Recherche Scientifique 
31, ch. Joseph Aiguier 
13402 MARSEILLE CEDEX 9 - FRANCE 
Abstract 
Morphosyntax cannot be simply ignored in natural- 
language man-machine dialogue since it constitutes an 
important part of the meaning. Nevertheless, troublesome 
side effects can arise when morphosyntactic errors are 
combined with other types of errors. We describe here an 
efficient means of handling quite complex combinations of 
typographical, phonographic and agreement errors in 
French, which are typical of C.A.I. users : a sentence as 
erroneous as les cott6 adgassan ~ I'ippeauttainuz son 
perpndiqul~re (!) will be perfectly recognized and translated 
into les c6t~s adjacents ~ I'hypot6nuse sont perpen- 
diculaires (the legs adjacent to the hypotenuse are perpen- 
dicular). 
'slips of the pen'), whereas competence errors reflect 
ignorance about language rules or misconceptions about 
the domain. Phonographic errors (in French : ippeauttainuz 
for hypot6nuse ) or agreement errors (les c5t6 oppos6 for 
les c6t6s oppos6s ) are typical competence errors. In man- 
machine communication, the correction of competence 
errors is far more important than the correction of 
performance ones (see V~ronis, 1988c). In fact, when faced 
with an error message, the user can correct typographical 
errors, for example, but he will generally be unable to 
correct phonographic or agreement errors. He can only try 
various spellings at random, which is a rather frustrating way 
of interacting with a system. We have tried elsewhere 
(V6ronis, 1987b, c) to demonstrate how some semantic and 
conceptual errors can be handled (especially wrong 
presuppositions) using a special many-sorted logic. The 
present paper focuses on morphosyntactic errors, inflexion 
and agreements. 
I. Introduction 
This study was carried out within the context of a C.A.I. 
system for teaching plane geometry at high school level, 
which is being developed at G.R.T.C (Chouraqui, Inghilterra 
and V6ronis, 1988). In this system, natural-language 
interfaces occur in various places : experts are enabled to 
transfer knowledge (tl~eorems, problems), and students to 
make demonstrations, using natural language. Error 
correction is particularly important in C.A.I. systems, since 
students are generally poor spellers and poor grammarians, 
and they make .many conceptual errors in the Subject they 
are learning. 
We introduce a distinction between competence and 
performance errors. Performance errors are simply due to 
mechanical or neuro-motor problems (typographical errors, 
We would stress that morphosyntax cannot be simply 
ignored in natural-language man-machine dialogue. Let us 
take, for example, the following wrong sentence, concerning 
a right triangle (we translate word for word; in French, 
determiners and adjectives agree in gender and number 
with nouns) : 
Trace les c6t6 oppos6 a rangle droit. 
(Draw thepl, sidesg, oppositesg. \[to\] the right angle). 
Two corrections of this sentence can be performed : 
Trace le c6t6 oppos6 ~ rangle droit (singular). 
Trace les c6t6s oppos6s ~ rangle droit (plural). 
In tl~e first case, there is no conceptual error, whereas, in the 
second, the user could (for example) have confused c6t6 
oppos6 (opposite side : there is exactly one such side, the 
hypotenuse) and c&t~s adjacents (legs adjacent to the right 
angle : there are two of them). This second interpretation 
* The author's paper entitled "Une extension ,~ la distance entre chaines" was accepted at the COLING'86 conference in Bonn, and actually presented in the 
session Morphology. Due to some technical error the paper was not included in the final program and was omitted ~om the Proceedings. 
708 
should trigger an error message such as : 
> Warning : in a right triangle, there is exactly one side 
opposite the right angle, the hypotenuse. Do you want 
to see the figure (y/n)? 
problems, such as errors caused by the input devices, or 
transmission) or typographical errors, due to keyboard 
typing slips, such as those listed in Damerau's (1964) often- 
quoted study, which shows that 80% of errors in words 
belong to one of the following categories : 
We must therefore correct morphosyntactic errors (gender 
and number, but also person, tense and moods) with great 
care, and apply appropriate rules to find out the right 
i nte rpretations. 
- substitution of a letter for another, 
- addition of a letter, 
-deletion of a letter, 
- transposition of two adjacent letters. 
The problem becomes rather more complicated when 
several types of errors (typographical, phonographic and 
morphosyntactic) are combined in a single word. 
Troublesome side-effects can then arise when a 
morphological program attempts to reduce such words to 
their root form. For example, the wrong form hippoth6nuses 
will be reduced to a hypothetical root form hippoth~nuse, 
which is not to be found in the dictionary. In addition, the 
inflexion itself may be misspelt (e.g. d6montron, instead of 
d6montrons ). In such a case, the wrong ending may in 
addition no longer be a possible inflexion, so that the 
standard morphological program will fail in trying to 
construct ~ hypothetical root form. We therefore need a two- 
.'~tage process, in order to first find out the root and inflexion 
of inflected words despite typographical or phonographic 
errors, and then to apply appropriate rules to obtain the right 
agreement interpretations. These rules will involve some 
weighting of the possible agreement errors, which makes 
certain interpretations more likely than others. 
II. Root and inflexion retrieval 
The most common strategy in spelling correction 
consists of applying reverse morphological transformations 
on words to produce a hypothetical root form, and then 
looking it up in the dictionary. If there is no matching entry, a 
spelling correction program is triggered. Nevertheless, if the 
inflexion i.~ misspelt, the problem is really troublesome 
since, as mentioned above, the morphological program will 
be unable to produce a hypothetical root. The solution 
consisting of avoiding any morphological analysis by storing 
all inflected forms in the dictionary is a very inefficient one, 
since spelling correction algorithms all involve scanning a 
sometimes quite large portion of the dictionary. The time 
spent on spelling correction will then naturally be even 
greater in an inflected dictionary (remember, for example, 
that French verbs have about forty different inflected forms). 
Moreover, much research has been devoted to spelling 
correction since the very beginning of computer science (for 
a review, see Peterson, 1980, and Pollock, 1982), but has 
generally focused on noise errors (due to hardware 
The first three errors can result from either noise or 
typographical causes, and the fourth is specifically a 
typographical one. We agree with Damerau (1964) that 
when writing computer programs or indexing documents by 
means of keywords, these errors are almost the only ones 
which occur. The same words are constantly repeated, and 
the operator (a specialist) knows exactly how to spell them. 
The mistakes made are therefore nearly all performance 
errors. But when the general public (especially in C.A.I.) 
uses computer services, very different problems can arise. 
Performance errors are still present, of course, but they are 
coupled with a very large number of competence errors 
such as phonographic ones, which, as we said previously, 
must be dealt with first and foremost. 
The mathematical framework developed for noise and 
typographical errors is very badly suited to phonographic 
errors. For example taking Wagner and Ficher's (1974) and 
Lowrance and Wagner's (1975) distance between strings 
(based on edit operations which model Damerau's four 
kinds of errors ), the wrong spelling ippeauttainnuz is very 
far from the right one hypotenuse, though it is obvious to 
any French speaker that the pronunciation is exactly the 
same. In addition, methods based on a transcription of 
words into some phonetic form cannot work when 
phonographic errors are combined with typographical ones. 
We have therefore extended the notion of proximity 
between strings to take phonetic similarity into account. In 
the case of phonographic errors, a whole grapheme, which 
can be more than one letter long, can be replaced by 
another grapheme having the same phonetic value. This 
defines a similarity relation between graphemes, as shown 
in Figure 1. The basic idea is to extend the edit operations to 
similar-substring substitution, and to associate high costs 
with edit operations altering pronunciation (most noise and 
typographical errors) and low costs to edit operations 
preserving pronunciation (phonographic errors) (V6ronis, 
1988a, b). 
In addition, we established a precise quantitative 
inventory of sound-to-spelling correspondences, which, 
although absolutely necessary in any attempt to build 
efficient phonographic correctors, was sorely lacking for 
French. This collection of data has subsequently proved to 
709 
be usefulto both psycholinguists and teachers (V~ronis, 
1986,1.988d). 
@UmtUlmlll 
IIIIIII!!  IIIIIII!11 
!!11111111  
Figure 1 : part of the similarity relation between 
substrings with French 
This led us to the building of an efficient algorithm for 
retrieving from a dictionary words'which can be riddled by 
both phonographic and typographical errors. This algorithm 
is an extension to phonographic errors of the algorithm 
proposed by Damerau (1964), Morgan (1970), and Durham 
et aL (1983). There are two essential differences between 
the latter and the algorithm that we propose. First, we try to 
match the entire unknown word against a dictionary of root 
forms, as we shall describe later. Secondly, we scan the 
strings x and y from left to right, no longer by simply 
checking at each point (i,j) that the symbols x\[i \] and 
y \[j'\] .are the same, but rather by testing whether these 
symbols constitute the beginning of any similarly- 
pronounced substrings. 
The problem is to find as quickly as possible the longest 
similar substrings at each point (i, j.) of the analysis. We 
have no room here to go into technical details, but this is 
possible using rather sophisticated methods which consist 
of pre-computing tables from the similarly-pronounced 
relation between graphemes, and storing the dictionary in a 
coded form where each character is replaced by a code 
which stands for the longest substring which begins with this 
character and can be involved in some similarly- 
pronounced relation (V~ronis, 1988b). 
The restriction stipulated by Morgan (1970), and 
Durham et al. (1983) is that the unknown word must contain 
no more than one typographical mistake, since this will 
cover the large majority of cases : two typographical errors 
rarely occur in the same word (Pollock and Zamora, 1983). 
We soften this restriction by allowing one typographical error 
in the root, and another at the ending of the word, in the 
inflexion, while within a word we accept an unfimited 
number of phonographic errors. Words as incorrectly spelt 
as ippeauttainnuz, hipptainuz, hyoth6nnuse (for 
hypotenuse) are perfectly recognized. This algorithm is 
quite fast enough for natural-language interfaces using 
dictionaries stored in R.A.M., since the access time to the 
correct entry in a 3O0-word French dictionary generating 
"700 inflected forms is about 25 ms with a Pascal program on 
a Macintosh II computer. The time taken hardly depends at 
all on the length of the word or on the number of 
phonographic errors it contains. Better results could be 
obtained by a more sophisticated organization of the 
dictionary (in tree form, for example). 
1)As long as x\[i \]and y\[j\] are the beginning of 
similar substrings, the indexes i and j are incremented by 
the lengths of the respective similarly-pronounced 
substrings, and this step is repeated (Fig 2.a). 
2) When two symbols are found which do not fulfill this 
requirement (Fig. 2.b), the following four hypotheses are 
tested (they correspond to typographical errors) : 
- the next 
- the next 
- the next 
- the next 
two adjacent letters have been transposed, 
letter is missing (as in the example), 
letter has been inserted, 
letter has been replaced by another. 
In each case, it is attempted to match the tail substrings 
according to 1), while skipping the appropriate letters (Fig. 
2.c). 
d alm nttront d al m nttront 171 1711 
a) d (~ m ontr b) d 6 m ontr 
d aim ntt r .:ont : (ont~--.Ib IPl 17/'" 
.... 
d ~ m \[o\] n t r 
c) d) 
-er 
-e 
-es 
-ons 
-ez 
-ent 
-(~e 
-Os 
-~es 
3) When the hypothetical root form has been 
completely scanned, if some substring remains in the 
unknown word, it is matched against a list of inflexions, 
using the same procedure (Fig. 2.d). 
Figure 2 : phonographic correction of root and Inflexion 
710 
~il. A~re~rnent correction 
Once the right root and inflexion have been found in the 
dictionary during the lexical analysis, the morphological 
inlormation (gender, number, etc.) associated with the word 
are passed on to the parser, which deals with any wrong 
a,.)Jeenlents. In sucl, a case, the parser builds various 
interpretatio,s : le triangles (thesg. trianglespl.) can be 
eerrected into les triangles (plural)or into le triangle 
(singLilar).'\]he problem is how to classify these 
interpretatior~s depending on their plausibility. The few 
methods proposed so far (as in Richard and Lapalme, 1986) 
are not satisfactory. There are in fact two classical 
approaches. rhe first consists of favouring the interpretation 
which rninimizes the total number of errors. For example, 
correcting le triangles rectangles (thesg. rightpl, trianglespl.) 
into le trianule rectangle (singular) implies two errors, 
whereas the correction into les triangles rectangles (plural) 
implies a single error. The second approach consists of 
always favouring the morphological features of fixed syn- 
tactic categories. For example, Richard and Lapalme (1986) 
propose favouring the determiner in French over the noun. 
\]his leads, in the previous example, to a correction into le 
hiangle rectangle (singular). The two approaches are in 
m~Jny cases, as here, contradictory. One can use a 
cornbination of the two methods, for example by applying 
the ,;econd when the first fails (same number of errors upon 
each hypothesis), but this will not solve all problems, In fact, 
we needed to carefully investigate the agreement 
phenomena, in order to establish a weighting of errors. 
Our first linding concerned the non=symmetry of errors. 
People very often forget unpronounced morphological 
markers but very rarely add them with no reason. Adding a 
marker cost.'; more than removing it. Therefore, the group 
triangles r~ctangle (rightsg. triangleSpl.) should be 
preferably corrected into triangles rectangles (plural). 
One should also note the very important role of 
pronunciation. For example, it is very unlikely that a user 
might write ~.quilat6raux (equilateralpl.) for 6quilat#ral 
(equilateralsg.), since the two forms do not have the same 
pronunciation. Consequently, triangle 6quilat6raux (equi- 
lateralpl, trianglesg.) should be preferably corrected into 
triangles 6quilat6raux (plural). In addition, one can assume 
that native speakers of French are unlikely to produce errors 
involving the knowledge of morphological features of words 
such as gender, number, person. Everybody knows that 
chien (dog) is masculine and chienne (female dog) is 
feminirle. The difficulty is due to the transcription of 
agreement markers in an orthographical system. Therefore, 
errors such a.'~ chienne dress6 (trainedmasc. dogfem.) should 
be corrected into chienne dress6e (feminine) and not into 
chien dress~} (masculine). The situation would be different 
with non-native speakers of French, for example in a C.A.I. 
system for learning French, where gender errors would be 
very frequent. In this case, the weighting of errors would 
have to be different. 
We postulate three classes of errors with increasing 
costs. 
I. The least costly type of error consists of deleting a 
marker involving no change in the pronunciation (e.g. 
French triangles-~ triangle ). 
I1. The second class consists of adding a marker 
which entails no pronunciation change (e.g. French triangle 
-~ triangles ). 
III. The third and most costly class consists of errors 
altering the pronunciation (e.g. le-9 la ). 
Some intermediate cases are distributed among these 
three classes. For example, errors involving a final so-called 
'mute' e (which indicates the feminine, and has an unstable 
pronunciation) will belong to class II in the case of a deletion 
(e.g., petite ~ petit =small), and class III in the case of an 
addition (e.g., petit -~ petite ). 
The main point is that we cannot simply attibute an 
increasing weight to each class, and add the weights when 
combining phrases. It should be noted that an arbitrary 
number of errors in a given class remains less costly than a 
single error in the next class. For example, 
les triangle rectangle et isoc#le 
(thepl. rightsg, and isocelessg, trianglesg.) 
should be corrected into 
les triangles rectangles et isoc#les (plural) 
with three class I errors, whereas the correction into 
le triangle rectangle et isoc61e (singular) 
would involve a single error, but of class II. 
This can be modelled by ordinal numbers : O, 1, 2, 3 .... 
co, m+l..., m 2, etc. (let us remember that coi.k < (o i+1, V k ). 
Class I has costs of the form k, class II of the form ~.k, and 
class III of the form e)2.k. In practice, ordinals can be coded 
by integers, by choosing a sufficiently large integer B (for 
example 10), and mapping ~n.k'n+...+ (o.k'l + k'o to 
k'nBn+...+ k'lB+ k'o. For example, e}2.2 + e).3+ 1 will be 
coded by 231. This coding is adopted in the Figures. 
The parser conducts the various possible 
morphological analyses in parallel, in order to avoid the 
costly backtrackiflgs needed to repeat the analysis as soon 
as an error occurs, and also to avoid the need for any 
special error recovery procedure. This is achieved by 
associating a vector of the costs upon each possible 
morphological hypothesis with each node of the syntactic 
tree. The lexicon provides these values for each word 
711 
(figure 3). For example, the word petits (smallpl.) will be 
associated with the vector \[~,ce.2,0,1\], which means that it 
can be a mistake for : 
petit (masc. sing.) with a cost co (adding s ) 
petite (fem. sing.) with a cost (o.2 (deleting mute e + 
adding s ) 
petits (masc. plur.) with a cost 0 (no error) 
- petites (fem. plur.) with a cost (~ (deleting mute e ). 
In addition, each word is associated with a domain, 
which consists of the only possible corrections, since many 
words have restricted morphological features. This is the 
case with most nouns: homme (man) can be only 
masculine, femme (woman) only feminine, gens (people) 
only masculine plural, but also some adjectives : enceinte 
(pregnant) can be only feminine. We represent the domains 
by hatching the forbidden part, which is coded by a special 
value in the vector. 
" masc. sing.-~~-\] ' 
fem. sing. ~ lO0 I I 
masc. plur-----H_ ~o I 
I f~m. plur. ---n u~ul 
linterpretation of cost vectors 
homme hommes 
Figure 3 : cost vectors for words 
When phrases are combined during the parsing, 
domains are intersected and costs are added separately in 
each vector column in the following way : 
o~ = (~.k2 + e).kl + ko 
13 = (~2.k'2 + (~.k'l + k'o 
e~ @ \[3 = (02.(k2+k'2) + o).(kl+k'l) + (ko+k'o) 
Under the above-mentioned assumption for coding 
ordinals, the addition ~) can be reduced, in practice, to the 
ordinary addition of integers in base B. Therefore, the 
parallel computation of the various morphological 
hypotheses is not much more expensive than the usual 
exact, non-parallel, computation. 
the same process can be applied to the other 
morphological features, persons, tenses and moods. In the 
final stage of parsing, the least costly hypothesis is chosen 
(Fig. 4, 5). If semantic constraints prove this interpretation to 
be impossible, the next hypothesis is chosen, and so on. 
This part is implemented in Prolog and calls on the Pascal 
module described in section II. 
les 
=the (plur.) 
best interpretation: 
petits 
small (masc. plur.) 
,0 
bl 
chatte miaule 
cat (fern. sing.) mews (sing.) 
Figure 4 : agreement correction on a (very/) 
erroneous sentence. 
IVo Conclusion 
An efficient means of handling quite complex 
combinations of typographical, phonographic and 
agreement errors, which are frequent with C.A.I. users, is 
described : a sentence as erroneous as les cott6 adgassan 
I'ippeauttainuz son perpndiqul~re (l) will be perfectly 
recognized and translated into les c6t~s adjacents 
I'hypot6nuse son t perpendiculaires (the legs adjacent to the 
hypotenuse are perpendicular). This feature can make 
interaction with systems more pleasant for non-specialists. 
I 1 )o )1 
les homme parle 
=the (plur.) man (fern. sing.) talk (sing.) 
best interpretation: 
o t 
o ~ 
o ~ 
grands 
great (masc. plur.) 
Figure 5 : Intersection of domains 
712 

REFERENCES 

CHOMSKY, N. (1965). Aspects of theTheory of Syntax. 
Cambridge, Mass. : The MIT Press. 

CHOURAQUI, E., INGHILTERRA, C., V~RONIS, J. (1988). 
ARCHIMEDE : un systOme expert d'enseignement de la 
g6om~trie. 8th International Workshop Expert Systems 
and theh" App/ications. Avignon, France. 

DAMERAU, D. N. (1964). A technique for computer detection 
and correction of spelling errors, Comm. A.C.M., 7, 3, 
171-176. 

DURHAM, I, LAMB, D. A., SAXE, J. B. (1983). Spelling 
correction in user interfaces, Comm. A,C.M., 26, 10, 764- 
773. 

LOWRANCE, R., WAGNER, R. A. (1975). An extension to the 
string-to-string correction problem, J.A.C.M., 22, 2, 177- 
183. 

MORGAN, H. L., (1970). Spelling correction in system 
programs, Comm. A.C.M., 13, 2, 90-94. 

PETERSON, J. L. (1980). Computer programs for detecting 
and correcting spelling errors, Comm. A.C.M., 23, 12, 
676-687. 

POLLOCK, J. J. (1982) Spelling error detection and 
correction by a computer ; some notes and a 
bibliography, J. Doc., 38, 4, 282-291. 

POLLOCK, J., J., ZAMORA, A. (1983). Collection and 
characterization of spelling errors in scientific and 
scholarly texts, J. Am. Soc. Inf Sc., 34, 1,51-58. 

RICHARD, D:, LAPALME, G. (1986). Un syst~me de correction 
automatique des accords des participes passes. 
Technique et Science Informatique, 5, 4, 307-320. 

VERONIS, J. (1986). Etude quantitative sur le syst~me 
graphique et phonographique du frangais. European 
Bulletin of Cognitive Psychology, 6, 5, 501-531. 

VERONIS, J. (1987 b). Discourse consistency verification 
and many-sorted logic. Proceedings of the lOth 
International Joint Conference on Artificial Intelligence. 
Milan, 633-635. 

VERONIS, ,J. (1987 C). V6rification de coherence dans le 
dialogue homme-machine en langage naturel. Actes du 
Colloque Reconnaissance des Formes et Intelligence 
Artificielle, A.F. C.E. T., Antibes, 143-158. 

VERONIS, J. (1988 a). Computerized correction of 
phonographic errors. Computers and the Humanities, 
22,1, 43-56. 

VERONIS, J. (1988 b). Correction of phonographic errors in 
natural language interfaces. 11th International 
Conference on Research and Development in Information 
Retrieval. Grenoble, France. 

VERONIS, J. (1988 c). L'erreur dans le dialogue en langage 
naturel avec des syst~mes experts. 8th International 
Workshop Expert Systems and their Applications. 
Avignon, France. 

VERONIS, J. (1988 d). Sound-to-spelling transcription : a 
computer simulation. European Bulletin of Cognitive 
Psychology, 8, 3, \[June 1988 :in press\]. 

WAGNER, C. K., FISCHER, M. J. (1974). The string-to-string 
correction problem, J.A.C.M., 21, 1,168-173. 
