Spelling Correction in Agglutinative Languages 
Kemal Oflazer and Cemaleddin G/izey 
Department of Computer Engineering and Information Science 
Bilkent University 
Ankara, 06533, Turkey 
ko@cs, bilkent, edu. tr 
1 Introduction 
Spelling correction is an important component of 
any system for processing text. Agglutinative lan- 
guages such as Turkish or Finnish, differ from lan- 
guages like English in the way lexical forms are gen- 
erated. Typical nominal or a verbal root may gener- 
ate thousands (or even millions) of valid forms which 
never appear in the dictionary. For instance, we 
can give the following (rather exaggerated) exam- 
ple from Turkish: 
uygarla~tzramayabileceklerimizdenmi~sinizcesine 1 
whose morpheme breakdown is: 
uygar -lag -tzr -area civilized +BECOME CAUS +NEG 
-yabil -ecek -let -irniz +POT +FUT +3PL +POSS-1SG 
-den -mi~ -siniz -cesine +ABL +NARR +2PL +AS-IF 
Methods developed for spelling correction for lan- 
guages like English (see the review by Kukich (Ku- 
kich, 1992)) are not readily applicable to agglutina- 
tive languages. This poster presents an approach to 
spelling correction in agglutinative languages that 
is based on two-level morphology and a dynamic- 
programming based search algorithm. After an 
overview of our approach, we present results from 
experiments with spelling correction in Turkish. 
2 Spelling correction algorithm 
Our approach comprises two-steps: (1) determining 
all the roots from the dictionary that can be the root 
of the misspelled word, and (2) generating (system- 
atically) all the possible words that "resemble" the 
given character string, from these roots. The first 
step of the problem is relatively easy because of the 
static structure of the root dictionary. Techniques 
developed for spelling correction, say, in English can 
usually be applied here. The second step involves 
producing all the possible words from the selected 
roots and requires a generate and test search proce- 
dure. 
1This is an adverb meaning roughly "(behaving) as 
if you were one of those whom we might not be able civilize." 
We denote the set of the surface forms of the roots 
in the language by R. We denote by X and Y 
strings formed using the alphabet of the language. 
X will denote the surface form of the incorrect or 
misspelled string, and Y will typically denote the 
surface string that is a (possibly partial) candidate 
word. Yle~ will denote the lexical form of this candi- 
date string? We assume the existence of a function, 
surface() to generate surface strings from lexical st- 
rings, i.e., surface(Yze~) = Y. The edit distance 
metric ed(Z,Y) (Du and Chang, 1992) measures 
how many unit operations (insertion, deletion, re- 
placement of single character and transposition of 
two adjacent characters) are necessary to convert 
one string X into another Y. 
We would like to abstract the behavior of a mor- 
phological generator and analyzer for the given lan- 
guage by two finite state automata. A finite state 
generator Mg = (P, 6, V, S, F) where P is a set of 
states, $ is the state transition function, V is the 
output alphabet, S is the starting state, and F is 
a set of final states, generates, all correctly formed 
words of the language. The transitions of Mg are 
of the form /5(pi) = pj (Pi and pj 6 P), with an 
output v~ 6 V which denotes the lexical form of 
a morpheme in the language and also labels the 
transition. It should be noted that it is possible to 
go from one state Pi to another pj by more than one 
transition, outputting a different morpheme. We say 
a string Ytez is generated by Mg, if Y)e, is formed 
by concatenating, in order, the outputs of the ma- 
chine as we traverse starting from S to one of the 
states in F. We denote by L(Mg) as the set of all 
lexical strings generated by Mg. We also assume a 
finite state recognizer Mr which recognizes whether 
a given surface string is in the language or not. 
The spelling correction problem can now be for- 
mulated as follows: Given an incorrect word X 
(rejected by Mr), a!ad an edit distance threshold 
t, find the solution set of possible correct words 
S(X,t) = {Yled(X,Y) < t and Y = surface(Yt~,) 
and Y~e, 6 L(Mg)} in viable time and space. 
2Lexical and surface are used in the two-level mor- phology sense. 
194 
The set of all the possible roots for the incorrect 
word X is defined as PR(X,t) = {r \[ ed(X\[i\],r) < 
t and 1 < i < m and r E R}. 3 We assume that 
PR(X,t) can be computed by a standard q-gram 
technique. Using a small number (3 - 5) of 2-grams, 
gives satisfactory solutions in our case. 
Assuming that we have a set of root words found 
as described above, we now have to generate words in 
the language having these roots, that do not deviate 
from the given misspelled string by more than the 
threshold. The solution requires a generate and test 
probing of the finite-state automaton Mg, starting 
with the start state S. We now have to find all the 
paths from this state to one of the final states using 
the roots in PR(X, t), so that when the morphemes 
along this path are concatenated and surface string 
is generated, it is within an edit distance t of X. 
When the search starts morphemes are concate- 
nated to root and the length of the candidate lexical 
string Yle~ increases. After one step of the search, 
the partial surface string Y is compared with a suit- 
able prefix of X. In most of the cases the candidate 
Y will deviate from these prefixes of X by more than 
the threshold without reaching a final state, so that 
we can no longer get to a viable solution. In such 
cases we do not consider any further transitions from 
that state. Special attention has to paid when a suf- 
fixation changes the surface realization of morpheme 
immediately to the left. 
3 Results from experiments with 
spelling correction in Turkish 
We first present a spelling correction example from 
our implementation where we used bigrams (q = 2), 
and we chose the number of bigrams to use for deter- 
mining the root, k, as 3 (see Figure 1). We tested 
EXAMPLE 
Misspelled t:word: qal§rnalamyla Threshold : 2 
Solutions: yam§malamyla yatl§malanyla on left edge yapl§malarlyla yakl§malarlyla 
i¢~,§malamyla qlk1§rnalarlyla 
Candidate lq.oots:46a~ qakl 6al qah qam qan 6ap gar qat qatl 
qag qak qakl§ qal qah§ ~ap qat qatl§ qav ~av qay 
Solutions: 5 Lexical Surface 
Edit distance 1 <jat+H§+mA+lArH+ylA qatl§malanyla ~ap+ H§-b rnA+lArH-bylA qapl§malamyla 
qah§-l-mA+lArH+ylA ~ah§rnalarlyla (corr.) Edit Distance 2 qav+mA+lArH+ylA qavmalamyla 
q'at +mA+IArH+ylA qatmalamyla 
Figure 1: Spelling correction example 
3X\[i\] denotes the string of the first i characters of X. 
4The duplicate entries in the list of candidate roots in fact have different part-of-speech categories andhence 
different morphota~=tics. 
5A small subset of the whole solution set is given, due to space limitations. 
Table 1: Average number of operations per mis- 
spelled word. i IR cl°en L 
Ed. is. 30.9 311.2 
2 108.4 4462.0 
Ed6peDrm" I S°ln" \] %Accuracy 
2498.4 3.6 95.1 20680.4 52.0 95.1 
our algorithm on a set of 141 randomly selected in- 
correct words from Turkish text with edit distances 
1 (86%) and 2 (14%) to their correct form. The 
morphological analyzer and generator that we used 
was our two-level specification for Turkish (Oflaser, 
1993), developed using the PC-KIMMO system. It 
is, however, rather slow and can analyze only about 
2 forms per second and can generate about 50 forms 
a second on Sun Sparcstations. So, instead of using 
timings, we counted the number of times certain ex- 
pensive operations we called. The statistics shown in 
Table 1 show the average number of morphological 
recognitions and generations, and the edit distance 
operations required, and the number of correct solu- 
tions offered per misspelled input word. The last col- 
umn indicates the percentage of cases the intended 
correct form was found. We also have developed a 
algorithm for ranking the solutions which offers the 
intended correct form as the first in 74% of the cases 
when t = 1. 
4 Conclusions 
This poster has presented a spelling correction algo- 
rithm for agglutinative languages that is based on a 
two-level morphological generator and analyzer, and 
a intelligent generate and test search procedure. 
5 Acknowledgement 
The research was supported in part by a NATO Sci- 
ence for Stability Grant, TU-LANGUAGE. 

References 

M .W. Du and S. C. Chang. 1992 A model and a fast 
algorithm for multiple errors spelling correction. 
Acta Informatica, 29:281-302. 

K. Kukich. 1992 Techniques for automatically cor- 
recting words in text. ACM Computing Surveys, 
24:377-439. 

K. Oflazer. 1993 Two-level description of Turkish 
morphology. In Proceedings of the Sixth Conference of the European Chapter of the Association 
for Computational Linguistics, April. Vol.9 No.2, 1994. 
