A Statistical Approach to Thai Morphological Analyzer* 
Kawtrakul Asanee , Thumkanon Chalathip 
Natural Language Processing and Intelligent Information System Technology Research 
Laboratory, 
Dept. of Computer Engineering, 
Kasetsart University, THAILAND 
email : ak@nontri.ku.ac.th 
Abstract 
Three nontrivial problems of Thai morphological processing are word boundary ambiguity, 
tagging ambiguity and implicit spelling errors. These problems cause a lot of difficulty to the parser due 
to the alternative or erroneous chain of word. This work attempts to provide a computational solution, 
called Word Filtering, to those linguistic phenomena. The filtering process calculates the probabilities 
of all possible chains of tagged words using a Markov Model. The most likely sequence of tagged word is 
the one that maximizes the chain probabilities. However, it may be an erroneous chain which has an 
implicit spelling error. Therefore, the Word Filtering, also, includes the scanning process that detect and 
correct these errors. Both filtering and scanning process use a statistical data infonuation collected ~om 
the hand-ta.~ed corpus. 
The experiment has shown that word filtering can eliminate most of the alternative word 
sequences. Moreover: this tcelmique is fairly good at the implicit error correction. 
1. Introduction 
One of the major problcrns in many languages, such as Japanese, Chinese, Korean and Thai, is 
word boundary ambiguity because these languages do not have any delimiters between words. The 
second problem is tagging ambigui~' which occurs when there is more than one tag for one word. 
Another probleau is implicit spelling error that occurs because some incorrect words can be found in a 
diotionm3, .This problem is very hard to solve with a simple approach, such as dictionary approach. Thai 
morphological ~n~lysis must face these three problems which cause many possible alternative or/and the 
erroneous chains of words. These problems generate a lot of unuecessary work for the parser. In order to 
simplify the parser and speed it up, three important points to bear in mind when cousidering the 
morphological processing are neat segmentation of characters into words, part of speech tagging 
selection, and implicit spelling error detection. This work attempts to provide a computational solution, 
called Word Filtering, to handle those three points prior to parsing. 
The proposed model of Tb.ai morphological analysis consists of three steps: sentence segmenting, 
spell checking and word ill, ring. Using word fonnation rules and a dictionary look up algorithm in the 
first step, all possible word groups with all possible tags will be given. If there is any explicit error, the 
second step, that is spell checking, will give a suggestion about a set of most likely words. However, the 
implicit spelling error may still exist and will affect the parser. That is, the parser must search a large set 
of tagged word combinations in order to choose the fight one. Thus, the main goal of word filtering is to 
reduce the combination of unuseful tagged words and to identify implicit spelling error. 
*A Statistical Approach to Thai Morphological Analyzer is a part of WPA (Writing Production 
Assistant) Project supported by the National Research Council of Thailand. 
289 
t 
The proposed Word Filtering method consists of two steps: a filtering process and a scanning 
process. The first process will try to filter out any incorrect word boundary and any unsuitable tag. The 
second process detects and corrects the implicit spelling error by generating the new words for the 
detected error. 
The basic idea of the filtering process is to calculate the probabilities of all possible chains of 
tagged words by using a trigram of the Markov Model. The most likely sequences of tagged words are 
the ones that maximize chain probabilities. Nevertheless, they may be an erroneous chain which have 
implicit speRing errors. Thus, the Word Filtering, also includes the scanning process to detect and correct 
the error. At this step, a set of words will be generated by a generating function and be replaced to the 
detected word. The most likely sequences of correct words arc the ones that maximize chain probabilities. 
Both filtering and scanning processes use the statistical infomaation collected from the hand-tagged 
corpus. From results of the experiments on small corpus (about 10,000 sentences), word filter can 
criminate alternative word sequences and can correct the implicit error quite well. 
In the following section, key problems in Thai morphological analysis are described. Then, we 
present the overview of a computational morphological processing in section 3. In the section 4, the 
concept of how to use the statistical information to handle word boundary ambiguity, tagging ambiguity 
and implicit spelling error will be explained. Finally-, we present the conclusion result of*.he experiment. 
2. Key Problems in Thai Morphological Analysis 
There are three nontrivial problems of Thai morphological processing: word boundary 
ambiguity, tagging ambiguity and impficit spelling errors. 
2.1 Word Boundary Ambiguity 
Thai seutences are simile to the Japanese's and Chinese's in terms of having no blank space to 
mark each words within the same sentence. Additionally most of Thai words are multisyllabic words. 
Some of them contains more than monosyllabic words as parts of its component. This causes word 
boundary ambiguity. 
Let C be a sequence of characters: C = c~ c.2 c3 ... 
Let W be a sequence of words: W = wl w2 ... w, where wl = cn..~ 
Giving a stream of characters, the possible word segmentation is as following : 
stream of characters wl w2 ~.__l--~c,c, I Ic, c,c., 
I \[c,c,c,c,c, 
t---'4c,c,c~ . I Ic, c, I 
L.._l----.~c~c~ I Ic, c, I {clc:Gc, 
J l , '--~ClC, C,C, I 
(1.1) 
(\].2) 
(2.\]) 
(2.2) 
I 
i 
I 
I 
I 
I 
I 
I 
I 
I 
I 
i 
I 
I 
I 
I 
I 
i 
290 I 
I 
i 
I 
I 
I 
I! 
As shown above, the word in "C,CzC~C4Cs" pattern has two ambiguous forras. One is "C, C2" 
and "C~C4Cs". The other one is "'C,C:C~" and "'C4Cs". In our corpus, more than 50% of sentences include 
word boundary ambiguity-. 
The assignment of part of speech to the segmented word is also effected by the word boundary 
ambiguity. This causes the ambiguous pattern in a sentence The example is as shown in the following: 
E~mple 1 
ABCD ABCD ? ? 
The ambiguous patterns of the above sentence are : 
~D 
a) N 
(The boat) 
b) N 
(The boat) 
c) N 
(The boat) 
d) N 
(The boat) 
N 
(ox) 
N 
(ox) 
B~ 
V 
(go down) 
V 
(go down) 
V 
(shake) 
V 
(shake) 
conj 
(because) 
conj 
(because) 
conj 
(because) 
conj 
(because) 
~t3 
N 
(ox) 
N 
(ox) 
B~ 
V 
(go down) 
V 
(shake) 
V 
(go down) 
V 
(shake) 
~B 
N 
(the boat) 
N 
(the boat) 
N 
(the boat) 
N 
(the boat) 
In the above example, only c) and d) are the meaningful sentences. 
2.2 Tag Ambiguity 
A Thai word can have more than one part of speech. This tag ambiguity can cause a large set of 
tagged word combinations. Consider the following example" 
\[Ex~ple 2 
1B'11t 1~1,.~'1 ~ Iqll 
N |J 
• ° 
(which), (which), 
(at), 
(place), l 
4) ch 5) el: 
i (di.~) (dish) i 
(4 lags) (4 tags) 
~~ \[) relpron: 
i~ (which), 
• D v: Dprep: 
(eat) (at), 
Dcn: 
(place), N 
(2 tags\] (2 tags) (4 "tags) 
291 
The above multiple-tagged words give 1024 combinations of word chain. However, only one 
word chain is correct. Figure 1. shows tag ambiguity in our corpus. As we can see, there are about 95% 
of the words are ambiguous with regards to the t.~s they take. 
Number ofta~s Number of words Percentage 
1 tags 130 4.24 
2 tags 1750 57.0 
3 tags 998 ,i 32.5 
4 tags 192 6.26 
Figure i. Tag ambiguity found in 3070 words'corpus. 
Both word boundary and tag ambiguity increase the complexity in syntax analysis. It also 
increases the amount of time used for parsing the sentences. Besides these two ambiguities, spelling 
errors in Thai, called implicit spelling errors, also cause a lot more work for the parser. 
2.3 Implicit Spelling Error 
Implicit spelling errors, one of ill-formedness usually encountered in documents, are caused by 
either carelessness or lack of knowledge. This type of error can not be d~ectecl by simply using a 
dictionary approach. There are three kinds of typing errors caused by the carelessness: Missing, 
Keyboard Mistyping, and Swapping as the examples in the following : 
Cause 
Type 
Missing 
Mistyping 
Swapping 
carelessness lack of knowledge 
(t)his ~ his free ~ fee 
fa(t) -'~ far both "9' boat 
(n)o "> on form -'-Y from 
In case of lacking of knowledge, the errors occured from the unclear speech confuse to the typist. 
Additionally, they also occurs from the confusion in writing since there are many forms in one sound. See 
the following example. 
Example 3 
c,q c.,c, c,c.,qc, 
~') (prep) (V) 
I push raft down water until leg twist 
(I push the raft to the river and twisted my leg.) 
1.The ambiguous patterns caused by word boundary ambiguity are : 
a) pron V N prep N conj. N N prep 
b) pron V N prep N conj. N V 
a) pron V V N conj. N N prep 
a) pron V V N conj. N V 
I 
i 
I 
I 
I 
I 
i 
I 
I 
I 
I 
I 
I 
i 
I 
I 
I 
I 
292 I 
I 
I , 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Example 3 (continue): 
2.The erroneous chain of words caused by implicit spd!irig error : 
c,~c, (~ ~ mJs.~ ) 
pron V N prep N conj. N mod 
I push raR down water unffi leg expensive 
The implicit spelling errors can occur much easier in Thai than in English and Japanese (in 
Hiragana) because the errors ah~ays involve using aword that has a similar pronunciation. There are 
about 20-30% of Thai words that can cause this kind of the confusion to typist. Additionally, there are 2 
characters in one key pad (see figure 2). Thus, keyboard mistyping increases the way of implicit 
misspelling which can not be dete~ed easily using the dictionary - based approach. 
I o, ,,I m , I 
1} \[ w ul u. '~"~-'~1o.1,° lo~l,:l, 
I ,! I1= ~l × al~ ~1" :1~ -I . , " :1,~ xl -~ ~1 \[~'l~ "ill\] 
Figure 2. Two level key pads for Thai character. 
In this work, we attempt to provide a computational solution to handle these three 
nontrivial problems for making ~ejob of a parser much easier. The next section will present the 
overview of the system. 
3. An Overview of a Computational Morphological Processing for Thai 
A computalional model consists of word segmenting, spelling checking and word filtering 
processes is proposed to handle the morphological problems mentioned earlier. (see figure 3) 
Input Sentence 
Word Segmenting \[ 
t Spelling Checking .} 
I 7°rdFiltering ! '" 
N 
The most likely sequence of tagged word 
M orp hological 
Knowledge base 
• Lexicon base 
• Statistical base 
• Word formation 
rules 
J 
Figure 3. An overview of Thai Morphological Analysis. 
293 
Input sentence is a stream of characters without explicit delimiters. Using word formation rules 
and Lexicon base look up algorithm \[KAW95(a)\], the word segmenting process, wig provide all 
possible word grouping with all possible tags. If there is any explicit error then the second step, the 
spell checking process, will be called to give a suggestion with a set of most likely, word \[KAW9fi(a)\]. 
However, an implicit spelliug error may stiU exist. In order to choose the right tagged word 
combination, word  tering process will use the statistical association among words, coUected as a 
statistical base, to eliminate the alternative and/or erroneous chain of words which is caused by word 
boundary and tagging ambiguities and implicit spelling error. This paper concentrates only on the word 
filtering process. The detail of the process will be discussed next. 
4. Word Filtering 
A11 of word boundary, part of speech tag and implicit spelling error can be disambiguated by 
using a trigram model \[CHAR 93\] to calculate the probabilities of word cluster. The sentences shown 
in example 1, lc) and ld) are meaningful sentences. In other words, they have the strength of 
association of word in a chain more than la) and ld) have. The association between words in "'the boat 
shake" is stronger than in "'the boat ox". In example 2, we can also can find the most likely sequence of 
parts of speech by considering the previous part of speech. Since an implicit spelling error affects both 
meaning and tag, (such as : ~ (fly) : v ~ ~J~ (on): preposition) the special process is needed . 
Consequently, word filtering will consists of two processes : a filtering process used to eliminate 
unuseful ragged word 
combinations and a scanning process used to detect and correct an implicit spelling error by generating 
a new set of words according to the cause of errors and selecting the one that maximizes the 
probabilities of word cluster. Both processes need to look up a statistical information collected from 
the hand-tagged corpus. 
4.1 The Training Corpus 
The ~alning corpus is a set of sentences, divided into two groups. Each sentence in the first 
group is prepared to give a context for a word which has a possibility to become an implicit 
spelling error, and a context for a sequence of words that have word boundary ambiguity. The 
second group are sentences prepared to give a context for a multiple-tagged word. All of these 
sentences have already segmented and tagged. A statistical information will be collected as a 
statistical base to support both filtering process and SCanning process. Thus, collected statistics not 
only emphasize on the frequency of using individual words but also on the cluster of words. 
4.2 Markov Model as a Statistical Model of Filtering Process 
4.2.1 Trigram Model 
A Trigram Model \[CHAR 93\] is utilized to calculate the probabilities of word cluster, i.e. 
how the previous two words affects the probabilities of next word. This can be explain in equation 
(1). 
I=l 
294 
I 
I 
I 
I 
I 
I 
i 
I 
! 
I 
I 
I 
I 
i 
I 
i 
I 
I 
I 
i,:i 
I 
I 
I 
I 
i 
I 
I 
I 
I 
I 
_/ i \ In order to estimate the probability of J~w, lw,_~,,_ , )in (1), the following equation is used: 
where PeO0 is the estimated probability for Xbased on some count C: 
So to estimate the probability of w, appear after "w;..,,w,.l", we count how many times the pair 
"~,..,,w,.t" appears in our corpus and how many times "w,.2,w,4,w," appears and divide. 
Because of the sparse-data problem in trigram model, rather than equation (1), we instead 
USe" 
t=l t~l 
Thus, we can compute the better probabildes although the relevant trigram or bigram data 
are missing. The result from the experiment shows that the assigned values. I, 3, .6 to 2 1,/i,_,, ~ s, 
\[CHAR 93\] respectively, will give the satisfied solution for Thai word sequence probability. Using 
equation (3), the strength of association of words in a chain can be calculated. 
In order to handle the tagging ambiguity problem. A U'igram part of speech model is also 
used \[DeRose 88\] 
= P(,,Iw,),(,,I,,_;,,_,) (,> 
Since the proposed model is provided for disambiguating both word boundary and tag, we 
use the average of probabilities calculated by the equation (3) and (4) as the strength of a chain of 
tagged words and select the higher one as the most likely sequence of corrected word with their tags. 
For example, the strength of word chain (see the example l) in lc) higher than la) while the 
probabilities of the sequence of parts of speech of la) and Ic) are equal. Based on the average of the 
strength of word chain and the most likely sequence of parts of speech, Ic) will be selected as the 
solution of word segmentation and tagging. 
4.2.2 Two parts of Word Filtering 
There are two parts in word filtering (see the figure below) 
A set of tagged word combination 
Filtering Process 1 
I he most likely sequence(s) of 
tagged words 
Scanning Process . 
The implicit spelling error correction 
Figure 4. Two Parts of Word Filtering 
The first part of word filtering, i.e., the faltering process, calculates the strength of each 
tagged word combination. The combination(s) that gives the highest value will be the most likely 
295 
sequence(s) of tagged words. In the second part, scanning process, an implicit spelling error will be 
detected and corrected \[KAW95(b)\]. That is, the weakest strength of word cluster will be assumed to 
have an implicit spelling error. Then a new set of words which are generated according to the causes of 
error will be replaced to flint detected word one by one. A replaced word which gives the highest value of 
the strength of word chain will be a solution of an implicit spelling error. 
5. Conclusion 
From the results of the experiment shown below, word filter can eliminate many of alternative 
word sequences and corr~t the unplicit error. This result makes the job of the parser much easier and 
speeds it up. 
ning Word boundary tagging 
ambiguit) ambiguity 
Corpus base-'~l~ 85.2% 76.6% 
(word filtering) 
implicit spdlin£ 
error 
61.9% 
speed (for one 
sentence) 
4msecs-mmutes 
Acknowledgment 
The work reported in this paper was supported by the National Research Council of Thailand. 
Thanks are also due to Patcharee Varasai, Supapas Kurntanode, Thitipom Tharapoome and Mukda 
• Suktarajam for their helpful on the preparation of the training corpus, Puchong Uthayopat and Amarin 
Deemagam for their helpful to complete this paper.. 

References 
Araki Tetsuo. Ikehara Satoru, Tsukahara Nobuyuki, Komatsu Yasunori, 
"'An Evaluation to Detect and Correct Erroneous Characters Wrongly 
Substitued, Deleted and Inserted in Japanese and Englist Sentences Using 
Markov Models.": COLING94, 1994, pp. 187-193. 
Charniak Eugene, "" Statistical Language Learning", MIT Press, 1993. 
DeRose, S.J.:'" Grammatical Category Disambiguation by" statistical 
optimization": Computational Linguistics 14, 1988, pp. 31-39. 
Kawtrakul Asanee, Muangymman Parinee, Maneekanjanajing Nopparat, 
"'A Morphological Analyzer for Writing Production Assistant System", A 
Progress Report to the National Research Council of Thailand, 1995. 
Kawtrakul Asanee, "'A statistical Approach to Ambiguity Filtering in WPA 
System", A Progress Report to the National Research Council of Thailand, 
1995. 
Shiho Nobesawa, "Segmenting a Sentence into Morphemes Using Statistic 
Information bet~ een Words", COLING94, 1994, pp.227-233. 
