Automatic Corpus-Based Thai Word Extraction 
with the C4.5 Learning Algorithm 
VIRACH SORNLERTLAMVANICH, TANAPONG POTIPITI AND THATSANEE 
CHAROENPORN 
National Electronics and Computer Technology Centel, 
National Science and Technology Development Agency, 
Ministry of Science and Technology Environntent, 
22 '~1 Floor Gypsum Metiw)olitan Tower 539/2 Sriayudhya Rd. Rajthevi Bangkok 10400 ThailatM 
Email: virach@nectec.or.th, tanapong@nectec.or.th, thatsanee@nectec.or.th 
Abstract 
"Word" is difficult to define in the s that 
do not exhibit explicit word boundary, such as 
Thai. Traditional methods on defining words for 
this kind of s have to depend on human 
judgement which bases on unclear criteria o1" 
procedures, and have several limitations. This 
paper proposes an algorithm for word extraction 
from Thai texts without borrowing a hand from 
word segmentation. We employ the c4.5 learning 
algorithm for this task. Several attributes such as 
string length, frequency, nmtual information and 
entropy are chosen for word/non-word 
determination. Our experiment yields high 
precision results about 85% in both training and 
test corpus. 
1 Introduction 
in the Thai , there is no explicit word 
boundary; this causes a lot of problems in Thai 
 processing including word 
segmentation, information retrieval, machine 
translation, and so on. Unless there is regularity in 
defining word entries, Thai  processing 
will never be effectively done. The existing Thai 
 processing tasks mostly rely on the 
hand-coded dictionaries to acquire the information 
about words. These manually created dictionaries 
have a lot of drawbacks. First, it cannot deal with 
words that are not registered in the dictionaries. 
Second, because these dictionaries are manually 
created, they will never cover all words that occur 
in real corpora. This paper, therefore, proposes an 
automatic word-extraction algorithm, which 
hopefully can overcome this Thai - 
processing barrier. 
An essential and non-trivial task for the 
s that exhibit inexplicit word boundary 
such as Thai, Japanese, and many other Asian 
s undoubtedly is the task in identifying 
word boundary. "Word", generally, means a unit 
of expression which has universal intuitive 
recognition by native speakers. Linguistically, 
word can be considered as the most stable unit 
which has little potential to rearrangement and is 
uninterrupted as well. "Uninterrupted" here 
attracts our lexical knowledge bases so much. 
There are a lot of uninterrupted sequences of 
words functioning as a single constituent of a 
sentence. These uninterrupted strings, of course 
are not the lexical entries in a dictionary, but each 
occurs in a very high frequency. The way to point 
out whether they are words or not is not 
distinguishable even by native speakers. Actually, 
it depends on individual judgement. For example, 
a Thai may consider 'oonfila~mu' (exercise) a whole 
word, but another may consider 'n~n~m~' as a 
compound: 'oon' (take)+ 'filg~' (power)+ 'too' (body). 
Computationally, it is also difficult to decide 
where to separate a string into words. Even 
though it is reported that the accuracy of recent 
word segmentation using a dictionary and some 
heuristic methods is in a high level. Currently, 
lexicographers can make use of large corpora and 
show the convincing results from the experiments 
over corpora. We, therefore, introduce here a new 
efficient method for consistently extracting and 
identifying a list of acceptable Thai words. 
2 Previous Works 
Reviewing the previous works on Thai word 
extraction, we found only the work of 
Sornlertlamvanich and Tanaka (1996). They 
employed the fiequency of the sorted character n- 
grams to extract Thai open compounds; the strings 
that experienced a significant change of 
occurrences when their lengths are extended. This 
algorithm reports about 90% accuracy of Thai 
802 
open compound extraction. However, the 
algorithm emphasizes on open compotmd 
extraction and has to limit tile range of n-gram to 
4-20 grams for the computational reason. This 
causes limitation in the size of corpora and 
efficiency in the extraction. 
The other works can be found in the 
research on the Japanese . Nagao et al. 
(1994) has provided an effective method to 
construct a sorted file that facilitates the 
calculation of n-gram data. But their algorithm did 
not yield satisfactory accuracy; there were many 
iuwflid substrings extracted. The following work 
(lkehara et al., 1995) improved the sorted file to 
avoid repeating in counting strings. The extraction 
cesult was better, but the determination of the 
longest strings is always made consecutively from 
left to right. If an erroneous string is extracted, its 
errors will propagate through the rest of the input 
:~trings. 
:3 Our Approach 
3.1 The C4.5 Learning Algorithm 
Decision tree induction algorithms have been 
successfully applied for NLP problems such as 
sentence boundary dismnbiguation (Pahner et al. 
1997), parsing (Magerman 1995) and word 
segmentation (Mekuavin et al. 1997). We employ 
the c4.5 (Quinhln 1993) decision tree induction 
program as the learning algorithm for word 
extraction. 
The induction algorithm proceeds by 
evaluating content of a series of attributes and 
iteratively building a tree fiom the attribute values 
with the leaves of the decision tree being the value 
of the goal attribute. At each step of learning 
procedure, the evolving tree is branched on the 
attribute that pal-titions tile data items with the 
highest information gain. Branches will be added 
until all items in the training set arc classified. To 
reduce the effect of overfitting, c4.5 prunes the 
entire decision tree constructed. It recursively 
examines each subtree to determine whether 
replacing it with a leaf or brauch woukt reduce 
expected error rate. This pruning makes the 
decision tree better in dealing with tile data 
different froul tile training data. 
3.2 Attributes 
We treat the word extraction problem as the 
problem of word/nou-word string disambiguation. 
The next step is to identify the attributes that are 
able to disambiguate word strings flom non-word 
strings. The attributes used for the learning 
algorithm are as follows. 
3.2.1 Left Mutual hfomlation and Right Mutual 
h{fbrmation 
Mutual information (Church et al. 1991) of 
random variable a and b is the ratio of probability 
that a and b co-occur, to tile indepeudent 
probability that a and b co-occur. High mutual 
information indicates that a and b co-occur lnore 
than expected by chance. Our algorithm employs 
left and right mutual information as attributes in 
word extraction procedure. Tile left mutual 
information (Lm), and right mutual information 
(Rm) of striug ayz are defined as: 
Lm(xyz) - 
Rm(xyr.) - 
p(xyz.) 
p(x)p(yz) 
p(xy~.) 
p(,y)p(z) 
where 
x is the leftmost character ofayz 
y is the lniddle substring ol'ayz 
is the rightmost character of :tlVz 
p( ) is tile probability function. 
If xyz is a word, both Lm(xyz) and Rm(~yz) should 
be high. On the contra W, if .rye is a non-word 
string but consists of words and characters, either 
of its left or right mutual information or both lnust 
be low. For example, 'ml~qn~" ( "n'(a Thai alphabet) 
'fl~anq'(The word means appear in Thai.) ) must 
have low left mutual information. 
3.2.2 Left Entropy and Right Entropy 
Eutropy (Shannon 1948) is the information 
measuring disorder of wu'iables. The left and right 
entropy is exploited as another two attributes in 
our word extraction. Left entropy (Le), and right 
entropy (Re) of stringy are defined as: 
803 
Le(y) = - Z p(xy I Y)' Iog2p(xYlY) 
V.r~ A 
Re(y) = - Z p(yz l y ) " log 2 p(yz l y ) 
Vz~A 
where 
y is the considered string, 
A is the set of all alphabets 
x, z is any alphabets in A. 
Ify is a word, the alphabets that come before and 
aflery should have varieties or high entropy. If y 
is not a complete word, either of its left or right 
entropy, or both must be low. For example, 'ahan' 
is not a word but a substring of word 'O~3n~l' 
(appear). Thus the choices of the right adjacent 
alphabets to '~qn' must be few and the right 
entropy of 'ahw, when the right adjacent alphabet 
is '~', must be low. 
3.2.3 Frequency 
It is obvious that the iterative occurrences of 
words must be higher than those of non-word 
strings. String frequency is also useful 
information for our task. Because the string 
frequency depends on the size of corpus, we 
normalize the count of occurrences by dividing by 
the size of corpus and multiplying by the average 
value of Thai word length: 
F(s) = N(s).Avl 
Sc 
where 
s is the considered string 
N(s) is the number of the occurrences 
of s in corpus 
Sc is the size of corpus 
Avl is the average Thai word length. 
We employed the frequency value as another 
attribute for the c4.5 learning algorithm. 
3.2.4 Length 
Short strings are more likely to happen by chance 
than long strings. Then, short and long strings 
should be treated differently in the disambiguation 
process. Therefore, string length is also used as an 
attribute for this task. 
3.2.5 Functional Words 
Functional words such as '~' (will) and '~' (then) 
are frequently used in Thai texts. These functional 
words are used often enough to mislead the 
occurrences of string patterns. To filter out these 
noisy patterns from word extraction process, 
discrete attribute Func(s): 
Func(s) : 1 if string s contains 
fnnctional words, 
= 0 if otherwise, 
is applied. 
3.2.6 First Two and Last Two Characters 
A very useful process for our disambiguation is to 
check whether the considered string complies with 
Thai spelling rules or not. We employ the words 
in the Thai Royal Institute dictionary as spelling 
examples for the first and last two characters. 
Then we define attributes Fc(s)and Lc(s) for 
this task as follows. 
N(s, s2*) Fc(s) - 
ND 
N(*s,,_l s,, ) Lc( s ) - 
ND 
where s is the considered string and 
S .= SIS2...Sn_IS n 
N(sls2* ) is the number of words in 
the dictionary that begin with s~s 2 
N(*s,_ls,,) is the nmnber of 
words in the dictionary that 
end with s,,_~s,, 
ND is the number of words in 
the dictionary. 
3.3 Applying C4.5 to Thai Word Extraction 
The process of applying c4.5 to our word 
extraction problem is shown in Figure 1. Firstly, 
we construct a training set for the c4.5 learning 
algorithm. We apply Yamamoto et al.(1998)'s 
algorithm to extract all strings from a plain and 
unlabelled I-MB corpus which consists of 75 
articles from various fields. For practical and 
reasonable purpose, we select only the 2-to-30- 
character strings that occur more than 2 times, 
804 
Extracting Strings 
from 
the Training 
Corpus 
Computing the\] 
Attributes I 
Value J 
iTagging the 
Strings 1 
'qV 
i Extracting Strings 
from 
the Test Corpus 
~t~ the 
Attributes 
Value 
J- --We r~ 
1 Extraction 
Figure. 1 : Overview o1' the Process 
Re > 1.78 /, 
-2~Lm 14233--:. / is notaword ' 
\ 
/ \\ Y//" 
\~ N 
.2" Func= 0 "> s nota wor 
is a word 
Figure 2: Exanlple of the Decision tree 
have positive right and left entropy, and conform 
to simple Thai spelling rules. To this step, we get 
about 30,000 strings. These strings are lnalmally 
tagged as words or non-word strings. The strings' 
statistics explained above are calculated for each 
string. Then the strings' attributes and tags are 
used as the training example for the learning 
algorithln. The decision tree is then constructed 
from the training data. 
In order to test the decision tree, another 
plain I-MB corpus (the test corpus), which 
consists of 72 articles fi'om various fields, is 
employed. All strings in the test corpus are 
extracted and filtered out by the same process as 
used in the training set. After the filtering process, 
we get about 30,000 strings to be tested. These 
30,000 strings are manually tagged in order that 
the precision and recall of the decision tree can be 
evaluated. The experimental results will be 
discussed in the next section. 
4 Experimental Results 
4.1 The Results 
To measure the accuracy of the algorithln, we 
consider two statistical values: precision and 
recall. The precision of our algorithm is 87.3% for 
the training set and 84.1% for the test set. The 
recall of extraction is 56% in both training and 
test sets. We compare the recall of our word 
extraction with the recall from using the Thai 
Royal Institute dictionary (RID). The recall froln 
our approach and from using RID are comparable 
and our approach should outperform the existing 
dictionary for larger corpora. Both precision and 
recall fiom training and test sets are quite close. 
This indicates that the created decision tree is 
robust for unseen data. Table 3 also shows that 
more than 30% of the extracted words are not 
found in RID. These would be the new entries for 
the dictionary. 
Table 1 : The precision of word extraction 
No. of strings 
extracted by the 
decision tree 
Training 1882 
Set (100%) 
'lest Set 1815 
(100%) 
No. of No. of non- 
words word strings 
extracted extracted 
1643 239 
(87.3%) (12.7%) 
1526 289 
(84.1%) (15.9%) 
Table 2: Tile recall of word extraction 
Training 
Set 
Test Set 
No. of words 
that ill 30,000 
strings 
extracted 
No. of words 
extracted by 
the decision 
tree 
No. of words 
in corpus that 
are found 
RID 
2933 1643 1833 
(100%) (56.0%) (62.5%) 
2720 1526 1580 
(100%) (56.1%) (58.1%) 
805 
Table 3: Words extracted 
No. of words 
extracted by 
the decision 
tree 
by the decision 
No. of words 
extracted by 
the decision 
tree which is 
inRID 
tree and RID 
No. of words 
extracted by 
the decision 
tree which is 
not in RID 
Training 1643 1082 561 
Set (100.0%) (65.9%) (34.1%) 
Test Set 1526 1046 480 
(100.1%) (68.5%) (31.5%) 
4.2 The Relationship of Accuracy, Occurrence 
and Length 
In this section, we consider the relationship of the 
extraction accuracy to the string lengths and 
occurrences. Figure 2 and 3 depict that both 
precision and recall have tendency to increase as 
string occurrences are getting higher. This implies 
that the accuracy should be higher for larger 
corpora. Similarly, in Figure 4 and 5, the accuracy 
tends to be higher in longer strings. The new 
created words or loan words have tendency to be 
long. Our extraction, then, give a high accuracy 
and very useful for extracting these new created 
words. 
T rain in g 
....... T cst 
, r 1 r I I I I 
2 6 10 14 18 22 26 3O 34 3\[{ 
0 ccurrcncc (x I O0 ) 
Figurc 3: Prccision-Occurrence Rclationship 
lOO 
Z. 
~4o --Training 
2o ...... Test 
o r r r T T 1 T T • • 
2 6 10 14 18 22 26 30 34 38 
Occurrence (xl00) 
Figure 4: Recall-Occurrence Relationship 
lOO 
"~ 40 r, 
2O 
0 
120 
I T raining 
...... Tcst 
T r E r i r ~ i 
1 3 5 7 9 11 13 15 17 
Length (No. of characters) 
Figure 5: Precision-Length Relationship 
I 90 
I 8o 
1 70 
i 60 
50 i!40• 
~" 30 
20 
...... Test lO 
0 • i i 
1 3 5 7 £ 11 13 15 17 
\[,cnglh (No. of characters) 
Figure 6: Prccision-Length P, elationship 
5 Conclusion 
In this paper, we have applied the c4.5 learning 
algorithm for the task of Thai word extraction. 
C4.5 can construct a good decision tree for 
word/non-word disambiguation. The learned 
attributes, which are mutual information, entropy, 
word frequency, word length, functional words, 
first two and last two characters, can capture 
useful information for word extraction. Our 
approach yields about 85% and 56% in precision 
and recall measures respectively, which is 
comparable to employing an existing dictionary. 
The accuracy should be higher in larger corpora. 
Our future work is to apply this algorithm with 
larger corpora to build a corpus-based Thai 
dictionary. And hopefully, out" approach should be 
successful for other non-word-boundary 
s. 
Acknowledgement 
Special thanks to Assistant Professor Mikio 
Yamamoto for providing the useful program to 
extract all substrings from the corpora in linear 
time. 
806 

References 

Church, K.W., Robert L. and Mark L.Y. 
(1991) A Status Report on ACL/DCL. 
Proceedings of 7 a' Annual Co#(ference of 
the UW Centre New OED attd Text 
Reseatrh: Using Corpora, pp. 84-91 

Ikehara, S., Shirai, S. and Kawaoka, T. (1995) 
Automatic Extraction of Uninterrupted 
Collocations by n-gran~ Statistics. Piwceeding q\[ 
The fitwt Annual Meeting of the Association for 
Natural Language Processing, pp. 313-316 (in 
Japancse) 

Magerman, D.M. (1995) Statistical decision-tree 
models for parsing., hwceeding of 33rd 
Amtual Meeting of Association for Computational 
Linguistics 

Meknavin, S., Charoenpornsawat, P. and Kijsirikul, B. 
(1997) Feature-based Thai Word Segmentation. 
Proceeding of the Natural Language Processing 
Pacific Rim Symposium 1997, pp. 35-46 

Nagao, M. and Mort, S. (1994) A New Method of N- 
gram Statistics for Large Number of n and 
Automatic Extraction of Words and Phrases fl'om 
Large Text l)ata of Japanese. Proceeding of 
COLING 94, Vol. 1, pp. 611-15 

Pahner, D.D. and Hearst M.A. (1997) Adaptive 
Multilingual Sentence Boundary Disambiguation. 
ComputationalLinguistics Vol. 27, pp. 241-267 

Quinhm, J.R. (1993) C4.5 Programs for Machine 
Learning.Morgan Publishers San Mated, 
California, 302 p. 

Shannon, C.E. (1948) A Mathematical Theory of 
CommunicatiomJ. Bell System Technical Jolu'nal 
27, pp. 379-423 

Sornlertlamvanich, V. and Tanaka, H. (1996) The 
Automatic Extraction of Open Compounds from 
Text. Proceeding o\[ COLING 96 Vol. 2, pp. 1143- 
1146 

Yamamoto, M. and Church, K.W. (1998) Using Suffix 
Arrays to Compare Term Frequency and 
Document Frequency for All Substrings in Corpus. 
Proceeding of Sixth Workshop on Veo' Large 
Corpora pp. 27-37
