The Automatic Extraction of Open Compounds from Text 
Corpora 
Virach Sornlertlamvanich and Hozumi Tanaka 
Department of Computer Science, Tokyo Institute of Technology 
2-12-1, ()okayama, Meguro-ku, ~t~okyo, .Japan 152 
{virach, t anaka}0cs, titech, ac. jp 
Abstract 
This paper describes a new method 
for extracting open compounds (unin- 
terrupted sequences of words) from text 
corpora of languages, such as Thai, 
Japanese and Korea that exhibit unex- 
plicit word segmentation. Without ap- 
plying word segmentation techniques to 
the inputted plain text, we generate n- 
gram data from it. We then count the oc- 
currence of each string and sort them in 
alphabetical order. It is significant that 
the frequency of occurrence of strings 
de, creases when the window size of ob- 
servation is extended. From the statis- 
tical point of view, a word is a string 
with a fixed pattern that is used repeat- 
edly, meaning that it; shouht occur with 
a higher frequency than a string that is 
not a word. We observe the variation 
of frequency of the sorted n-gram data 
and extract the strings that experience a 
significant (:hange in frequency of oc(:ur- 
rence when their length is extended. We 
apply this occurrence test to both the 
right and left hand sides of all strings 
to ensure the accurate detection of both 
boundaries of the string. The method 
returned satisfying results regardless of 
the size of the input file. 
1 Introduction 
This paper discusses a method automatic extrac- 
tion of candidates for open compound registra- 
tion. An open compound refers to an uninter- 
rupted sequence of words, generally functioning 
as a single constituent (Smadja and McKcown , 
1990). We propose a new method of extraction 
for languages which haw~ no specific use of punc- 
tuation to signify word boundaries. Our method 
is applied to n-gram text data using statistical ob- 
servation of the change of frequency of occurrence 
when the window size of string observation is ex- 
tended (character) cluster-wise. We generate both 
rightward and the leftward sorted n-gram data, 
then determine the left and right boundaries of a 
string using the methods of competitive ,selection 
and unified selection. In this paper, we examine 
the result of applying our medlod to Thai tex~ cor- 
pora and also introduce conventional Thai spelling 
rules to avoid e, xtracting invalid strings. 
Previous work (Nagao et al., 1994:) has shown 
all effective way of constructing a sorted file for 
tile efficient calculation of n-gram data. However a 
surprisely large numbe, r of invMid strings were also 
extracted. Subsequent work, (Ikehara et al., 1995) 
has extended the sorted file to avoid repeating the 
counting of substrings contained in strings already 
counted. This meant the extraction of only the 
longest strings in the order of frequency of occur- 
rence. The result of extraction was improved as 
a result, but the deterinination of longest strings 
is always made consecutively from left to right. If 
an erroneous string is extracte, d, its error directly 
propagates through the, rest of input. It is possible 
that a string with an invalid starting pattern will 
be extracted because a string too long in character 
length has been extracted previously. 
In the following sections, we firstly describe the 
necessity for making this statistical ol)servad(m 
for extracting open comtlounds from Thai text 
corpora. Then, the methodology of data preparw 
tion and open compound extraction is explained, 
Finally, we discuss the result of an experiment on 
both large and small test corpora to investigate 
the effectiveness of our method. 
2 Problem Description 
It is a non-trivial task to identify a word ill the 
text of a language which has no specific punctua- 
tion to mark word boundaries. Up to the present, 
lexicographers' efforts have been inhibited by in- 
sufficient corpora and limited computational fa- 
cilities. Almost all lexicon knowledge bases have 
been created with reliance oll human intuition. \]\]1 
recent years, a large amount of text corpora haw', 
become available, and it is now becoming possible 
to conduct more rigorous experiments on text cor- 
pora. We address the following problems in such 
1143 
a way that they are able to be solved by the way 
of statistical' methods. 
1. There is no good evidence to support the 
itemization of a word in a dictionary. In 
traditional dictionary making, lexicographers 
have had to rely on citations collected by hu- 
man readers from limited text corpora. More 
rare words rather than common words are 
found even in standard dictionaries (Church 
and Hanks , 1990). This is the problem in 
making a lexical entry list in dictionary con- 
struction. 
2. It is hard to decide where to segment a string 
into its component words. It is also hard 
to enumerate words from a text, though it 
is reported that the accuracy of recent word 
segmentation methods using a dictionary and 
heuristic methods is higher than 95% in case 
of Thai (Virach , 1993). The accuracy de- 
pends mostly on word entries in the dictio- 
nary, and the priority for selecting between 
candidate words when there is more than one 
solution for word segmentation. This is the 
problem in assigning priority information for 
selection. 
3 Word Extraction from Text 
Corpora 
We used a window size of 4 to 32 for n-gram 
data accumulation. The value is arbitrary but 
this range has proven sufficient to avoid collect- 
ing illegible strings. 
3.1 Algorithm 
Define that, 
\]a I is the number of clusters ~ in the string 'a', 
n(a) is the number of occurrences of the string 
~&', &ud 
n(a+l) is the number of occurrences of the 
string 'a' with one additional cluster added. 
As the length of a string increases the number 
of occurrences of that string will decrease. There- 
fore, 
+ 1) < (1) 
For the string 'a', n(a+l) decreases significantly 
from n(a) when 'a' is a frequently used string in 
contrast to 'a+l'. From this, it can be seen that 
'a' is a rigid expression of an open compound when 
it satisfies the condition 
'n(a + 1) << n(a). (2) 
In such a case, 'a' is considered a rigid expression 
that is used frequently in the text, and 'a+l' is 
just a string that occurs in limited contexts. 
1The smallest stand-alone character unit as by the 
spelling rules. 
Since we count the occurrence of strings gener- 
ated from an arbitrary position in tile text, with 
only the above observation, only the right end po- 
sition of a string can be assumed to determined a 
rigid expression. To identify the correct starting 
position of a string, we apply the same observation 
to the leftward extension of a string. Therefore, 
we have to include the direction to the string ob- 
servation. 
Further define that, 
+a is the right observation of the string 'a', 
and 
-a is the left observation of the string 'a'. 
Then, 
n(-t-a+l) is the number of occurrences of the 
string 'a' with one cluster added to its right, 
and 
n(-a+l) is the number of occurrences of the 
string 'a' with one cluster added to its left. 
Following the same reasoning as above, we will 
obtain, 
n(+a + 1) < n(a), and (3) 
+ 1) < (4) 
A string 'a' is a rigid expression if it satisfies the 
following conditions, 
n(+a + 1) << n(a), and (5) 
n(-a + 1) << n(a). (6) 
3.2 Data preparation 
Following are the steps for creating n-gram text 
data according to the fundamental features of 
Thai text corpora. The results are shown in Ta- 
ble 1 and Table 2. In each table, "n" is the number 
of occurrences and "d" is the difference in occur- 
rence with the next string. 
1. Tokenize the text at locations of spaces, tabs 
and newline characters. 
2. Produce n-gram strings following Thai 
spelling rules. Only strings that have possi- 
ble boundaries are generated, and their oc- 
currence counted. For example, shifting a 
string from 'a+6' to 'a+7' in the Table 1, 
the string at 'a+7' is '~z~t~ff.~' and 
not 'fl~g~|~'l.lfll~l\], despite the first char- 
acter after 'a+6' being '~'. According to 
o/ the Thai spelling rules, the character ' ' can 
never stand by itself. It needs both of an ini- 
tial consonant and a final consonant. We call 
this three character unit a cluster. 
3. Create both rightward (Table 1) and leftward 
(Table 2) sorted strings. The frequency of 
each string is the same but the strings are 
lexically reversed and ordered based on this 
reversed state. 
1144 
4. Calculate the diiference between the occur- 
renee of adjoining strings ill the sorted lists. 
Let {t(a) be the difference wdue of the string 
'a', then 
d(~) = ~(~) - n(~ + n. (7) 
The difference w~lue (d) is generated sepa- 
rately for the rightward and legward sorted 
string ta.bles. 
The occurrences (n) ill both Tal}le 1 an{l Table 2 
apparently SUl}port the conditions (3) all{\[ (4). 
Strin----~g- Rightward sorted string \[--n-~d-- 
a u~. \[ 5Ta- 68 
a+l tl~UTl~ \] 445 (} 
I a+2 i~a~l~\] I 445 0 
a+3 i~z~aa I 445 42 
a+4 II i g TI ~i l,~il I 1303 0 
a+5 ~ig~liflll } 303 22 
a-k6 ill g~ll l,~tll~a I 281 0 
a+7 ~l~a~l~aa~Igaffa I 281 274 
a+8 ,~uu,aa,~affa6~ I 7 0 
Sorted String Table 1: Exami)le of a Rightward 
'Pable 
-b 
-b+l I 
-b+2 I 
-b+3 I 
-1}+4 I 
-b+5 I 
-b+6 I 
-b+7 I 
Leftward sorted string -7 
\].r2 --ol 
172 0 
172 421 
130 
121 
I \].21. 7 
114 107t 
"7 01 
Table 2: Example of a Leftward Sorted String Ta- 
ble 
3.3 Extraction 
3.3.1 Competitive selection 
According to condition (5) the string %' ( a~ un ) 
in Table 3 is considered an open compound be- 
cause the difference of betweml n(a) and n(a+l) 
is as high as 450. However, 'a~u~l' is an illegible 
string and cannot be used on as indivi{lual basis 
in general text. Observing tile same string :a' in 
Table 1, the difference between n(a) and n(a+l) 
is only 68. It is not comparably high enough to 
be selected. Therefore, we have to determine the 
minimum wflue of the difference when there is 
more than one branch extended from a string. 
........ String_ Rightward sorted string 1 _ {t 
a a~gw 51~ 45{\] 
/ 
Table 3: A Further Example of the Count of a 
\]lightwm'd Sorted String Tal)le 
3.3.2 Unified selection 
in Figure 1, we ob- 
tain the string '~o~ ~ ~1~ a,lnl~a~' 1)y observing 
the significant change in d just before the next 
string '~l~u~l~:l.l~lÂ~i~a{fi' The string could 
be wrongly selected if we do not observe its be- 
haviour ill the leftward sorted string table, to 
determine tit(', correct left boundary. Thus, we 
(}bserve tile count of string '~itlSg~l~\],ltll~li~,~' 
when it is extended leftward, as shown ill Figure 2. 
0 20 40 60 80 100 120 140 160 180 
Figure 1: Rightward Sorted Strings Starting from 
an Arbitrary String 
0 20 40 60 
~ | 4\] 1114~ II 4 
88 1013 120 140 160 180 
Figm'e 2: Leftward Sorted Strings Starting fl'om 
an Arbitrary String 
By unifying the results of both methods of 
the observation, we iinally obtain tile word 
1145 
4 Experimental Results 
We have applied our method to an actual Thai 
text corpora without word segmentation pre- 
processing. 
4.1 Natural language data 
We selected 'Thai Revenue Code (1995)', as large 
as 705,513 bytes, and 'Convention for Avoidance 
of Double Taxation between Thailand and Japan', 
which has a smaller size of 40,401 bytes. The pur- 
pose is to show that our method is effective re- 
gardless of the size of the data file. 
4.2 Results of extraction 
g 
o~ 
g 
0~ 
o 
13_ 
I~ Word E~ Fixed • String| Expression Illegible 
100% 
8O% 
60% 
48% 
20% 
0% 
100 90 80 70 60 50 40 30 20 10 
Theshold level of the value of Difference (O) 
Figure 3: Result of Extraction of 'Thai Revenue 
Code (705,513 bytes)' 
100% 
,~ (\]0% 
L 
60% - 
4o~- 
20% 
~L 
0% 
3O 
I l a Word \[\] Find Expressian •lllegble Strinq 
20 10 8 B 4 
Threshold level of the value of Difference (D) 
Figure 4: Result of Extraction of 'Convention for 
Thailand-Japan (40,401 bytes)' 
The results of extraction examined in both large 
and small file sizes are very satisfactory. Very few 
illegible strings are extracted though the thresh- 
old of the difference value is set to be as low as 
10 in Figure 3, and 4 in Figure 4. The suitable 
value of the threshold of difference varies with the 
size of text corpus file. To obtain more mean- 
ingful strings fl'om a large file, we have to set a 
relatively high threshold of extraction. One of the 
advantages of our method is that there is an inher- 
ent trade~off between the quantity and the quality 
of the extracted strings. In the case of Figure 3, 
to limit the amount of illegible strings to not ex- 
ceed 15% of the total extracted strings, we set 
the threshold to 30. As a result, we obtained 154 
words, 114 fixed expressions and only 46 illegible 
strings. Furthermore, we found that of the 154 
words appearing in the text, there were 84 words 
that were not found in a standard Thai dictionary. 
5 Conclusion 
This paper has shown an algorithm for data prepa- 
ration and open compound extraction. The corn- 
petitive selection and unified selection of rightward 
and leftward sorted strings play an important role 
in improving accuracy of the extraction. In the ex- 
periment, we applied Thai spelling rules to restrict 
the search path for string counts. Some types of 
spelling irregularities can be excluded by this pro- 
cess. By adjusting the value of threshold, we can 
extract suitable entries for open compound regis- 
tration regardless of the size of the input file. Fur- 
thermore, our method has ensured the extraction 
of new words from the text file of the language 
that has no explicit word boundary, such as Thai. 
References 
Church, K. W. and Hanks, P. 1990. Word Asso- 
ciation Norms, MutuM Information, and Lexicog- 
raphy. Computational Linguistics, Vol.16, No.l, 
pages 22-29. 
Ikehara, S., Shirai S. and Kawaoka, T. 1995. Auto- 
matic Extraction of Uninterrupted Collocations by 
n-gram Statistics. Proceedings of The first Annual 
Meeting of The Association for Natural Language 
Processing, pages 313-316 (in Japanese). 
Nagao, M. and Mori, S. 1994. A New Method of 
N-gram Statistics for Large Number of n and Auto- 
matic Extraction of Words and Phrases h'om Large 
Text Data of Japanese. Proceedings of COLING 9It, 
Vol.1, pages 611-615. 
Smadja~ F. A. and McKeown, K. R. 1990. Auto- 
maritally Extracting and Representing Collocations 
for Language Generation. Proceedings of ACL-90, 
pages 252-259. 
Sornlertlamvanich, Virach. 1993. Word Segmenta- 
tion for Thai in Machine Translation System. Ma- 
chine Translation, National Electronics and Com- 
puter Technology Center, (in Thai). 
1146 
