TEXT PROCESSING OF THAI LANGUAGE 
=THE THREE SEALS LAW= 
Shigeharu Sugita 
National Museum of Ethnology 
Expo Park, Senri, Suita 
OSAKA 565, Japan 
Abstract 
Computer softwares for processing 
Thai language are developed at National 
Museum of Ethnology,Osaka,Japan. We use 
a popular intelligent terminal TEKTRONIX 
4051 for inputting and editing,IBM 370 
model 138 for KWIC making and sorting, 
and CANON's laser beam printer for final 
output. 
Using these systems,"Kotmai Tra Sam 
Duang"(the Three Seals Law)which con- 
tains many kind of laws and ordinances 
proclaimed in Thai between 1350-1805 
A.D. is computerized. This text has 1700 
pages and about 1400000 letters. KWIC 
index becomes 200000 lines. 
Some statistical data for this text 
are obtained. They are occurrence fre- 
quency data of single letter,group vowel, 
and letter combination(digram),etc. 
Aknowledgements 
This report is a result of joint 
project at National Museum of Ethnology. 
The member are Y.Ishii, I.Akagi, S.Tanabe 
Y.Sakamoto, S.Uemura, A.Ishizawa, 
M.Sawamura, K.Sasaki, Y.Kurita, and 
S.Sugita. Their research field are eth- 
nology,linguistics,computer science,and 
sociology etc. 
We thanks Mr. Sophon Chitthasatcha, 
Miss Sumalee Maungpaisaln and Miss Hiroe 
Matsumoto for their help in Segmentation, 
inputting and correction. 
We also thanks Prof. K.Nakayama 
and A.Oikawa of Tsukuba University for 
their support on making Thai letter 
patterns and output software for laser 
beam printer. 
Introduction 
In the field of ethnology or cul- 
tural anthlopology,ethnographies are 
very important information sources for 
comparative study of many different 
societies. Not only bibliographic data 
but also contents of text are necessary. 
HRAF(Human Relations Area Files), 
which was developed by Dr. Murdock and 
now managed by HRAF Inc. at Yale Univer- 
sity,is a unique retrieval system. 
They use about 800 category codes by 
which analysts classify the contents of 
each pages of books. 
Though HRAF system is an elaborate 
work,it is not easy to search necessary 
data by user terms,that is,natural words. 
If whole text are fed into computer,it 
is very easy to retrieve any part of 
text by the same natural words used in 
the text. 
On-line retrieval system is smart 
and effective. But sometimes researcher 
wants printted index like as KWIC which 
is usable at any time and place. Com- 
bining KWIC index and thesaurus diction- 
ary,it gives us a very powerful tools 
for searching special expression hidden 
in the text. 
Till quite recentry,at least in 
Japan,most cases of computer processing 
of natural language are distored to 
indo-europian language or Japanese. In 
the ethnological studies,we must treat 
many areas in the world. We need comput- 
er softwares which process unfamiliar 
languages for us,such as Arabic,Korean, 
Sumerian,Mongolian,Devanagari,Thai,etc. 
National Museum of Ethnology at 
Osaka has introduced several computer 
systems to encourage humanity study,and 
now is developing many application 
softwares which are usable by any re- 
searchers who do not know computer pro- 
gramming or how to use computer. 
This report describes one of such 
application softwares which treats Thai 
letters. The points of our work are as 
follows; 
i) A popular computer terminal is used 
for Thai letter inputting and editing. 
It is easy to use because dead key oper- 
ation is not necessary. 
2) KWIC making and sorting software are 
implemented using FORTRAN language which 
can be transfered to any other computer 
system. The algorithm is not so complex 
but it was not implemented only because 
they are not popular language. 
3) Statistical data of the text are ob- 
tained. They are occurrence frequency 
of single letter,group vowel,and letter 
combination. These data will help us as 
a contexial data in case of OCR. 
- -330 • 
Seqmentation 
There is no segmentation problems 
in case of indo-europian languages,be- 
cause they have clear separator for word 
unit such as space or comma. There are, 
however, many languages in Asia which 
have no clear separator. They are Korean 
(Hangul),Chinese,Japanese,and Thai,etc. 
Examples shown below mean that there 
exist several different segmentation. 
Segmentation affects to the meaning of 
sentence and retrieval efficiency. 
0 ~ -,- 
O L -r- 
e L. -~- 
Fig.l Examples of different segmentations 
To cut into long unit is effort 
saving, but it is difficult to search 
the string included in that unit. To 
cut into short unit is effective for 
searching, but too many keywords appear. 
The text, the Three Seals Law, has 
no word separator, as shown in fig.3. 
So it is necessary to segment into ap- 
propriate units before making KWIC index. 
But it is difficult problem because 
segmentation needs well understanding 
of meaning, which conversely needs KWIC 
index. 
We adopted a practical method which 
at first cut into long unit and then 
cut again after looking KWIC index. 
I_ RE-'SEGI"IEN TATIOt'~ i--'i 
Inputting 
Terminal 
We use a popular intelligent graph- 
ic terminal TEKTRONIX 4051 which has 
usual alphabet keyboard. We sticked Thai 
letter labels on the side of each key as 
if it looks like Thai typewriter. A code 
table of Thai letters and coresponding 
english alphabets is shown in Table i. 
The characteristics of this termi- 
nal are; 
i) It generates Thai letter pattern by 
BASIC program in graphic mode. User can 
affirm the letter he typed. 
2) It has local cassett memory, so that 
user can input and edit data anytime, 
even when host computer is not working. 
3) By way of communication line, stored 
data can be transmitted to host computer 
for time consuming work. 
4) It is easy to implement a flexible 
Thai language editor, which accept al- 
phabet commands and display Thai letters. 
5) Copy of screen can be taken by the 
hard copy unit attached to it. 
Rules for text inputting 
The text has many irregular expres- 
sions. So following expediencies are a- 
dopted. 
i) Quotated words or phrases from Pali 
language are skipped by inserting spe- 
cial symbol to indicate there are 
skipped words. 
ORIGIHAL TEXT 
~EGIIENTAT 1014 
IHPUTTIHG 
I 
1700 Pages 
200000 units 
1400000 
strokes 
\[__zO_RRE, UTION, I Thai editor 
-----\[ £TA'T I ST'i CQL DATA 
M __I 
.... \[ 
...... £ ORTING 
printer 
Fig.2 Flow diagram of KWIC making 
-331 
2)Tables are skipped. 
3)Special expressions for money,dating, 
and fractional number are transformed 
into sentence form. 
4)Vertical expression shown in Fig.3 are 
attached special symbol after and before 
the word. 
5)Parallel expression in the middle of 
line,and tree like expression are trans- 
formed into linear form from which ori- 
ginal form can be reconstructable as 
much as possible. 
Order of input 
The order of input of Thai letters 
to the computer is same with the order 
in which one would strike keys of type- 
writer. 
9 ,tl, lXI, o,u,l.)'l,'ial,"ln~'l,fl/ "LI,q ~oo/ 
• tOt..v"iLl~.i~t~l'lJ,/ "u:"ll,ql:l cr~l 
11 I,i~"'l/ 'l.4qR'a eol 
1 2'~.t'1,1~11~h1714 ~1" 
t4 L 191t1./, 
! 
i5 ~u I,~UL"ll,'iq/, 
16 L'14ql:/) 'L-.i;q~l~ e I~l n : / (a) 
~-;~ 
e~ C 11 G tl \],lit Pl Ill 11 l tl (I d°~ c-" g. 4 -~ 
~--I 111'11{ J b 0 O\] b 
(A) 
I.<m <3~. , 
fir 
l~r~! "Ol~,\]/.~l,~'li~llli l-j/3 i ~,l',l~.,ll~l~,,\]r'j, llr/;~,~.lgfl G / 
I )l , , t,) ..... 
~J\] 1il<Gll 11 .-t-,,s fig ~fln~l~/ ~,odt~ 11 
& 
ilJ ~ U, :1.11 ~ 1//t,-i' fl \[ 
29 
( B ) 
(c) 
Example text 
UOL.=4 PAGE:259 
4 %911 ' q ni'nlmqulm1 w q  / a£q"l, it  'nw'qjqlIL /'i,i'il.lql,l-i . 
I 
P=:iog<cONT) 
i6~l-1:n"llq/,lfq4l~: ,, 'I~i~,1/ ~."1%~!1~-I"-1~/~1'"1,11i1~: ~m I._IQM m ~t/ I,,tfq 
~lJ,/ 
t7 I, tJtl,/l~'l.l/ ,~ li~ h ~ l ,~ tl q ~4 i~ "~1i~ 41% ~ l (1. "q141, .q h l l il l ) "~'1t/L1,4 
\]il,"~l_li'~,,ll,IU~l/ %~l¢~@lli (c) 
Fig. 3 Examples of text and inputted form 
332 
Correction 
Thai editor 
A line editor for Thai text is im- 
plemented on TEKTRONIX 4051 terminal. 
Commands are english like term and Thai 
text are displayed by Thai letter. 
This editor suporse that there are 
volume number,page number,and line num- 
ber. 
ENTER THE VOLUME NUMBER = v 
;specify volume number 
*PAGE,N ;specify page number. Until 
next page command,this page 
is held in memory. 
*LADD XX ;XX is added to this page as 
a last line 
*LINS,M XX;XX is inserted as a new line 
after line number m 
*LDEL,M ;line number m is deleted 
from this page 
*SHOW,M ;string of line number m is 
displayed in Thai letter 
*LGET,M ;line number m is object to 
be edited by following sub- 
commands 
*ADB XX ;XX is added to the beginning 
of line m 
*ADE XX ;XX is added to the end part 
of line m 
*DEL XX ;string XX is deleted from 
line m. If there are several 
XX's in line m, the position 
number ar e displayed. Enter 
corresponding number after 
prompt "which?" 
*INS XX BEFORE YY 
;string XX is inserted before 
string YY. If several YY's 
are there,type corresponding 
number after prompt "which?" 
*REP XX BY YY 
;string XX is replaced by YY. 
If several YY's are there, 
type corresponding number 
after prompt "which?" 
*SEE ;three letters after and 
before changed part are 
displayed 
*PART,0 ;five letters of beginning 
part of line m are displayed 
*PART,100;iast five letters of line m 
are displayed 
*PART,K ;five letters from kth posi- 
tion are displayed 
*END ;editing session is completed 
HELI, O!! HObl ARE Yf!',K, 
ENTER THE VOLUME NUMBER=5 
~pcge,208 %lset, 10 
~r~p 
8 26 29 
~NHICH?:77 
~see 
%Iget,15 %ins 
%see 
poge,199 
i' ~) BEFORE ~ql.J 
~:p~rt, 100 ~0 
I 
%ade 
%ghow, 13 
%ghow~14 
%fins, 12 12,5 
't" q 
"~a4 
~SKIP~ 
*SKIPS 
33 42 46 
~. 1~04 19qn~'l'l i~'4qlfl.ni l dt 
,show, i~ ~0,I/~qn~qql#~n'~.Lql~ lit 
:tins 
i" BEFORE ~ "\]FI~ 
page,203 
~19et, iZ 
~show 
~see 
%Iget, I5 
%part, lO0 
! 
m "I 171¢IfI"I'~ I 
) hn 
40 
n,19E/ 
Fig. 4 Examples of editing 
-333 
KWIC making 
The most obvious complication is 
the fact that in Thai writing as many as 
three separate characters can appear at 
the same holizontal position in four 
different vertical positions. Therefore 
number of letters to take as before or 
after context must be carefuly counted. 
As a index of every unit,volume 
number,page number and line number are 
attached to the left side. 
Sorting 
Sorting algorithm of Thai words is 
not so simple as English. 
l hai 
Computer algorithm 
i) Every occurrence of pre-positioned 
vowel ( & ~ I~ ) is moved to a posi- 
tion immediately -following consonant it 
preceeds. 
2) Diacritic symbols are moved to the 
end of word with the indication of posi- 
tion counted from the end of word. 
3) Each letter is replaced by the code 
given in Table i. 
4) Then two words are compared as if 
they are numerals. 
n=%~ n~OOOl' , 08567146000103 
z 
~ ~e ~ 0002 l 15571500020300 
We ignored algorism 2),because our 
segmentation units are not necessarily 
words so that it does not work effec- 
tively. 
Table 1 Code table of Thai letter 
\[ ° \[ 61 -~ > 81 
r I .......... 
q e 62 mll 1 } L - 82 I~I l 
b 63 ~ ~ 83 
hl " ~ 4 ' I K Ill :I ~ 4 
7 65' 85 
n 66 86 
q 6 67 87 
,~ & 68 88 
L ~ 69 ~9 
F 71 91 
"~ . 72 - 2 92 
1 ~ z3 ( z 93 
"1 0 74 ) " ,~ 94 
Q 75 ~ M 95 
ip " 76 * t 96 
\[El # 77 I 3 97 
• q $ 78 98 
~; 79 99 
d\[ ( 8e I 
\[ I ~ ~ ~ 1 ~ "I ~ 
H 82 ~ P 
I ' .i O~ ,\]~ E 
p 
v h 04 i~"J D 
, 
" U 05 ! 
J o~ ~ R j. . 
'~ H o7 ~ < 
d e,9 "\['l~ I 
, -- .,, 
~J B t~ fl 5 
8 12 'I/I ,n 
~q v 13 1~ T \] 
s 14 ~ o 
q \ 15 1_J e 
o 16 tJ x 
G, c t7 ~ z 
2'2 ~.I , 42 
23 LJ , 43 
24 "~ - i-44 
25 ~ A 45 
2e "B i 4e 
27 ~ ? 47 
28 q ~ 48 
29 ~ L 49 
~o 'N K 5o 
31 "~ 1 51 
32 ~ s 52 
33 ~ > 53 
34 ~ v 54 
35 ~ U 55 
36 '~" -- t 56 
37 ~ 57 , ...J l 
38 ~" ~ 58 
39 ""I k 59 
40 ~ { .68 
334 
Statistical data 
Total number of letters in the 
machine readable text is 1362602 which 
include special symbols such as separa- 
tor,skip symbol,comma,etc. Total line 
number is 29582. In Table 2 is shown 
letter occurrence frequency for each 
letter. Table 3 shows occurrence fre- 
quency of compound vowels. Combination 
frequency of two letters are listed in 
Table 4. They are taken in order from 
the highest frequency. The combination 
is taken as shown below. 
Fig.5 show a distribution of the 
ratio of upper and lower letters to the 
total number of letters in a line. Av- 
erage ratio is 19%. A simple culculation 
give a ratio of 23% which is number of 
upper and lower letters among the hori- 
zontal positions. This means that in a 
line of Thai letter upper and lower 
letters is about 23% of normal horizon- 
tal positions. 
T=total number of letters in one line 
S=total number of upper and lower 
letters in the line 
M=T-S=number of horizontal positions 
in the line 
Qi=(S/T)Xi00 
Q2=(S/M) Xi00 
mean value of Qi=19% 
" Q2=23% 
2888_ 
1888 I 
? 
0 
..jim 
21 
Fig. 5 
Table 2 Occurrence frequency of single letter 
q 
ti 
"3 
! 
92754 ~1 
13916i 
11407 
18844 
18739 
10818 lq 
q 827~ 
LI 8069 
!. 
70137 ~I 
62913 
55392 
41798 "~ 
41407 
39532 ~ 
38624 
37497 
37310 ~ 
33185 1...I 
29376 'LJ 
28653 
27657 
27053\]~ 
22768~ 
21848~ 
201121 q 
193161 
18866~ 
iSil~ q 
18049 
17658 
17549 
17421 
16903F'L~ 
154o5 CL.I, 
15403'~"L.\], 
'I,,I 
q 
LI 
7989, 
78toI 
638~ 
5661 
44851 42411. 
RT 
~ ...... 
865 
q- 
4233 h 
3169 ,,~ 
3006 "~ 
v-,,w 
2938 
2652 '~" 
1988 
1708 
493 
i 455 L"OJ 
1186 
4- 
774 ,4, 
698 
691 
369 
,,L 
365 
276 
150 
125 
184 
63 
52 
28 
18 
17 
7 
6 
5 
335-- 
Table 3 Occurrence frequency of compound vowel 
- : consonant position 
L-'q 
- q~ 
L~E~ 
L~ -a~ 
'Z 0 ":q 
9947 
9268 
5020 
3617 
3434 
- qhl 
- qq 
LL-q L,z 
3228 ,% -Ll 
,, i , ,, 
3e85 %-LI 
2672 ~ 
2622 
2134 
1885 
1067 
1056 
955 
i-O 
L -~uc -~q 
l,-q 
k - "q.t 
qq 
LL "- 
545 
4 !2 
406 
339 
235 
107 
90 
L,- 
W ~.-J _ 
L-~ 
L -g\]~ 
55 
49 
22 
'll&. t-O.- 
L-~qi 
f ~I 'i ~L -tJ~ 
8 iLL -~ , 
0 
0 
, ,,,,, 
0 
0 
0 
0 
Table 4 Occurrence frequency of con\]lected letters 
/ : segmentation symbol, ~ ~ means ~ , SP : space 
/ sP 
q/ 
1334e2 
28370 
24672 
LL~ 
/'kl 
8248 
8173 
8025 
~. ,,, 
q~ 
/N 
6034 
5943 
5930 
q / 2t972 ~q 7856 14q 5928 
/ k 10666 '~,~"J 7854 1-'l~i"j 5798 
14322 
13239 
11695 
11636 
qq 
'1,4 ~ 
/I./ 
/% 
q~ 
q/ 
%t4 
7678 
kl/ 
7584 
7498 
7363 
7330 
7110 
7021 
11508 
lq/ 
10842 
10511 
9924 6829 
/% 9692 "\]L.L 6756 
"i"\] 93o5 ~q~z 6676 
L,lq 
/LL 
qhl 
t..L u 
/tJ 
9888 
8836 
8827 
8752 
8494 
/Fl 
6484 
6398 
6395 
6299 
6285 
Oq 
p,,l ~ 
/I,,4 
/iEl 
/-q 
qq 
iq'q du 
~q 
L'LI 
I/lq 
k~ 
vi ' 
W, 
ViI..L 
5606 
5370 
5318 
5176 
5086 
5016 
4997 
4973 
4918i 
4860 
4732 
4705 
4611 
4601 
4508 
m"'i 
/¢q 
LN "l,~q 
-~..q 
"1,,4/ 
6\] / 
~u 
I F~ 
I tq 
/p 
q _f 
sELL 
q/q 
~q 
qq 
4488 
4482 
4428 
OLI 3599 
3488 
3479 
4427 ~...\]'L.\[ 3470 
4325 l.\]/ 3447 
4286 ~ 'El 3442 
4162 I.._l"~ 3430 
41i8 /~1 
4094 I...~ a 
4066 '~'q 
4034 'tl"l 
3998 ~.ja 
3966 LLn 
3.950 
3925 
3908 
sPO 
d/ !u 
IAq 
sP~ 
"iLl 
3810 
3774 
3663 i 
3400 
3373 
3368 
3350 
3339 
3339 
3337 
3249 
3204 
3179 
3173 
3160 
3148 
336 
Printing 
~Age\[ bgamprinter 
CANON LBP-3500 is a laser beam 
printer which can print out any kind of 
figure and characters. In a character 
mode,character must be defined as a dot 
matrix of 8X8,16Xi6,24X24,32X32,etc. 
We use 16X16 matrix as a minimum 
module of Thai letter pattern. Thai 
characters are classified into fifteen 
types from the size of dot matrix. The 
largest pattern has 48X32 matrix which 
uses 6 modules. 
One text line is printed by five 
horizontal zone. Each zone has 16 dot 
vertical width. The horizontal width of 
each letter can be changed character by 
character. But in a same zone,vertical 
size can not be changed. 
Control of different letter width 
The complex part of output program 
is to control the width so that heading 
part of KWIC index come in a line verti- 
cally. 
An example of KWIC index is shown 
in Fig.6. We have printed about 200000 
lines. 
N/~d/~ ~O~'tl/%gI'YlN'33:IJ/flUfT1J/L'~Ifl I~fl/'~ 3LB~ /~Rlql/'l,131~lJ~-I/fl'll~UIM-~fl/llJ/ ~T!lqq/~L03/ 
ILL~llJfl I W II'~qlW ~'IJ l I~ I'VIl.~I ~tTLJL L Y, I%~t ~tll 
I 
vozume page line 
Fig. 6 Example of KWIC index of the Three Seals Law 
337- 

References

I) Ishii, Yoneo 
1969 "Introductory remarks on the 
Law of Three Seals", East 
Asian Study, Vol.6, No.4,Kyoto 
University. 

2) Murdock, George P. 
1971 "Outline of cultural materials" 
Human Relations Area Files,Inc. 

3) Oikawa, Akifumi & Nakayama, Kazuhiko 
& Sugita, Shigeharu 
1979 "Printing of Thai letters by 
laser beam printer", the 20th 
anual meeting of information 
processing society of Japan 

4) Sugita, Shigeharu 
1979 "Computer use in ethnological 
studies",Bulletin of the 
National Museum of Ethnology, 
Vol.4,No.l 

5) Udom Warotamasikkhadit & David Londe 
"Computerized Alphabetization 
of Thai" 
