CLASSIFIER ASSIGNMENT BY CORPUS-BASED APPROACH 
Virach Sornlertlamvanich Wantanee Pantachat Surapant Meknavin 
Linguistics and Knowledge Science Laboratory 
National Electronics and Computer Technology Center 
National Science and Technology Development Agency 
Ministry of Science Technology and Environment 
22nd Gypsum Metropolitan Tower, 
539/2 Sriayudbya Rd., Bangkok 10400, Thailand 
{ virach,wantanee,surapan } @nwg.nectee.or.th 
Abstract 
This paper presents an algorithm for selecting an 
appropriate classifier word for a noun. In Thai 
language, it frequently happens that there is fluctuation 
in the choice of classifier for a given concrete noun, 
both from the point of view of the whole speech 
community and individual speakers. Basically, there is 
no exact rule for classifier selection. As far as we can 
do in the rule~based approach is to give a default rule 
to pick up a corresponding classifier of each noun. 
Registration of classifier for each noun is limited to the 
type of unit classifier because other types ,are open due 
to the meaning of representation. We propose a 
corpus-based method (Biber,1993; Nagao,1993; 
Smadja,1993) which generates Noun Classifier 
Associations (NCA) to overcome the problems in 
classifier assignment and semantic construction of 
noun phrase. The NCA is created statistically from a 
large corpus and recomposed under concept hierarchy 
constraints and frequency of occurrences. 
Keywords: Thai language, classifier, corpus-based 
method, Noun Classifier Associations (NCA) 
1. Introduction 
A classifier has a significant use in Thai language tbr 
construction of noun or verb to express quantity, 
determination, pronoun, etc. By far the most common 
use of classifiers, however, is in enumerations, where 
the classifiers follow numerals and precede 
demonstratives (Noss,1964). Not all types of classifier 
have a relationship with noun or verb as a unit 
classifier does. 
A unit classifier is any classifier which has a 
special relationship with one or more concrete nouns. 
For example, to enumerate members of the class of 
/rya/ 'boats', tile unit classifier/lain/ is selected as in 
the phrase below: 
/rya nung lain/ 
boat one <boat> 
'one boat'. 
Other than tile unit classifier, there are collective 
classifier, metric classifier, frequency classifier and 
verbal classifier. 
A collective classifier is ,any classifier which 
shows general group or set of mass nouns, un a~ ~ 
/nok soong lung/ 'two flocks of bird'. A metric 
classifier is any classifier which occurs in 
enumerations that modify predicates as well as nouns, v 
lh l~1~ ,u~/nam saam kaew/ 'three glasses of water'. 
A frequency classifier is any classifier which is used to 
express the frequency of event that occurs, ~u ~ ~mJ 
/bin sii roob/ 'fly four rounds'. A verbal classifier is 
any classifier which is derived from a verb and usually 
used in construction with mass nouns, n~z~q~a #a ~ 11")11 
/kradaad haa muan/ 'five rolls of paper'. 
The unit classifier has a special relationship 
with concrete noun. The member of this class of 
classifier is closed for each noun. Most of the unit 
classifiers m'e used with a great many concrete nouns 
of very different meaning, but few are restricted to a 
single noun. Except for the unit classifier, the members 
of classifier for a noun or predicate are open. 
Especially for the metric classifier, the number of 
classifiers for numeral expression of distance, size, 
weight, container and value is large. 
The use of classifier in Thai is not limited to 
the nunmral expression but is extended to other 
expressions such as ordinal, determination, relative 
pronoun, pronoun, etc. The detail of each classifier 
phrase is described in the next section. 
In many existing natural language processing 
systems, tile list of available classifiers lk3r each noun 
is attached to a lexicon base. Rules for classifier 
selection from the list can somehow provide the 
556 
dcfault value but does not guarantee thc 
appropriateness, tlowever, the problems on classifier 
phrase construction still remain unsolved. 
To overcome the problems of using 
classifiers, we propose a method of classifier phrase 
extracting fl'om a large corpus. As a result, Noun- 
Classifier Associations (dcscribcd in Section 3) is 
statistically created to define the relationship between 
a noun and a classifier in a classifier phrase. With the 
li'equency of tile occurrence of a classifier in a 
classifier phrase, we can propose the most apl)rot)riate 
use of a classilier. Furthermore, we introduce a 
hierarchy of semantic class for tile induction of a 
classifier class when they are employed to construct 
with nouns which belong to the same class of meaning. 
Section 3 and Section 4 (lescribc the generation and the 
imlflcmentation of the NCA, respectively. 
2. The roles of classilier in Thai hmguage 
in Thai language, we use classifiers ill wuious 
situations. The classilier plays atu important role ill 
COllStrtlciiou with tlnUll to express ordinal, pronoun, for 
instance. The classifier phrase is syntactically 
geneutted according to a specific pattenL Fig. 2.1 
showt; the position of a classifier in each pattern, where 
N stands lot noun, NCNM stands for cardinal nnnlher, 
CI, stallds for classifier, DET stands for determiner, 
VATF stands for attributive verh, Rt'iL M stands for 
relative marker, ITR. M stands for Interrogative 
iilarkcr , DONM stands foi ordinal liu/tlt:llil, DDAC 
statMs fin definite demonshativc 
Study on tile use of classilira' in each 
expression inemioned above, we can conclu(le that tile 
types of classifier are not restricted tt) any kinds of 
expression, 'to consider tile Selnantic representatioll of 
each exprcssioit, it happens that tilt: unit classifier is 
not wgarded its a conceptual refit in all expressions 
except i~l pattern 6, hut the other types are. (see 
examples in a. and b.) 
a) 1J7~'~ I'irll ~lU,l ~tl! 
/prachachon 2 khon/ 
(IJnit-Cl.,) 
people 2 <people'~ 
'2 peot~le' 
/prachachon 2 Mum/ 
(Collecfive-CI ,) 
penple 2 <gr(mp> 
'2 groups el'people' 
We ellcolmtered to gcnerate tile alWopdate 
classifier tel noun or verb ill a semantic representation. 
"file classifier assignment for non-conceptual 
representation alld the classifier selection of o\[le to 
nunly conceptual representation arc over handleable by 
the rule-based approach. The propose on classifier 
assignment using the corpus-based method is another 
approach. Based on the collocation of noun and 
classifier of each pattern shown in Fig. 2.1, we decided 
to construct the Noun Classifier Association table (see 
Section 3). A stocMstic method combined with the 
concept hierarchy is proposed as a strategy in making 
the NCA table. The table composes of the information 
about nonn-classifier collocation, statistic occurrences 
and the representative classifier for each semantic class 
in the concept hierarchy. 
3. Extraction of Noun-Classifier Collocation 
1,1 this section, wc describe tile algorithm used for 
extraction of Noun Classifier Associations (NCA) from 
a large corpus. We used a 40 megabyte Thai coq)us 
collected from wu'ious areas to create tile table. The 
algorithm is as follows: 
Step 1: Word segmentation. 
Input: A corpus. 
Output: The wordosegmented corpus. 
hi text processing, we often need word boundary 
information lot several puqmses. Because Thai has no 
explicit raarke, to separate words from one another, we 
have to prcprocess the corpus with word segmentation 
program. We used the program developed by 
Sornlcrthmwanich (1993) with post-editing to correct 
fault segmcntation. The program employs heuristic 
rules of longest malching and least word count 
incoq)orated with character combining rules for Thai 
words. Though tile accuracy of the word segmentation 
does not reach 100%, but it is high enough (more than 
95%) to reduce the post-~iting time. 
Step 2: Tagging. 
Input: Output of step 1. 
Output: The corpus of which each word is tagged with 
a part of speech and a semantic class. 
The word-segmented corpus is then processed with a 
stochastic paWof.-st)eed, tagger. Each word w together 
with its part of speech is then used to reUieve the 
semantic class of tile word fiom a dictionary. The 
result yields a data structure of (w,p,s), where p 
denotes the pm-t of speech of w and s denotes the 
semantic chtss of w. For example, the data structure of 
the word fihf~mA hlakrian/'student' is (ffnt~ou, NCMN, 
person), where NCMN stml(ls for common noun and 
t)crson rel)rescnts ffntTml in file class of person. 
Step 3: Producing cnncordances. 
hq)ut: Output of step 2, a given classifier el. 
Output: All the fragnlents containing cl. 
557 
Expressions 
1. Enumeration 
2. Ordinal 
3. Determination 
-Definite 
demonstration 
-Indefinite 
demonstration 
-Referential 
4. Attributive 
5. Noun modifier 
6. Prononn 
-Relative pronoun 
-Interrogative 
pronoun 
-Ordinal pronoun 
-Pronoun 
Patterns 
N/V-NCNM-CL 
N-CL-/tii/-NCNM 
a) N-CL-DET 
a) N-CL-DET 
b) N-DET-CL 
a) N-CL-DET 
N-CL-VA'Iq" 
CL-N 
a) CL-REL_M 
b) CL-ITR_M 
c) CL-DONM 
d) CL-DDAC 
Samples 
/nakrian 3 khon/ 
(N) (N) (CL) 
student 3 <student> 
'three students' 
/kaew bai thii4/ 
(N) (CL) (N) 
glass <glass> 4th 
'the fourth gl~Lss I 
a)/raw chop kruangkhidlek kruang nii/ 
(N) (CL) (DEW) 
we like calculator <calculator> this 
'we like this c~dculator' 
a)/phukhawfung khon nung sadaeng 
(N) (CL) (DEW) 
participant <participant> one express 
khwamhen nai thiiprachum/ 
opinion in conference 
'A participant expressed his opinion in 
the conference.' 
b)/sunak bang tua/ 
(N) (DET) (CL) 
dog some <dog> 
'some dogs' 
a)/kamakan kana nii thukkhon 
(N) (CL) (DET) 
committee <group> this everyone 
chuua w~m ja thamngan samret/ 
believe that will work success 
'It is this committee that everyone 
believed its mission would be success.' 
/dinsoo theng san/ 
(N) (CL) (VAT'I') 
pencil <shape> short 
~ncil' 
/kana naktongtiew/ 
(CL) (N) 
group tourist 
of tourist' 
a)/nakbanchii khon thii thamngan 
(N) (CL) (REL-M) (V) 
accountant who work 
thii borisat nii/ 
at company this 
'the accountant who works at this company' 
b) /sing nail 
(CL) (nR-M) 
<thing> which 
'which one' 
c) /tua raek/ 
(CL) (DONM) 
one first 
'the first one' 
d) /khon nil chop hia mak/ 
(CI,) (DDAC) 
the one like beer very 
'The one likes b~much' 
Fig. 2.1 Classification of classifier expressions table 
558 
(em=n~7;4nq's 111, ~q~'4= 2, 11) 
(a~rl"~tlnql- 111, n~/.l_2, 5) 
(~,=rl'~lJnq's 111, ~.t 1, 6) 
(Brl 13111, ~'1 1,9) 
('#,n_13111, fJ,L2, 4) 
(~ri 13111,~'q 1, 10) 
(~,fi_13111,tt~q 2,3) 
(~nn~=~an 13111, ~'L1, 7) 
(mL11t, ~1, 67) 
(RILl I I, f1~I_2, I) 
(lq~q'I. 1 1 I, ~'l'd, I, 1 7) 
(tlrlq'~~111,l~qu 1,9) 
('n!aql 111, ~q#_2, 1) 
(~.\]'a.,lq~ 111, ~q~.t 1, 6) 
(~11 13114, ~r1_1, 12) 
(t~a 13114, NB 1,3) 
(Lte,ltl.I 13114, ~n_I, 8) 
(~tiFm_l 3114. ~n_l, 9) 
('\[~1.13111,~3 1, 7) 
('~11q_13111, ~3_1, 13) 
(!4~ 13111,~q_1,5) 
(~q~l .13111, titan 1, 3) 
Fig. 3.1 Table of Noun Classificr Associations (NCA) 
Concrete (1) 
Subject (11) Concrete place (12) Concrete thing (13) 
Person (111) Organization (112) .,. Nature thing (131) ,,. 
living thing (1311) ...... 
Animal (13111) .,, Plant (13113) Fruit (13114) 
Fig. 3.2 Concept hierarchy 
Instead of picking up the data sentence by sentence, 
we extracted a fragment of data arouud the el, because 
there is no explicit marker to indicate sentence 
boundaries. We used the range of -10 to +2 words 
around the cl in our experiments which appeared to 
cover most of co-occurrence patterns. 
Step 4: PaRern naatching 
Input: Output of step 3. 
Output: A list of nouns-classifiers with frequency 
intormatiou of co-occurrences. 
In this step, the tagged corpus is matched with each 
pattern of classifier occurrences shown below: 
No -NCNM-CL (Enumeration) 
N- -CL- ~/tii/-NCNM (Ordinal expression) 
N- -CL-DET (Referential expression) 
N- -DET-CL 
(Indefinite demonstration expression) 
N- -CL-VA'IT (Attribute noun phrase) 
CL-N (Noun modifier) 
N- -CL-{~/tii/, ~/sung/, "ht/n,'fi/,.. } 
(Relative/Interrogative pronoun) 
where N denotes noun, CL denotes classifier, NCNM 
denotes c~u'dinal number, DET denotes determiner, A ,. .4 
VATF denotes attributive verb, ~l/tu/, ~ /sung/ and '\[u 
/nai/ are specific Thai words, A-B denotes a 
consecutive pair of A and 1t, aud A--B denotes a 
possibly separated pair. Actually, A--B can be 
559 
separated by several arbitrary words but in our 
experiments we considered only possible separations 
by a relative pronoun phrase having no more than 5 
words. This is to limit the search space of general 
cases to a manageable size with some loss of 
generality. 
The pattern matching process was carried out 
one by one with each pattern. For each pattern of A- - 
B-C, the matching of B-C pair was simple and was 
performed at first. Next, the matching of a pair A- -B 
was done by: 
1. searching for the nearest A from B. If 
found, mark AI. 
2. from B within a span of five, searching for 
the nearest relative pronoun. If found, mark pl then go 
to 3. Otherwise, match A1. 
3. further searching for the nearest A from p 1. 
If found, mark A2. If A2 is farther from B than A1, 
match A2. Otherwise, match A I. 
At the end of these steps, we obtained a list of 
nouns Ni along with the frequency of w in the corpus 
for each matching pattern (see Fig. 3.1 for sample 
ouqmts). Each entry is of the form (W_N1, CLN2, 
Freq) where W denotes a noun, N1 denotes a number 
representing semantic class of W, CL denotes the 
associated classifier, N2 is a number indicating 
whether CL is a unit or collective classifier (1 for unit, 
2 for collective) and Freq denotes the frequency of co- 
occurrence between W and CL. The semantic class is 
shown in Fig. 3.2. 
Step 5: Determine representative classifier 
Input: A list of noun-classifier with frequency 
information of co-occurrence. 
Output: Representative classifier of each noun and 
each semantic class of nouns. 
As it can be observed in Fig. 3.1, each noun 
may be used with several possible classifiers. In 
language generation process. However, we have to 
select only one of them. For each noun we select the 
classifier with the greatest value of co-occurrence 
frequency to be the representative classifier for both 
representative unit classifier and representative 
collective classifier. Tile classifier in Fig. 3.1, for 
example, will have ~__1 as the representative unit 
classifier and have n~ 2 as the representative 
collective one for the noun sm~nr~unq'~ 111. Collective 
classifiers are used instead of unit classifiers when the 
notion of "group' is required. 
We also find the representative classifier for 
each semantic class of nouns in the same manner. For 
each semantic class of nouns (grouped by the semantic 
class attached with each noun), the classifier with the 
greatest value of co-occurrence frequency is selected 
to be the representative. The classifier is used to 
handle the assignment of classifier to noun which does 
not exist in the trained corpus. For example, the 
representative unit classifiers for each semantic class 
extracted by the pattern (N- -NCNM-CL) are shown in 
Fig. 3.3. 
4. Classifier Resolution 
The associations as produced in the previous section 
are useful for determining a proper classifier for a 
given noun. For a noun occurring ill the corpus, 
alternative determination is accomplished in a 
straightforward manner by using its associated 
representative classifier which occurs in the corpus 
more frequently than any other classifiers. In the other 
case where the given noun does not exist in tile corpus, 
the determination is done by using the representative 
classifier of its class in the concept hierarchy. 
Some examples of classifier determining are 
listed below. (1) and (3) show the case of nouns 
appearing in the corpus, while (2) and (4) show a 
different scenario. In (2), the unit classifier of/appem/ 
is obtained by using the representative unit classifier of 
its class "fruit' which is ~n_~ /luuld according to Fig. 
3.3. Similarly, in (4), the collective classifier of 
/gangkerd is determined by the representative 
collective classifier of its class "animal' which is ~2 
/fuung\]. 
Semantic class Unit classifier Collective classifier 
animal ~'L 1 ~.~_2 
human ~u I ~m~2 
plant ~u_l 
fruit nnl 
Fig. 3.3 NCA for representative classifier 
560 
Unit classifier 
/nakrian kon tit sit/ 
student <sttident> number four 
v 
(2) mnlfi'ka ~\]n lint 
/appern luuk nail 
apple <apple> which 
Collective classifier 
(3) ~ul:n~m~ ~ut.~ un 
/kanagammagarn kana nan/ 
committee . group that 
(4) ,m,~u ~ ~Tu 
/gangken fuung nan/ 
magpie group that 
Linguistics, Vol. 19, No.3, Set)tember 1993. 
\[2\] Nagao, Makato. (1993). "Machine Translation: 
What Have We to Do". Proceedings of MT Summit 
IV, June 20-22, 1993, Kobe, Japan. 
\[3\] Noss, Richard B. (1964). Thai Reference Grammar, 
U.S. Oovermnent Printing Office, Washington, DC. 
\[4\] Smadja, Frank. (1993). "Retrieving Collocations 
fi'om Text: Xtract". Computational Linguistics, 
Vol. 19, No.l, March 1993. 
\[51 Sornlerthmwanich, Virach. (1993), "Word Segmen= 
ration for Thai in Machine Translation System", 
Machine Translation, National Electronics and 
Computer Teclmology Center, (in Thai). 
5. Conclusion 
"File proposed approach is a significantly new method to 
manipulate the classifier phrase in Thai language. The 
fact that the expression of some syntactic constituents 
needs a specific classifier to be constnmted with and the 
selection of classifier lot each noun or noun phrase 
depends on tile traditional use and the senmntic class. 
The corpus-based approach is quite suitable for 
detecting the traditional use and searching for the most 
appropriate one wlmn it does not exist in the corpus yet. 
Concept hierarchy of noun provides another path for 
searching when the NCA does not cover the noun in 
question. 
In the future, this NCA will be included in the 
generation process of Machine Translation to solve the 
classifier assignment, and incoqmrated in the analysis 
process to produce a proper syntactic and semantic 
structure. The classifier will then be a key for pattern 
disambiguation when it is fixed to one of the patterns 
illustrated in Fig. 2.1. 
Acknowledgement 
We wish to thank the National Electronics and 
Computer Technology Center (NECTEC) and Center of 
tile International Cooperation for Computerization 
(CICC) who provide facilities and a large corpus base 
for the rescarchl 
References 
\[ 1 \] Biber, I)ouglas. (I 993). "Co-occurrence Patterns 
aulong Collocations: A "Fool for Corpt, s-Based 
Lexical Knowledge Acquisition". Comlmtational 
561 

Corpus-based NLP 

