Chinese Classifier Assignment Using SVMs
Hui Guo and Huayan Zhong
Department of Computer Science
Stony Brook University
Stony Brook, NY 11794-4400, USA
{huguo, huayan}@cs.sunysb.edu
Abstract
In Chinese, nouns need numeral clas-
sifiers to express quantity. In this pa-
per, we explore the relationship be-
tween classifiers and nouns. We ex-
tract a set of lexical, syntactic and onto-
logical features and the corresponding
noun-classifier pairs from a corpus and
then train SVMs to assign classifers to
nouns. We analyse which features are
most important for this task.
1 Introduction
In English, numbers directly modify count nouns,
as in ‘two apples’ and ‘five computers’. Num-
bers cannot directly modify mass nouns; instead,
an embedded noun phrase must be formed, e.g.
‘five slices of bread’. However, in Chinese all
nouns need numeral classifiers to express quan-
tity1. When translating from English to Chinese,
we may need to choose Chinese classifiers to form
noun phrases. We can see the difference between
the two languages in the following two examples:
_d_973[liang] _d_977[ge]_d_5678_d_3428[pingguo] (Chinese)
two apples (English)
and
_d_1032[wu]_d_4252[pian]_d_7268_d_1479[mianbao] (Chinese)
five slices of bread (English)
Noun classifer combinations appear with high
frequency in Chinese. There are more than 500
classifiers although fewer than 200 of them are
frequently used. Each classifier can only be
1Proper nouns and bare noun phrases do not need classi-
fiers.
used with certain classes of noun. Nouns in a
class usually have similar properties. For exam-
ple, nouns that can be used with the classifier
‘_d_3495[gen]’ are: ‘_d_4919_d_5718’(straw), ‘_d_5022_d_2258’(chopstick),
‘_d_5035_d_2258’(pipe), etc. All these objects are long and
thin. However, sometimes nouns with similar
properties are in different classes. For example,
‘_d_4259’(cow), ‘_d_7406’(horse) and ‘_d_5306’(lamb) are all live-
stock, but they associate with different classifiers.
This means that classifier assignment is not totally
rule-based but partly idiomatic.
In this paper, we explore the relationship be-
tween classifiers and nouns. We extract a set of
features and the corresponding noun-classifier at-
tachments from a corpus and then train SVMs to
assign classifers to nouns. In Section 4 we de-
scribe our data set. In Section 5 we describe our
experiments. In Section 6 we present our results.
2 Related Work
Many Asian languages (e.g. Chinese, Korean,
Japanese and Thai) have numeral classifier sys-
tems. Previous work on noun-classifier match-
ing has been done in these languages. (Sorn-
lertlamvanich et al., 1994) present an algorithm
for selecting an appropriate classifier for a noun
in Thai. The general idea is to extract noun-
classifier collocations from a corpus, and output a
list of noun-classifier pairs with frequency infor-
mation. During noun phrase generation, the most
frequently co-occurring classifier for a given noun
is selected. However, no evaluation is reported for
this algorithm.
The algorithm described in (Paik and Bond,
2001) generates Japanese and Korean numeral
25
classifiers using semantic classes from an ontol-
ogy. The authors assigned classifiers to each
of the 2,710 semantic classes in the ontology
by hand. During generation, nouns in each se-
mantic class are assigned the associated classi-
fier. The classifier assignment accuracy is 81%
for Japanese classifiers and 62% for Korean clas-
sifiers. However, the evaluation set contains only
90 noun phrases, which is pretty small. Further-
more, it is hard work to attach classifiers to an on-
tology by hand, and with this approach it is hard
to deal with cases like the cattle example men-
tioned earlier.
(Paul et al., 2002) present a method for ex-
tracting classifier information from a bilingual
(Japanese-English) corpus based on phrasal cor-
respondences in the sentential context. Bilin-
gual sentence pairs are compared to find noun-
classifier collocations. The evaluation was done
by a human. The precision is high (84.2%) but
the recall is only about 40% because the algorithm
does not give output for half of the nouns.
In contrast to these algorithms, our approach: is
based on a large data set; uses machine learning;
and does not require the attachment of classifiers
to an ontology by hand.
3 Support Vector Machines
Support Vector Machines (SVMs) are a type of
classifier first introduced in (Boser et al., 1992).
In the last few years SVMs have become an im-
portant and active field in machine learning re-
search. The SVM algorithm detects and exploits
complex patterns in data.
A binary SVM is a maximum margin classifier.
Given a set of training data {x1,x2,...,xk}, with
corresponding labels y1,y2,...,yk ∈ {+1,−1}, a
binary SVM divides the input space into two re-
gions at a decision boundary, which is a separat-
ing hyperplane 〈w,x〉 + b = 0 (Figure 1). The
decision boundary should classify all points cor-
rectly, that is:
yi(〈w,xi〉 + b) > 0,∀i
Also, the decision boundary should have the
maximum separating margin with respect to the
two classes. If we rescale w and b to make
the closest point(s) to the hyperplane satisfy
b
w
x
x
x
x x
x
x
x
<w, x> + b = 0
Figure 1: The input space and hyperplane
|〈w,xi〉 + b| = 1, then the margin equals 1/||w||
and the problem can be formulated as:
minimize 12||w||2
subject to yi(〈w,xi〉 + b) ≥ 1,∀i
The generalized Lagrange Function is:
L(w,b,α) = 12〈w,w〉−
lsummationdisplay
i=1
αi[yi(〈w,xi〉+b)−1]
So we can transform the problem to its dual:
maximize
W(α) =
nsummationdisplay
i=1
αi − 12
nsummationdisplay
i=1,j=1
αiαjyiyj〈xi,xj〉
subject to αi ≥ 0,
nsummationdisplay
i=1
αiyi = 0
This is a quadratic programming (QP) problem
and we can always find the global maximum of
αi. We can recover w and b for the hyperplane
by:
w =
nsummationdisplay
i=1
αiyixi
b = −maxyi=−1(〈w,xi〉) + minyi=+1(〈w,xi〉)2
If the points in the input space are not linearly
separable, we allow ‘slack variables’ ξi in the
classification. We need to find a soft margin hy-
perplane, e.g.:
minimize 12||w||2 + C
nsummationdisplay
i=1
ξi
26
subject to yi(〈w,xi〉 + b) ≥ 1 −ξi,∀i
Once again, a QP solver can be used to find the
solution.
For our task we need multi-class SVMs. To get
multi-class SVMs, we can construct and combine
several binary SVMs (one-against-one), or we can
directly consider all data in one optimization for-
mula (one-against-all).
Many SVM implementations are available on
the web. We chose LIBSVM (Chang and Lin,
2001), which is an efficient multi-class imple-
mentation. LIBSVM uses the “one-against-one”
approach in which k(k−1)/2 classifiers are con-
structed and each one trains on data from two dif-
ferent classes (Hsu and Lin, 2002).
4 Data and Resources
We use the Penn Chinese Treebank (Xue et al.,
2002) as our corpus and the ontology/lexicon
HowNet (Dong and Dong, 2000) to get ontologi-
cal features for nouns. We train SVMs on differ-
ent feature sets to see which set(s) of features are
important for noun-classifier matching.
4.1 Penn Chinese Treebank
The Penn Chinese Treebank is a 500,000 word
Chinese corpus annotated with both part-of-
speech (POS) tags and syntactic brackets.
We automatically extract noun phrases that
contain classifiers from the corpus. An example
noun phrase (translation: ‘a major commercial
waterway’) is:
(IP
....
(NP (QP (CD_d_948) (CLP (M_d_3403)))
(NP (NN_d_3762_d_6655)) (ADJP (JJ_d_2101))
(NP (NN_d_1452_d_5471)))
...
)
The word in (CLP (M_d_3403[tiao])) is the classifier
and the head noun of the noun phrase is (NN _d_1452
_d_5471). In Section 5.3 we describe a set of features
we obtain from each noun phrase and the sentence
in which it is embedded.
In our corpus, there are 61587 noun occur-
rences (12225 unique nouns) and 3940 classifier-
noun co-occurrences (212 unique classifiers).
However, there is a trival rule determining
whether a noun needs a classifier. If a noun is
preceded by a quantifier or a determiner, then a
classifier is needed, otherwise it is not. Hence,
we only focus on noun-classifier pairs. The most
frequently occurring classifier in this corpus is
‘_d_977[ge]’, which occurs with 497 unique nouns. In
this corpus, 87 classifiers occur in only one noun-
classifier pair.
4.2 HowNet
We get ontological features of nouns from
HowNet. HowNet is a bilingual Chinese-English
lexicon and ontology. Each word sense is as-
signed to a concept containing ontological fea-
tures. HowNet uses basic meaning units named
sememes to construct concepts.
Table 1 shows an example entry in HowNet.
The entry in Table 1 is for the word ‘_d_1144
_d_2319’(writer). The sememe at the first position, ‘hu-
man(_d_1054)’, is the categorical attribute, which de-
scribes the general category of the concept. The
sememes following the first sememe are addi-
tional attributes, which give additional specific
features. There are two types of pointer, ‘#’ and
‘*’, in the definition. ‘#’ means ‘related’, so
‘#occupation’ shows that the concept has a re-
lationship with ‘occupation’. ‘*’ means ‘agent’,
so ‘*compile’ shows that ‘writer’ is the agent of
‘compile’. The sememes ‘#readings’ and ‘litera-
ture’ show that the job of ‘writer’ is to compile
‘readings’ about ‘literature’.
We use HowNet 2000, which contains 120,496
entries for about 65,000 Chinese words defined
with a set of 1503 sememes. It is big enough for
our task and we can get ontological features for
94.71% of the nouns from the Penn Chinese Tree-
bank. For the nouns that are not in HowNet, we
just leave the ontological features blank.
5 Experiments
We use six different feature sets to assign classi-
fiers to nouns. To evaluate each feature set, we
perform 10-fold cross validation. We report our
results in Section 6.
5.1 Baseline Algorithm
In the training data, we count the number of times
each classifier appears with a given noun. We as-
sign to each noun in the testing data its most fre-
27
No.: 114303
W C (word in Chinese): _d_1144_d_2319
E C (example in Chinese):
G C (POS tag in Chinese): N
W E (word in English): writer
E E (example in English):
G E (POS tag in English): N
DEF (concept definition): human(_d_1054),#occupation(_d_5385_d_1132),*compile(_d_5249_d_6625),
#readings(_d_6321_d_4266),literature(_d_3230)
Table 1: An entry in HowNet
Lexical Features Syntactic Features
noun POS of noun
first premod POS of first premod
second premod POS of second premod
main verb POS of main verb
total number of premodifiers sentType
embedded in vp or pp
quoted or not
Table 2: Features extracted from training data
quently co-occurring classifier (c.f. (Sornlertlam-
vanich et al., 1994)). If a noun does not appear in
the training data, we assign the classifier ‘_d_977[ge]’,
the classifier which appears most frequently over-
all in the corpus.
5.2 Noun Features
Since classifiers are assigned mostly based on the
noun, the most important features for classifier
prediction should be features of the nouns. We
ran four different experiments for noun features:
• (1) The feature set includes only the noun it-
self.
• (2) The feature set includes ontological fea-
tures of the noun only. If classifiers are as-
sociated with semantic categories (c.f. (Paik
and Bond, 2001)), we should be able to as-
sign classifiers based on the ontological fea-
tures of nouns.
• (3) The feature set includes the noun and on-
tological features.
• (4) Two SVMs are trained: one on the noun
only, and one on ontological features only.
During testing, nouns in the training set
are assigned classifiers using the first SVM;
other nouns are assigned classifiers using the
second SVM.
5.3 Context Features
In this set of experiments, we used features from
both the noun and the context. The features we
used can be categorized into two groups: lexical
features and syntactic features. They are shown
in Table 2.
We ran two experiments using this set of fea-
tures:
• (5) The feature set includes the noun, lexical
and syntactic features only.
• (6) The feature set includes the noun, lexical,
syntactic and ontological features.
6 Results and Discussion
We built SVMs using all the feature sets described
in Section 5 and tested using 10-fold cross valida-
tion. We tried the four types of kernel function in
LIBSVM: linear, polynomial, radial basis func-
tion (RBF) and sigmoid, then selected the RBF
kernal K(x,y) = e−γ||x−y||2, which gives the
28
Algorithm All nouns Nouns occuring 2+ times
Baseline 50.76% 50.69%
(1) noun only 57.81% (c = 4, γ = 0.5) 59.34% (c = 16, γ = 0.125)
(2) ontology only 58.69% (c = 4, γ = 0.5) 60.68% (c = 256, γ = 0.125)
(3) noun and ontology 57.81% (c = 16, γ = 0.5) 59.46% (c = 16, γ = 0.125)
(4) noun or ontology 58.71% 60.55%
(5) noun, syntactic and
lexical features 52.14% (c = 1024, γ = 0.5) 53.51% (c = 16, γ = 0.5)
(6) all features 52.06% (c = 1024, γ = 0.075) 53.55% (c = 16, γ = 0.5)
Table 3: Accuracy of different algorithms
Most common
noun
_d_1132[wei] _d_3664[ci] _d_977[ge] _d_1613[ming] _d_2384[jie] _d_7303[xiang]
_d_1132[wei] _d_2299_d_1661(official) 24.1 (57.1) 14.7 (34.7)
_d_3664[ci] _d_2101_d_1109 (conven-
tion)
22.3 (53.3) 1.1 (2.6) 7.6 (18.2)
_d_977[ge] _d_7303_d_4649(project) 1.0 (7.0) 0.7 (5.2) 0.2 (1.7) 3.3 (24.4)
_d_1613[ming] _d_1054_d_1661(person) 31.7 (55.2) 23.8 (41.4)
_d_2384[jie] _d_6655_d_1452_d_1109 (sports
tournament)
1.9 (2.1) 29.6 (34.0) 31.5 (36.2)
_d_7303[xiang] _d_2884_d_3428 (achieve-
ment)
6.6 (11.3) 35.2 (60.4) 1.1 (1.9)
Table 4: Most commonly misclassified classifiers; Cell shows percentage of total occurrences of row
value misclassified as column value and (percentage of total misclassifications of row value misclassi-
fied as column value)
highest accuracy. For each feature set, we sys-
tematically varied the values for the parameters C
(range from 2−5 to 215) and γ (range from 23 to
2−15); we report the best results with correspond-
ing values for C and γ. Finally, for each feature
set, we ran once on all nouns and once only on
nouns occurring twice or more in the corpus.
Classifier assignment accuracy is reported in
Table 3. The performance of all the SVMs is sig-
nificantly better than baseline (paired t-test, p <
0.005). There is no significant difference between
the performance with the 1st, 2nd, 3rd and 4th
feature sets. But the performance of the SVMs us-
ing lexical and syntactic features (experiments 5
and 6) is significantly worse than the performance
on feature sets 1-4 (df = 17.426, p < 0.05).
These results show that lexical and syntactic
contextual features do not have a positive effect
on the assignment of classifiers. They confirm the
intuition that the noun is the single most important
predictor of the classifier; however, the semantic
class of the noun works as well as the noun itself.
In addition, a combination approach that uses se-
mantic class information when the noun is previ-
ously unseen does not perform better.
We also computed the confusion matrix for the
most commonly misclassified classifiers. The re-
sults are reported in Table 4.
For these experiments we used automatic eval-
uation (cf. (Paul et al., 2002)). A classifier is only
judged to be correct if it is exactly the same as that
in the original test set. For some noun phrases,
there are multiple valid classifiers. For example,
we can say
‘_d_948[yi]_d_1966[kuai]_d_6872_d_4254[jinpai]’
or
‘_d_948[yi]_d_3427[mei]_d_6872_d_4254[jinpai]’
(a golden medal).
We did a subjective evaluation on part of our
data to evaluate how many automatically gener-
ated classifiers are acceptable to human readers.
We randomly selected 241 noun-classifier pairs
from our data. We presented the sentence con-
taining each pair to a human judge who is a na-
tive speaker of Mandarin Chinese. We asked the
judge to rate all the classifiers generated by our
29
Algorithm Number rated 1
or higher
Percent rated 1 or
higher
Average rating
Baseline 209 86.7% 1.59
(1) noun only 224 92.9% 1.76
(2) ontology only 226 93.8% 1.78
(3) noun and ontology 226 93.8% 1.77
(4) noun or ontology 227 94.2% 1.80
(5) noun, syntactic and
lexical features 218 90.5% 1.67
(6) all features 218 90.5% 1.67
Original 241 100% 1.95
Table 5: Human evaluation: Ratings of classifiers
algorithms as well as the original classifier by in-
dicating whether each is good (2), acceptable (1)
or bad (0) in that sentence context. The classifiers
were presented in random order; the judge was
blind to the source of the classifiers.
The results for our human evaluation are re-
ported in Table 5. Although our automatic eval-
uation indicates relatively poor accuracy, 94.2%
of generated classifiers using feature set 4) are
rated acceptable or good in our subjective evalua-
tion. Also, the performance of SVMs with the 1st,
2nd, 3rd and 4th feature sets is significantly bet-
ter than baseline (paired t-test, p < 0.005). There
is no significant difference between the perfor-
mance with the 1st, 2nd, 3rd and 4th feature sets.
But the performance of the SVMs using lexical
and syntactic features (experiments 5 and 6) is
significantly worse than those without (p < 0.05).
The ratings of the classifiers generated by all our
algorithms are significantly worse than the origi-
nal classifiers in the corpus. In future work, we
plan to extend this evaluation using more judges.
Which classifier to select also depends on the
emotional background of the discourse (Fang,
2003). For example, we can use different class-
fiers to express different affect for the same noun
(e.g. if a government official is in favor or dis-
grace). However, we cannot get this kind of in-
formation from our corpus.
7 Conclusions and Future Work
Our machine learning approach to classifier as-
signment in Chinese performs better than previ-
ously published rule-based approaches and works
for bigger data sets. The noun is clearly the most
important feature (experiment 1). However, we
still think ontological features may be useful in
classifier assignment, for example for previously
unseen nouns, and our experimental results show
a trend in this direction, although not a statisti-
cally significant one (experiments 2 and 4).
We used the Chinese Treebank for these ex-
periments because it is the only available corpus
of parsed Chinese text. Now that we have iso-
lated the relevant features for this task, we plan to
conduct further experiments using larger corpora,
such as the Chinese Gigaword (Graf and Chen,
2003).
Our use of ontological features could be im-
proved in several ways. First, the ontological fea-
tures we get from HowNet do not fit our pur-
pose well. For example, the definitions of ‘_d_4340’
(cat) and ‘_d_4259’ (cow) are both ‘livestock’; how-
ever, they should use different classifiers. In or-
der to improve the performance of our approach,
we need an ontology that correctly groups nouns
into classes according to their semantic properties
(e.g. type, shape, color, size).
For another knowledge-rich approach, we
could use a complex ontology plus a Chinese
classifier dictionary that describes the properties
of the objects each classifier can modify. By
comparing noun properties and classifier char-
acteristics, classifier assignment could be im-
proved as long as the nouns are in the ontol-
ogy. However, there are many idiomatic noun-
classifier matchings that can not be categorised
by dictionaries. Therefore, a combination of rule-
30
based and machine-learning approaches seems
most promising.
Third, we can classify Chinese classifers into
groups and focus on those that modify single ob-
jects. Certain Chinese classifiers can be used
before all plural nouns. Some classifiers spec-
ify the container of the objects, for example,
‘_d_948[yi] _d_5061_d_2258[lanzi] _d_5678_d_3428[pingguo]’ (a basket of
apples). The classifier changes when the con-
tainer changes. These can be treated differently
from sortal and anaphoric classifiers.
Acknowledgements
We thank the anonymous reviewers for their help-
ful feedback, HowNet for giving us permission to
download their knowledge base, and Chih-Chung
Chang and Chih-Jen Lin for the implementation
of LIBSVM.
This material is based upon work supported by
the Defense Advanced Research Projects Agency
(DARPA), through the Department of the Interior,
NBC, Acquisition Services Division, under Con-
tract No. NBCHD030010.
References
B. Boser, I. Guyon, and V. Vapnik. 1992. A training
algorithm for optimal margin classifiers. In Pro-
ceedings of the Fifth Annual ACM Conference on
Computational Learning Theory (COLT 1992).
Chih-Chung Chang and Chih-Jen Lin,
2001. LIBSVM: a library for support
vector machines. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm.
Z. Dong and Q. Dong. 2000. Introduction
to HowNet - Chinese message structure base.
http://www.keenage.com.
L. Fang. 2003. Research of Chinese lexicon teaching
- quantities. Journal of Secondary Education.
D. Graf and K. Chen. 2003. Chinese gigaword. LDC
Catalog Number LDC2003T09.
C. Hsu and C. Lin. 2002. A comparison of methods
for multi-class support vector machines. In IEEE
Transactions on Neural Networks.
K. Paik and F. Bond. 2001. Multilingual generation
of numeral classifiers using a common ontology. In
Proceedings of the 19th International Conference
on Computer Processing of Oriental Languages.
M. Paul, E. Sumita, and S. Yamamoto. 2002. Corpus-
based generation of numeral classifier using phrase
alignment. In Proceedings of the 19th International
Conference on Computational Linguistics.
V. Sornlertlamvanich, W. Pantachat, and S. Meknavin.
1994. Classifier assignment by corpus-based ap-
proach. In Proceedings of the 15th Conference on
Computational Linguistics.
N. Xue, F. Chiou, and M. Palmer. 2002. Building a
large-scale annotated Chinese corpus. In Proceed-
ings of the 19th International Conference on Com-
putational Linguistics.
31
