Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 217–220,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Designing Special Post-processing Rules for SVM-based
Chinese Word Segmentation
Muhua Zhu, Yilin Wang, Zhenxing Wang, Huizhen Wang, Jingbo Zhu
Natural Language Processing Lab
Northeastern University
No.3-11, Wenhua Road, Shenyang, Liaoning, China, 110004
{zhumh, wangyl, wangzx, wanghz}@ics.neu.edu.cn
zhujingbo@mail.neu.edu.cn
Abstract
We participated in the Third Interna-
tional Chinese Word Segmentation Bake-
off. Specifically, we evaluated our Chi-
nese word segmenter NEUCipSeg in
the close track, on all four corpora,
namely Academis Sinica (AS), City Uni-
versity of Hong Kong (CITYU), Mi-
crosoft Research (MSRA), and Univer-
sity of Pennsylvania/University of Col-
orado (UPENN). Based on Support Vec-
tor Machines (SVMs), a basic segmenter
is designed regarding Chinese word seg-
mentation as a problem of character-based
tagging. Moreover, we proposed post-
processing rules specially taking into ac-
count the properties of results brought
out by the basic segmenter. Our system
achieved good ranks in all four corpora.
1 SVM-based Chinese Word Segmenter
We built out segmentation system following (Xue
and Shen, 2003), regarding Chinese word segmen-
tation as a problem of character-based tagging.
Instead of Maximum Entropy, we utilized Sup-
port Vector Machines as an alternate. SVMs are
a state-of-the-art learning algorithm, owing their
success mainly to the ability in control of general-
ization error upper-bound, and the smooth integra-
tion with kernel methods. See details in (Vapnik,
1995). We adopted svm-light1 as the specific
implementation of the model.
1.1 Problem Formalization
By formalizing Chinese word segmentation into
the problem of character-based tagging, we as-
1http://svmlight.joachims.org/
signed each character to one and only one of the
four classes: word-prefix, word-suffix,
word-stem and single-character. For
example, given a two-word sequence o�2�
� p, the Chinese words for ”Southeast Asia(�
2�) people(�) ”, the character o� pis as-
signed to the category word-prefix, indicating
the beginning of a word; o2 pis assigned to the
category word-stem, indicating the middle po-
sition of a word; o� pbelongs to the category
word-suffix, meaning the ending of a Chinese
word; and last, o� pis assigned to the category
single-character, indicating that the single
character itself is a word.
1.2 Feature Templates
We utilized four of the five basic feature templates
suggested in (Low et al. , 2005), described as
follows:
• Cn(n = −2,−1,0,1,2)
• CnCn+1(n = −2,−1,0,1)
• Pu(C0)
• T(C−2)T(C−1)T(C0)T(C1)T(C2)
where C refers to a Chinese character. The first
two templates specify a context window with the
size of five characters, where C0 stands for the
current character: the former describes individual
characters and the latter presents bigrams within
the context window. The third template checks
if current character is a punctuation or not, and
the last one encodes characters’ type, including
four types: numbers, dates, English letters and
the type representing other characters. See de-
tail description and the example in (Low et al.
, 2005). We dropped template C−1C1, since,
217
in experiments, it seemed not to perform well
when incorporated by SVMs. Slightly different
from (Low et al. , 2005), character set repre-
senting dates are expanded to include o� p
 o p oM p oH p os p o
� p,
the Chinese characters for “day”, “month”, “year”,
“hour”,“minute”,“second”, respectively.
2 Post-processing Rules
Segmentation results of SVM-based segmenter
have their particular properties. In respect to the
properties of segmentation results produced by the
SVM-based segmenter, we extracted solely from
training data comprehensive and effective post-
processing rules, which are grouped into two cate-
gories: The rules, termed IV rules, make ef-
forts to fix segmentation errors of character se-
quences, which appear both in training and test-
ing data; Rules seek to recall some OOV(Out
Of Vocabulary) words, termed OOV rules. In
practice, we sampled out a subset from train-
ing dataset as a development set for the analysis
of segmentation results produced by SVM-based
segmenter. Note that, in the following, we defined
Vocabulary to be the collection of words ap-
pearing in training dataset and Segmentation
Unit to be any isolated character sequence as-
sumed to be a valid word by a segmenter. A
segmentation unit can be a correctly seg-
mented word or an incorrectly segmented charac-
ter sequence.
2.1 IV Rules
The following rules are named IV rules, pur-
suing the consistence between segmentation re-
sults and training data. The intuition underlying
the rules is that since training data give somewhat
specific descriptions for most of the words in it, a
character sequence in testing data should be seg-
mented in accordance with training data as much
as possible.
Ahead of post-processing, all words in the
training data are grouped into two distinct sets:
the uniquity set, which consists of words
with unique segmentation in training data and the
ambiguity set, which includes words having
more than one distinct segmentations in training
data. For example, the character sequence o�W
@ phas two kinds of segmentations, as o�W
@ p(new century) and o�W@ p(as a compo-
nent of some Named-Entity, such as the name of a
restaurant).
• For each word in the uniquity set, check
whether it is wrongly segmented into more
than one segmentation units by the SVM-
based segmenter. If true, the continuous seg-
mentation units corresponding to the word
are grouped into the united one. The in-
tuition underlying this post-processing rule
is that SVM-based segmenter prefers two-
character words or single-character words
when confronting the case that the segmenter
has low self-confidence in some character-
sequence segmentation. For example, o�
�� p(duplicate) was segmented as o�
�� pand odB p(unify) was split
into odB p. This phenomenon is
caused by the imbalanced data distribution.
Specifically, characters belonging to category
word-stem are much less than other three
categories.
• For each segmentation unit in the result
produced by SVM-based segmenter, check
whether the unit can be segmented into more
than one IV words and, meanwhile, the words
exist in a successive form for at least once in
training data . If true, replace the segmen-
tation unit with corresponding continuously
existing words. The intuition underlying this
rule is that SVM-based segmenter tends to
combine a word with some suffix, such as
 o� p a o� p, two Chinese characters
representing “person”. For example, o
 � p(Person in registration) tends to be
grouped as a single unit.
• For any sequence in the ambiguity set, such
as o�W@ p, check if the correct seg-
mentation can be determined by the con-
text surrounding the sequence. Without los-
ing the generality, in the following explana-
tion, we assume each sequence in the am-
biguity set has two distinct segmentations.
we collected from training data the word
preceding a sequence where each existence
of the sequence has one of its segmenta-
tions, into a collection, named preceding
word set, and, correspondingly, the fol-
lowing word into another set, which is
termed following word set. Analog-
ically, we can produce preceding word
218
set and following word set for an-
other case of segmentation. When an am-
biguous sequence appears in testing data, the
surrounding context (in fact, just one preced-
ing word and a following word) is extracted.
If the context has overlapping with either of
the pre-extracted contexts of the same se-
quence which are from training data, the seg-
mentation corresponding to one of the con-
texts is retained.
• More over, we took a look into the annotation
errors existing in training data. We assume
there unavoidably exist some annotation mis-
takes. For example, in UPENN, the sequence
 o�
� p(abbreviation for China and Amer-
ica) exists, for eighty-seven times, as a whole
word and only one time, exists as o�
� p.
We regarded the segmentation o�
� pas
an annotation error. Generally, when the ra-
tio of two kinds of segmentations is greater
than a pre-determined threshold (the value is
set seven in our system), the sequence is re-
moved from the ambiguity set and added as
a word of unique segmentation into the uniq-
uity set.
2.2 OOV Rules
The following rules are termed OOV rules,
since they are utilized to recall some of the
wrongly segmented OOV words. A OOV word
is frequently segmented into two continuous OOV
segmentation units. For example, the OOV
word o��� p(Vatican) was frequently seg-
mented as o��� p, where both o�
� pand o� pare OOV character sequences.
Continuous OOVs present a strong clue of po-
tential segmentation errors. A rule is designed
to merge some of continuous OOVs into a cor-
rect segmentation unit. The designed rule is ap-
plicable to all four corpora. Moreover, since dis-
tinction between different segmentation standards
frequently leads to very different segmentation of
a same OOV words in different corpora, we de-
signed rules particularly for MSRA and UPENN
respectively, to recall more OOVs.
• For two continuous OOVs, check whether
at least one of them is a single-character
word. If true, group the continuous OOVs
into a segmentation unit. The reason for
the constraint of at least one of continuous
OOVs being single-character word is that not
all continuous OOVs should be combined,
for example, o��9 p, both o�
 p(Germany merchant) and o�9 p(the
company name) are OOVs, but this sequence
is a valid segmentation unit. On the other
hand, we assume appropriately that most of
the cases for character being single-character
word have been covered by training data.
That is, once a single character is a OOV seg-
mentation unit, there exists a segmentation
error with high possibility.
• MSRA has very different segmentation stan-
dard from other three corpora, mainly be-
cause it requires to group several continuous
words together into a Name Entity. For ex-
ample, the word o�S��� p(the Min-
istry of Foreign Affairs of China) appear-
ing in MSRA is generally annotated into two
words in other corpora, as o�S p(China)
and o��� p(the Ministry of Foreign Af-
fairs). In our system, we first gathered all
the words from the training data whose length
are greater than six Chinese characters, filter-
ing out dates and numbers, which was cov-
ered by Finite State Automation as
a pre-processing stage. For each words col-
lected, regard the first two and three charac-
ters as NE prefix, which indicates the be-
ginning of a Name Entity. The collection of
prefixes is termed Sp(refix). Analogously, the
collection Ss(uffix) of suffixes is brought up
in the same way. Obviously not all the pre-
fixes (suffixes) are good indicators for Name
Entities. Partly inheriting from (Brill, 1995),
we applied error-driven learning to filter pre-
fixes in Sp and suffixes in Ss. Specifically,
if a prefix and a suffix are both matched in
a sequence, all the characters between them,
together with the prefix and the suffix, are
merged into a single segmentation unit. The
resulted unit is compared with corresponding
sequence in training data. If they were not ex-
actly matched, the prefix and suffix were re-
moved from collections respectively. Finally
resulted Sp and Ss are utilized to recognize
Name Entities in the initial segmentation re-
sults.
• UPENN has different segmentation standard
from other three corpora in that, for some
219
Corpus R P F ROOV RIV
AS 0.949 0.940 0.944 0.694 0.960
MSRA 0.955 0.956 0.956 0.650 0.966
UPENN 0.940 0.914 0.927 0.634 0.969
CITYU 0.965 0.971 0.968 0.719 0.981
Table 1: Our official SIGHAN bakeoff results
Locations, such as o�g p(Beijing
) and Organizations, such as o��
� p(the Ministry of Foreign Affairs), the
last Chinese character presents a clue that
the character with high possibility is a suf-
fix of some words. In fact, SVM-based seg-
menter sometimes mistakenly split an OOV
word into a segmentation unit followed by a
suffix. Thus, when some suffixes exist as a
single-character segmentation unit, it should
be grouped with the preceding segmentation
unit. Undoubtedly not all suffixes are appro-
priate to this rule. To gather a clean collec-
tion of suffixes, we first clustered together the
words with the same suffix, filtering accord-
ing to the number of instances in each clus-
ter. Second, the same as above, error-driven
method is utilized to retain effective suffixes.
3 Evaluation Results
We evaluated the Chinese word segmentation
system in the close track, on all four cor-
pora, namely Academis Sinica (AS), City Uni-
versity of Hong Kong (CITYU), Microsoft Re-
search (MSRA), and University of Pennsylva-
nia/University of Colorado (UPENN). The results
are depicted in Table 1, where columns R,P and
F refer to Recall, Precision, F measure
respectively, and ROOV , RIV for the recall of out-
of-vocabulary words and in-vocabulary words.
In addition to final results reported in Bake-
off, we also conducted a series of experiments to
evaluate the contributions of IV rules and OOV
rules. The experimental results are showed in
Table 2, where V1, V2, V3 represent versions
of our segmenters, which compose differently of
components. In detail, V1 represents the basic
SVM-based segmenter; V2 represents the seg-
menter which applied IV rules following SVM-
based segmentation; V3 represents the segmenter
composing of all the components, that is, includ-
ing SVM-based segmenter, IV rules and OOV
rules. Since the OOV ratio is much lower than IV
correspondence, the improvement made by OOV
rules is not so dramatic as IV rules.
Corpus V1 v2 v3
AS 0.932 0.94 0.944
MSRA 0.939 0.954 0.956
UPENN 0.914 0.923 0.927
CITYU 0.955 0.966 0.968
Table 2: Word segmentation accuracy(F Measure)
resulted from post-processing rules
4 Conclusions and future work
We added post-processing rules to SVM-based
segmenter. By doing so, we our segmentation sys-
tem achieved comparable results in the close track,
on all four corpora. But on the other hand, post-
processing rules have the problems of confliction,
which limits the number of rules. We expect to
transform rules into features of SVM-based seg-
menter, thus incorporating information carried by
rules in a more elaborate manner.
Acknowledgements
This research was supported in part by the Na-
tional Natural Science Foundation of China(No.
60473140) and by Program for New Century Ex-
cellent Talents in University(No. NCET-05-0287).

References
Nianwen Xue and Libin Shen. 2003. Chinese Word
segmentation as LMR tagging. In Proceedings of
the Second SIGHAN Workshop on Chinese Lan-
guage Processing,pages 176-179.
Vladimir N. Vapnik. 1995. The Nature of Statistical
Learning Theory. Berlin: Springer-Verlag.
Jin Kiat Low, Hwee Tou Ng and Wenyuan Guo. 2005.
A Maximum Entropy Approach to Chinese Word
Segmentation. In Proceeding of the Fifth SIGHAN
Workshop on Chinese Language Processing, pages
161-164.
Eric.Brill. 1995. Transformation-based error-driven
learning and natural language processing:A case
study in part-of-speech tagging. Computational Lin-
guistics, 21(4):543-565.
