Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 193–196,
New York, June 2006. c©2006 Association for Computational Linguistics
Subword-based Tagging by Conditional Random Fields for Chinese Word
Segmentation
Ruiqiang Zhang1,2 and Genichiro Kikui∗ and Eiichiro Sumita1,2
1National Institute of Information and Communications Technology
2ATR Spoken Language Communication Research Laboratories
2-2-2 Hikaridai, Seiika-cho, Soraku-gun, Kyoto, 619-0288, Japan
{ruiqiang.zhang,eiichiro.sumita}@atr.jp
Abstract
We proposed two approaches to improve Chi-
nese word segmentation: a subword-based tag-
ging and a confidence measure approach. We
found the former achieved better performance
than the existing character-based tagging, and
the latter improved segmentation further by
combining the former with a dictionary-based
segmentation. In addition, the latter can be
used to balance out-of-vocabulary rates and
in-vocabulary rates. By these techniques we
achieved higher F-scores in CITYU, PKU and
MSR corpora than the best results from Sighan
Bakeoff 2005.
1 Introduction
The character-based “IOB” tagging approach has been
widely used in Chinese word segmentation recently (Xue
and Shen, 2003; Peng and McCallum, 2004; Tseng
et al., 2005). Under the scheme, each character of a
word is labeled as ‘B’ if it is the first character of a
multiple-character word, or ‘O’ if the character func-
tions as an independent word, or ‘I’ otherwise.” For ex-
ample, ” (whole) (Beijing city)” is labeled as
” (whole)/O (north)/B (capital)/I (city)/I”.
We found that so far all the existing implementations
were using character-based IOB tagging. In this work
we propose a subword-based IOB tagging, which as-
signs tags to a pre-defined lexicon subset consisting of
the most frequent multiple-character words in addition to
single Chinese characters. If only Chinese characters are
used, the subword-based IOB tagging is downgraded into
a character-based one. Taking the same example men-
tioned above, “ (whole) (Beijing city)” is la-
beled as ” (whole)/O (Beijing)/B (city)/I” in the
subword-based tagging, where ” (Beijing)/B” is la-
beled as one unit. We will give a detailed description of
this approach in Section 2.
∗ Now the second author is affiliated with NTT.
In addition, we found a clear weakness with the IOB
tagging approach: It yields a very low in-vocabulary (IV)
rate(R-iv)inreturnforahigherout-of-vocabulary(OOV)
rate (R-oov). In the results of the closed test in Bakeoff
2005 (Emerson, 2005), the work of (Tseng et al., 2005),
using conditional random fields (CRF) for the IOB tag-
ging, yielded very high R-oovs in all of the four corpora
used, but the R-iv rates were lower. While OOV recog-
nition is very important in word segmentation, a higher
IV rate is also desired. In this work we propose a confi-
dence measure approach to lessen the weakness. By this
approach we can change R-oovs and R-ivs and find an
optimal tradeoff. This approach will be described in Sec-
tion 2.2.
In the followings, we illustrate our word segmentation
process in Section 2, where the subword-based tagging is
implementedbytheCRFsmethod. Section3presentsour
experimental results. Section 4 describes current state-
of-the-art methods for Chinese word segmentation, with
which our results were compared. Section 5 provides the
concluding remarks.
2 Our Chinese word segmentation process
Our word segmentation process is illustrated in Fig. 1. It
is composed of three parts: a dictionary-based N-gram
word segmentation for segmenting IV words, a subword-
based tagging by the CRF for recognizing OOVs, and a
confidence-dependent word segmentation used for merg-
ing the results of both the dictionary-based and the IOB
tagging. An example exhibiting each step’s results is also
given in the figure.
Since the dictionary-based approach is a well-known
method, we skip its technical descriptions. However,
keep in mind that the dictionary-based approach can pro-
duce a higher R-iv rate. We will use this advantage in the
confidence measure approach.
2.1 Subword-based IOB tagging using CRFs
There are several steps to train a subword-based IOB tag-
ger. First, we extracted a word list from the training data
sorted in decreasing order by their counts in the training
193
T�8��#����
 + X D Q J < L Q J & K X Q  O L Y H V  L Q  % H L M L Q J  F L W \
input
T� 8� � # � ���
 + X D Q J  < L Q J  & K X Q  O L Y H V  L Q  % H L M L Q J  F L W \
Dictionary-based word segmentation
T�  % 8�  , �  , #  2 �  2 ��  % �  ,
 + X D Q J  %  < L Q J  ,  & K X Q  ,  O L Y H V  2  L Q  2  % H L M L Q J  %  F L W \  ,
Subword-based IOB tagging
T�  % 8�  , �  , #  2 �  2 ��  % �  ,
 + X D Q J  %  < L Q J  ,  & K X Q  ,  O L Y H V  2  L Q  2  % H L M L Q J  %  F L W \  ,
Confidence-based segmentation
T�8�� # � ���
 + X D Q J < L Q J & K X Q  O L Y H V  L Q  % H L M L Q J  F L W \
output
Figure 1: Outline of word segmentation process
data. We chose all the single characters and the top multi-
character words as a lexicon subset for the IOB tagging.
If the subset consists of Chinese characters only, it is a
character-based IOB tagger. We regard the words in the
subset as the subwords for the IOB tagging.
Second, we re-segmented the words in the training
data into subwords belonging to the subset, and assigned
IOB tags to them. For a character-based IOB tagger,
there is only one possibility of re-segmentation. How-
ever, there are multiple choices for a subword-based
IOB tagger. For example, “ (Beijing-city)” can
be segmented as “ (Beijing-city)/O,” or “
(Beijing)/B (city)/I,” or ” (north)/B (capital)/I
(city)/I.” In this work we used forward maximal match
(FMM) for disambiguation. Of course, backward max-
imal match (BMM) or other approaches are also appli-
cable. We did not conduct comparative experiments be-
cause trivial differences of these approaches may not re-
sult in significant consequences to the subword-based ap-
proach.
In the third step, we used the CRFs approach to train
the IOB tagger (Lafferty et al., 2001) on the training data.
We downloaded and used the package “CRF++” from the
site“http://www.chasen.org/˜taku/software.”Accordingto
the CRFs, the probability of an IOB tag sequence, T =
t0t1 · · ·tM, given the word sequence, W = w0w1 · · ·wM, is
defined by
p(T|W) =
exp



Msummationdisplay
i=1



summationdisplay
k
λk fk(ti−1,ti,W)+
summationdisplay
k
µkgk(ti,W)





/Z,
Z =
summationdisplay
T=t0t1···tM
p(T|W)
(1)
where we call fk(ti−1,ti,W) bigram feature functions be-
cause the features trigger the previous observation ti−1
and current observation ti simultaneously; gk(ti,W), the
unigram feature functions because they trigger only cur-
rent observation ti. λk and µk are the model parameters
corresponding to feature functions fk and gk respectively.
The model parameters were trained by maximizing the
log-likelihood of the training data using L-BFGS gradi-
ent descent optimization method. In order to overcome
overfitting, a gaussian prior was imposed in the training.
The types of unigram features used in our experiments
included the following types:
w0,w−1,w1,w−2,w2,w0w−1,w0w1,w−1w1,w−2w−1,w2w0
where w stands for word. The subscripts are position in-
dicators. 0 means the current word; −1, −2, the first or
second word to the left; 1,2, the first or second word to
the right.
For the bigram features, we only used the previous and
the current observations, t−1t0.
Astofeatureselection, wesimplyusedabsolutecounts
for each feature in the training data. We defined a cutoff
value for each feature type and selected the features with
occurrence counts over the cutoff.
A forward-backward algorithm was used in the train-
ing and viterbi algorithm was used in the decoding.
2.2 Confidence-dependent word segmentation
Before moving to this step in Figure 1, we produced two
segmentation results: the one by the dictionary-based ap-
proach and the one by the IOB tagging. However, nei-
ther was perfect. The dictionary-based segmentation pro-
duced results with higher R-ivs but lower R-oovs while
the IOB tagging yielded the contrary results. In this sec-
tion we introduce a confidence measure approach to com-
bine the two results. We define a confidence measure,
CM(tiob|w), to measure the confidence of the results pro-
duced by the IOB tagging by using the results from the
dictionary-based segmentation. The confidence measure
comes from two sources: IOB tagging and dictionary-
based word segmentation. Its calculation is defined as:
CM(tiob|w) = αCMiob(tiob|w)+(1 − α)δ(tw,tiob)ng (2)
where tiob is the word w’s IOB tag assigned by the IOB
tagging; tw, a prior IOB tag determined by the results of
the dictionary-based segmentation. After the dictionary-
based word segmentation, the words are re-segmented
into subwords by FMM before being fed to IOB tagging.
Each subword is given a prior IOB tag, tw. CMiob(t|w), a
confidence probability derived in the process of IOB tag-
ging, is defined as
CMiob(t|wi) =
summationtext
T=t0t1···tM,ti=t P(T|W,wi)summationtext
T=t0t1···tM P(T|W)
where the numerator is a sum of all the observation se-
quences with word wi labeled as t.
194
δ(tw,tiob)ng denotes the contribution of the dictionary-
based segmentation. It is a Kronecker delta function de-
fined as
δ(tw,tiob)ng = { 1 if tw = tiob0 otherwise
In Eq. 2, α is a weighting between the IOB tagging
and the dictionary-based word segmentation. We found
the value 0.7 for α, empirically.
By Eq. 2 the results of IOB tagging were re-evaluated.
A confidence measure threshold, t, was defined for mak-
ing a decision based on the value. If the value was lower
than t, the IOB tag was rejected and the dictionary-based
segmentation was used; otherwise, the IOB tagging seg-
mentation was used. A new OOV was thus created. For
the two extreme cases, t = 0 is the case of the IOB tag-
ging while t = 1 is that of the dictionary-based approach.
In a real application, a satisfactory tradeoff between R-
ivs and R-oovs could find through tuning the confidence
threshold. InSection3.2wewillpresenttheexperimental
segmentationresultsoftheconfidencemeasureapproach.
3 Experiments
We used the data provided by Sighan Bakeoff 2005 to
test our approaches described in the previous sections.
The data contain four corpora from different sources:
Academia Sinica (AS), City University of Hong Kong
(CITYU), Peking University (PKU) and Microsoft Re-
search in Beijing (MSR). Since this work was to evaluate
the proposed subword-based IOB tagging, we carried out
the closed test only. Five metrics were used to evaluate
segmentation results: recall(R), precision(P), F-score(F),
OOV rate(R-oov) and IV rate(R-iv). For detailed info. of
the corpora and these scores, refer to (Emerson, 2005).
For the dictionary-based approach, we extracted a
word list from the training data as the vocabulary. Tri-
gram LMs were generated using the SRI LM toolkit for
disambiguation. Table 1 shows the performance of the
dictionary-based segmentation. Since there were some
single-character words present in the test data but not in
the training data, the R-oov rates were not zero in this
experiment. In fact, there were no OOV recognition.
Hence, this approach produced lower F-scores. However,
the R-ivs were very high.
3.1 Effects of the Character-based and the
subword-based tagger
The main difference between the character-based and the
word-based is the contents of the lexicon subset used
for re-segmentation. For the character-based tagging, we
used all the Chinese characters. For the subword-based
tagging, we added another 2000 most frequent multiple-
character words to the lexicons for tagging. The segmen-
tation results of the dictionary-based were re-segmented
R P F R-oov R-iv
AS 0.941 0.881 0.910 0.038 0.982
CITYU 0.928 0.851 0.888 0.164 0.989
PKU 0.948 0.912 0.930 0.408 0.981
MSR 0.968 0.927 0.947 0.048 0.993
Table 1: Our segmentation results by the dictionary-
based approach for the closed test of Bakeoff 2005, very
low R-oov rates due to no OOV recognition applied.
R P F R-oov R-iv
AS 0.951 0.942 0.947 0.678 0.964
0.953 0.940 0.947 0.647 0.967
CITYU 0.939 0.943 0.941 0.700 0.958
0.950 0.942 0.946 0.736 0.967
PKU 0.940 0.950 0.945 0.783 0.949
0.943 0.946 0.945 0.754 0.955
MSR 0.957 0.960 0.959 0.710 0.964
0.965 0.963 0.964 0.716 0.972
Table 2: Segmentation results by a pure subword-based
IOB tagging. The upper numbers are of the character-
based and the lower ones, the subword-based.
using the FMM, and then labeled with “IOB” tags by the
CRFs. The segmentation results using CRF tagging are
shown in Table 2, where the upper numbers of each slot
were produced by the character-based approach while the
lower numbers were of the subword-based. We found
that the proposed subword-based approaches were effec-
tive in CITYU and MSR corpora, raising the F-scores
from0.941to0.946forCITYUcorpus, 0.959to0.964for
MSR corpus. There were no F-score changes for AS and
PKU corpora, but the recall rates were improved. Com-
paring Table 1 and 2, we found the CRF-modeled IOB
tagging yielded better segmentation than the dictionary-
based approach. However, the R-iv rates were getting
worse in return for higher R-oov rates. We will tackle
this problem by the confidence measure approach.
3.2 Effect of the confidence measure
In section 2.2, we proposed a confidence measure ap-
proach to re-evaluate the results of IOB tagging by com-
binations of the results of the dictionary-based segmen-
tation. The effect of the confidence measure is shown in
Table 3, where we used α = 0.7 and confidence threshold
t = 0.8. In each slot, the numbers on the top were of the
character-based approach while the numbers on the bot-
tom were the subword-based. We found the results in Ta-
ble 3 were better than those in Table 2 and Table 1, which
prove that using confidence measure approach achieved
the best performance over the dictionary-based segmen-
tation and the IOB tagging approach. The act of con-
fidence measure made a tradeoff between R-ivs and R-
oovs, yielding higher R-oovs than Table 1 and higher R-
195
R P F R-oov R-iv
AS 0.953 0.944 0.948 0.607 0.969
0.956 0.947 0.951 0.649 0.969
CITYU 0.943 0.948 0.946 0.682 0.964
0.952 0.949 0.951 0.741 0.969
PKU 0.942 0.957 0.949 0.775 0.952
0.947 0.955 0.951 0.748 0.959
MSR 0.960 0.966 0.963 0.674 0.967
0.972 0.969 0.971 0.712 0.976
Table 3: Effects of combination using the confidence
measure. The upper numbers and the lower numbers are
of the character-based and the subword-based, respec-
tively
AS CITYU MSR PKU
Bakeoff-best 0.952 0.943 0.964 0.950
Ours 0.951 0.951 0.971 0.951
Table 4: Comparison our results with the best ones from
Sighan Bakeoff 2005 in terms of F-score
ivs than Table 2.
Even with the use of confidence measure, the word-
based IOB tagging still outperformed the character-based
IOBtagging. Itproves theproposedword-basedIOBtag-
ging was very effective.
4 Discussion and Related works
The IOB tagging approach adopted in this work is not a
new idea. It was first used in Chinese word segmentation
by(XueandShen, 2003), wheremaximumentropymeth-
ods were used. Later, this approach was implemented
by the CRF-based method (Peng and McCallum, 2004),
which was proved to achieve better results than the maxi-
mum entropy approach because it can solve the label bias
problem (Lafferty et al., 2001).
Our main contribution is to extend the IOB tagging ap-
proach from being a character-based to a subword-based.
We proved the new approach enhanced the word segmen-
tation significantly. Our results are listed together with
the best results from Bakeoff 2005 in Table 4 in terms
of F-scores. We achieved the highest F-scores in CITYU,
PKU and MSR corpora. We think our proposed subword-
based tagging played an important role for the good re-
sults. Since it was a closed test, some information such
as Arabic and Chinese number and alphabetical letters
cannot be used. We could yield a better results than those
shown in Table 4 using such information. For example,
inconsistent errors of foreign names can be fixed if al-
phabetical characters are known. For AS corpus, “Adam
Smith” are two words in the training but become a one-
word in the test, “AdamSmith”. Our approaches pro-
duced wrong segmentations for labeling inconsistency.
Another advantage of the word-based IOB tagging
over the character-based is its speed. The subword-based
approach is faster because fewer words than characters
were labeled. We found a speed up both in training and
test.
The idea of using the confidence measure has appeared
in (Peng and McCallum, 2004), where it was used to rec-
ognizetheOOVs. Inthisworkweuseditmoredelicately.
By way of the confidence measure we combined results
fromthedictionary-basedandtheIOB-tagging-basedand
as a result, we could achieve the optimal performance.
5 Conclusions
In this work, we proposed a subword-based IOB tagging
method for Chinese word segmentation. Using the CRFs
approaches, we prove that it outperformed the character-
based method using the CRF approaches. We also suc-
cessfully employed the confidence measure to make a
confidence-dependent word segmentation. This approach
is effective for performing desired segmentation based on
users’ requirements to R-oov and R-iv.
Acknowledgements
The authors appreciate the reviewers’ effort and good ad-
vice for improving the paper.
References
Thomas Emerson. 2005. The second international chi-
nese word segmentation bakeoff. In Proceedings of
the Fourth SIGHAN Workshop on Chinese Language
Processing, Jeju, Korea.
John Lafferty, Andrew McCallum, and Fernando Pereira.
2001. Conditional random fields: probabilistic models
for segmenting and labeling sequence data. In Proc. of
ICML-2001, pages 591–598.
Fuchun Peng and Andrew McCallum. 2004. Chinese
segmentation and new word detection using condi-
tional random fields. In Proc. of Coling-2004, pages
562–568, Geneva, Switzerland.
Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel
Jurafsky, and Christopher Manning. 2005. A condi-
tional random field word segmenter for Sighan bakeoff
2005. InProceedingsoftheFourthSIGHANWorkshop
on Chinese Language Processing, Jeju, Korea.
Nianwen Xue and Libin Shen. 2003. Chinese word
segmentation as LMR tagging. In Proceedings of the
Second SIGHAN Workshop on Chinese Language Pro-
cessing.
196
