Combining Segmenter and Chunker for Chinese Word Segmentation
Masayuki Asahara, Chooi Ling Goh, Xiaojie Wang, Yuji Matsumoto
Graduate School of Information Science
Nara Institute of Science and Technology, Japan
fmasayu-a,ling-g,xiaoji-w,matsug@is.aist-nara.ac.jp
Abstract
Our proposed method is to use a Hidden
Markov Model-based word segmenter and a
Support Vector Machine-based chunker for
Chinese word segmentation. Firstly, input sen-
tences are analyzed by the Hidden Markov
Model-based word segmenter. The word seg-
menter produces n-best word candidates to-
gether with some class information and confi-
dence measures. Secondly, the extracted words
are broken into character units and each char-
acter is annotated with the possible word class
and the position in the word, which are then
used as the features for the chunker. Finally, the
Support Vector Machine-based chunker brings
character units together into words so as to de-
termine the word boundaries.
1 Methods
We participate in the closed test for all four sets of data
in Chinese Word Segmentation Bakeoff. Our method is
based on the following two steps:
1. The input sentence is segmented into a word se-
quence by Hidden Markov Model-based word seg-
menter. The segmenter assigns a word class with
a confidence measure for each word at the hidden
states. The model is trained by Baum-Welch algo-
rithm.
2. Each character in the sentence is annotated with the
word class tag and the position in the word. The
n-best word candidates derived from the word seg-
menter are also extracted as the features. A sup-
port vector machine-based chunker corrects the er-
rors made by the segmenter using the extracted fea-
tures.
We will describe each of these steps in more details.
1.1 Hidden Markov Model-based Word Segmenter
Our word segmenter is based on Hidden Markov Model
(HMM). We first decide the number of hidden states
(classes) and assume that the each word can belong to
all the classes with some probability. The problem is de-
fined as a search for the sequence of word classes C =
c1;:::;cn given a word sequence W = w1;:::;wn. The
target is to find W and C for a given input S that maxi-
mizes the following probability:
argmax
W;C
P(WjC)P(C)
We assume that the word probability P(WjC) is con-
strained only by its word class, and that the class prob-
ability P(C) is constrained only by the class of the pre-
ceding word. These probabilities are estimated by the
Baum-Welch algorithm using the training material (See
(Manning and Sch¨utze., 1999)). The learning process is
based on the Baum-Welch algorithm and is the same as
the well-known use of HMM for part-of-speech tagging
problem, except that the number of states are arbitrarily
determined and the initial probabilities are randomly as-
signed in our model.
1.2 Correction by Support Vector Machine-based
Chunker
While the HMM-based word segmenter achieves good
accuracy for known words, it cannot identify compound
words and out-of-vocabulary words. Therefore, we in-
troduce a Support Vector Machine(below SVM)-based
chunker (Kudo and Matsumoto, 2001) to cover the er-
rors made by the segmenter. The SVM-based chunker
re-assigns new word boundaries to the output of the seg-
menter.
An SVM (Vapnik, 1998) is a binary classifier. Sup-
pose we have a set of training data for a binary class
problem: (x1;y1);:::;(xN;yN), where xi 2 Rn is a
feature vector of the i th sample in the training data and

References

T. Kudo and Y. Matsumoto. 2001. Chunking with Support
Vector Machines. In Proc. of NAACL 2001, pages
192199.

C. D. Manning and H. Schutze. 1999. Foundation of
Statistical Natural Language Processing. Chapter 9.
Markov Models, pages 317340.

Y. Matsumoto, A. Kitauchi, T. Yamashita, Y. Hirano, K.
Takaoka and M. Asahara 2003. Morphological Analyzer
ChaSen-2.3.0 Users Manual Tech. Report. Nara
Institute of Science and Technology, Japan.

L. A. Ramshaw and M. P. Marcus. 1995 Text chunking
using transformation-bases learning In Proc. of the 3rd
Workshop on Very Large Corpora, pages 8394.

V. N. Vapnik. 1998. Statistical Learning Theory. A
Wiley-Interscience Publication.
