Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 428–435,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Unsupervised Segmentation of Chinese Text
by Use of Branching Entropy
Zhihui Jin and Kumiko Tanaka-Ishii
Graduate School of Information Science and Technology
University of Tokyo
Abstract
We propose an unsupervised segmen-
tation method based on an assumption
about language data: that the increas-
ing point of entropy of successive char-
acters is the location of a word bound-
ary. A large-scale experiment was con-
ducted by using 200 MB of unseg-
mented training data and 1 MB of test
data,and precision of90%wasattained
with recall being around 80%. More-
over, we found that the precision was
stable at around 90% independently of
the learning data size.
1 Introduction
The theme of this paper is the following as-
sumption:
The uncertainty of tokens coming
after a sequence helps determine
whether a given position is at a
boundary. (A)
Intuitively, as illustrated in Figure 1, the vari-
ety of successive tokens at each character in-
side a word monotonically decreases according
tothe osetlength, because thelonger the pre-
ceding character n-gram, the longer the pre-
ceding context and the more it restricts the
appearance of possible next tokens. For ex-
ample, it is easier to guess which character
comes after \natura" than after \na". On the
other hand, the uncertainty at the position of
a word border becomes greater, and the com-
plexity increases, as the position is out of con-
text. With the same example, it is dicult to
guess which character comes after \natural ".
This suggests that a word border can be de-
tected by focusing on the dierentials of the
uncertainty of branching.
In this paper, we report our study on ap-
plying this assumption to Chinese word seg-
Figure 1: Intuitive illustration of a variety of
successive tokens and a word boundary
mentation by formalizing the uncertainty of
successive tokens via the branching entropy
(which we mathematically dene in the next
section). Our intention in this paper is above
all to study the fundamental and scientic sta-
tistical property underlying language data, so
that it can be applied to language engineering.
The above assumption (A) dates back to
the fundamental work done by Harris (Harris,
1955), where he says that when the number
of dierent tokens coming after every prex of
a word marks the maximum value, then the
location corresponds to the morpheme bound-
ary. Recently, with the increasing availabil-
ity of corpora, this property underlying lan-
guage has been tested through segmentation
into words and morphemes. Kempe (Kempe,
1999) reports a preliminary experiment to de-
tectwordborders in Germanand English texts
by monitoring the entropy of successive char-
acters for 4-grams. Also, the second author
of this paper (Tanaka-Ishii, 2005) have shown
how Japanese and Chinese can be segmented
into words by formalizing the uncertainty with
the branching entropy. Even though the test
data was limited to a small amount in this
work, the report suggested how assumption
428
(A) holds better when each of the sequence el-
ements forms a semantic unit. This motivated
our work to conduct a further, larger-scale test
in the Chinese language, which is the only hu-
man language consisting entirely of ideograms
(i.e., semantic units). In this sense, the choice
of Chinese as the language in our work is es-
sential.
If the assumption holds well, the most im-
portant and direct application is unsuper-
vised text segmentation into words. Many
works in unsupervised segmentation so far
could be interpreted as formulating assump-
tion (A) in a similar sense where branch-
ing stays low inside words but increases
at a word or morpheme border. None of
these works, however, is directly based on
(A), and they introduce other factors within
their overall methodologies. Some works are
based on in-word branching frequencies for-
mulated in an original evaluation function,
as in (Ando and Lee, 2000) (boundary pre-
cision=84.5%,recall=78.0%, tested on 12500
Japanese ideogram words). Sun et al. (Sun
et al., 1998) uses mutual information (bound-
ary p=91.8%, no report for recall, 1588 Chi-
nese characters), and Feng(Feng et al., 2004)
incorporates branching counts in the evalua-
tion function to be optimized for obtaining
boundaries (wordprecision=76%, recall=78%,
2000sentences). Fromthe performance results
listed here, we can see that unsupervised seg-
mentation is more dicult, by far, than super-
vised segmentation; therefore, the algorithms
are complex, and previous studies have tended
to be limited in terms of both the test corpus
size and the target.
In contrast, as assumption (A) is simple, we
keep this simplicity in our formalization and
directly test the assumption on a large-scale
test corpus consisting of 1001 KB manually
segmented data with the training corpus con-
sisting of 200 MB of Chinese text.
Chinese is such an important language that
supervised segmentation methods are already
very mature. The current state-of-the-art seg-
mentation software developed by (Low et al.,
2005), which ranks as the best in the SIGHAN
bakeo (Emerson, 2005), attains word preci-
sion and recall of 96.9% and 96.8%, respec-
tively, on the PKU track. There is also free
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 5
 1  2  3  4  5  6  7  8
entropy
offset
Figure 2: Decrease in H(XjX
n
) for Chinese
characters when n is increased
software such as (Zhang et al., 2003) whose
performance is also high. Even then, as most
supervised methods learn on manually seg-
mented newspaper data, when the input text
is not from newspapers, the performance can
be insucient. Given that the construction of
learning data is costly, we believe the perfor-
mance can be raised by combining the super-
vised and unsupervised methods.
Consequently, this paper veries assump-
tion (A) in a fundamental manner for Chinese
text and addresses the questions of why and to
what extent (A) holds, when applying it to the
Chinese word segmentation problem. We rst
formalize assumption (A) in a general manner.
2 The Assumption
Given a set of elements  and a set of n-gram
sequences 
n
formed of , the conditional en-
tropy of an element occurring after an n-gram
sequence X
n
is dened as
H(XjX
n
) =
BnZr
X
x
n
2
n
P(x
n
)
X
x2
P(xjx
n
)logP(xjx
n
);
(1)
where P(x) = P(X = x), P(xjx
n
) = P(X =
xjX
n
= x
n
), and P(X = x)indicates the prob-
ability of occurrence of x.
A well-known observation on language data
states that H(XjX
n
) decreases as n increases
(Bell et al., 1990). For example, Figure 2
shows how H(XjX
n
) shifts when n increases
from 1 to 8 characters, where n is the length of
a word prex. This is calculated for all words
existing in the test corpus, with the entropy
being measuredin the learning data(thelearn-
ing and test data are dened in x4).
This phenomenon indicates that X will be-
come easier to estimate as the context of X
n
429
gets longer. This can be intuitively under-
stood: it is easy to guess that \e" will follow
after \Hello! How ar", but it is dicult to
guess what comes after the short string \He".
The last term BnZrlog P(xjx
n
) in the above for-
mula indicates the information of a token of x
coming after x
n
, and thus the branching after
x
n
. The latter half of the formula, the local
entropy value for a given x
n
,
H(XjX
n
= x
n
) = BnZr
X
x2
P(xjx
n
)logP(xjx
n
);
(2)
indicates the average information of branching
for a specic n-gram sequence x
n
. As our in-
terest in this paper is this local entropy, we
denote H(XjX
n
= x
n
) simply as h(x
n
) in the
rest of this paper.
The decrease in H(XjX
n
) globally indicates
that given an n-length sequence x
n
and an-
other (n+1)-length sequence y
n+1
, the follow-
ing inequality holds on average:
h(x
n
) > h(y
n+1
): (3)
One reason why inequality (3) holds for lan-
guage data is that there is context in language,
and y
n+1
carries a longer context as compared
with x
n
. Therefore, if we suppose that x
n
is
the prex of x
n+1
, then it is very likely that
h(x
n
) > h(x
n+1
) (4)
holds, because the longer the preceding n-
gram, the longer the same context. For ex-
ample, it is easier to guess what comes af-
ter x
6
=\natura" than what comes after x
5
=
\natur". Therefore, the decrease in H(XjX
n
)
can be expressed asthe concept thatif thecon-
text is longer, the uncertainty of the branching
decreases on average. Then, taking the logical
contraposition, if the uncertainty does not de-
crease, the context is not longer, which can be
interpreted as the following:
If the entropy of successive tokens in-
creases, the location is at a context
border. (B)
For example, in the case of x
7
= \natu-
ral", the entropy h(\natural")should be larger
than h(\natura"),because it is uncertain what
character will allow x
7
to succeed. In the next
section, we utilize assumption (B) to detect
context boundaries.
Figure 3: Our model for boundary detection
based on the entropy of branching
3 Boundary Detection Using the
Entropy of Branching
Assumption (B) gives a hint on how to utilize
the branching entropy as an indicator of the
context boundary. When two semantic units,
both longer than 1, are put together, the en-
tropy would appear as in the rst gure of Fig-
ure 3. The rst semantic unit is from osets
0 to 4, and the second is from 4 to 8, with
each unit formed by elements of . In the g-
ure, one possible transition of the branching
degree is shown, where the plot at k on the
horizontal axis denotes the entropy for h(x
0;k
)
and x
n;m
denotes the substring between osets
n and m.
Ideally, the entropy would take a maximum
at 4, because it will decrease as k is increased
in the ranges of k < 4 and 4 < k < 8, and
at k = 4, it will rise. Therefore, the position
at k = 4 is detected as the \local maximum
value" when monitoring h(x
0;k
) over k. The
boundary condition after such observation can
be redened as the following:
B
max
Boundaries are locations where the en-
tropy is locally maximized.
A similar method is proposed by Harris (Har-
ris, 1955), where morpheme borders can be
detected by using the local maximum of the
number of dierent tokens coming after a pre-
x.
This only holds, however, for semantic units
longer than 1. Units often have a length of
430
1, especially in our case with Chinese charac-
ters as elements, so that there are many one-
character words. If a unit has length 1, then
the situation will look like the second graph
in Figure 3, where three semantic units, x
0;4
,
x
4;5
, and x
5;8
, are present, with the middle
unit having length 1. First, at k = 4, the
value of h increases. At k = 5, the value may
increase or decrease, because the longer con-
text results in an uncertainty decrease, though
an uncertainty decrease does not necessarily
mean a longer context. When h increases at
k = 5, the situation will look like the second
graph. In this case, the condition B
max
will
not suce, and we need a second boundary
condition:
B
increase
Boundaries are locations where the
entropy is increased.
On the other hand, when h decreases at k = 5,
then even B
increase
cannot be applied todetect
k = 5asaboundary. We haveotherchances to
detect k = 5, however, by considering h(x
i;k
),
where 0 < i < k. According to inequality
(3), then, a similar trend should be present
for plots of h(x
i;k
), assuming that h(x
0;n
) >
h(x
0;n+1
); then, we have
h(x
i;n
) > h(x
i;n+1
); for 0 < i < n: (5)
The value h(x
i;k
) would hopefully rise for some
i if the boundary at k = 5 is important,
although h(x
i;k
) can increase or decrease at
k = 5, just as in the case for h(x
0;n
).
Therefore, when the target language con-
sists of many one-element units, B
increase
is
crucial for collecting all boundaries. Note that
the boundaries detected by B
max
are included
in those detected by the condition B
increase
,
and also that B
increase
is a boundary condition
representing the assumption (B) moredirectly.
So far, we have considered only regular-
order processing: the branching degree is cal-
culated for successive elements of x
n
. We can
also consider the reverse order, which involves
calculating h forthe previous element of x
n
. In
the case of the previous element, the question
is whether the head of x
n
forms the beginning
of a context boundary.
Next, we move on to explain how we ac-
tually applied the above formalization to the
problem of Chinese segmentation.
4 Data
The whole data for training amounted to 200
MB, from the Contemporary Chinese Cor-
pus of the Center of Chinese Linguistics at
Peking University (CenterforChinese Linguis-
tics, 2006). It consists of several years of Peo-
ples' Daily newspapers, contemporaryChinese
literature, and some popular Chinese maga-
zines. Note that as our method is unsuper-
vised, this learning corpus is just text without
any segmentation.
The test data were constructed by selecting
sentences from the manually segmented Peo-
ple's Daily corpus of Peking University. In to-
tal, the test dataamountsto 1001KB, consist-
ing 147026 Chinese words. The word bound-
aries indicated in the corpus were used as our
golden standard.
As punctuation is clear from text bound-
aries in Chinese text, we pre-processed the test
data by segmenting sentences at punctuation
locations to form text fragments. Then, from
all fragments, n-grams of less than 6 charac-
ters were obtained. The branching entropies
for all these n-grams existing within the test
data were obtained from the 200 MB of data.
We used 6 as the maximum n-gram length
because Chinese words with a length of more
than 5 characters are rare. Therefore, scan-
ning the n-grams up to a length of 6 was su-
cient. Another reason is that we actually con-
ducted the experiment up to 8-grams, but the
performance did not improve from when we
used 6-grams.
Using this list of words ranging from un-
igrams to 6-grams and their branching en-
tropies, the test data were processed so as to
obtain the word boundaries.
5 Analysis for Small Examples
Figure 4 shows an actual graph of
the entropy shift for the input phrase
(wei lai fa zhan
de mu biao he zhi dao fang zhen, the aim and
guideline of future development). The upper
gure shows the entropy shift for the forward
case, and the lower gure shows the entropy
shift for the backward case. Note that for the
backward case, the branching entropy was
calculated for characters before the x
n
.
In the upper gure, there are two lines, one
431
Figure 4: Entropy shift for a small example
(forward and backward)
for the branching entropy after the substrings
starting from . The leftmost line plots
h( ), h( ) ::: h( ). There
are two increasing points, indicating that the
phrase was segmented between and ,
and between and . The second line
plots h( ):::h( ). The increas-
ing locations are between and , be-
tween and , and after .
The lower gure is the same. There are two
lines, one for the branching entropy before the
substring ending with sux . The rightmost
line plots h( ), h( ) ...h( )
running from back to front. We can see in-
creasing points (asseen fromback tofront)be-
tween and , and between and .
As for the last line, it also starts from and
runs from back to front, indicating boundaries
between and , between and ,
and just before .
If we consider all the increasing points in all
four lines and take the set union of them, we
obtain the correct segmentation as follows:
j j j j j j ,
which is the 100 % correct segmentation in
terms of both recall and precision.
In fact, as there are 12 characters in this
input, there should be 12 lines starting from
each character for all substrings. For read-
ability, however, we only show two lines each
for the forward and backward cases. Also, the
maximum length of a line is 6, because we only
took 6-grams out of the learning data. If we
consider all the increasing points in all 12 lines
and take the set union, then we again obtain
100 % precision and recall. It is amazing how
all 12 lines indicate only correct word bound-
aries.
Also, note how the correct full segmenta-
tion is obtained only with partial information
from 4 lines taken from the 12 lines. Based
on this observation, we next explain the algo-
rithm that we used for a larger-scale experi-
ment.
6 Algorithm for Segmentation
Having determined the entropy for all n-grams
in the learning data, we could scan through
each chunk of test data in both the forward
order and the backward orderto determine the
locations of segmentation.
As our intention in this paper is above all to
study the innate linguistic structure described
by assumption (B), we do not want to add any
artifacts other than this assumption. For such
exact verication, we have to scan through all
possible substrings ofan input, which amounts
to O(n
2
) computational complexity, where n
indicates the input length of characters.
Usually, however, h(x
m;n
) becomes impos-
sible to measure when n BnZr m becomes large.
Also, as noted in the previous section, words
longer than 6 characters are very rare in Chi-
nese text. Therefore, given a string x, all n-
grams of no more than 6 grams are scanned,
and the points where the boundary condition
holds are output as boundaries.
As for the boundary conditions, we have
B
max
and B
increase
, and we also utilize
B
ordinary
, where location n is considered as a
boundary when the branching entropy h(x
n
)
is simply above a given threshold. Precisely,
there are three boundary conditions:
B
max
h(x
n
) > valmax,
where h(x
n
) takes a local maximum,
B
increase
h(x
n+1
)BnZrh(x
n
) > valdelta,
B
ordinary
h(x
n
) > val,
where valmax, valdelta, and val are arbitrary
thresholds.
432
7 Large-Scale Experiments
7.1 Denition of Precision and Recall
Usually, when precision and recall are ad-
dressed in the Chinese word segmentation do-
main, they are calculated based on the number
of words. For example, consider a correctly
segmented sequence \aaajbbbjcccjddd", with
a,b,c,d being characters and \j" indicating a
word boundary. Suppose that the machine's
result is \aaabbbjcccjddd"; then the correct
words are only \ccc" and \ddd", giving a value
of 2. Therefore, the precision is 2 divided
by the number of words in the results (i.e., 3
for the words \aaabbb", \ccc", \ddd"), giving
67%, and the recall is 2 divided by the total
number of wordsin the golden standard (i.e., 4
for the words \aaa",\bbb", \ccc", \ddd") giv-
ing 50%. We call these values the word pre-
cision and recall, respectively, throughout this
paper.
In our case, we use slightly dierent mea-
sures for the boundary precision and recall,
which are based on the correct number of
boundaries. These scoresarealsoutilized espe-
cially in previous works on unsupervised seg-
mentation (Ando and Lee, 2000) (Sun et al.,
1998). Precisely,
Precision =
N
correct
N
test
(6)
Recall =
N
correct
N
true
; where (7)
N
correct
is the number of correct boundaries in
the result,
N
test
is the number of boundaries in the test
result, and,
N
true
is the number of boundaries in the
golden standard.
For example, in the case of the machine result
being \aaabbbjcccjddd", the precision is 100%
and the recall is 75%. Thus, we consider there
to be no imprecise result as a boundary in the
output of \aaabbbjcccjddd".
The crucial reason for using the boundary
precision and recall is that boundary detec-
tion and word extraction are not exactly the
same task. In this sense, assumption (A) or
(B) is a general assumption about a bound-
ary (of a sentence, phrase, word, morpheme).
Therefore, the boundary precision and recall
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 0.55  0.6  0.65  0.7  0.75  0.8  0.85  0.9  0.95  1
recall
precision
Bincrease
Bordinary
Bmax
Figure 5: Precision and recall
measure serves for directly measuring bound-
aries.
Notethatall precision and recall scoresfrom
now on in this paper are boundary precision
and recall. Even in comparing the super-
vised methods with our unsupervised method
later, the precision and recall values are all re-
calculated as boundary precision and recall.
7.2 Precision and Recall
The precision and recall graph is shown in Fig-
ure 5. The horizontal axis is the precision
and the vertical axis is the recall. The three
lines from right to left (top to bottom) cor-
respond to B
increase
(0:0  valdelta  2:4),
B
max
(4:0  valmax  6:2), and B
ordinary
(4:0  val  6:2). All are plotted with an
interval of 0.1. For every condition, the larger
the threshold, the higher the precision and the
lower the recall.
We can see how B
increase
and B
max
keep
high precision as compared with B
ordinary
. We
also can see that the boundary can be more
easily detected if it is judged as comprising
the proximity value of h(x
n
).
For B
increase
, in particular, when valdelta =
0:0,the precision and recall arestill at0.88and
0.79, respectively. Upon increasing the thresh-
old to valdelta = 2:4, the precision is higher
than 0.96 at the cost of a low recall of 0.29. As
for B
max
, we also observe a similar tendency
but with low recall due to the smaller number
of local maximum points as compared with the
number of increasing points. Thus, we see how
B
increase
attains a better performance among
the three conditions. This shows the correct-
ness of assumption (B).
Fromnow on, we consider only B
increase
and
proceed through our other experiments.
433
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 10  100  1000  10000  100000  1e+06
size(KB)
recall
precision
Figure 6: Precision and recall depending on
training data size
Next, we investigated how the training data
size aects the precision and recall. This time,
the horizontal axis is the amount of learning
data, varying from 10 KB up to 200 MB, on
a log scale. The vertical axis shows the pre-
cision and recall. The boundary condition is
B
increase
with valdelta = 0:1.
We can see how the precision always re-
mains high, whereas the recall depends on the
amount of data. The precision is stable at an
amazingly high value, even when the branch-
ing entropy is obtained from a very small cor-
pus of 10 KB. Also, the linear increase in the
recall suggests that if we had more than 200
MB of data, we would expect to have an even
higher recall. As the horizontal axis is in a log
scale, however, we would have to have giga-
bytes of data to achieve the last several per-
cent of recall.
7.3 Error Analysis
According to our manual error analysis, the
top-most three errors were the following:
 Numbers: dates, years, quantities (ex-
ample: 1998, written in Chinese number
characters)
 One-character words (example: (at)
(again) (toward) (and))
 Compound Chinese words (example:
(open mind) being segmented
into (open) and (mind))
The reason for the bad results with numbers
is probably because the branching entropy for
digits is less biased than for usual ideograms.
Also, for one-character words, our method is
limited, as we explained in x3. Both of these
two problems, however, can be solved by ap-
plying special preprocessing for numbers and
one-character words, given that many of the
one-character words are functional characters,
which are limited in number. Such improve-
ments remain for our future work.
The third error type, in fact, is one that
could be judged as correct segmentation. In
the case of \open mind", it was not segmented
into two words in the golden standard; there-
fore, our result was judged as incorrect. This
could, however, be judged as correct.
ThestructuresofChinese wordsand phrases
are very similar, and there are no clear crite-
ria for distinguishing between a word and a
phrase. The unsupervised method determines
the structure and segments words and phrases
into smaller pieces. Manual recalculation of
the accuracy comprising such cases also re-
mains for our future work.
8 Conclusion
We have reported an unsupervised Chinese
segmentation method based on the branching
entropy. This method is based on an assump-
tion that \if the entropy of successive tokens
increases, the location is at the context bor-
der." The entropies of n-grams were learned
from an unsegmented 200-MB corpus, and the
actual segmentation was conducted directly
according to the above assumption, on 1 MB
of test data. We found that the precision was
as high as 90% with recall being around 80%.
We also found an amazing tendency for the
precision to always remain high, regardless of
the size of the learning data.
There are two important considerations for
our future work. The rst is to gure out how
to combine the supervised and unsupervised
methods. In particular, as the performance of
the supervised methods could be insucient
for data that are not from newspapers, there
is the possibility of combining the supervised
and unsupervised methods to achieve a higher
accuracy for general data. The second future
work is to verify our basic assumption in other
languages. In particular, we should undertake
experimental studies in languages writtenwith
phonogram characters.
434
References
R.K. Ando and L. Lee. 2000. Mostly-unsupervised
statistical segmentation of Japanese: Applica-
tions to kanji. In ANLP-NAACL.
T.C.Bell, J.G.Cleary, andWitten.I.H. 1990. Text
Compression. Prentice Hall.
Center for Chinese Linguistics. 2006. Chi-
nese corpus. visited 2006, searchable from
http://ccl.pku.edu.cn/YuLiao Contents.Asp,
part of it freely available from
http://www.icl.pku.edu.cn.
T. Emerson. 2005. The second international chi-
nese word segmentation bakeo. In SIGHAN.
H.D. Feng, K. Chen, C.Y. Kit, and Deng. X.T.
2004. Unsupervised segmentationofchinese cor-
pus using accessor variety. In IJCNLP, pages
255{261.
S.Z. Harris. 1955. From phoneme to morpheme.
Language, pages 190{222.
A. Kempe. 1999. Experiments in unsupervised
entropy-based corpus segmentation. In Work-
shop of EACL in Computational Natural Lan-
guage Learning, pages 7{13.
J.K. Low, H.T Ng, and W. Guo. 2005. A maxi-
mumentropy approach to chinese word segmen-
tation. In SIGHAN.
M. Sun, D. Shen, and B. K. Tsou. 1998. Chi-
nese word segmentation without using lexicon
and hand-crafted training data. In COLING-
ACL.
K. Tanaka-Ishii. 2005. Entropy as an indicator of
context boundaries |anexperiment usinga web
search engine |. In IJCNLP, pages 93{105.
H.P. Zhang, Yu H.Y., Xiong D.Y., and Q Liu.
2003. Hhmm-based chinese lexical analyzer ict-
clas. In SIGHAN. visited 2006, available from
http://www.nlp.org.cn.
435
