A Word Segmentation Method With Dynamic Adapting To Text
Using Inductive Learning
Zhongjian Wang
Hokuto System Co.,LTD
Oyachi Higashi1-3-23, Atubetuku, Sapporo, 004-0041 Japan
wang@hscnet.co.jp
Kenji Araki
Graduate School of Engineering, Hokkaido University
N13-W8, Kita-ku, Sapporo, 060-8628 Japan
araki@media.eng.hokudai.ac.jp
Koji Tochinai
Graduate School of Business Administration, Hokkai-Gakuen University,
Asahimachi 4-1-40, Toyohira-ku, Sapporo, 062-8605 Japan
Abstract
We have proposed a method of word segmen-
tation for non-segmented language using Induc-
tive Learning. This method uses only surface
information of a text, so that it has an ad-
vantage that is entirely not dependent on any
speciﬁc language. In this method, we consider
that a character string of appearing frequently
in a text has a high possibility as a word. The
method predicts unknown words by recursively
extracting common character strings. With the
proposed method, the segmentation results can
adapt to diﬀerent users and ﬁelds. To evaluate
eﬀectivety for Chinese word segmentation and
adaptability for diﬀerent ﬁelds, we have done
the evaluation experiment with Chinese text of
the two ﬁelds.
1 Instruction
InNLPapplications,wordsegmentationofnon-
segmented language is a very necessary initial
stage(Sun et al., 1998). In the other hands,
with the development of the Internet and popu-
larization of computers, a large amount of text
information in diﬀerent languages on the In-
ternet are increasing explosively, so it is nec-
essary to develop a common method to deal
with multi-language(Yamasita and Matsumoto,
2000). Furthermore, the standard of word seg-
mentation is dependent on a user and destina-
tion of use(Sproatet al., 1996), so that it is nec-
essary that word segmentation can adapt users,
can deal with multi languages.
In our method, we extract recursively a
common character string that occur frequently
in text and call it a common part. When
some common parts contain still same charac-
ter strings, furthermore we extract the same
character string as high dimensional common
parts and the remain parts is called diﬀerent
parts. The high dimensional common parts
maybe have higher possibility as words because
it is extracted by multi steps. Those extracted
common parts and diﬀerent parts are called
WS(Word Segment), and classiﬁed into some
ranks according to extracting condition. The
proposed method segments a non-segmented
sentence into words using the ranks of WS in
order of the higher value of the certainty de-
grees as words. When there are multiple seg-
mentation candidates, the system gets a list of
segmentablecandidates,andpicksacorrectseg-
mentation candidate from the list by using a
value of LEF (Likelihood Evaluation Function,
Section 2.1) and so on. In addition, it is not
necessary to prepare a dictionary and any word
segmentation rules beforehand. A dictionary of
adapting to the user or the ﬁeld is generated
with increasing of processed text. Because only
surface information of a text is used, it is possi-
blethemethod isused todeal withgeneral non-
segmented language. Here Inductive Learning
is the procedure to extract recursively WS by
multi steps(Araki et al., 1995).
2 Algrithm
Fig. 1 shows the outline of the proposed
method.
(1) Input sentences are segmented by word
candidates that were acquired in the dictionary
so far.
Segment by a update dictionary
Prediction over
No
Yes
Input sentences
Predict unknown words using 
Inductive Learning 
Segment by known words
Output segmentation results
Proofreading
Feedback processing
Segmentation
results
Corrected results
Recursively
prediction
Dictionary
Figure 1: Outline
(2) For the remaining part of the charac-
ter strings that are unsegmented by the known
words, the system predicts unknown words by
extracting WS using Inductive Learning.
The system extracts WS as word candidates.
This process is based on the supposition that a
common character string of appearing repeat-
edly in text has high probability as a word.
(3) Theuser judgeswhether theresults ofthe
wordsegmentationiscorrect ornot. Ifthereare
errors in the result, the user will correct errors.
(4) The system compares the proofread re-
sults with the segmentation result to update
the information in the dictionary. Through this
procedure, thecertaintyof WS asa wordiscon-
ﬁrmed and increased.
Here, the WS those are used in correct seg-
mentation are called CW (Correct Word).
2.1 SegmentationbyKnownWords
Input sentence and then the system segments
it into words by registered CW and WS that
the system has got by using Inductive Learning
until that time.
(1) In the ﬁrst step, the system compares the
registered CW or WS in the dictionary with
the character string in the input sentence from
the beginning of the sentence, and ﬁnds out
the same character strings with the registered
words. Thesystemrepeatsthiscomparisonpro-
cess until the end of the sentence is reached.
A list of segmentation candidate is established.
Then the system segments the sentence into
words.
(2) In the second step, however, for the char-
acter strings of multiple segmentations, we use
the registered candidates in order of their ranks
in the dictionary(Section2.3). When there are
more than one word candidate with the same
rank, we decide the correct segmentation from
the list of segmentation candidates by the value
of LEF. We deﬁne LEF as follows:
LEF =
FR+αCS − βES+ γLE
FR+CS − ES+LE
(1)
Where: FR, CS, ES and LE are the fre-
quency of CW or WS appearing in the text, the
frequency of the correct segmentation, the fre-
quency of the erroneous segmentation and the
lengthofCWorWSrespectively. α, β and γ are
coeﬃcients. The optimum coeﬃcients of LEF
are decided by the preliminary experiments us-
ing Greedy method, α=10, β=1 and γ=5.
The word that has the maximum value of
LEFis decided asthe correctsegmentation can-
didate.
(3) When LEF value of the set of possible
segmentations is equal to each other, the cor-
rect segmentation candidate is decided by the
word candidate that the value of ES is mini-
mum, the value of CS is maximum, the value of
FR is maximum, the value of LE is the longest
or the location of segmentation is the leftmost
in a sentence in turn.
2.2 PredictionforUnknownWords
Fig. 2 shows an example of a non-segmented
sentence. In this example, every character rep-
resents a Chinese character, so we use this ex-
ample to express a general sentence of non-
segmented language to present the proposed
method. Those words that are not registered in
the dictionary are predicted by using Inductive
Learning. After the sentences were segmented
by known words, which have been registered in
the dictionary, the unsegmented part of char-
acter string will be used to extract WS. The
prediction method is to ﬁnd the common char-
acter string in text. The extraction procedure
is carried out as Fig. 3 shows: the extraction
of common parts, sift out the common part of
themostpossibility as a word, there-extraction
of common parts and the extraction of diﬀerent
parts.
αβχδκϕepsilon1φγΘπµτγβπαβχδ
ΘπµτγβηΨepsilon1φγτγβαζθ.
Figure 2: An example of non-segmented sen-
tence.
2.2.1 ExtractionofaCommonPart
A common part in non-segmented text is ex-
tracted by two steps:
(1) When a character string appears in text
frequently,wecallitacommoncharacterstring.
If the common character string consists of more
thantwocharacters,weextractitasawordcan-
didate and call it common part and represent it
by S1(Segment one). Here, we use length, fre-
quency andlocation of S1 in the sentence to sift
out it, to get the S1 of the most possible as a
word. At this step, we acquired S1 from the
sentence that is shown in Fig. 2: “αβχδ”,“epsilon1φγ”
and “Θπµτγβ”.
(2) When the character string appears in the
sentence only one times but meanwhile it is
included in other extracted common part and
made up by more than two characters, we also
extract it as a word candidate. For example in
Fig. 2: “τγβ”isincluded in “Θπµτγβ”. There-
fore “τγβ” is extracted and belongs to S1.
2.2.2 ExtractionofaHighDimensional
CommonPartandaDiﬀerent
Part
The extracted S1 at 2.2.1 may still include a
common character string. At this situation, the
common character string can be re-extracted
moreover from the extracted S1. We consider
it has a higher probability as a word that re-
extracted common parts at this procedure. The
conditions of re-extraction are presented as fol-
lows:
(1) The common part can be re-extracted
from the extracted S1 when it includes a com-
mon character string that is more than two
characters. For example, “Θπµτγβ” contains
“τγβ” whichcan be extracted from “Θπµτγβ”,
so “Θπµτγβ(S1)” is equal to “Θπµ(S2)” +
“τγβ(S3)”.
The part of re-extraction is called high di-
mensional common part and represented by S2
(Segment two). The part of remain is called
diﬀerent part and represented by S3 (Segment
three). The S1 is deleted from the dictionary
Extraction of high dimension 
common parts and different parts
Sift out common parts of the 
most possibility as words
Extraction of common parts
Get S1
Get S2 and S3
Unsegmented Character String
Figure 3: WS extraction procedure
when it is divided into S2 and S3.
(2)Furthermoreone charactercanalsobe ex-
tracted as a word candidate when both sides
of it are extracted as a word candidate or
both sides were segmented by known words.
Like “π”in“Θπµτγβπαβχδ” is surrounded by
“Θπµτγβ” and “αβχδ”, and “π” is extracted
as a word candidate belonging to S2.
The extraction procedure is carried out re-
peatedly untilthenew WS cannotbe extracted
and the input can not be segmented.
2.3 SegmentationbyaUpdate
Dictionary
The extracted WS are classiﬁed to “S1”, “S2”,
and “S3”. Those WS that are conﬁrmed
as a word by proofreading process are called
“CW” (Correct Word). Furthermore, the
FR(appearing FRequency), CS(Correct Seg-
mentation frequency), ES(Erroneous Segmen-
tation frequency), LE(LEength) and rankof a
word candidate are rigestered simultaneously.
Word Segmentation is carried out by the up-
date dictionary as2.1.
2.4 FeedbackProcess
After the system segments the sentence into
words, the results are judged whether they are
correct or not by the user. Then the user cor-
rects the errors if there are errors in the results.
The system updates the rankof the registered
CWandWS inthedictionary bycomparingthe
corrected results with the segmentation results.
And the system increases the priority degree of
the words that were used in correct segmenta-
tion and decreases the priority degree of words
thatwereusedin erroneoussegmentations. The
Table 1: Experimental results
Fields Economics Engineering Average
words 92,085 70,017 162,102
CSR[%] 87.50 90.80 89.44
ESR[%] 5.40 5.60 5.45
USR[%] 7.10 3.60 5.11
feedbackprocessisdescribedindetailasfollows:
(1) For the Correct Segmentation Results:
 When the result of segmentation is correct,
the value of FR and CS of a word that is
used to segment are added one.
 If the rankof the words does not belong to
CW, it is changed to CW.
(2) For the Erroneous Segmentation Results:
 If the dictionary does not has the correct
words,thesystemregistersthewordsinthe
dictionary. In this case, their FRs are 1,
their ranks are CW.
 If the dictionary has the correct words, the
system adds one to the value of FR for a
word and changes the value of CL to CW
if it does not belong to CW.
 If the reason of erroneous segmentation is
that theerroneous word wasused, thenthe
ES of erroneous word is added one.
(3) For the Unsegmented Parts:
 The system registers the words in the dic-
tionary, as FR of the words equal 1 and
rankequal CW.
3 Evaluation Experiments
3.1 ExperimentalDataAndProcedure
To evaluate the adaptability of the proposed
method fordiﬀerentﬁeldsandtheeﬀectivityfor
Chinese word segmentation. We use the Chi-
nese text of two specialized ﬁelds from Sinica
Corpus
1
: the economics contains 92,085 words
and the engineering contains 70,017 words. To-
tal words is 162,102. The economics consists of
the text of economic system, economic policy
and economic theory. The engineering consists
of the text of electronics, communication engi-
neering,machineengineeringandnuclearindus-
try.
1
http : //www.sinica.edu.tw/ftms −bin/kiwi.sh
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16
Segmentation Rate(%)
Number of Words
Economics Engineering
Correct Segmentation Rate
Unsegmentation Rate
Erroneous Segmentation Rate
(x10,000)
Figure 4: The changes of segmentation rates
In order to conﬁrm the adaptability of pro-
posed method to user, we let the initial dic-
tionary empty. We input a paragraph about
hundred words one times and two ﬁelds text in
turns.
3.2 ExperimentalResults
The results of experiment are shown in Table
1. Fig. 4 shows the change of CSR, ESR and
USR. In our method, the correct segmentation
number is the number of correct segmentation
that is judged by a user. The unsegmentation
number is the number when all unsegmented
strings are segmented correctly. The erroneous
segmentation number is the number that sub-
tracts the number of correct segmentation and
unsegmentation from the number of all words
in the input text. To evaluate the experiment
result, we use these formulas of CSR (Correct
Segmentation Rate), ESR (Erroneous Segmen-
tation Rate) and USR (Unsegmented Rate) as
follows:
CSR[%] =
Correct segmentationnumber
Total number of words
×100 (2)
ESR[%] =
Erroneous segmentation number
Total number of words
×100 (3)
USR[%] =
Unsegmentationnumber
Total number of words
×100 (4)
4 Discussion
4.1 AdaptabilityToDiﬀerentFields
Fig. 4 shows the experimental results of two
ﬁelds. When the text is changed to diﬀerent
domain,because appearance of somenew words
of diﬀerent ﬁelds, the correct segmentation rate
0
10
20
30
40
50
60
70
80
90
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Precision & Recall
Number of words (x10,000)
Precosion
Recall
Figure 5: The ability to predict unknown words
isfalldowntemporary. Howeverwithincreasing
of processed sentence, the correct segmentation
rate goes on increasing quickly.
We may consider that the proposed method
has adaptability for diﬀerent ﬁelds. Sometimes
the correct segmentation rate is a little lower
because the domain of text is a little diﬀerence,
for example: the economics consists of the text
of economic system, economic policy and eco-
nomic theory and so on.
4.2 EvaluationofAbilityforPredicting
UnknownWords
We use 50,000 words to discuss the predicting
abilityofproposedmethodforunknownwords.
Precision[%] =
CWN
TWN
×100 (5)
Recall[%] =
CWN
TUN
×100 (6)
Where, CWNis thenumberofwords thatare
predicted correctly. TWN is the total number
of words that are predicted. TUN is the total
number of unknown words.
The precision and recall are shown in Fig. 5.
The average precision is 26.0%. The average
recall is 31.0%. With increasing of registered
words in the dictionary, prediction eﬀect for
unknown words is becoming well, after 40,000
words are processed the precision and the recall
are 85.0%, 40.0% respectively.
4.3 AnalysisofErroneous
Segmentation
We select 1,000words from thebeginning of the
experimental date and the end of the experi-
mental date respectively, to analysis the reason
of an erroneous segmentation. At the begin-
ning, ESR that is because of unregistered words
is 18.0%, but after 16,000 words are processed,
ESR that is because of unregistered words is
0.9%. However ESR that is caused by ambigu-
ity goes on increasing from 1.6% to 7.0%. ESR
caused by ambiguity is increasing with increas-
ing of registered word in the dictionary. Am-
biguoussegmentation isstill adiﬃcult problem,
so that it is necessary to improve the ability to
deal with ambiguity.
5 Conclusion
The experiment results show the prediction
ability for unknown words by using Inductive
Learning. The experiment results of two ﬁelds
shown the proposed method can adapt to dif-
ferent ﬁelds text. In this paper, the emphasis is
to evaluate the adaptivity of the method to dif-
ferent user and ﬁelds. About comparison with
otherexistedmethodswillbedoneinthefuture.
The proposed method may be used to
computer-aided acquisition of language re-
source. The experimental results show our
proposed method has ability of learning, pre-
dictability for unknown words and eﬀectivity
for Chinese word segmentation. For the future
works, we plan to improve the ability of dealing
with segmentation ambiguity, and use this pro-
posed method for Chinese morphological anal-
ysis.

References
KenjiAraki, YoshioMomouchi, andKojiTochi-
nai. 1995. Evaluation for adaptability of
kana-kanji translation of non-segmentation
japanese kana sentences using inductive
learning. PACLING-II, pages 1–7.
Richard Sproat, Chilin Shih, William Gale,
and Nancy Chang. 1996. A stochastic ﬁnite-
state word-segmentation algorithm for Chi-
nese. Association for Computational Linguis-
tics, 22(3):377–404.
Maosong Sun, Dayang Shen, and Benjamin K
Tsou. 1998. Chinese word segmentation
withoutusing lexicon andhand-crafted train-
ing data. 17th International Conference on
Computational Linguistics, pages 1265–1271.
T. Yamasita and Y. Matsumoto. 2000. Journal
of natural language processing(in japanese).
Framework for Language Independent Mor-
phological Analysis, 7(3):39–56.
