Investigating the Relationship between Word Segmentation
Performance and Retrieval Performance in Chinese IR
Fuchun Peng and Xiangji Huang and Dale Schuurmans and Nick Cercone
School of Computer Science, University of Waterloo
200 University Ave. West, Waterloo, Ontario, Canada, N2L 3G1
ff3peng, jhuang, dale, ncerconeg@uwaterloo.ca
Abstract
It is commonly believed that word segmentation ac-
curacy is monotonically related to retrieval perfor-
mance in Chinese information retrieval. In this pa-
per we show that, for Chinese, the relationship be-
tween segmentation and retrieval performance is in
fact nonmonotonic; that is, at around 70% word
segmentation accuracy an over-segmentation phe-
nomenon begins to occur which leads to a reduction
in information retrieval performance. We demon-
strate this effect by presenting an empirical inves-
tigation of information retrieval on Chinese TREC
data, using a wide variety of word segmentation al-
gorithms with word segmentation accuracies ranging
from 44% to 95%. It appears that the main reason
for the drop in retrieval performance is that correct
compounds and collocations are preserved by accu-
rate segmenters, while they are broken up by less
accurate (but reasonable) segmenters, to a surpris-
ing advantage. This suggests that words themselves
might be too broad a notion to conveniently capture
the general semantic meaning of Chinese text.
1 Introduction
Automated processing of written languages such
as Chinese involves an inherent word segmentation
problem that is not present in western languages like
English. Unlike English, Chinese words are not ex-
plicitly delimited by whitespace, and therefore to
perform automated text processing tasks (such as
information retrieval) one normally has to first seg-
ment the text collection. Typically this involves seg-
menting the text into individual words. Although
the text segmentation problem in Chinese has been
heavily investigated recently (Brent and Tao, 2001;
Chang, 1997; Ge et al., 1999; Hockenmaier and
Brew, 1998; Jin, 1992; Peng and Schuurmans, 2001;
Sproat and Shih, 1990; Teahan et al, 2001) most
research has focused on the problem of segmenting
character strings into individual words, rather than
useful constituents. However, we have found that
focusing exclusively on words may not lead to the
most effective segmentation from the perspective of
broad semantic analysis (Peng et al, 2002).
In this paper we will focus on a simple form of se-
mantic text processing: information retrieval (IR).
Although information retrieval does not require a
deep semantic analysis, to perform effective retrieval
one still has to accurately capture the main topic of
discourse and relate this to a given query. In the con-
text of Chinese, information retrieval is complicated
by the fact that the words in the source text (and
perhaps even the query) are not separated by whites-
pace. This creates a significant amount of additional
ambiguity in interpreting sentences and identifying
the underlying topic of discourse.
There are two standard approaches to information
retrieval in Chinese text: character based and word
based. It is usually thought that word based ap-
proaches should be superior, even though character
based methods are simpler and more commonly used
(Huang and Robertson, 2000). However, there has
been recent interest in the word based approach, mo-
tivated by recent advances in automatic segmenta-
tion of Chinese text (Nie et al, 1996; Wu and Tseng,
1993). A common presumption is that word segmen-
tation accuracy should monotonically influence sub-
sequent retrieval performance (Palmer and Burger,
1997). Consequently, many researchers have focused
on producing accurate word segmenters for Chinese
text indexing (Teahan et al, 2001; Brent and Tao,
2001). However, we have recently observed that low
accuracy word segmenters often yield superior re-
trieval performance (Peng et al, 2002). This obser-
vation was initially a surprise, and motivated us to
conduct a more thorough study of the phenomenon
to uncover the reason for the performance decrease.
The relationship between Chinese word segmenta-
tion accuracy and information retrieval performance
has recently been investigated in the literature. Foo
and Li (2001) have conducted a series of experiments
which suggests that the word segmentation approach
does indeed have effect on IR performance. Specif-
ically, they observe that the recognition of words of
length two or more can produce better retrieval per-
formance, and the existence of ambiguous words re-
sulting from the word segmentation process can de-
crease retrieval performance. Similarly, Palmer and
Burger (1997) observe that accurate segmentation
tends to improve retrieval performance. All of this
previous research has indicated that there is indeed
some sort of correlation between word segmentation
performance and retrieval performance. However,
the nature of this correlation is not well understood,
and previous research uniformly suggests that this
relationship is monotonic.
One reason why the relationship between seg-
mentation and retrieval performance has not been
well understood is that previous investigators have
not considered using a variety of Chinese word seg-
menters which exhibit a wide range of segmenta-
tion accuracies, from low to high. In this paper,
we employ three families of Chinese word segmenta-
tion algorithms from the recent literature. The first
technique we employed was the standard maximum
matching dictionary based approach. The remaining
two algorithms were selected because they can both
be altered by simple parameter settings to obtain
different word segmentation accuracies. Specifically,
the second Chinese word segmenter we investigated
was the minimum description length algorithm of
Teahan et al. (2001), and the third was the EM
based technique of Peng and Schuurmans (2001).
Overall, these segmenters demonstrate word identi-
fication accuracies ranging from 44% to 95% on the
PH corpus (Brent and Tao, 2001; Hockenmaier and
Brew, 1998; Teahan et al, 2001).
Below we first describe the segmentation algo-
rithms we used, and then discuss the information
retrieval environment considered (in Sections 2 and
3 respectively). Section 4 then reports on the out-
come of our experiments on Chinese TREC data,
and in Section 5 we attempt to determine the reason
for the over-segmentation phenomenon witnessed.
2 Word Segmentation Algorithms
Chinese word segmentation has been extensively re-
searched. However, in Chinese information retrieval
the most common tokenziation methods are still
the simple character based approach and dictionary-
based word segmentation. In the character based
approach sentences are tokenized simply by taking
each character to be a basic unit. In the dictionary
based approach, on the other hand, one pre-defines a
lexicon containing a large number of words and then
uses heuristic methods such as maximum matching
to segment sentences. Below we experiment with
these standard methods, but in addition employ two
recently proposed segmentation algorithms that al-
low some control of how accurately words are seg-
mented. The details of these algorithms can be
found in the given references. For the sake of com-
pleteness we briefly describe the basic approaches
here.
2.1 Dictionary based word segmentation
The dictionary based approach is the most popu-
lar Chinese word segmentation method. The idea is
to use a hand built dictionary of words, compound
words, and phrases to index the text. In our experi-
ments we used the longest forward match method in
which text is scanned sequentially and the longest
matching word from the dictionary is taken at each
successive location. The longest matched strings are
then taken as indexing tokens and shorter tokens
within the longest matched strings are discarded. In
our experiments we used two different dictionaries.
The first is the Chinese dictionary used by Gey et
al. (1997), which includes 137,659 entries. The sec-
ond is the Chinese dictionary used by Beaulieu et al.
(1997), which contains 69,353 words and phrases.
2.2 Compression based word segmentation
The PPM word segmentation algorithm of Teahan et
al. (2001) is based on the text compression method
of Cleary and Witten (1984). PPM learns an n-gram
language model by supervised training on a given set
of hand segmented Chinese text. To segment a new
sentence, PPM seeks the segmentation which gives
the best compression using the learned model. This
has been proven to be a highly accurate segmenter
(Teahan et al, 2001). Its quality is affected both by
the amount of training data and by the order of the
n-gram model. By controlling the amount of train-
ing data and the order of language model we can
control the resulting word segmentation accuracy.
2.3 EM based word segmentation
The “self-supervised” segmenter of Peng and Schu-
urmans (2001) is an unsupervised technique based
on a variant of the EM algorithm. This method
learns a hidden Markov model of Chinese words, and
then segments sentences using the Viterbi algorithm
(Rabiner, 1989). It uses a heuristic technique to
reduce the size of the learned lexicon and prevent
the acquisition of erroneous word agglomerations.
Although the segmentation accuracy of this unsu-
pervised method is not as high as the supervised
PPM algorithm, it nevertheless obtains reasonable
performance and provides a fundamentally different
segmentation scheme from PPM. The segmentation
performance of this technique can be controlled by
varying the number of training iterations and by ap-
plying different lexicon pruning techniques.
3 Information Retrieval Method
We conducted our information retrieval experiments
using the OKAPI system (Huang and Robertson,
2000; Robertson et al., 1994). In an attempt to en-
sure that the phenomena we observe are not specific
to a particular retrieval technique, we experimented
with a parameterized term weighting scheme which
allowed us to control the quality of retrieval per-
formance. We considered a refined term weighting
scheme based on the the standard term weighting
function
w0 = logN ¡n+0:5n+0:5 (1)
where N is the number of indexed documents in the
collection, and n is the number of documents con-
taining a specific term (Spark Jones, 1979). Many
researchers have shown that augmenting this basic
function to take into account document length, as
well as within-document and within-query frequen-
cies, can be highly beneficial in English text retrieval
(Beaulieu et al., 1997). For example, one standard
augmentation is to use
w1 = w0 ⁄ (c1 +1)⁄tfK +tf ⁄ (c2 +1)⁄qtfc
2 +qtf
(2)
where
K = c1 ⁄
 
1¡c3 +c3 dlavdl
¶
Here tf is within-document term frequency, qtf is
within-query term frequency, dl is the length of
the document, avdl is the average document length,
and c1, c2, c3 are tuning constants that depend on
the database, the nature of the queries, and are
empirically determined. However, to truly achieve
state-of-the-art retrieval performance, and also to
allow for the quality of retrieval to be manipulated,
we further augmented this standard term weighting
scheme with an extra correction term
w2 = w1 ' kd ⁄y (3)
This correction allows us to more accurately account
for the length of the document. Here ' indicates
that the component is added only once per docu-
ment, rather than for each term, and
y =
8>
>><
>>>
:
ln( dlavdl)+ln(c4) if dl • rel avdl
¡ln(rel avdl
avdl )+ln(c4)
¢‡1¡ dl¡rel avdl
c5⁄avdl¡rel avdl
·
if dl > rel avdl
where rel avdl is the average relevant document
length calculated from previous queries based on the
same collection of documents. Overall, this term
weighting formula has five tuning constants, c1 to
c5, which are all set from previous research on En-
glish text retrieval and some initial experiments on
Chinese text retrieval. In our experiments, the val-
ues of the five arbitrary constants c1, c2, c3, c4 and
c5 were set to 2.0, 5.0, 0.75, 3 and 26 respectively.
The key constant is the quantity kd, which is the
new tuning constant that we manipulate to control
the influence of correction factor, and hence control
the retrieval quality. By setting kd to different val-
ues, we have different term weighting methods in our
experiments. In our experiments, we tested kd set
to values of 0, 6, 8, 10, 15, 20, 50.
4 Experiments
We conducted a series of experiments in word based
Chinese information retrieval, where we varied both
the word segmentation method and the information
retrieval method. We experimented with word seg-
mentation techniques of varying accuracy, and infor-
mation retrieval methods with varying performance.
In almost every case, we witness a nonmonotonic
relationship between word segmentation accuracy
and retrieval performance, robustly across retrieval
methods. Before describing the experimental results
in detail however, we first have to describe the per-
formance measures used in the experiments.
4.1 Measuring segmentation performance
We evaluated segmentation performance on the
Mandarin Chinese corpus, PH, due to Guo Jin. This
corpus contains one million words of segmented Chi-
nese text from newspaper stories of the Xinhua news
agency of the People’s Republic of China published
between January 1990 and March 1991.
To make the definitions precise, first define the
original segmented test corpus to be S. We then
collapse all the whitespace between words to make
a second unsegmented corpus U, and then use the
segmenter to recover an estimate ˆS of the original
segmented corpus. We measure the segmentation
performance by precision, recall, and F-measure on
detecting correct words. Here, a word is considered
to be correctly recovered if and only if (Palmer and
Burger, 1997)
1. a boundary is correctly placed in front of the
first character of the word
2. a boundary is correctly placed at the end of the
last character of the word
3. and there is no boundary between the first and
last character of the word.
Let N1 denote the number of words in S, let N2 de-
note the number of words in the estimated segmen-
tation ˆS, and let N3 denote the number of words
correctly recovered. Then the precision, recall and
F measures are defined
precision: p = N3N2
recall: r = N3N1
F-measure: F = 2£p£rp+r
In this paper, we only report the performance in
F-measure, which is a comprehensive measure that
combines precision and the recall.
4.2 Measuring retrieval performance
We used the TREC relevance judgments for each
topic that came from the human assessors of the
National Institute of Standards and Technology
(NIST). Our statistical evaluation was done by
means of the TREC evaluation program. The mea-
sures we report are Average Precision: average pre-
cision over all 11 recall points (0.0, 0.1, 0.2,..., 1.0);
and R Precision: precision after the number of doc-
uments retrieved is equal to the number of known
relevant documents for a query. Detailed descrip-
tions of these measures can be found in (Voorhees
and Harman, 1998).
4.3 Data sets
We used the information retrieval test collections
from TREC-5 and TREC-6 (Voorhees and Harman,
1998). (Note that the document collection used in
the TREC-6 Chinese track was identical to the one
used in TREC-5, however the topic queries differ.)
This collection of Chinese text consists of 164,768
documents and 139,801 articles selected from the
People’s Daily newspaper, and 24,988 articles se-
lected from the Xinhua newswire. The original arti-
cles are tagged in SGML, and the Chinese characters
in these articles are encoded using the GB (Guo-
Biao) coding scheme. Here 0 bytes is the minimum
file size, 294,056 bytes is the maximum size, and 891
bytes is the average file size.
To provide test queries for our experiments, we
considered the 54 Chinese topics provided as part
of the TREC-5 and TREC-6 evaluations (28 for
TREC-5 and 26 for TREC-6).
Finally, for the two learning-based segmentation
algorithms, we used two separate training corpora
but a common test corpus to evaluate segmentation
accuracy. For the PPM segmenter we used 72% of
the PH corpus as training data. For the the self-
supervised segmenter we used 10M of data from the
data set used in (Ge et al., 1999), which contains one
year of People’s Daily news service stories. We used
the entire PH collection as the test corpus (which
gives an unfair advantage to the supervised method
PPM which is trained on most of the same data).
4.4 Segmentation accuracy control
By using the forward maximum matching segmen-
tation strategy with the two dictionaries, Berkeley
and City, we obtain the segmentation performance
of 71% and 85% respectively. For the PPM algo-
rithm, by controlling the order of the n-gram lan-
guage model used (specifically, order 2 and order
3) we obtain segmenters that achieve 90% and 95%
word recognition accuracy respectively. Finally, for
the self-supervised learning technique, by controlling
the number of EM iterations and altering the lexi-
con pruning strategy we obtain word segmentation
accuracies of 44%, 49%, 53%, 56%, 59%, 70%, 75%,
and 77%. Thus, overall we obtain 12 different seg-
menters that achieve segmentation performances of
44%, 49%, 53%, 56%, 59%, 70%, 71%, 75%, 77%,
85%, 90%, and 95%.
4.5 Experimental results
Now, given the 12 different segmenters, we con-
ducted extensive experiments on the TREC data
sets using different information retrieval methods
(achieved by tuning the kd constant in the term
weighting function described in Section 3).
Table 1 shows the average precision and R-
precision results obtained on the TREC-5 and
TREC-6 queries when basing retrieval on word seg-
mentations at 12 different accuracies, for a single
retrieval method, kd = 10. To illustrate the results
graphically, we re-plot this data in Figure 1, in which
the x-axis is the segmentation performance and the
y-axis is the retrieval performance.
seg. accuracy TREC-5 TREC-6
44% 0.2231/0.2843 0.3424/0.3930
49% 0.2647/0.3259 0.3848/0.4201
53% 0.2999/0.3376 0.4492/0.4801
56% 0.3056/0.3462 0.4473/0.4727
59% 0.3097/0.3533 0.4740/0.4960
70% 0.3721/0.3988 0.5044/0.5072
71% 0.3656/0.4088 0.5133/0.5116
75% 0.3652/0.4000 0.4987/0.5097
77% 0.3661/0.4027 0.4968/0.4973
85% 0.3488/0.3898 0.5049/0.5047
90% 0.3213/0.3663 0.4983/0.5008
95% 0.3189/0.3669 0.4867/0.4933
Table 1: Average precision and R-precision results
on TREC queries when kd = 10.
0.4 0.6 0.8 10.2
0.25
0.3
0.35
0.4
0.45 Relation of segmentation performance and retrieval performance on TREC5 (kd=10)
P−precisionR−precision
0.4 0.6 0.8 10.2
0.3
0.4
0.5
0.6 Relation of segmentation performance and retrieval performance on TREC6 (kd=10)
P−precisionR−precision
Figure 1: Retrieval F-measure (y-axis) versus seg-
mentation accuracy (x-axis) for kd = 10.
Clearly these curves demonstrate a nonmonotonic
relationship between retrieval performance (on the
both P-precision and the R-precision) and segmen-
tation accuracy. In fact, the curves show a clear
uni-modal shape, where for segmentation accura-
cies 44% to 70% the retrieval performance increases
steadily, but then plateaus for segmentation accu-
racies between 70% and 77%, and finally decreases
slightlywhen thesegmentationperformance increase
to 85%,90% and 95%.
This phenomenon is robustly observed as we
alter the retrieval method by setting kd =
0;6;8;15;20;50, as shown in Figures 2 to 7 respec-
tively.
To give a more detailed picture of the results, Fig-
ures 8 and 9 we illustrate the full precision-recall
curves for kd = 10 at each of the 12 segmentation
accuracies, for TREC-5 and TREC-6 queries respec-
tively. In these figures, the 44%, 49% segmentations
are marked with stars, the 53%, 56%, 59% segmen-
tations are marked with circles, the 70%, 71%, 75%,
77% segmentations are marked with diamonds, the
85% segmentation is marked with hexagrams, and
the 90% and 95% segmentations are marked with
triangles. We can see that the curves with the dia-
monds are above the others, while the curves with
stars are at the lowest positions.
0.4 0.6 0.8 10.1
0.15
0.2
0.25
0.3
0.35
0.4 Relation of segmentation performance and retrieval performance on TREC5 (kd=0)
P−precisionR−precision
0.4 0.6 0.8 10.2
0.3
0.4
0.5 Relation of segmentation performance and retrieval performance on TREC6 (kd=0)
P−precisionR−precision
Figure 2: Results for kd = 0.
5 Discussion
The observations were surprising to us at first, al-
though they suggest that there is an interesting phe-
nomenon at work. To attempt to identify the under-
lying cause, we break the explanation into two parts:
one for the first part of the curves where retrieval
performance increases with increasing segmentation
accuracy, and a second effect for the region where
retrieval performance plateaus and eventually de-
creases with increasing segmentation accuracy.
The first part of these performance curves seems
easy to explain. At low segmentation accuracies the
segmented tokens do not correspond to meaningful
0.4 0.6 0.8 10.2
0.25
0.3
0.35
0.4
0.45 Relation of segmentation performance and retrieval performance on TREC5 (kd=6)
P−precisionR−precision
0.4 0.6 0.8 10.2
0.3
0.4
0.5
0.6 Relation of segmentation performance and retrieval performance on TREC6 (kd=6)
P−precisionR−precision
Figure 3: Results for kd = 6.
0.4 0.6 0.8 10.2
0.25
0.3
0.35
0.4
0.45 Relation of segmentation performance and retrieval performance on TREC5 (kd=8)
P−precisionR−precision
0.4 0.6 0.8 10.2
0.3
0.4
0.5
0.6 Relation of segmentation performance and retrieval performance on TREC6 (kd=8)
P−precisionR−precision
Figure 4: Results for kd = 8.
linguistic terms, such as words, which hampers re-
trieval performance because the term weighting pro-
cedure is comparing arbitrary tokens to the query.
However, assegmentationaccuracyimproves, theto-
kens behave more like true words and the retrieval
engine begins to behave more conventionally.
However, after a point, when the second regime
is reached, retrieval performance no longer increases
with improved segmentation accuracy, and eventu-
ally begins to decrease. One possible explanation
for this which we have found is that a weak word
segmenter accidentally breaks compound words into
smaller constituents, and this, surprisingly yields a
beneficial effect for Chinese information retrieval.
For example, one of the test queries, Topic 34,
is about the impact of droughts in various parts of
China. Retrieval based on the EM-70% segmenter
retrieved 84 of 95 relevant documents in the col-
lection, whereas retrieval based on the PPM-95%
segmenter retrieved only 52 relevant documents. In
fact, only 2 relevant documents were missed by EM-
70% but retrieved by PPM-95%, whereas 34 docu-
0.4 0.6 0.8 10.2
0.25
0.3
0.35
0.4
0.45 Relation of segmentation performance and retrieval performance on TREC5 (kd=15)
P−precisionR−precision
0.4 0.6 0.8 10.2
0.3
0.4
0.5
0.6 Relation of segmentation performance and retrieval performance on TREC6 (kd=15)
P−precisionR−precision
Figure 5: Results for kd = 15.
0.4 0.6 0.8 10.2
0.25
0.3
0.35
0.4
0.45 Relation of segmentation performance and retrieval performance on TREC5 (kd=20)
P−precisionR−precision
0.4 0.6 0.8 10.2
0.3
0.4
0.5
0.6 Relation of segmentation performance and retrieval performance on TREC6 (kd==20)
P−precisionR−precision
Figure 6: Results for kd = 20.
ments retrieved by EM-70% and not by PPM-95%.
In investigating this phenomenon, one finds that the
performance drop appears to be due to the inherent
nature of written Chinese. That is, in written Chi-
nese many words can often legally be represented
their subparts. For example, x40x2axd4(agriculture
plants) is sometimes represented asx2axd4(plants). So
for example in Topic 34, the PPM-95% segmenter
correctly segments x42xef as x42xef(drought disas-
ter) and x40x2axd4correctly as x40x2axd4 (agriculture
plants), whereas the EM-70% segmenter incorrectly
segmentsx42xefasx42(drought) andxef(disaster), and
incorrectly segments x40x2axd4as x40(agriculture) and
x2axd4(plants). However, by inspecting the relevant
documents for Topic 34, we find that there are many
Chinese character strings in these documents that
are closely related to the correctly segmented word
x42xef(drought disaster). These alternative words are
x13x42xc7x42xe0xc7x49x42xc7x9ax42xc7x1cx42xc7x42x4betc. For
example, in the relevant document “pd9105-832”,
which is ranked 60th by EM-70% and 823rd by
PPM-95%, the correctly segmented wordx42xefdoes
0.4 0.6 0.8 10.1
0.15
0.2
0.25
0.3
0.35
0.4 Relation of segmentation performance and retrieval performance on TREC5 (kd=50)
P−precisionR−precision
0.4 0.6 0.8 10.2
0.3
0.4
0.5 Relation of segmentation performance and retrieval performance on TREC6 (kd=50)
P−precisionR−precision
Figure 7: Results for kd = 50.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recall
P−Precision
Overview of the TREC5 results 44%49%
53%56%
59%70%
71%75%
77%85%
90%95%
Figure 8: TREC5 precision-recall comprehensive
view at kd = 10
not appear at all. Consequently, the correct seg-
mentation for x42xefby PPM-95% leads to a much
weaker match than the incorrect segmentation of
EM-70%. Here EM-70% segmentsx42xefintox42and
xef, which is not regarded as a correct segmentation.
However, there are many matches between the topic
and relevant documents which contain onlyx42. This
same phenomenon happens with the query wordx40
x2axd4since many documents only contain the frag-
ment x2axd4instead of x40x2axd4, and these documents
are all missed by PPM-95% but captured by EM-
70%.
Although straightforward, these observations sug-
gest a different trajectory for future research on Chi-
nese information retrieval. Instead of focusing on
achieving accurate word segmentation, we should
pay more attention to issues such as keyword weight-
ing (Huang and Robertson, 2000) and query key-
word extraction (Chien et al, 1997). Also, we find
that the weak unsupervised segmentation method
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 Overview of the TREC−6 results
Recall
P−Precision
44%49%
53%56%
59%70%
71%75%
77%85%
90%95%
Figure 9: TREC6 precision-recall comprehensive
view at kd = 10
based yields better Chinese retrieval performance
than the other approaches, which suggests a promis-
ing new avenue to apply machine learning techniques
to IR (Sparck Jones, 1991). Of course, despite these
results we expect highly accurate word segmenta-
tion to still play an important role in other Chinese
information processing tasks such as information ex-
traction and machine translation. This suggests that
somedifferentevaluationstandardsforChineseword
segmentation should be given to different NLP ap-
plications.
6 Acknowledgments
Research supported by Bell University Labs, MI-
TACS and NSERC. We sincerely thank Dr. William
Teahan for supplying us the PPM segmenters.

References
Beaulieu, M. and Gatford, M. and Huang, X. and
Robertson, S. and Walker, S. and Williams, P.
1997. Okapi at TREC-5. In Proceedings TREC-5.
Brent, M. and Tao, X. 2001, Chinese Text Segmen-
tation With MBDP-1: Making the Most of Train-
ing Corpora. In Proceedings ACL-2001.

Buckley, C., Singhal, A., and Mitra, M. 1997. Us-
ing query zoning and correlation within SMART:
TREC-5. In Proceedings TREC-5.

Chang, J.-S. and Su, K.-Y. 1997, An Unsupervised
IterativeMethodforChineseNewLexiconExtrac-
tion, In Int J Comp Ling & Chinese Lang Proc.

Chen, A. and He, J. and Xu, L. and Gey, F. and
Meggs, J. 1997. Chinese Text Retrieval Without
Using a Dictionary. In Proceedings SIGIR-97.
Chien L. and Huang, T. and Chien, M. 1997 In
Proceedings SIGIR-97.

Cleary, J. and Witten, I. 1984. Data compression
using adaptive coding and partial string matching.
In IEEE Trans Communications, 32(4): 396-402.

Foo, S. and Li, H. 2001 Chinese Word Segmentation
Accuracy and Its Effects on Information Retrieval.
In TEXT Technology.

Ge, X., Pratt, W. and Smyth, P. 1999. Discover-
ing Chinese Words from Unsegmented Text. In
Proceedings SIGIR-99.

Gey, F., Chen, A., He, J., Xu, L. and Meggs, J.
1997 Term Importance, Boolean Conjunct Train-
ning Negative Terms, and Foreign Language Re-
trieval: Probabilistic Algorithms at TREC-5. In
Proceedings TREC-5.

Hockenmaier, J. and Brew C. 1998. Error driven
segmentation of Chinese. In Comm. COLIPS,
8(1): 69-84.

Huang, X. and Robertson, S. 2000. A probabilistic
approach to Chinese information retrieval: theory
and experiments. In Proceedings BCS-IRSG 2000.

Jin, W. 1992, Chinese Segmentation and its Disam-
biguation, Tech report, New Mexico State Univ.

Nie, J., Brisebois, M. and Ren, X. 1996. On Chinese
text retrieval. In Proceedings SIGIR-96.

Palmer, D. and Burger, J. 1997. Chinese Word Seg-
mentation and Information Retrieval. In AAAI
Symp Cross-Language Text and Speech Retrieval.

Peng, F., Huang, X., Schuurmans, D., Cercone, N.,
and Robertson, S. 2002. Using Self-supervised
Word Segmentation in Chinese Information Re-
trieval. In Proceedings SIGIR-02.

Peng, F. and Schuurmans, D. 2001. Self-supervised
Chinese Word Segmentation. In Proceedings IDA-
01, LNCS 2189.

Rabiner, L. 1989. A Tutorial on Hidden Markov
Models and Selected Applications in Speech
Recognition. In Proceedings of IEEE, 77(2).

Robertson, S. and Walker, S. 1994. Some Simple
Effective Approximations to the 2-Poisson Model
for Probabilistic Weighted Retrieval. SIGIR-94.

Sparck Jones, K. 1991 The Role of Artificial In-
telligence in Information Retrieval J. Amer. Soc.
Info. Sci., 42(8): 558-565.

Sparck Jones, K. 1979. Search Relevance Weight-
ing Given Little Relevance Information. In J. of
Documentation, 35(1).

Sproat, R. and Shih, C. 1990. A Statistical Method
for Finding Word Boundaries in Chinese Text, In
Comp Proc of Chinese and Oriental Lang, 4.

Teahan, W. J. and Wen, Y. and McNab, R. and
Witten I. H. 2001 A Compression-based Algo-
rithm for Chinese Word Segmentation. In Com-
put. Ling., 26(3):375-393.

Voorhees, E. and Harman, D. 1998. Overview of
the Sixth Text REtrieval Conference (TREC-6),
In Proceedings TREC-6.

Wu, Z. and Tseng, G. 1993. Chinese Text Segmen-
tationforTextRetrieval: AchievementsandProb-
lems. In J. Amer. Soc. Info. Sci., 44(9): 532-542.
