File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1148_intro.xml

Size: 5,578 bytes

Last Modified: 2025-10-06 14:01:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1148">
  <Title>Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Automated processing of written languages such as Chinese involves an inherent word segmentation problem that is not present in western languages like English. Unlike English, Chinese words are not explicitly delimited by whitespace, and therefore to perform automated text processing tasks (such as information retrieval) one normally has to first segment the text collection. Typically this involves segmenting the text into individual words. Although the text segmentation problem in Chinese has been heavily investigated recently (Brent and Tao, 2001; Chang, 1997; Ge et al., 1999; Hockenmaier and Brew, 1998; Jin, 1992; Peng and Schuurmans, 2001; Sproat and Shih, 1990; Teahan et al, 2001) most research has focused on the problem of segmenting character strings into individual words, rather than useful constituents. However, we have found that focusing exclusively on words may not lead to the most effective segmentation from the perspective of broad semantic analysis (Peng et al, 2002).</Paragraph>
    <Paragraph position="1"> In this paper we will focus on a simple form of semantic text processing: information retrieval (IR).</Paragraph>
    <Paragraph position="2"> Although information retrieval does not require a deep semantic analysis, to perform effective retrieval one still has to accurately capture the main topic of discourse and relate this to a given query. In the context of Chinese, information retrieval is complicated by the fact that the words in the source text (and perhaps even the query) are not separated by whitespace. This creates a significant amount of additional ambiguity in interpreting sentences and identifying the underlying topic of discourse.</Paragraph>
    <Paragraph position="3"> There are two standard approaches to information retrieval in Chinese text: character based and word based. It is usually thought that word based approaches should be superior, even though character based methods are simpler and more commonly used (Huang and Robertson, 2000). However, there has been recent interest in the word based approach, motivated by recent advances in automatic segmentation of Chinese text (Nie et al, 1996; Wu and Tseng, 1993). A common presumption is that word segmentation accuracy should monotonically influence subsequent retrieval performance (Palmer and Burger, 1997). Consequently, many researchers have focused on producing accurate word segmenters for Chinese text indexing (Teahan et al, 2001; Brent and Tao, 2001). However, we have recently observed that low accuracy word segmenters often yield superior retrieval performance (Peng et al, 2002). This observation was initially a surprise, and motivated us to conduct a more thorough study of the phenomenon to uncover the reason for the performance decrease.</Paragraph>
    <Paragraph position="4"> The relationship between Chinese word segmentation accuracy and information retrieval performance has recently been investigated in the literature. Foo and Li (2001) have conducted a series of experiments which suggests that the word segmentation approach does indeed have effect on IR performance. Specifically, they observe that the recognition of words of length two or more can produce better retrieval performance, and the existence of ambiguous words resulting from the word segmentation process can decrease retrieval performance. Similarly, Palmer and Burger (1997) observe that accurate segmentation tends to improve retrieval performance. All of this previous research has indicated that there is indeed some sort of correlation between word segmentation performance and retrieval performance. However, the nature of this correlation is not well understood, and previous research uniformly suggests that this relationship is monotonic.</Paragraph>
    <Paragraph position="5"> One reason why the relationship between segmentation and retrieval performance has not been well understood is that previous investigators have not considered using a variety of Chinese word segmenters which exhibit a wide range of segmentation accuracies, from low to high. In this paper, we employ three families of Chinese word segmentation algorithms from the recent literature. The first technique we employed was the standard maximum matching dictionary based approach. The remaining two algorithms were selected because they can both be altered by simple parameter settings to obtain different word segmentation accuracies. Specifically, the second Chinese word segmenter we investigated was the minimum description length algorithm of Teahan et al. (2001), and the third was the EM based technique of Peng and Schuurmans (2001).</Paragraph>
    <Paragraph position="6"> Overall, these segmenters demonstrate word identification accuracies ranging from 44% to 95% on the PH corpus (Brent and Tao, 2001; Hockenmaier and Brew, 1998; Teahan et al, 2001).</Paragraph>
    <Paragraph position="7"> Below we first describe the segmentation algorithms we used, and then discuss the information retrieval environment considered (in Sections 2 and 3 respectively). Section 4 then reports on the outcome of our experiments on Chinese TREC data, and in Section 5 we attempt to determine the reason for the over-segmentation phenomenon witnessed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML