File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1148_metho.xml
Size: 14,817 bytes
Last Modified: 2025-10-06 14:07:51
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1148"> <Title>Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Word Segmentation Algorithms </SectionTitle> <Paragraph position="0"> Chinese word segmentation has been extensively researched. However, in Chinese information retrieval the most common tokenziation methods are still the simple character based approach and dictionary-based word segmentation. In the character based approach sentences are tokenized simply by taking each character to be a basic unit. In the dictionary based approach, on the other hand, one pre-defines a lexicon containing a large number of words and then uses heuristic methods such as maximum matching to segment sentences. Below we experiment with these standard methods, but in addition employ two recently proposed segmentation algorithms that allow some control of how accurately words are segmented. The details of these algorithms can be found in the given references. For the sake of completeness we briefly describe the basic approaches here.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Dictionary based word segmentation </SectionTitle> <Paragraph position="0"> The dictionary based approach is the most popular Chinese word segmentation method. The idea is to use a hand built dictionary of words, compound words, and phrases to index the text. In our experiments we used the longest forward match method in which text is scanned sequentially and the longest matching word from the dictionary is taken at each successive location. The longest matched strings are then taken as indexing tokens and shorter tokens within the longest matched strings are discarded. In our experiments we used two different dictionaries.</Paragraph> <Paragraph position="1"> The first is the Chinese dictionary used by Gey et al. (1997), which includes 137,659 entries. The second is the Chinese dictionary used by Beaulieu et al.</Paragraph> <Paragraph position="2"> (1997), which contains 69,353 words and phrases.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Compression based word segmentation </SectionTitle> <Paragraph position="0"> The PPM word segmentation algorithm of Teahan et al. (2001) is based on the text compression method of Cleary and Witten (1984). PPM learns an n-gram language model by supervised training on a given set of hand segmented Chinese text. To segment a new sentence, PPM seeks the segmentation which gives the best compression using the learned model. This has been proven to be a highly accurate segmenter (Teahan et al, 2001). Its quality is affected both by the amount of training data and by the order of the n-gram model. By controlling the amount of training data and the order of language model we can control the resulting word segmentation accuracy.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 EM based word segmentation </SectionTitle> <Paragraph position="0"> The &quot;self-supervised&quot; segmenter of Peng and Schuurmans (2001) is an unsupervised technique based on a variant of the EM algorithm. This method learns a hidden Markov model of Chinese words, and then segments sentences using the Viterbi algorithm (Rabiner, 1989). It uses a heuristic technique to reduce the size of the learned lexicon and prevent the acquisition of erroneous word agglomerations.</Paragraph> <Paragraph position="1"> Although the segmentation accuracy of this unsupervised method is not as high as the supervised PPM algorithm, it nevertheless obtains reasonable performance and provides a fundamentally different segmentation scheme from PPM. The segmentation performance of this technique can be controlled by varying the number of training iterations and by applying different lexicon pruning techniques.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Information Retrieval Method </SectionTitle> <Paragraph position="0"> We conducted our information retrieval experiments using the OKAPI system (Huang and Robertson, 2000; Robertson et al., 1994). In an attempt to ensure that the phenomena we observe are not specific to a particular retrieval technique, we experimented with a parameterized term weighting scheme which allowed us to control the quality of retrieval performance. We considered a refined term weighting scheme based on the the standard term weighting function</Paragraph> <Paragraph position="2"> where N is the number of indexed documents in the collection, and n is the number of documents containing a specific term (Spark Jones, 1979). Many researchers have shown that augmenting this basic function to take into account document length, as well as within-document and within-query frequencies, can be highly beneficial in English text retrieval (Beaulieu et al., 1997). For example, one standard augmentation is to use</Paragraph> <Paragraph position="4"> Here tf is within-document term frequency, qtf is within-query term frequency, dl is the length of the document, avdl is the average document length, and c1, c2, c3 are tuning constants that depend on the database, the nature of the queries, and are empirically determined. However, to truly achieve state-of-the-art retrieval performance, and also to allow for the quality of retrieval to be manipulated, we further augmented this standard term weighting scheme with an extra correction term</Paragraph> <Paragraph position="6"> This correction allows us to more accurately account for the length of the document. Here ' indicates that the component is added only once per document, rather than for each term, and</Paragraph> <Paragraph position="8"> if dl > rel avdl where rel avdl is the average relevant document length calculated from previous queries based on the same collection of documents. Overall, this term weighting formula has five tuning constants, c1 to c5, which are all set from previous research on English text retrieval and some initial experiments on Chinese text retrieval. In our experiments, the values of the five arbitrary constants c1, c2, c3, c4 and c5 were set to 2.0, 5.0, 0.75, 3 and 26 respectively.</Paragraph> <Paragraph position="9"> The key constant is the quantity kd, which is the new tuning constant that we manipulate to control the influence of correction factor, and hence control the retrieval quality. By setting kd to different values, we have different term weighting methods in our experiments. In our experiments, we tested kd set to values of 0, 6, 8, 10, 15, 20, 50.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We conducted a series of experiments in word based Chinese information retrieval, where we varied both the word segmentation method and the information retrieval method. We experimented with word segmentation techniques of varying accuracy, and information retrieval methods with varying performance.</Paragraph> <Paragraph position="1"> In almost every case, we witness a nonmonotonic relationship between word segmentation accuracy and retrieval performance, robustly across retrieval methods. Before describing the experimental results in detail however, we first have to describe the performance measures used in the experiments.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Measuring segmentation performance </SectionTitle> <Paragraph position="0"> We evaluated segmentation performance on the Mandarin Chinese corpus, PH, due to Guo Jin. This corpus contains one million words of segmented Chinese text from newspaper stories of the Xinhua news agency of the People's Republic of China published between January 1990 and March 1991.</Paragraph> <Paragraph position="1"> To make the definitions precise, first define the original segmented test corpus to be S. We then collapse all the whitespace between words to make a second unsegmented corpus U, and then use the segmenter to recover an estimate ^S of the original segmented corpus. We measure the segmentation performance by precision, recall, and F-measure on detecting correct words. Here, a word is considered to be correctly recovered if and only if (Palmer and Burger, 1997) 1. a boundary is correctly placed in front of the first character of the word 2. a boundary is correctly placed at the end of the last character of the word 3. and there is no boundary between the first and last character of the word.</Paragraph> <Paragraph position="2"> Let N1 denote the number of words in S, let N2 denote the number of words in the estimated segmentation ^S, and let N3 denote the number of words correctly recovered. Then the precision, recall and F measures are defined precision: p = N3N2 recall: r = N3N1 F-measure: F = 2PSpPSrp+r In this paper, we only report the performance in F-measure, which is a comprehensive measure that combines precision and the recall.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Measuring retrieval performance </SectionTitle> <Paragraph position="0"> We used the TREC relevance judgments for each topic that came from the human assessors of the National Institute of Standards and Technology (NIST). Our statistical evaluation was done by means of the TREC evaluation program. The measures we report are Average Precision: average precision over all 11 recall points (0.0, 0.1, 0.2,..., 1.0); and R Precision: precision after the number of documents retrieved is equal to the number of known relevant documents for a query. Detailed descriptions of these measures can be found in (Voorhees and Harman, 1998).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Data sets </SectionTitle> <Paragraph position="0"> We used the information retrieval test collections from TREC-5 and TREC-6 (Voorhees and Harman, 1998). (Note that the document collection used in the TREC-6 Chinese track was identical to the one used in TREC-5, however the topic queries differ.) This collection of Chinese text consists of 164,768 documents and 139,801 articles selected from the People's Daily newspaper, and 24,988 articles selected from the Xinhua newswire. The original articles are tagged in SGML, and the Chinese characters in these articles are encoded using the GB (Guo-Biao) coding scheme. Here 0 bytes is the minimum file size, 294,056 bytes is the maximum size, and 891 bytes is the average file size.</Paragraph> <Paragraph position="1"> To provide test queries for our experiments, we considered the 54 Chinese topics provided as part of the TREC-5 and TREC-6 evaluations (28 for TREC-5 and 26 for TREC-6).</Paragraph> <Paragraph position="2"> Finally, for the two learning-based segmentation algorithms, we used two separate training corpora but a common test corpus to evaluate segmentation accuracy. For the PPM segmenter we used 72% of the PH corpus as training data. For the the self-supervised segmenter we used 10M of data from the data set used in (Ge et al., 1999), which contains one year of People's Daily news service stories. We used the entire PH collection as the test corpus (which gives an unfair advantage to the supervised method PPM which is trained on most of the same data).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Segmentation accuracy control </SectionTitle> <Paragraph position="0"> By using the forward maximum matching segmentation strategy with the two dictionaries, Berkeley and City, we obtain the segmentation performance of 71% and 85% respectively. For the PPM algorithm, by controlling the order of the n-gram language model used (specifically, order 2 and order 3) we obtain segmenters that achieve 90% and 95% word recognition accuracy respectively. Finally, for the self-supervised learning technique, by controlling the number of EM iterations and altering the lexicon pruning strategy we obtain word segmentation accuracies of 44%, 49%, 53%, 56%, 59%, 70%, 75%, and 77%. Thus, overall we obtain 12 different segmenters that achieve segmentation performances of 44%, 49%, 53%, 56%, 59%, 70%, 71%, 75%, 77%, 85%, 90%, and 95%.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Experimental results </SectionTitle> <Paragraph position="0"> Now, given the 12 different segmenters, we conducted extensive experiments on the TREC data sets using different information retrieval methods (achieved by tuning the kd constant in the term weighting function described in Section 3).</Paragraph> <Paragraph position="1"> Table 1 shows the average precision and R-precision results obtained on the TREC-5 and TREC-6 queries when basing retrieval on word segmentations at 12 different accuracies, for a single retrieval method, kd = 10. To illustrate the results graphically, we re-plot this data in Figure 1, in which the x-axis is the segmentation performance and the y-axis is the retrieval performance.</Paragraph> <Paragraph position="2"> seg. accuracy TREC-5 TREC-6 Clearly these curves demonstrate a nonmonotonic relationship between retrieval performance (on the both P-precision and the R-precision) and segmentation accuracy. In fact, the curves show a clear uni-modal shape, where for segmentation accuracies 44% to 70% the retrieval performance increases steadily, but then plateaus for segmentation accuracies between 70% and 77%, and finally decreases slightlywhen thesegmentationperformance increase to 85%,90% and 95%.</Paragraph> <Paragraph position="3"> This phenomenon is robustly observed as we alter the retrieval method by setting kd = 0;6;8;15;20;50, as shown in Figures 2 to 7 respectively. null To give a more detailed picture of the results, Figures 8 and 9 we illustrate the full precision-recall curves for kd = 10 at each of the 12 segmentation accuracies, for TREC-5 and TREC-6 queries respectively. In these figures, the 44%, 49% segmentations are marked with stars, the 53%, 56%, 59% segmentations are marked with circles, the 70%, 71%, 75%, 77% segmentations are marked with diamonds, the 85% segmentation is marked with hexagrams, and the 90% and 95% segmentations are marked with triangles. We can see that the curves with the diamonds are above the others, while the curves with stars are at the lowest positions.</Paragraph> </Section> </Section> class="xml-element"></Paper>