File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1148_evalu.xml
Size: 4,871 bytes
Last Modified: 2025-10-06 13:58:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1148"> <Title>Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The observations were surprising to us at first, although they suggest that there is an interesting phenomenon at work. To attempt to identify the underlying cause, we break the explanation into two parts: one for the first part of the curves where retrieval performance increases with increasing segmentation accuracy, and a second effect for the region where retrieval performance plateaus and eventually decreases with increasing segmentation accuracy.</Paragraph> <Paragraph position="1"> The first part of these performance curves seems easy to explain. At low segmentation accuracies the segmented tokens do not correspond to meaningful linguistic terms, such as words, which hampers retrieval performance because the term weighting procedure is comparing arbitrary tokens to the query.</Paragraph> <Paragraph position="2"> However, assegmentationaccuracyimproves, thetokens behave more like true words and the retrieval engine begins to behave more conventionally.</Paragraph> <Paragraph position="3"> However, after a point, when the second regime is reached, retrieval performance no longer increases with improved segmentation accuracy, and eventually begins to decrease. One possible explanation for this which we have found is that a weak word segmenter accidentally breaks compound words into smaller constituents, and this, surprisingly yields a beneficial effect for Chinese information retrieval.</Paragraph> <Paragraph position="4"> For example, one of the test queries, Topic 34, is about the impact of droughts in various parts of China. Retrieval based on the EM-70% segmenter retrieved 84 of 95 relevant documents in the collection, whereas retrieval based on the PPM-95% segmenter retrieved only 52 relevant documents. In fact, only 2 relevant documents were missed by EM-70% but retrieved by PPM-95%, whereas 34 docu- null ments retrieved by EM-70% and not by PPM-95%.</Paragraph> <Paragraph position="5"> In investigating this phenomenon, one finds that the performance drop appears to be due to the inherent nature of written Chinese. That is, in written Chinese many words can often legally be represented their subparts. For example, x40x2axd4(agriculture plants) is sometimes represented asx2axd4(plants). So for example in Topic 34, the PPM-95% segmenter correctly segments x42xef as x42xef(drought disaster) and x40x2axd4correctly as x40x2axd4 (agriculture plants), whereas the EM-70% segmenter incorrectly segmentsx42xefasx42(drought) andxef(disaster), and incorrectly segments x40x2axd4as x40(agriculture) and x2axd4(plants). However, by inspecting the relevant documents for Topic 34, we find that there are many Chinese character strings in these documents that are closely related to the correctly segmented word x42xef(drought disaster). These alternative words are x13x42xc7x42xe0xc7x49x42xc7x9ax42xc7x1cx42xc7x42x4betc. For example, in the relevant document &quot;pd9105-832&quot;, which is ranked 60th by EM-70% and 823rd by PPM-95%, the correctly segmented wordx42xefdoes view at kd = 10 not appear at all. Consequently, the correct segmentation for x42xefby PPM-95% leads to a much weaker match than the incorrect segmentation of EM-70%. Here EM-70% segmentsx42xefintox42and xef, which is not regarded as a correct segmentation. However, there are many matches between the topic and relevant documents which contain onlyx42. This same phenomenon happens with the query wordx40 x2axd4since many documents only contain the fragment x2axd4instead of x40x2axd4, and these documents are all missed by PPM-95% but captured by EM70%. null Although straightforward, these observations suggest a different trajectory for future research on Chinese information retrieval. Instead of focusing on achieving accurate word segmentation, we should pay more attention to issues such as keyword weighting (Huang and Robertson, 2000) and query key-word extraction (Chien et al, 1997). Also, we find that the weak unsupervised segmentation method view at kd = 10 based yields better Chinese retrieval performance than the other approaches, which suggests a promising new avenue to apply machine learning techniques to IR (Sparck Jones, 1991). Of course, despite these results we expect highly accurate word segmentation to still play an important role in other Chinese information processing tasks such as information extraction and machine translation. This suggests that somedifferentevaluationstandardsforChineseword segmentation should be given to different NLP applications. null</Paragraph> </Section> class="xml-element"></Paper>