File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1719_intro.xml
Size: 5,340 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1719"> <Title>The First International Chinese Word Segmentation Bakeoff</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Chinese word segmentation is a difficult problem that has received a lot of attention in the literature; reviews of some of the various approaches can be found in (Wang et al., 1990; Wu and Tseng, 1993; Sproat and Shih, 2001). The problem with this literature has always been that it is very hard to compare systems, due to the lack of any common standard test set. Thus, an approach that seems very promising based on its published report is nonetheless hard to compare fairly with other systems, since the systems are often tested on their own selected test corpora.</Paragraph> <Paragraph position="1"> Part of the problem is also that there is no single accepted segmentation standard: There are several, including the four standards used in this evaluation.</Paragraph> <Paragraph position="2"> A number of segmentation contests have been held in recent years within Mainland China, in the context of more general evaluations for Chinese-English machine translation. See (Yao, 2001; Yao, 2002) for the first and second of these; the third evaluation will be held in August 2003. The test corpora were segmented according to the Chinese national standard GB 13715 (GB/T 13715-92, 1993), though some lenience was granted in the case of plausible alternative segmentations (Yao, 2001); so while GB 13715 specifies the segmentation a0 /a1a3a2 for Mao Zedong, a0 a1a4a2 was also allowed. Accuracies in the mid 80's to mid 90's were reported for the four systems that participated in the first evaluation, with higher scores (many in the high nineties) being reported for the second evaluation.</Paragraph> <Paragraph position="3"> The motivations for holding the current contest are twofold. First of all, by making the contest international, we are encouraging participation from people and institutions who work on Chinese word segmentation anywhere in the world. The final set of participants in the bakeoff include two from Mainland China, three from Hong Kong, one from Japan, one from Singapore, one from Taiwan and four from the United States.</Paragraph> <Paragraph position="4"> Secondly, as we have already noted, there are at least four distinct standards in active use in the sense that large corpora are being developed according to those standards; see Section 2.1. It has also been observed that different segmentation standards are appropriate for different purposes; that the segmentation standard that one might prefer for information retrieval applications is likely to be different from the one that one would prefer for text-to-speech synthesis; see (Wu, 2003) for useful discussion. Thus, while we do not subscribe to the view that any of the extant standards are, in fact, appropriate for any particular application, nevertheless, it seems desirable to have a contest where people are tested against more than one standard.</Paragraph> <Paragraph position="5"> A third point is that we decided early on that we would not be lenient in our scoring, so that alternative segmentations as in the case of a0 a1 a2 Mao Zedong, cited above, would not be allowed. While it would be fairly straightforward (in many cases) to automatically score both alternatives, we felt we could provide a more objective measure if we went strictly by the particular segmentation standard being tested on, and simply did not get into the business of deciding upon allowable alternatives.</Paragraph> <Paragraph position="6"> Comparing segmenters is difficult. This is not only because of differences in segmentation standards but also due to differences in the design of systems: Systems based exclusively (or even primarily) on lexical and grammatical analysis will often be at a disadvantage during the comparison compared to systems trained exclusively on the training data. Competitions also may fail to predict the performance of the segmenter on new texts outside the training and testing sets. The handling of out-of-vocabulary words becomes a much larger issue in these situations than is accounted for within the test environment: A system that performs admirably in the competition may perform poorly on texts from different registers.</Paragraph> <Paragraph position="7"> Another issue that is not accounted for in the current collection of evaluations is the handling of short strings with minimal context, such as queries submitted to a search engine. This has been studied indirectly through the cross-language information retrieval work performed for the TREC 5 and TREC 6 competitions (Smeaton and Wilkinson, 1997; Wilkinson, 1998).</Paragraph> <Paragraph position="8"> This report summarizes the results of this First International Chinese Word Segmentation Bakeoff, provides some analysis of the results, and makes specific recommendations for future bakeoffs. One thing we do not do here is get into the details of specific systems; each of the participants was required to provide a four page description of their system along with detailed discussion of their results, and these papers are published in this volume.</Paragraph> </Section> class="xml-element"></Paper>