XML Viewer - w03-1719

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1719_metho.xml
Size: 16,046 bytes
Last Modified: 2025-10-06 14:08:35
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1719">
  <Title>The First International Chinese Word Segmentation Bakeoff</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Details of the contest
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Corpora
</SectionTitle>
      <Paragraph position="0"> The corpora are detailed in Table 1. Links to descriptions of the corpora can be found at http://www.sighan.org/bakeoff2003/ bakeoff_instr.html; publications on specific corpora are (Huang et al., 1997) (Academia Sinica), (Xia, 1999) (Chinese Treebank); the Beijing University standard is very similar to that outlined in (GB/T 13715-92, 1993). Table 1 lists the abbreviations for the four corpora that will be used throughout this paper. The suffixes &amp;quot;o&amp;quot; and &amp;quot;c&amp;quot; will be used to denote open and closed tracks, respectively: Thus &amp;quot;ASo,c&amp;quot; denotes the Academia Sinica corpus, both open and closed tracks; and &amp;quot;PKc&amp;quot; denotes the Beijing University corpus, closed track.</Paragraph>
      <Paragraph position="1"> During the course of this bakeoff, a number of inconsistencies in segmentation were noted in the CTB corpus by one of the participants. This was done early enough so that it was possible for the CTB developers to correct some of the more common cases, both in the training and the test data. The revised training data was posted for participants, and the revised test data was used during the testing phase.</Paragraph>
      <Paragraph position="2"> Inconsistencies were also noted by another participant for the AS corpus. Unfortunately this came too late in the process to correct the data. However, some informal tests on the revised testing data indicated that the differences were minor.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Rules and Procedures
</SectionTitle>
      <Paragraph position="0"> The contest followed a strict set of guidelines and a rigid timetable. The detailed instructions for the bakeoff can be found at http://www.sighan.</Paragraph>
      <Paragraph position="1"> org/bakeoff2003/bakeoff_instr.html (with simplified and traditional Chinese versions also available). Training material was available starting March 15, testing material was available April 22, and the results had to be returned to the SIGHAN ftp site by April 25 no later than 17:00 EDT.</Paragraph>
      <Paragraph position="2"> Upon initial registration sites were required to declare which corpora they would be training and testing on, and whether they would be participating in the open or closed tracks (or both) on each corpus, Corpus Abbrev. Encoding # Train. Words # Test. Words  where these were defined as follows: a0 For the open test sites were allowed to train on the training set for a particular corpus, and in addition they could use any other material including material from other training corpora, proprietary dictionaries, material from the WWW and so forth. However, if a site selected the open track the site was required to explain what percentage of the results came from which sources. For example, if the system did particularly well on out-of-vocabulary words then the participants were required to explain if, for example, those results could mostly be attributed to having a good dictionary.</Paragraph>
      <Paragraph position="3"> a0 In the closed test, participants could only use training material from the training data for the particular corpus being testing on. No other material was allowed.</Paragraph>
      <Paragraph position="4"> Other obvious restrictions applied: Participants were prohibited from testing on corpora from their own sites, and by signing up for a particular track, participants were declaring implicitly that they had not previously seen the test corpus for that track. Scoring was completely automatic. Note that the scoring software does not correct for cases where a participant converted from one coding scheme into another, and any such cases were counted as errors. Results were returned to participants within a couple of days of submission of the segmented test data. The script used for scoring can be downloaded from http://www.sighan.org/ bakeoff2003/score; it is a simple Perl script that depends upon a version of diff (e.g. GNU diffutils 2.7.2), that supports the -y flag for side-by-side output format.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Participating sites
</SectionTitle>
      <Paragraph position="0"> Participating sites are shown in Table 2. These are a subset of the sites who had registered for the bakeoff, as some sites withdrew due to technical difficulties. null  3 Further details of the corpora An unfortunate, and sometimes unforseen, complexity in dealing with Chinese text on the computer is the plethora of character sets and character encodings used throughout Greater China. This is demonstrated in the Encoding column of Table 1: 1. Both AS and HK utilize complex-form (or &amp;quot;traditional&amp;quot;) characters, using variants of the Big Five character set. The Academia Sinica corpus is composed almost entirely of characters in pure Big Five (four characters, 0xFB5B, 0xFA76, 0xFB7A, and 0xFAAF are outside the encoding range of Big Five), while the City University corpus utilizes 38 (34 unique) characters from the Hong Kong Supplementary Character Set (HKSCS) extension to Big Five.</Paragraph>
      <Paragraph position="1"> 2. The CTB and PK corpora each use simple-form (or &amp;quot;simplified&amp;quot;) characters, using the EUC-CN encoding of the GB 2312-80 character set.</Paragraph>
      <Paragraph position="2">  However, The PKU corpus includes characters that are not part of GB 2312-80, but are encoded in GBK. GBK is an extension of GB 2312-80 that incorporates some 18,000 hanzi found in Unicode 2.1 within the GB-2312 code space. Only Microsoft's CP936 implements GBK.</Paragraph>
      <Paragraph position="3"> This variation of encoding is exacerbated by the usual lack of specific declaration in the files. Generally a file is said to be &amp;quot;Big Five&amp;quot; or &amp;quot;GB&amp;quot;, when in actuality the file is encoded in a variation of these. This is problematic in systems that utilize Unicode  internally, since transcoding back to the original encoding may lose information.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Baseline and topline experiments
</SectionTitle>
      <Paragraph position="0"> We computed a baseline for each of the corpora by compiling a dictionary of all and only the words in the training portion of the corpus. We then used this dictionary with a simple maximum matching algorithm to segment the test corpus. The results of this experiment are presented in Table 3. In this and subsequent tables, we list the word count for the test corpus, test recall (R), test precision (P), F score1, the out-of-vocabulary (OOV) rate for the test corpus, the recall on OOV words (Ra0a1a0a3a2 ), and the recall on in-vocabulary (Ra4a2 ) words. Per normal usage, OOV is defined as the set of words in the test corpus not occurring in the training corpus.2 We expect systems to do at least as well as this baseline.</Paragraph>
      <Paragraph position="1"> As a nominal topline we ran the same maximum matching experiments, but this time populating the dictionary only with words from the test corpus; this is of course a &amp;quot;cheating&amp;quot; experiment since one could  gorithm might get lucky. In particular, if the dictionary contains no word starting with some character a23 , then the maximum matching algorithm with move on to the next character, leaving a23 segmented as a word on its own. If it happens that a23 is in fact a single-character word, then the algorithm will have fortuitously done the right thing.</Paragraph>
      <Paragraph position="2"> not reasonably know exactly the set of words that occur in the test corpus. Since this is better than one could hope for in practice, we would expect systems to generally underperform this topline. The results of this &amp;quot;cheating&amp;quot; experiment are given in Table 4.3</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Raw scores
</SectionTitle>
      <Paragraph position="0"> Results for the closed tests are presented in Tables 5-8. Column headings are as above, except for &amp;quot;ca24 &amp;quot;, and &amp;quot;ca25 &amp;quot; for which see Section 4.3.  Results for the open tests are presented in Tables 9-12; again, see Section 4.3 for the explanation of &amp;quot;ca24 &amp;quot;, and &amp;quot;ca25 &amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Statistical significance of the results
</SectionTitle>
      <Paragraph position="0"> Let us assume that the recall rates for the various system represent the probability a26 that a word will be successfully identified, and let us further assume that a binomial distribution is appropriate for this experiment. Given the Central Limit Theorem for Bernouilli trials -- e.g. (Grinstead and Snell, 1997, page 330), then the 95% confidence interval is given 3If one did have the exact list of words occurring in the test corpus, one could still do better than the maximum matching algorithm, since the maximum matching algorithm cannot in general correctly resolve cases where more than one segmentation is possible given the dictionary. However as we can see from the scores in Table 4, such cases constitute at most about 1.5%.</Paragraph>
      <Paragraph position="1">  data as a0a2a1 a3 a26a5a4a7a6a9a8 a26a11a10a13a12a15a14 , where a14 is the number of trials (words). The values for a0a2a1 a3 a26a16a4a7a6a9a8 a26a11a10a13a12a15a14 are given in Tables 5-12, under the heading &amp;quot;ca24 &amp;quot;. They can be interpreted as follows: To decide whether two sites are significantly different (at the 95% confidence level) in their performance on a particular task, one just has to compute whether their confidence intervals overlap. Similarly one can treat the precision rates as the probability that a character string that has been identified as a word is really a word; these precision-based confidences are given as &amp;quot;ca25 &amp;quot; in the tables.</Paragraph>
      <Paragraph position="2"> It seems reasonable to treat two systems as significantly different (at the 95% confidence level), if at least one of their recall-based or precision-based confidences are different. Using this criterion all systems are significantly different from each other except that on PK closed S10 is not significantly different from S09, and S07 is not significantly different from S04.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Differences between &amp;quot;open&amp;quot; and &amp;quot;closed&amp;quot;
</SectionTitle>
      <Paragraph position="0"> performance In Figure 1 we plot the F scores for all systems, all tracks. We include as &amp;quot;BASE&amp;quot;, and &amp;quot;TOP&amp;quot; the base-line and topline scores discussed previously. In most cases people performed above the baseline, though well below the ideal topline; note though that the two participants in the Academia Sinica open track underperformed the baseline.</Paragraph>
      <Paragraph position="1"> Performance on the Penn Chinese Treebank (CTB) corpus was generally lower than all the other corpora; omitting S02, which only ran on CTBo,c the scores for the other systems were uniformly higher on other corpora than they were on CTB, the single exception being S11 which did better on CTBo than on HKo. The baseline for CTB is also much lower than the baseline for other corpora, so one might be inclined to ascribe the generally lower performance to the smaller training data for this corpus. Also, the OOV rate for this corpus is much higher than all of the other corpora, and since error rates are generally higher on OOV, this is surely a contributing factor. However, this would only explain why CTB showed lower performance on the closed test; on the open test, one might expect the size of the training data to matter less, but there were still large differences between several systems' performance on CTB and their performance on other corpora. Note also that the topline for CTB is also lower than for the other corpora. What all of this suggests is that the CTB may simply be less consistent than the other corpora in its segmentation; indeed one of the participants (Andi Wu) noted a number of inconsistencies in both the training and the test data (though inconsistencies were also noted  for the AS corpus).4 Systems that ran on both closed and open tracks for the same corpus generally did better on the open track, indicating (not surprisingly) that using additional data can help. However, the lower-thanbaseline performance of S03 and S11 on ASo may reflect issues with tuning of these additional resources to the particular standard in question.</Paragraph>
      <Paragraph position="2"> Finally note that the top performance of any system on any track was S09 on ASc (F=0.961). Since performances close to our ideal topline have occasionally been reported in the literature it is worth bearing the results of this bakeoff in mind when reading such reports.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Differences on OOV
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> consistently segmented as two words in the training data, but as one word in the test data. Similarly a5a7a6a7a8a7a9 ((corporate) vice president) is segmented as one word in training data but as two words (a5 /a6a10a8a11a9 ) in the testing data. As a final example, superlatives such as a12a10a13 (best) should be segmented as a single word if the adjective is monosyllabic, and it is not being used predicatively; however this principle is not consistently applied.</Paragraph>
      <Paragraph position="3"> Wu also notes that the test data is different from the training data in several respects. Most of the training data comprise texts about Mainland China, whereas most of the testing data is about Taiwan. The test data contains classes of items, such as URL's and English page designations (&amp;quot;p. 64&amp;quot;), that never appeared in the test data.</Paragraph>
      <Paragraph position="4"> sure, the performance of the baseline is only above 0.0 fortuitously, as we noted in Section 4.1. Similarly the topline performance is only less than 1.0 in cases where there are two or more possible decompositions of a string, and where the option with the longest prefix is not the correct one.</Paragraph>
      <Paragraph position="5"> It is with OOV recall that we see the widest variation among systems, which in turn is consistent with the observation that dealing with unknown words is the major outstanding problem of Chinese word segmentation. While some systems performed little better than the baseline, others had a very respectable 0.80 recall on OOV. Again, there was clearly a benefit for many systems in using additional resources than what is in the training data: A number of systems that were run on both closed and open tracks showed significant improvements in the open track. For the closed-track entries that did well on OOV, one must conclude that they have effective unknown-word detection methods.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML