File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-3017_evalu.xml

Size: 5,163 bytes

Last Modified: 2025-10-06 13:59:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3017">
  <Title>The Second International Chinese Word Segmentation Bakeoff</Title>
  <Section position="7" start_page="126" end_page="129" type="evalu">
    <SectionTitle>
3.notdef.g0001 Results
</SectionTitle>
    <Paragraph position="0"> In order to provide hypothetical best and worst case results (i.e., we expect systems to do no worse than the base-line and to generally underperform the top-line), we used a simple left-to-right maximal matching algorithm implemented in Perl to generate &amp;quot;top-line&amp;quot; and &amp;quot;base-line&amp;quot;  numbers. This was done by generating word lists based only on the vocabulary in each truth (topline) and training (bottom-line) corpus and segmenting the respective test corpora. These results are presented in Tables.notdef.g00013 and 4. All of the results comprise the following data: test recall (R), test precision (P), balanced F score (where F = 2PR/(P + R)), the out-of-vocabulary (OOV) rate on the test corpus, the recall on OOV words (Roov), and the recall on in-vocabulary words (Riv). We use the usual definition of out-of-vocabulary words as the set of words occurring in the test corpus that are not in the training corpus.</Paragraph>
    <Paragraph position="1"> As in the previous evaluation, to test the confidence level that two trials are significantly different from each other we used the Central Limit Theorem for Bernoulli trials (Grinstead and Snell, 1997), assuming that the recall rates from the various trials represents the probability that a word will be successfully identified, and that a binomial distribution is appropriate for the experiment. We calculated these values at the 95% confidence interval with the formula +-2 .notdef.g0002(p  (1 - p)/n) where n is the number of words. This value appears in subsequent tables under the column cr. We also calculate the confidence that the a character string segmented as a word is actually a word by treating p as the precision rates of each system. This is referred to as cp in the result tables. Two systems are then considered to be statistically different (at a 95% confidence level) if one of their cr or cp are different. Tables 5-12 contain the results for each corpus and track (groups are referenced by their ID as</Paragraph>
  </Section>
  <Section position="8" start_page="129" end_page="130" type="evalu">
    <SectionTitle>
4.notdef.g0001 Discussion
</SectionTitle>
    <Paragraph position="0"> Across all of the corpora the best performing system, in terms of F score, achieved a 0.972, with an average of 0.918 and median of 0.941.</Paragraph>
    <Paragraph position="1"> As one would expect the best F score on the open tests was higher than the best on the closed tests, 0.972 vs. 0.964, both on the MSR corpus.</Paragraph>
    <Paragraph position="2"> This result follows from the fact that systems taking part on the open test can utilize more information than those on the closed. Also interesting to compare are the OOV recall rates between the Open and Closed tracks. The best OOV recall in the open evaluation was 0.872 compared to just 0.813 on the closed track.</Paragraph>
    <Paragraph position="3"> These data indicate that OOV handling is still the Achilles heel of segmentation systems, even when the OOV rates are relatively small. These OOV recall scores are better than those observed in the first bakeoff in 2003, with similar OOV values, which suggests that advances in unknown word recognition have occurred. Nevertheless OOV is still the most significant problem in segmentation systems.</Paragraph>
    <Paragraph position="4"> The best score on any track in the 2003 bakeoff was F=0.961, while the best for this evaluation was F=0.972, followed by 17 other scores above 0.961. This shows a general trend to a decrease in error rates, from 3.9% to 2.8%! These scores are still far below the theoretical 0.99 level reflected in the topline and the higher numbers often reflected in the literature. It is plain that one can construct a test set that any given system will achieve very high measures of precision and recall on, but these numbers must viewed with caution as they may not scale to other applications or other problem sets.</Paragraph>
    <Paragraph position="5"> Three participants that used the scoring script in their system evaluation observed different behavior from that of the organizers in the  generation of the recall numbers, thereby affecting the F score. We were unable to replicate the behavior observed by the participant, nor could we determine a common set of software versions that might lead to the problem. We verified our computed scores on two different operating systems and two different hardware architectures. In each case the difference was in the participants favor (i.e., resulted in an increased F score) though the impact was minimal. If there is an error in the scripts then it affects all data sets identically, so we are confident in the scores as reported here. Nevertheless, we hope that further investigation will uncover the cause of the discrepancy so that it can be rectified in the future.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML