File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3025_metho.xml

Size: 2,145 bytes

Last Modified: 2025-10-06 14:09:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3025">
  <Title>A Maximum Entropy Approach to Chinese Word Segmentation</Title>
  <Section position="5" start_page="162" end_page="163" type="metho">
    <SectionTitle>
3 Evaluation Results
</SectionTitle>
    <Paragraph position="0"> We evaluated our Chinese word segmenter in the open track, on all 4 corpora, namely Academia Sinica (AS), City University of Hong Kong (CITYU), Microsoft Research (MSR), and Peking University (PKU). Table 1 shows our official SIGHAN bakeoff results. The columns R, P, and F show the recall, precision, and F measure, respectively. The columns ROOV and RIV show the recall on out-of-vocabulary words and in-vocabulary words, respectively. Our Chinese word segmenter which participated in the bakeoff was trained with the basic features (Section 1.1), and made use of the external dictionary (Section 1.2) and additional training corpora (Section 1.3). Our word segmenter achieved the highest F measure for AS, CITYU, and PKU, and the second highest for MSR.</Paragraph>
    <Paragraph position="1"> After the release of the official bakeoff results,  sure) of different versions of our word segmenter we ran a series of experiments to determine the contribution of each component of our word segmenter, using the official scorer and test sets with gold-standard segmentations. Version V1 used only the basic features (Section 1.1); Version V2 used the basic features and additional features derived from our external dictionary (Section 1.2); Version V3 used the basic features but with additional training corpora (Section 1.3); and Version V4 is our official submitted version combining basic features, external dictionary, and additional training corpora. Table 2 shows the word segmentation accuracy (F measure) of the different versions of our word segmenter, when tested on the official test sets of the four corpora. The results indicate that the use of external dictionary increases segmentation accuracy. Similarly, the use of additional training corpora of different segmentation standards also increases segmentation accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML