File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0133_intro.xml
Size: 2,026 bytes
Last Modified: 2025-10-06 14:03:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0133"> <Title>Maximum Entropy Word Segmentation of Chinese Text</Title> <Section position="4" start_page="186" end_page="187" type="intro"> <SectionTitle> 2 Results </SectionTitle> <Paragraph position="0"> Table 1 lists our official results for the bakeoff.</Paragraph> <Paragraph position="1"> The columns show F scores, recall rates, precision rates, and recall rates on out-of-vocabulary and in-vocabulary words. Out of the participants in the bakeoff whose scores were reported, our system achieved the highest F score for UPUC, the second-highest for CKIP, the seventh-highest for MSRA, and the third-highest for CITYU.</Paragraph> <Section position="1" start_page="186" end_page="187" type="sub_section"> <SectionTitle> Thesystem'sFscoreforMSRAwashigherthan </SectionTitle> <Paragraph position="0"> for UPUC or CKIP, but it did particularly poorly compared to the rest of the contestants when one considers how well it performed for the other corpora. An analysis of the gold-standard files for the MSRA test data show that out of all of the corpora, MSRA had the highest percentage of single-character words and the smallest percentage of two-character and three-character words. Moreover, its proportion of words over 5 characters in length was five times that of the other corpora. Most of the errors our system made on the MSRA test set involved incorrect groupings of true single-character words. Another comparatively high proportion involved very long words, especially names with internal syntactic structure (e.g. -Zi}X,]!hh ').</Paragraph> <Paragraph position="1"> Our out of vocabulary scores were fairly high for all of the corpora, coming in first, fourth, fifth, and third places in UPUC, CKIP, MSRA, and CITYU respectively. Much of this can be attributedtothevalueofusinganexternaldictionary null and additional training data, as illustrated by the experiments run by Low et al. (2005) with their model.</Paragraph> </Section> </Section> class="xml-element"></Paper>