File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-1729_concl.xml
Size: 2,612 bytes
Last Modified: 2025-10-06 13:53:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1729"> <Title>SYSTRAN's Chinese Word Segmentation</Title> <Section position="7" start_page="4" end_page="4" type="concl"> <SectionTitle> 4 Conclusion </SectionTitle> <Paragraph position="0"> For an open track segmentation competition like the Bakeoff, we need to achieve a balance between the following aspects: * Segmentation standards: differences between one's own standard and the reference standard. * Adaptation to the other standards: whether one should adapt to other standards.</Paragraph> <Paragraph position="1"> * Dictionary coverage: the coverage of one's own dictionary and the dictionary obtained by training.</Paragraph> <Paragraph position="2"> * Algorithm: combination of segmentation, unknown word identification, and name entity recognition.</Paragraph> <Paragraph position="3"> * Speed: the time needed to segment the corpora. * Training: time and manpower used for training each corpus and track Few systems participated in all open tracks: only SYSTRAN and one university participated in all four. We devoted about 2 person/week for this evaluation. We rank in the top three of three open tracks, and only the PK o track scores are lower, probably because of encoding problems for numbers for this corpus (we did not adjust our segmenter to cope with this corpus-specific problem). Our results are very consistent for all open tracks, indicating a very robust approach to Chinese segmentation.</Paragraph> <Paragraph position="4"> Analysis of results shows that SYSTRAN's Chinese word segmentation excels in the area of dictionary coverage, robustness, and speed. The vast majority of divergences with the test corpora originate from differences in segmentation standards (over 55% for CAS-R and about 98% for CAS-T). True errors range between 0% and 2% only, the rest being assigned to either the lack of unknown word processing or the lack of a name entity recognizer. Although not integrated, the unknown word identification and name entity recognition are under development as part of a terminology extraction tool.</Paragraph> <Paragraph position="5"> For future Chinese word segmentation evaluations, some of the issues that arose in this Bakeoff would need to be addressed to obtain even more significant results, including word segmentation standards and encoding problems for example. We would also welcome the introduction of a surprise track, similar to the surprise track of the DARPA MT evaluations that would require participants to submit results within 24 hours on an unknown corpus.</Paragraph> </Section> class="xml-element"></Paper>