File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1729_metho.xml

Size: 9,340 bytes

Last Modified: 2025-10-06 14:08:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1729">
  <Title>SYSTRAN's Chinese Word Segmentation</Title>
  <Section position="4" start_page="0" end_page="2" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Chinese word segmentation is one of the pre-processing steps of the SYSTRAN Chinese-English Machine Translation (MT) system. The development of the Chinese-English MT system began in August 1994, and this is where the Chinese word segmentation issue was first addressed. The algorithm of the early version of the segmentation module was borrowed from SYSTRAN's Japanese segmentation module. The program ran on a large word list, which contained 600,000 entries at the time  . The basic strategy was to list all possible matches for an entire linguistic unit, then solve the overlapping matches via linguistic rules. The development was focused on technical domains, and high accuracy was achieved after only three months of development. Since then, development has shifted to other areas of Chinese-English MT, including the enrichment of the bi-lingual word lists with part-of-speech, syntactic and semantic features. In 2001, the development of a prototype Chinese-Japanese MT system began. Although the project only lasted for three months, some important changes were made in the segmentation convention, regarding the distinction between words and phrases  . Along with new developments of the SYSTRAN MT engine, the segmentation engine has recently been re-implemented. The dictionary and the general approach remain unchanged, but dictionary lookup and rule matching were re-implemented using finite-state technology, and linguistic rules for the segmentation module are now expressed using a context-free-based formalism, improving maintainability. The re-implementation generates multiple segmentation results with associated probabilities. This will allow for disambiguation at a later stage of the MT process, and will widen the possibility of word segmentation for other applications.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
2 System Description
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Segmentation Standard
</SectionTitle>
      <Paragraph position="0"> Our definition of words and our segmentation conventions are based on available standards, modified for MT purposes. The PRC standard (Liu et al., 1993) was initially used. Sample differences are listed as follows:</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.2 Methodology
</SectionTitle>
      <Paragraph position="0"> The SYSTRAN Chinese word segmentation module uses a rule-based approach and a large dictionary. The dictionary is derived from the Chinese-English MT dictionary. It currently includes about 400,000 words. The basic segmentation strategy is to list all possible matches for a translation unit (typically, a sentence), then to solve overlapping matches via linguistic rules. The same segmentation module and the same dictionary are used to segment different types of text with comparable performance.</Paragraph>
      <Paragraph position="1"> All dictionary lookup and rule matching are performed using a low level Finite State Automaton library. The segmentation speed is 3,500 characters per second using a Pentium 4 2.4GHZ processor.</Paragraph>
      <Paragraph position="2"> Dictionary The Chinese-English MT dictionary currently contains 400,000 words (e.g., Zhong Hua ), and 200,000 multi-word expressions (e.g., Zhong Hua Ren Min Gong He Guo ). Only words are used for the segmentation.</Paragraph>
      <Paragraph position="3"> Specialized linguistic rules are associated with the dictionary. The dictionary is general purpose, with good coverage on several domains. Domain-specific dictionaries are also available, but were not used in the Bakeoff.</Paragraph>
      <Paragraph position="4"> The dictionary contains words from different Chinese-speaking regions, but the representation is mostly in simplified Chinese. The traditional characters are considered as &amp;quot;variants&amp;quot;, and they are not physically stored in the dictionary. For example, Yi Da Li and Yi Da Li are stored in the dictionary, and Yi Da Li can also be found via the character matching Yi - Yi .</Paragraph>
      <Paragraph position="5"> The dictionary is encoded in Unicode (UTF8), and all internal operations manipulate UTF8 strings. Major encoding conversions are supported, including GB2312-80, GB13000, BIG-5, BIG5-HKSCS, etc.</Paragraph>
      <Paragraph position="6"> Training The segmentation module has been tested and fine-tuned on general texts, and on texts in the technical and military domains (because of specific customer requirements for the MT system). Due to the wide availability of news texts, the news domain has also recently been used for training and testing. The training process is merely reduced to the customization of a SYSTRAN MT system. In the current version of the MT system, customization is achieved by building a User Dictionary (UD). A UD supplements the main dictionary: any word that is not found in the main MT system dictionary is added in a User Dictionary.</Paragraph>
      <Paragraph position="7"> Name-Entity Recognition and Unknown Words Name entity recognition is still under development. Recognition of Chinese persons' names is done via linguistic rules. Foreign name recognition is not yet implemented due to the difficulty of obtaining translations.</Paragraph>
      <Paragraph position="8"> Due to the unavailability of translations, even when an unknown word has been successfully recognized, we consider the unknown word recognition as part of the terminology extraction process. This feature was not integrated for the Bakeoff.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.3 Evaluation
</SectionTitle>
      <Paragraph position="0"> Our internal evaluation has been focused on the accuracy of segmentation using our own segmentation standard. Our evaluation process includes large-scale bilingual regression testing for the Chinese-English system, as well as regression testing of the segmenter itself using a test database  of over 5MB of test items. Two criteria are used: 1. Overlapping Ambiguity Strings (OAS): the reference segmentation and the segmenter segmentation overlap for some string, e.g., AB-C and A-BC. As shown below, this typically indicates an error from our segmenter.</Paragraph>
      <Paragraph position="1"> 2. Covering Ambiguity Strings (CAS): the test  strings that cover the reference strings (CAS-T: ABC and AB-C), and the reference strings that cover the test strings (CAS-R: AB-C and ABC). These cases arise mostly from a difference between equally valid segmentation standards.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="2" end_page="4" type="metho">
    <SectionTitle>
3 Discussion of the Bakeoff
3.1 Results
</SectionTitle>
    <Paragraph position="0"> SYSTRAN participated in the four open tracks in the First International Chinese Word Segmentation Bakeoff http://www.sighan.org/bakeoff2003/. Each track corresponds to one corpus with its own word segementation standard. Each corpus had its own segmentation standard that was significantly different from the others. The training process included building a User Dictionary that contains words found in the training corpora, but not in the SYSTRAN dictionary. Although each of these corpora was segmented according to its own standard, we made a single UD containing all the words gathered in all corpora.</Paragraph>
    <Paragraph position="1"> Although the ranking of the SYSTRAN segmenter is different in the four open tracks, SYSTRAN's segmentation performance is quite comparable across the four corpora. This is to be compared to the scores obtained by other participants, where good performance was typically obtained on one corpus only. SYSTRAN scores for the 4 tracks are shown in Table 3 (Sproat and Emerson, 2003).</Paragraph>
    <Section position="1" start_page="2" end_page="4" type="sub_section">
      <SectionTitle>
3.2 Discussions
</SectionTitle>
      <Paragraph position="0"> The segmentation differences between the reference corpora and SYSTRAN's results are further analyzed. Table 4 shows the partition of divergences between OAS, CAS-T, and CAS-R strings:  The majority of OAS divergences show incorrect segmentation from SYSTRAN. However, differences in CAS do not necessarily indicate incorrect segmentation results. The reasons can be categorized as follows: a) different segmentation standards, b) unknown word problem, c) name entity recognition problem, and d) miscellaneous  .</Paragraph>
      <Paragraph position="1"> The distributions of the differences are further analyzed in Table 5 and 6 for the AS</Paragraph>
      <Paragraph position="3"> This analysis shows that the segmentation results are greatly impacted by the difference in the segmentation standards. Other problems include for example the encoding of numbers using single bytes instead of the standard double-byte encoding in the PKo corpus, which account for about 12% of differences in the PKo track scores.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML