File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0121_evalu.xml

Size: 4,030 bytes

Last Modified: 2025-10-06 13:59:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0121">
  <Title>Chinese Word Segmentation with Maximum Entropy and N-gram Language Model</Title>
  <Section position="5" start_page="139" end_page="140" type="evalu">
    <SectionTitle>
3 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> We have participated in both the closed and open tracks of all the four corpora. For MSRA corpus and other three corpora, we build System I and System II respectively. Both systems are based on the ME model and the Maximum Entropy Toolkit 1, provided by Zhang Le, is adopted.</Paragraph>
    <Paragraph position="1"> Four systems are derived from System I with regard to whether or not the n-gram language model andthreepostprocessingstrategiesareusedonthe closed track of MSRA corpus. Table 2 shows the results of four derived systems.</Paragraph>
    <Paragraph position="2">  model and three post processing strategies on the closed track of MSRA corpus.</Paragraph>
    <Paragraph position="3"> System IA only adopts the ME model. System IB integrates the ME model and the bigram language model. System IC integrates the division and combination strategy and the numeral words  processing strategy. System ID adds the long organization name processing strategy.</Paragraph>
    <Paragraph position="4"> For the open track of MSRA, an external dictionary is utilized to extract the e and f features. The external dictionary is built from six sources, including the Chinese Concept Dictionary from Institute of Computational Linguistics, Peking University(72,716 words), the LDC dictionary(43,120 words), the Noun Cyclopedia(111,633), the word segmentation dictionary from Institute of Computing Technology, Chinese Academy of Sciences(84,763 words), the dictionary from Institute of Acoustics, and the dictionary from Institute of Computational Linguistics, Peking University(68,200 words) and a dictionary collected by ourselves(63,470 words).</Paragraph>
    <Paragraph position="5"> The union of the six dictionaries forms a big dictionary, and those words appearing in five or six dictionaries are extracted to form a core dictionary. If a word belongs to one of the following dictionaries or word sets, it is added into the external dictionary.</Paragraph>
    <Paragraph position="6"> a) The core dictionary.</Paragraph>
    <Paragraph position="7"> b) The intersection of the big dictionary and the training data.</Paragraph>
    <Paragraph position="8"> c) The words appearing in the training data twice or more times.</Paragraph>
    <Paragraph position="9"> Those words in the external dictionaries will be eliminated, if in most cases they are divided in the training data. Table 3 shows the effect of ME model, n-gram language model, three post processing strategies on the open track of MSRA. Here System IO only adopts the basic features, while the external dictionary based features are used in four derived systems related to open track:  model, threepostprocessingstrategiesontheopen track of MSRA.</Paragraph>
    <Paragraph position="10"> System II only adopts ME model, the division and combination strategy and the numeral word processing strategy. In the open track of the corporaCKIPandCITYU,thetrainingsetandtestset null from the 2nd Chinese Word Segmentation Backoff are used for training. For the corpora UPUC and CITYU, the external dictionaries are used, which is constructed in the same way as that in the open track of MSRA Corpus. Table 4 shows the official results of system II on UPUC, CKIP and CITYU.</Paragraph>
  </Section>
  <Section position="6" start_page="140" end_page="140" type="evalu">
    <SectionTitle>
CKIP and CITYU
</SectionTitle>
    <Paragraph position="0"> On the UPUC corpus, an interesting observation is that the performance of the open track is worse than the closed track. The investigation and analysis lead to a possible explanation. That is, the segmentation standard of the dictionaries, which are used to construct the external dictionary, is different from that of the UPUC corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML