File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1723_evalu.xml

Size: 5,409 bytes

Last Modified: 2025-10-06 13:59:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1723">
  <Title>A two-stage statistical word segmentation system for Chinese</Title>
  <Section position="5" start_page="21" end_page="21" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> Our system participated in both closed and open tests on Peking University corpora at the First International Chinese Word Segmentation Bakeoff.</Paragraph>
    <Paragraph position="1"> This section reports the results and discussions on its evaluation.</Paragraph>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
4.1 Measures
</SectionTitle>
      <Paragraph position="0"> In the evaluation program of the First International Chinese Word Segmentation Bakeoff, six measures are employed to score the performance of a word segmentation system, namely test recall (R), test precision (denoted by P), the balanced F-measure (F), the out-of-vocabulary (OOV) rate for the test corpus, the recall on OOV words (R OOV ), and the recall on in-vocabulary (R iv ) words. OOV is defined as the set of words in the test corpus not occurring in the training corpus in the closed test, and the set of words in the test corpus not occurring in the lexicon used in the open test.</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
4.2 Experimental lexicons and corpora
</SectionTitle>
      <Paragraph position="0"> As shown in Table 1, we only used the training data from Peking University corpus to train our system in both the open and closed tests. As for the dictionary, we compiled a dictionary for the closed test from the training corpus, which contained 55, 226 words, and used a dictionary in the open test that contained about 65, 269 words.</Paragraph>
    </Section>
    <Section position="3" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
4.3 Experimental results and discussion
</SectionTitle>
      <Paragraph position="0"> Segmentation speed: There are in all about 28,458 characters in the test corpus. It takes about 3.21 and 3.07 seconds in all for our system to perform full segmentation (including known word segmentation and unknown word identification) on the closed and open test corpus respectively, running on an ACER notebook (TM632XC-P4M).</Paragraph>
      <Paragraph position="1"> This indicates that our system is able to process about 531,925~556,182 characters per minute.</Paragraph>
      <Paragraph position="2"> Results and discussions: The results for the closed and open test are presented in Table 2. We can draw some conclusions from these results.</Paragraph>
      <Paragraph position="3"> Firstly, the overall performance of our system is very stable in both the closed and open tests. As shown in Table 2, the out-of-vocabulary (OOV) rate is 6.9% in the closed test and 9.4% in the open test. However, the overall test F-measure drops by only 0.2 percent in the open test, compared with the closed test.</Paragraph>
      <Paragraph position="4"> Secondly, our approach can handle most unknown words in the input. As can be seen from Table 2, the recall on OOV words are 67.5% the closed-test and 76.2% in the open-test. Wang et al (2000) and Yao (1997) have proposed a character juncture model and word-formation patterns for Chinese unknown word identification. However, their approaches can only work for the unknown words that are made up of pure monosyllable character in that they are character-based methods. To address this problem, we introduce both word juncture model and word-based word-formation patterns into our system. As a result, our system can deal with different unknown words that consist of different known words, including monosyllable characters and multiword.</Paragraph>
      <Paragraph position="5"> Although our system is effective for most ambiguities and unknown words in the input, it has its inherent deficiencies. Firstly, to avoid data sparseness, we do not differentiate known words and unknown words while estimating word juncture models and word-formation patterns from the training corpus. This simplification may introduce some noises into these models for identifying unknown words. Our further investigations show that the precision on OOV words is still very low, i.e. 67.1% for closed test and 72.5% for open test. As a result, our system may yield a number of mistaken unknown words in the processing.</Paragraph>
      <Paragraph position="6"> Secondly, we regard known word segmentation and unknown word identification as two independent stages in our system. This strategy is obviously simple and more easily applicable. However, it does not work while the input contains a mixture of ambiguities and unknown words. For example, there was a sentence Zhong Xing Chang Ge Zhi Xing Zhu Zhong Jian Shen in the test corpus, where, the stringZhong Xing Chang Ge is a fragment mixed with ambiguity and unknown words. The correct segmentation should be Zhong Xing /Chang Ge /, whereZhong Xing (Zhonghang, the Bank of China) is a abbreviation of organization name, andChang Ge (Changge) is a place name. Actually, this fragment is segmented asZhong /Xing Chang /Ge / in the first stage of our system. However, the unknown word identification stage does not have a mechanism to split the word Xing Chang (Hangzhang, president) and finally resulted in wrong segmentation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML