File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-3018_evalu.xml

Size: 4,896 bytes

Last Modified: 2025-10-06 13:59:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3018">
  <Title>Combination of Machine Learning Methods for Optimum Chinese Word Segmentation Masayuki Asahara Chooi-Ling Goh Kenta Fukuoka</Title>
  <Section position="5" start_page="135" end_page="136" type="evalu">
    <SectionTitle>
4 Discussions and Conclusions
</SectionTitle>
    <Paragraph position="0"> Table 1 summarizes the results of the three models.</Paragraph>
    <Paragraph position="1"> The proposed systems employ purely corpus-based statistical/machine learning method. Now, we discuss what we observe in the three models. We remark two problems in word segmentation: OOV word problem and length-bias problem.</Paragraph>
    <Paragraph position="2"> OOV word problem is that simple word-based Markov Model family cannot analyze the words not included in the word list. One of the solutions is character-based tagging (Xue and Converse, 2002) (Goh et al., 2004a). The simple character-based tagging (Model b) achieved high ROOV but the precision is low. We tried to refine OOV extraction by voting and merging (Model a and c). However, the ROOV of Models a and c are not as good as that of Model b. Figure 1 shows type-precision and type-recall of each OOV extraction modules. While voting helps to make the precision higher, voting deteriorates the recall. Defining some hand written rules to prune false OOV words will help to improve the IV word segmentation (Goh et al., 2004b), because the precision of OOV word extraction becomes higher. Other types of OOV word extraction methods should be introduced.</Paragraph>
    <Paragraph position="3"> For example, (Uchimoto et al., 2001) embeded OOV models in MEMM-based word segmenter (with POS tagging). Less than six-character substrings are extracted as the OOV word candidates in the word trellis. (Peng and McCallum, 2004) proposed OOV word extraction methods based on CRF-based word segmenter. Their CRF-based word segmenter can compute a confidence in each segment. The high confident segments that are not in the IV word list are regarded as OOV word candidates. (Nakagawa, 2004) proposed integration of word and OOV word position tag in a trellis. These three OOV extraction method are different from our methods - character-based tagging.</Paragraph>
    <Paragraph position="4"> Future work will include implementation of these different sorts of OOV word extraction modules.</Paragraph>
    <Paragraph position="5"> Length bias problem means the tendency that the locally normalized Markov Model family prefers longer words. Since choosing the longer words reduces the number of words in a sentence, the state-transitions are reduced. The less the state-transitions, the larger the likelihood of the whole sentence. Actually, the length-bias reflects the real distribution in the corpus. Still, the length-bias problem is nonnegligible to achieve high accuracy due to small exceptional cases. We used CRF-based word segmenter which relaxes the problem (Kudo and Matsumoto, 2004). Actually, the CRF-based word segmenter achieved high RIV .</Paragraph>
    <Paragraph position="6"> We could not complete Model a and c for MSR.</Paragraph>
    <Paragraph position="7"> After the deadline, we managed to complete Model a (CRF + Voted Unk.) and c (CRF + Merged Unk.) The result of Model a was precesion 0.976, recall 0.966, F-measure 0.971, OOV recall 0.570 and IV recall 0.988. The result of Model c was precesion 0.969, recall 0.963, F-measure 0.966, OOV recall 0.571 and IV recall 0.974. While the results are quite good, unfortunately, we could not submit the outputs in time.</Paragraph>
    <Paragraph position="8"> While our results for the three data sets (AS, CITYU, MSR) are fairly good, the result for the PKU data is not as good. There is no correlation between scores and OOV word rates. We investigate unseen character distributions in the data set. There is no correlation between scores and unseen character distributions. null We expected Model c (merging) to achieve higher recall for OOV words than Model a (voting). However, the result was opposite. The noises in OOV word candidates should have deteriorated the F-value of overall word segmentation. One reason might be that our CRF-based segmenter could not encode the occurence of OOV words. We defined the 21st word class for OOV words. However, the training data for CRF-based segmenter did not contain the 21st class.</Paragraph>
    <Paragraph position="9"> We should include the 21st class in the training data  by regarding some words as pseudo OOV words.</Paragraph>
    <Paragraph position="10"> We also found a bug in the CRF-based OOV word extration module. The accuracy of the module might be slightly better than the reported results. However, the effect of the bug on overall F-value might be limited, since the module was only part of the OOV extraction module combination - voting and merging.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML