File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-3018_intro.xml
Size: 6,674 bytes
Last Modified: 2025-10-06 14:02:55
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3018"> <Title>Combination of Machine Learning Methods for Optimum Chinese Word Segmentation Masayuki Asahara Chooi-Ling Goh Kenta Fukuoka</Title> <Section position="3" start_page="0" end_page="134" type="intro"> <SectionTitle> 2 Models a and c </SectionTitle> <Paragraph position="0"> Models a and c use several modules. First, a hard clustering algorithm is used to define word classes and character classes. Second, three OOV extraction modules are trained with the training data. These modules, then, extract the OOV words in the test data. Third, the OOV word candidates produced by the three OOV extraction modules are refined by voting (Model a) or merging (Model c) them. The final word list is composed by appending the OOV word candidates to the IV word list. Finally, a CRF-based word segmenter analyzes the sentence based on the new word list.</Paragraph> <Section position="1" start_page="0" end_page="134" type="sub_section"> <SectionTitle> 2.1 Clustering for word/character classes </SectionTitle> <Paragraph position="0"> We perform hard clustering for all words and characters in the training data. K-means algorithm is utilized. We use R 2.2.1 (http://www.r-project.org/) to perform k-means clustering.</Paragraph> <Paragraph position="1"> Since the word types are too large, we cannot run k-means clustering on the whole data. Therefore, we divide the word types into 4 groups randomly. K-means clustering is performed for each group. Words in each group are divided into 5 disjoint classes, producing 20 classes in total. Preceding and succeeding words in the top 2000 rank are used as the features for the clustering. We define the set of the OOV words as the 21st class. We also define two other classes for the beginof-sentence (BOS) and end-of-sentence (EOS). So, we define 23 classes in total.</Paragraph> <Paragraph position="2"> 20 classes are defined for characters. K-means clustering is performed for all characters in the training data. Preceding and succeeding characters and BIES position tags are used as features for the clustering: &quot;B&quot; stands for 'the first character of a word'; &quot;I&quot; stands for 'an intermediate character of a word'; &quot;E&quot; stands for 'the last character of a word'; &quot;S&quot; stands for 'the single character word'. Characters only in the test data are not assigned with any character class.</Paragraph> </Section> <Section position="2" start_page="134" end_page="134" type="sub_section"> <SectionTitle> 2.2 Three OOV extraction modules </SectionTitle> <Paragraph position="0"> In Models a and c, we use three OOV extraction modules. null First and second OOV extraction modules use the output of a Maximam Entropy Markov Model (MEMM)-based word segmenter (McCallum et al., 2000) (Uchimoto et al., 2001). Word list is composed by the words appeared in 80% of the training data. The words occured only in the remaining 20% of the training data are regarded as OOV words. All word candidates in a sentence are extracted to form a trellis. Each word is assigned with a word class. The word classes are used as the hidden states in the trellis. In encoding, MaxEnt estimates state transition probabilities based on the preceding word class (state) and observed features such as the first character, last character, first character class, last character class of the current word. In decoding, a simple Viterbi algorithm is used.</Paragraph> <Paragraph position="1"> The output of the MEMM-based word segmenter is splitted character by character. Next, character-based chunking is performed to extract OOV words. We use two chunkers: based on SVM (Kudo and Matsumoto, 2001) and CRF (Lafferty et al., 2001). The chunker annotates BIO position tags: &quot;B&quot; stands for 'the first character of an OOV word'; &quot;I&quot; stands for 'other characters in an OOV word'; &quot;O&quot; stands for 'a character outside an OOV word'.</Paragraph> <Paragraph position="2"> The features used in the two chunkers are the characters, the character classes and the information of other characters in five-character window size. The word sequence output by the MEMM-based word segmenter is converted into character sequence with BIES position tags and the word classes. The position tags with the word classes are also introduced as the features. null The third one is a variation of the OOV module in section 3 which is character-based tagging by MaxEnt classifier. The difference is that we newly introduce character classes in section 2.1 as the features.</Paragraph> <Paragraph position="3"> In summary, we introduce three OOV word extraction modules: &quot;MEMM+SVM&quot;, &quot;MEMM+CRF&quot; and &quot;MaxEnt classifier&quot;.</Paragraph> </Section> <Section position="3" start_page="134" end_page="134" type="sub_section"> <SectionTitle> 2.3 Voting/Merging the OOV words </SectionTitle> <Paragraph position="0"> The word list for the final word segmenter are composed by voting or merging. Voting means the OOV words which are extracted by two or more OOV word extraction modules. Merging means the OOV words which are extracted by any of the OOV word extraction modules. The model with the former (voting) OOV word list is used in Model a, and the model with the latter (merging) OOV word list is used in Model c.</Paragraph> </Section> <Section position="4" start_page="134" end_page="134" type="sub_section"> <SectionTitle> 2.4 CRF-based word segmenter </SectionTitle> <Paragraph position="0"> Final word segmentation is carried out by a CRF-based word segmenter (Kudo and Matsumoto, 2004) (Peng and McCallum, 2004). The word trellis is composed by the similar method with MEMM-based word segmenter. Though state transition probabilities are estimated in the case of MaxEnt framework, the probabilities are normalized in the whole sentence in CRF-based method. CRF-based word segmenter is robust to length-bias problem (Kudo and Matsumoto, 2004) by the global normalization. We will discuss the length-bias problem in section 4.</Paragraph> </Section> <Section position="5" start_page="134" end_page="134" type="sub_section"> <SectionTitle> 2.5 Note on MSR data </SectionTitle> <Paragraph position="0"> Unfortunately, we could not complete Models a and c for the MSR data due to time constraints. Therefore, we submitted the following 2 fragmented models: Model a for MSR data is MEMM-based word segmenter with OOV word list by voting; Model c for MSR data is CRF-based word segmenter with no OOV word candidate.</Paragraph> </Section> </Section> class="xml-element"></Paper>