File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1067_metho.xml
Size: 13,594 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1067"> <Title>Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Word Segmentation Using Word-Level </SectionTitle> <Paragraph position="0"> and Character-Level Information We saw the two methods for word segmentation in the previous section. It is observed that the Markov model-based method has high overall accuracy, however, the accuracy drops for unknown words, and the character tagging method has high accuracy for unknown words but lower accuracy for known words (Yoshida et al., 2003; Xue, 2003; Sproat and Emerson, 2003). This seems natural because words are used as a processing unit in the Markov model-based method, and therefore much information about known words (e.g., POS or word bigram probability) can be used. However, unknown words cannot be handled directly by this method itself. On the other hand, characters are used as a unit in the character tagging method. In general, the number of characters is finite and far fewer than that of words which continuously increases. Thus the character tagging method may be robust for unknown words, but cannot use more detailed information than character-level information. Then, we propose a hybrid method which combines the Markov model-based method and the character tagging method to make the most of word-level and character-level information, in order to achieve high overall accuracy.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 A Hybrid Method </SectionTitle> <Paragraph position="0"> The hybrid method is mainly based on word-level Markov models, but both POC-tags and POS-tags are used in the same time and word segmentation for known words and unknown words are conducted simultaneously.</Paragraph> <Paragraph position="1"> Figure 3 shows an example of the method given a Japanese sentence &quot; &quot;, where the word &quot; &quot;(person's name) is an unknown word. First, given a sentence, nodes of lattice for known words are made as in the usual Markov model-based method. Next, for each character in the sentence, nodes of POC-tags (four nodes for each character) are made. Then, the most likely path is searched (the thick line indicates the correct path in the example). Unknown words are identified by the nodes with POC-tags. Note that some transitions of states are not allowed (e.g. from I to B, or from any POS-tags to E), and such transitions are ignored.</Paragraph> <Paragraph position="2"> Because the basic Markov models in Equation (1) are not expressive enough, we use the following equation instead to estimate probability of a path in a lattice more precisely:</Paragraph> <Paragraph position="4"> The probabilities in the equation above are estimated from a word segmented and POS-tagged corpus using the maximum-likelihood method, for ex-</Paragraph> <Paragraph position="6"> where f(w;t) is a frequency that the word w with the tag t occurred in training data. Unseen events in the training data are handled as they occurred 0.5 times for smoothing. ,1;,2;,3;,4 are calculated by deleted interpolation as described in (Brants, 2000). A word dictionary for a Markov model-based system is often constructed from a training corpus, and no unknown words exist in the training corpus in such a case. Therefore, when the parameters of the above probabilities are trained from a training corpus, words that appear only once in the training corpus are regarded as unknown words and decomposed to characters with POC-tags so that statistics about unknown words are obtained2.</Paragraph> <Paragraph position="7"> 2As described in Equation (5), we used the additive smoothing method which is simple and easy to implement. Although there are other more sophisticated methods such as Good-Turing smoothing, they may not necessarily perform well because the distribution of words is changed by this operation. In order to handle various character-level features, we calculate word emission probabilities for</Paragraph> <Paragraph position="9"> where TPOC = fB;I;E;Sg, wi is a character and ti is a POC-tag. In the above equation, P(ti) and P(wi;t) are estimated by the maximum-likelihood method, and the probability of a POC tag ti, given a character wi (P(tijwi;ti 2 TPOC)) is estimated using ME models (Berger et al., 1996). We use the following features for ME models, where cx is the xth character in a sentence, wi = ci0 and yx is the character type of cx (Table 2 shows the definition of character types we used): (1) Characters (ci0!2, ci0!1, ci0, ci0+1, ci0+2) (2) Pairs of characters (ci0!2ci0!1, ci0!1ci0, ci0!1ci0+1, ci0ci0+1, ci0+1ci0+2) (3) Character types (yi0!2, yi0!1, yi0, yi0+1, yi0+2) (4) Pairs of character types (yi0!2yi0!1, yi0!1yi0,</Paragraph> <Paragraph position="11"> training data. We use the Generalized Iterative Scaling algorithm (Darroch and Ratcliff, 1972) for parameter estimation, and features that appeared less than or equal to 10 times in training data are ignored in order to avoid overfitting.</Paragraph> <Paragraph position="12"> What our method is doing for unknown words can be interpreted as follows: The method examines all possible unknown words in a sentence, and probability for an unknown word of length k, wi =</Paragraph> <Paragraph position="14"> where h is a history of the sequence. In other words, the probability of the unknown word is approximated by the product of the probabilities of the composing characters, and this calculation is done in the framework of the word-level Markov model-based method.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> This section gives experimental results of Chinese and Japanese word segmentation with the hybrid method. The following values are used to evaluate the performance of word segmentation: R : Recall (The number of correctly segmented words in system's output divided by the number of words in test data)</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experiments of Chinese Word Segmentation </SectionTitle> <Paragraph position="0"> We use three Chinese word-segmented corpora, the Academia Sinica corpus (AS), the Hong Kong City University corpus (HK) and the Beijing University corpus (PK), all of which were used in the First International Chinese Word Segmentation Bakeoff (Sproat and Emerson, 2003) at ACL-SIGHAN 2003.</Paragraph> <Paragraph position="1"> The three corpora are word-segmented corpora, but POS-tags are not attached, therefore we need to attach a POS-tag (state) which is necessary for the Markov model-based method to each word. We attached a state for each word using the Baum-Welch algorithm (Rabiner and Juang, 1993) which is used for Hidden Markov Models. The algorithm finds a locally optimal tag sequence which maximizes Equation (1) in an unsupervised way. The initial states are randomly assigned, and the number of states is set to 64.</Paragraph> <Paragraph position="2"> We use the following systems for comparison: Bakeoff-1, 2, 3 The top three systems participated in the SIGHAN Bakeoff (Sproat and Emerson, 2003).</Paragraph> <Paragraph position="3"> Maximum Matching A word segmentation system using the well-known maximum matching method.</Paragraph> <Paragraph position="4"> Character Tagging A word segmentation system using the character tagging method. This system is almost the same as the one studied by Xue (2003). Features described in Section 3.1 (1)-(4) and the following (5) are used to estimate a POC tag of a character ci0, where tx is a POC-tag of the xth character in a sentence: (5) Unigram and bigram of previous POC-</Paragraph> <Paragraph position="6"> All these systems including ours do not use any other knowledge or resources than the training data.</Paragraph> <Paragraph position="7"> In this experiments, word dictionaries used by the hybrid method and Maximum Matching are constructed from all the words in each training corpus. Statistical information of these data is shown in Table 3. The calculated values of ,i in Equation (4) are shown in Table 4.</Paragraph> <Paragraph position="8"> The results are shown in Table 5. Our system achieved the best F-measure values for the three corpora. Although the hybrid system's recall values for known words are not high compared to the participants of SIGHAN Bakeoff, the recall values for known words and unknown words are relatively well-balanced. The results of Maximum Matching and Character Tagging show the trade-off between the word-based approach and the character-based approach which was discussed in Section 3. Maximum Matching is word-based and has the higher recall values for known words than Character Tagging on the HK and PK corpus. Character Tagging is character-based and has the highest recall values for unknown words on the AS, HK and PK corpus.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Experiments of Japanese Word Segmentation </SectionTitle> <Paragraph position="0"> We use the RWCP corpus, which is a Japanese word-segmented and POS-tagged corpus.</Paragraph> <Paragraph position="1"> We use the following systems for comparison: ChaSen The word segmentation and POS-tagging system based on extended Markov models (Asahara and Matsumoto, 2000; Matsumoto et al., 2001). This system carries out unknown word processing using heuristic rules.</Paragraph> <Paragraph position="2"> Maximum Matching The same system used in the Chinese experiments.</Paragraph> <Paragraph position="3"> Character Tagging The same system used in the Chinese experiments.</Paragraph> <Paragraph position="4"> As a dictionary for ChaSen, Maximum Matching and the hybrid method, we use IPADIC (Matsumoto and Asahara, 2001) which is attached to ChaSen. Statistical information of these data is shown in Table 3. The calculated values of ,i in Equation (4) are shown in Table 4.</Paragraph> <Paragraph position="5"> The results are shown in Table 63. Compared to ChaSen, the hybrid method has the comparable F-measure value and the higher recall value for unknown words (the difference is statistically significant at 95% confidence level). Character Tagging has the highest recall value for unknown words as in the Chinese experiments.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> Several studies have been conducted on word segmentation and unknown word processing. Xue (2003) studied Chinese word segmentation using the character tagging method. As seen in the previous section, this method handles known and unknown words in the same way basing on character-level information. Our experiments showed that the method has quite high accuracy for unknown words, but accuracy for known words tends to be lower than other methods.</Paragraph> <Paragraph position="1"> considering words in the dictionary as known words. Words which are in the training corpus but not in the dictionary are handled as unknown words in the calculations. The number of known/unknown words of the RWCP corpus shown in Table 3 is also calculated in the same way.</Paragraph> <Paragraph position="2"> Uchimoto et al. (2001) studied Japanese word segmentation using ME models. Although their method is word-based, no word dictionaries are used directly and known and unknown words are handled in a same way. The method estimates how likely a string is to be a word using ME. Given a sentence, the method estimates the probabilities for every substrings in the sentence. Word segmentation is conducted by finding a division of the sentence which maximizes the product of probabilities that each divided substring is a word. Compared to our method, their method can handle some types of features for unknown words such as &quot;the word starts with an alphabet and ends with a numeral&quot; or &quot;the word consists of four characters&quot;. Our method cannot handle such word-level features because unknown words are handled by using a character as a unit. On the other hand, their method seems to have a computational cost problem. In their method, unknown words are processed by using a word as a unit, and the number of candidates for unknown words in a sentence which consists of n characters is equal to n(n + 1)=2. Actually, they did not consider every substrings in a sentence, and limited the length of substrings to be less than or equal to five characters. In our method, the number of POC-tagged characters which is necessary for unknown word processing is equal to 4n, and there is no limitation for the length of unknown words.</Paragraph> <Paragraph position="3"> Asahara et al. (2003) studied Chinese word segmentation based on a character tagging method with support vector machines. They preprocessed a given sentence using a word segmenter based on Markov models, and the output is used as features for character tagging. Their method is a character-based method incorporating word-level information and that is reverse to our approach. They did not use some of the features we used like character types, and our method achieved higher accuracies compared to theirs on the AS, HK and PK corpora (Asahara et al., 2003).</Paragraph> </Section> class="xml-element"></Paper>