File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0121_metho.xml
Size: 8,196 bytes
Last Modified: 2025-10-06 14:10:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0121"> <Title>Chinese Word Segmentation with Maximum Entropy and N-gram Language Model</Title> <Section position="4" start_page="0" end_page="139" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> With the ME model, n-gram language model, and several post processing strategies, our systems are established. And detailed description on these components are given in following subsections.</Paragraph> <Section position="1" start_page="0" end_page="138" type="sub_section"> <SectionTitle> 2.1 Maximum Entropy Model </SectionTitle> <Paragraph position="0"> The ME model used in our system is based on the previous works (Jin Kiat Low et al., 2005; Hwee Tou Ng et al., 2004). As mentioned above, the ME model based word segmentation is a 4-classes learning process. Here, we remarked four classes, i.e. thebeginning,middle,endofamulti-character word and a single-character word, as b, m, e and s respectively.</Paragraph> <Paragraph position="1"> In ME model, the following features (Jin Kiat Low et al., 2005) are selected:</Paragraph> <Paragraph position="3"> where cn indicates the character in the left or right position n relative to the current character c0.</Paragraph> <Paragraph position="4"> For the open track especially, three extended features are extracted with the help of an external dictionary as follows:</Paragraph> <Paragraph position="6"> where Pu(c0) denotes whether the current character is a punctuation, L is the length of word W that conjoined from the character and its context which matching a word in the external dictionary as long as possible. t0 is the boundary tag of the character in W.</Paragraph> <Paragraph position="7"> With the features, a ME model is trained which could output four scores for each character with regard to four classes. Based on scores of all characters, a completely segmented semiangle matrix can be constructed. Each element wji in this matrix represents a word that starts at the ith character and ends at jth character, and its valueME(j,i), the score for these (j [?]i+1) characters to form a word, is calculated as follow:</Paragraph> <Paragraph position="9"> As a consequence, the optimal segmentation results corresponding to the best path with the lowest overall score could be reached via a dynamic programming algorithm. For example: a64a152a99a183a155a202a149(I was 19 years old that year) Table 1 shows its corresponding matrix. In this example, the ultimate segmented result is: a64 a152a99 a183 a155a202a149</Paragraph> </Section> <Section position="2" start_page="138" end_page="138" type="sub_section"> <SectionTitle> 2.2 Language Model </SectionTitle> <Paragraph position="0"> N-gram language model, a widely used method in natural language processing, can represent the context relation of words. In our systems, a bi-gram model is integrated with ME model in the phase of calculating the path score. In detail, the score of a path will be modified by adding the bi-gram of words with a weight l at the word boundaries. The approach used for modifying path score is based on the following formula.</Paragraph> <Paragraph position="2"> (2) where V[j,i] is the score of local best path which ends at the jth character and the last word on the path is wi,j = ci...cj, the parameter l is optimized by the test set used in the 2nd International Chinese Word Segmentation Bakeoff. When scoring the path, if one of the words wk,i[?]1 and wi,j is out of the vocabulary, their bigram will backoff to the unigram. And the unigram of the OOV word will be calculated as:</Paragraph> <Paragraph position="4"> where p is the minimal unigram value of words in vocabulary; l is the length of the word acting as a punishment factor to avoid overemphasizing the long OOV words.</Paragraph> </Section> <Section position="3" start_page="138" end_page="139" type="sub_section"> <SectionTitle> 2.3 Post Processing Strategies </SectionTitle> <Paragraph position="0"> The analysis on preliminary experiments, where the ME model and n-gram language model are involved, lead to several post processing strategies in developing our final systems.</Paragraph> <Paragraph position="1"> To handle the combination ambiguity issue, we introduce a division and combination strategy which take in use of unigram and bigram. For each two words A and B, if their bigrams does not exist while there exists the unigram of word AB, then they can be conjoined as one word. For example, &quot;a155a27(August)&quot; and &quot;a128a183(revolution)&quot; are two segmented words, and in training set the bigram of &quot;a155a27&quot; and &quot;a128a183&quot; is absent, while the word &quot;a155a27a128a183(the August Revolution)&quot; appeares, then the character string &quot;a155a27a128a183&quot; is conjoined as one word. On the other hand, for a word C which can be divided as AB, if its unigramdoesnotexitintrainingset, whilethebigram of its subwords A and B exits, then it will be resegmented. For example, Taking the word &quot;a178a76 a78a155a85a128(economicsystem reform)&quot;for instance, if its corresponding unigram is absent in training set, while the bigram of two subwords &quot;a178a76a78 as a consequence, it will be segmented into two words &quot;a178a76a78a155&quot; and &quot;a85a128&quot;. The ME model always segment a numeral word into several words. For instance, the word &quot;4.34a3(RMB Yuan 4.34)&quot;, may be segmented into two words &quot;4.&quot; and &quot;34a3&quot;. To tackle this problem, a numeral word processing strategy is used. Under this strategy, those words that contain Arabic numerals are manually marked in the training set firstly, then a list of high frequency characters which always appear alone between the numbers in the training set can be extracted, based on which numeral word issue can be tackled as follows. When segmenting one sentence, if two conjoint words are numeral words, and the last character of the former word is in the list, then they are combined as one word.</Paragraph> </Section> <Section position="4" start_page="139" end_page="139" type="sub_section"> <SectionTitle> 2.3.3 Long Organization Name Processing Strategy </SectionTitle> <Paragraph position="0"> Since an organization name is usually an OOV, it always will be segmented as several words, especially for a long one, while in MSRA corpus, it is required to be recognized as one word. In our systems, a corresponding strategy is presented to deal with this problem. Firstly a list of organization names is manually selected from the training set and stored in the prefix-tree based on characters. Then a list of prefixes is extracted by scanning the prefix-tree, that is, for each node, if the frequencies of its child nodes are all lower than the predefined threshold k and half of the frequency of the current node, the string of the current node will be extracted as a prefix; otherwise, if there exists a child node whose frequency is higher than the threshold k, scan the corresponding subtree. In the same way, the suffixes can also be extracted. The only difference is that the order of characters is inverse in the lexical tree.</Paragraph> <Paragraph position="1"> During recognizing phase, to a successive words string that may include 2-5 words, will be combined as one word, if all of the following conditions are satisfied.</Paragraph> <Paragraph position="2"> a) Does not include numbers, full stop or comma.</Paragraph> <Paragraph position="3"> b) Includes some OOV words.</Paragraph> <Paragraph position="4"> c) Has a tail substring matching some suffix.</Paragraph> <Paragraph position="5"> d) Appears more than twice in the test data.</Paragraph> <Paragraph position="6"> e) Has a higher frequency than any of its substring which is an OOV word or combined by multiple words.</Paragraph> <Paragraph position="7"> f) Satisfy the condition that for any two successive words w1 w2 in the strings, freq(w1w2)/freq(w1)[?]0.1, unless w1 contains some prefix in its right.</Paragraph> </Section> </Section> class="xml-element"></Paper>