File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1721_metho.xml

Size: 15,796 bytes

Last Modified: 2025-10-06 14:08:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1721">
  <Title>Chinese Word Segmentation Using Minimal Linguistic Knowledge</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Word segmentation
</SectionTitle>
    <Paragraph position="0"> New texts are segmented in four steps which are described in this section. New words are automatically extracted from the unsegmented testing texts and added to the base dictionary consisting of words from the training data before the testing texts are segmented, line by line.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Base segmentation algorithm
</SectionTitle>
      <Paragraph position="0"> Given a dictionary and a sentence, our base segmentation algorithm finds all possible segmentations of the sentence with respect to the dictionary, computes the probability of each segmentation, and chooses the segmentation with the highest probability. If a sentence of a6 characters, a1a8a7a10a9a12a11a13a9a13a14a16a15a12a15a17a15a18a9a17a19 , has a segmentation of a20 words, a1a21a7a23a22a24a11a13a22a25a14a26a15a17a15a12a15a27a22a29a28 , then the probability of the segmentation is estimated as a30a32a31 a1a34a33a18a35a2a36a37a7 a30a38a31a22a24a11a39a22a25a14a26a15a17a15a12a15a40a22a29a28a41a36a29a42a44a43</Paragraph>
      <Paragraph position="2"> where a35 denotes a segmentation of a sentence. The probability of a word is estimated from the training corpus as</Paragraph>
      <Paragraph position="4"> , where a53a54a31a22a2a36 is the number of times that the word a22 occurs in the training corpus, and a53 is the number of words in the training corpus. When a word is not in the dictionary, a frequency of 0.5 is assigned to the new word. The dynamic programming technique is applied to find the segmentation of the highest probability of a sentence without first enumerating all possible segmentations of the sentence with respect to the dictionary. Consider the text fragmenta55a44a56a44a57a44a58a60a59 with respect to a dictionary containing the wordsa55a61a56a62a59a63a55a61a56a61a57a62a59a63a57a61a58a62a59a63a57 anda58a62a59 it has three segmentations: (1) a55a64a56 / a57a65a58a62a66 (2) a55a65a56a64a57 / a58a60a66 and (3)a55a61a56 /a57 /a58a60a67 The probabilities of the three segmentations are computed as: (1) p(a55a64a56 )*p(a57a64a58 ); (2)</Paragraph>
      <Paragraph position="6"> bility of a word is estimated by its relative frequency in the training data. Assume the first segmentation has the highest probability, then the text fragment will be segmented into</Paragraph>
      <Paragraph position="8"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Combining single characters
</SectionTitle>
      <Paragraph position="0"> New words are usually two or more characters long and are often segmented into single characters. For example, the word a69a64a70 is segmented into a69 / a70 when it is not in the dictionary. After a sentence is segmented using the base algorithm, the consecutive single Hanzi characters are combined into a word if the in-word probabilities of the single characters are over a threshold which is empirically determined from the training data. The in-word probability of a character is the probability that the character occurs in a word of two or more characters.</Paragraph>
      <Paragraph position="1"> Some Hanzi characters, such as a71 and a72a73a59 occur as words on their own in segmented texts much more frequently than in words of two or more characters. For example, in the PK training corpus, the character a72 occurs as a word on its own 11,559 times, but in a word only 875 times.</Paragraph>
      <Paragraph position="2"> On the other hand, some Hanzi characters usually do not occur alone as words, instead they occur as part of a word.</Paragraph>
      <Paragraph position="3"> As an example, the character a74 occurs in a word 17,108 times, but as a word alone only 794 times in the PK training data. For each character in the training data, we compute its in-word probability as follow: a30a38a31a76a75</Paragraph>
      <Paragraph position="5"> where a53a54a31a92a75 a36 is the number of times that character a75 occurs in the training data, and a53a54a31a92a75</Paragraph>
      <Paragraph position="7"> a36 is the number of times that character a75 is in a word of two or more characters. We do not want to combine the single characters that occur as words alone more often than not. For both the PK training data and the AS training data, we divided the training data into two parts, two thirds for training, and one third for system development. We found that setting the threshold of the in-word probability to 0.85 or around works best on the development data. After the initial segmentation of a sentence, the consecutive single-characters are combined into one word if their in-word probabilities are over the threshold of 0.85. The text fragmenta96a44a96a8a97a73a98a44a99a44a100a44a101 contains a new worda99a44a100a44a101 which is not in the PK training data. After the initial segmentation, the text is segmented into a96a68a96 / a97a102a98 / a99 / a100 / a101 /, which is subsequently changed into a96a68a96 / a97a103a98 / a99a68a100a68a101 after combining the three consecutive characters. The in-word probabilities for the three characters a99a44a59a60a100a44a59 and a101 are 0.94, 0.98, and 0.99, respectively.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Combining suffixes
</SectionTitle>
      <Paragraph position="0"> A small set of characters , such as a104a44a59a60a105 and a106a44a59 frequently occur as the last character in words. We selected 145 such characters from the PK training corpus, and 113 from the AS corpus. After combining single characters, we combine a suffix character with the word preceding it if the preceding word is at least two-character long.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Consistency check
</SectionTitle>
      <Paragraph position="0"> The last step is to perform consistency checks. A segmented sentence, after combining single characters and suffixes, is checked against the training data to make sure that a text fragment in a testing sentence is segmented in the same way as in the training data if it also occurs in the training data. From the PK training corpus, we created a phrase segmentation table consisting of word quadgrams, trigrams, bigrams, and unigrams, together with their segmentations and frequencies. Our phrase table created from the AS corpus does not include word quad-grams to reduce the size of the phrase table. For example, from the training text a107a44a108 / a109 / a110 / a111a8a112a68a59 we create the following entries (only some are listed to save space): text fragment freq segmentation</Paragraph>
      <Paragraph position="2"> After a new sentence is processed by the first three steps, we look up every word quad-grams of the segmented sentence in the phrase segmentation table. When a word quad-gram is found in the phrase segmentation table with a different segmentation, we replace the segmentation of the word quad-gram in the segmented sentence by its segmentation found in the phrase table. This process is continued to word trigrams, word bigrams, and word unigrams. The idea is that if a text fragment in a new sentence is found in the training data, then it should be segmented in the same way as in the training data. As an example, in the PK testing data, the sentence a107a113a108a61a109a64a110a61a114a61a115a64a116a61a117a118a71a73a119a61a117a64a120a61a121a62a122 is segmented into a107a113a108 /a109a61a110 /a114a61a115 /a116 /a117 / a71 /a119a61a117a64a120 a121 /a122 after the first three steps (the two characters a116 and a117 are not, but should be, combined because the in-word probability of character a116a61a59 which is 0.71, is below the pre-defined threshold of 0.85). The word bigram a107a102a108a68a109 a110 is found in the phrase segmentation table with a different segmentation, a107a102a108 / a109 / a110a61a67 So the segmentation a107a102a108 / a109a68a110 is changed to the segmentation a107a102a108 / a109 / a110 in the final segmented sentence. In essence, when a text fragment has two or more segmentations, its surrounding context, which can be the preceding word, the following word, or both, is utilized to choose the most appropriate segmentation. When a text fragment in a testing sentence never occurred in the same context in the training data, then the most frequent segmentation found in the training data is chosen. Consider the text a109a68a110 again, in the testing data, a59a123a109a44a110a61a124a44a125 is segmented into a59 /a109a61a110 /a124a44a125 by our base algorithm. In this case, a109a64a110 never occurred in the context of a59a126a109a64a110a61a124a61a125a62a59a127a59a126a109a64a110 ora109a61a110a64a124a61a125a62a67 The consistency check step changes a59 /a109a61a110 / a124a61a125 into a59 /a109 /a110 / a124 a125 since a109a64a110 is segmented into a109 / a110 515 times, but is treated as one word a109a61a110 105 times in the training data.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 New words recognition
</SectionTitle>
    <Paragraph position="0"> We developed a few procedures to identify new words in the testing data. Our first procedure is designed to recognize numbers, dates, percent, time, foreign words, etc. We defined a set of characters consisting of characters such as the digits '0' to '9' (in ASCII and GB), the letters 'a' to 'z', 'A' to 'Z' (in ASCII and GB), 'a128a68a121a68a129a68a130a132a131a103a133a68a134a68a135 a136a68a137a68a138a65a139a68a140a68a141a68a142 a67a144a143a102a145a68a146a65a147a68a117a149a148 ', and the like. Any consecutive sequence of the characters that are in this pre-defined set of characters is extracted and post-processed.</Paragraph>
    <Paragraph position="1"> A set of rules is implemented in the post-processor. One such rule is that if an extracted text fragments ends with the character a117 and contains any character in a138a65a139a65a140a64a141 a142 a147a62a59 then remove the ending character a117 and keep the remaining fragment as a word. For example, our recognizer will extract the text fragment a131 a139a150a136a150a138 a117 and a147 a137 a117 since all the characters are in the pre-defined set of characters. The post-processor will strip off the trailing character  sonal names, we developed a program to extract the names preceding texts such as a151a153a152a118a154a103a155a157a156 and a151a153a158a159a156a27a59 a program to detect and extract names in a sequence of names separated by the Chinese punctuation &amp;quot;a160 &amp;quot;, such as a161a61a162a61a163  personal names (Chinese or foreign) following title or profession names, such asa180a64a181a183a182 in the texta184a64a185a61a186a61a187a118a188a73a180 a181a189a182a190a59 and a program to extract Chinese personal names based on the preceding word and the following word. For example, the string a191a171a192a64a107 in a193a64a163a171a71a113a191a171a192a65a107a113a194 is most likely a personal name (in this case, it is) sincea191 is a Chinese family name, the string is three-character long (a typical Chinese personal name is either three or two-character long). Furthermore, the preceding word a71 and the following word a194 are highly unlikely to appear in a Chinese personal name. For the personal names extracted from the PK testing data, if the name is two or three-character long, and if the first character or two is a Chinese family name, then the family name is separated from the given name. The family names are not separated from the given names for the personal names extracted from the AS testing data. In some cases, we find it difficult to decide whether or not the first character should be removed from a personal name. Consider the personal name a195a23a196a44a197 which looks like a Chinese personal name since the first character is a Chinese family name, and the name is three-character long. If it is a translated foreign name (in this case, it is), then the name should not be split into family name and given name. But if it is the name of a Chinese personal name, then the family name a195 should be separated from the given name. For place names, we developed a simple program to extract names of cities, counties, towns, villages, streets, etc, by extracting the strings of up to three characters appearing between two place name designators. For example, from the text a198a64a199</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> The last row (in boldface) in Table 1 gives our official results for the PK closed track. Other rows in the table present the results under different experimental conditions. The column labeled steps refers to the executed steps of our Chinese word segmentation algorithm. Step 1 segments a text using the base segmentation algorithm, step 2 combines single characters, step 3 attaches suffixes to the preceding words, and step 4 performs consistency checks. The four steps are described in details in section 2. The column labeled dict gives the dictionary used in each experiment. The pkd1 consists of only the words from the PK training corsteps dict R P F a176a37a177a27a177a40a178 a176a37a179a50a178  using words from the training data.</Paragraph>
    <Paragraph position="1"> pus, pkd2 consists of the words in pkd1 and the words converted from pkd1 by changing the GB encoding to ASCII encoding for the numeric digits and the English letters, and pkd3 consists of the words in pkd2 and the words automatically extracted from the PK testing texts using the procedures described in section 3. The columns labeled R, P and F give the recall, precision, and F score, respectively. The columns labeled a208 a77a81a77a40a209 and a208  a209 show the recall on out-of-vocabulary words and the recall on in-vocabulary words, respectively. All evaluation scores reported in this paper are computed using the score program written by Richard Sproat. We refer readers to (Sproat and Emerson, 2003) for details on the evaluation measures. For example, row 4 in table 1 gives the results using pkd3 dictionary when a sentence is segmented by the base algorithm, and then the single characters in the initial segmentation are combined, but suffixes are not attached and consistency check is not performed. The last row in table 2 presents our official results for the closed track using the AS corpus. The asd1 dictionary contains only the words from the AS training corpus, while the asd2 consists of the words in asd1 and the new words automatically extracted from the AS testing texts using the new words recognition described in section 3. The results show that new words recognition and joining single characters contributed the most to the increase in precision, while the consistency check contributed the most to the increase in recall. Table 3 gives the results of the maximum matching using only the words in the training data. While the difference between the F-scores of the maximum matching and the base algorithm is small for the PK corpus, the F-score difference for the AS corpus is much larger. Our base algorithm performed substantially better than the maximum matching for the AS corpus. The performances of our base algorithm on the testing data using the words from the training data are presented in row 1 in table 1 for the a3a5a4 corpus, and row 1 in table 2 for the a0a2a1 corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML