File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1721_evalu.xml

Size: 5,019 bytes

Last Modified: 2025-10-06 13:59:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1721">
  <Title>Chinese Word Segmentation Using Minimal Linguistic Knowledge</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Discussions
</SectionTitle>
    <Paragraph position="0"> In this section we will examine in some details the problem of segmentation inconsistencies within the training data, within the testing data, and between training data and testing data. Due to space limit, we will only report our findings in the PK corpus though the same kinds of inconsistencies also occur in the AS corpus. We understand that it is difficult, or even impossible, to completely eliminate segmentation inconsistencies. However, perhaps we could learn more about the impact of segmentation inconsistencies on a system's performance by taking a close look at the problem.</Paragraph>
    <Paragraph position="1"> We wrote a program that takes as input a segmented corpus and prints out the shortest text fragments in the corpus that have two or more segmentations. For each text fragment, the program also prints out how the text fragment is segmented, and how many times it is segmented in a particular way. While some of the text fragments, such as a210a171a211 anda212a61a213a214a59 truly have two different segmentations, depending on the contexts in which they occur or the meanings of the text fragments, others are segmented inconsistently.</Paragraph>
    <Paragraph position="2"> We ran this program on the PK testing data and found 21 unique shortest text fragments, which occur 87 times in total, that have two different segmentations. Some of the text fragments, such asa215a44a216a44a217a60a59 are inconsistently segmented. The fragment a215a64a216a65a217 occurs twice in the testing data and is segmented into a215a61a216 / a217 in one case, but treated as one word in the other case. We found 1,500 unique shortest text fragments in the PK training data that have two or more segmentations, and 97 unique shortest text fragments that are segmented differently in the training data and in the testing data. For example, the text a218a61a219a61a220a118a221 is treated as one word in the training data, but is segmented into a218 / a219 / a220 / a221 in the testing data. We found 11,136 unique shortest text fragments that have two or more segmentations in the AS training data, 21 unique shortest text fragments that have two or more segmentations in the AS testing data, and 38 unique shortest text fragments that have different segmentations in the AS training data and in the AS testing data.</Paragraph>
    <Paragraph position="3"> Segmentation inconsistencies not only exists within training and testing data, but also between training and testing data. For example, the text fragmenta222a61a119a61a223 occurs 35 times in the PK training data and is consistently segmented into &amp;quot; a222a68a119 / a223a61a59 but the same text fragment, occurring twice in the testing data, is segmented into a222 / a119 / a223 in both cases. The text a224a65a225 occurs 67 times in the training data and is treated as one word a224a68a225 in all 67 cases, but the same text, occurring 4 times in the testing data, is segmented intoa224 / a225 in all 4 cases. The text a226a61a227a113a228 occurs 16 times in the training data, and is treated as one word in all cases, but in the testing data, it is treated as one word in three cases and segmented into a226a61a227 / a228 in one case. The text a229a65a230 is segmented into a229 / a230 in 8 cases, but treated as one word in one case in the training data. A couple of text fragments seem to be incorrectly segmented. The text  Our segmented texts of the PK testing data differ from the reference segmented texts for 580 text fragments (427 unique). Out of these 580 text fragments, 126 text fragments are among the shortest text fragments that have one segmentation in the training data, but another in the testing data. This implies that up to 21.7% of the mistakes committed by our system may have been impacted by the segmentation inconsistencies between the PK training data and the PK testing data. Since there are only 38 unique shortest text fragments found in the AS corpus that are segmented differently in the training data and the testing data , the inconsistency problem probably had less impact on our AS results. Out of the same 580 text fragments, 359 text fragments (62%) are new words in the PK testing data. For example, the proper name a239a241a240a64a242a60a59 which is a new word, is incorrectly segmented into a239 /a240a64a242 by our system. Another example is the new word a243a113a244a171a243a113a245 which is treated as one word in the testing data, but is segmented into a243 / a244 / a243 / a245 by our system. Some of the longer text fragments that are incorrectly segmented may also involve new words, so at least 62%, but under 80%, of the incorrectly segmented text fragments are either new words or involve new words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML