File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-3029_evalu.xml
Size: 6,703 bytes
Last Modified: 2025-10-06 13:59:26
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3029"> <Title>Maximal Match Chinese Segmentation Augmented by Resources Generated from a Very Large Dictionary for Post-Processing</Title> <Section position="5" start_page="176" end_page="548" type="evalu"> <SectionTitle> 3 Results and Analysis </SectionTitle> <Paragraph position="0"> The results of the different stages of segmentation are shown in Table 1 and Table 2.</Paragraph> <Paragraph position="1"> In both test corpora, the primary dictionary-based segmentation alone has achieved a significant percentage (over 95% in recall and over 90% in precision). This exemplifies that the rich vocabulary we have offers a useful resource for language engineering, and provides a solid platform for further enhancement.</Paragraph> <Paragraph position="2"> Post-processing with supplementary features from the dictionary shows consistent incremental improvement in segmentation. The scores (F-measure) due to FMM and BMM with heuristic rules demonstrate a relatively substantial gap at the very beginning, largely because of the heuristic rules developed and accumulated through the precise and systematic processing of The performance of unknown word detection can be seen from the leap in ROOV after the operation. It increases remarkably from 0.663 to 0.715, offsetting the fall in RIV, which drops by 0.001 and this may be due to the concatenation of some monosyllabic morphemes which are supposed to be independent words.</Paragraph> <Paragraph position="3"> The results of the comparison between FMM and BMM are summarized in Table 3 and Table 4. A noticeable drawback of such comparison is that some phrases will be mis-segmented in either direction. For example, the phrase & - will be segmented backwardly into &/ /- but &/ -/forwardly. The correct segmentation, &/ /-, cannot be attained in both cases. Hence as an experiment, for any combination of five characters which are segmented into 1/2/2 pattern by BMM and 2/2/1 pattern by FMM, the 2/1/2 pattern will be also tested against the overall probabilities. For the former example, &/ /-will override the other two.</Paragraph> <Paragraph position="4"> Table 3 shows that the number of correct replacements from FMM is 399 in the AS test corpus, combining the gain from the reshuffling * The reported figures differ from those computed on our platform, probably due to system differences. The official scorer program is publicly available and described in (Sproat and Emerson, 2003).</Paragraph> <Paragraph position="5"> of 5-character strings, the total is 408. Since the default choice is the BMM segmented texts, the sum 408 is the total gain from this BMM/FMM comparison, while 77 correct segmented texts have been mis-replaced, the gain/loss ratio is 5.30. This means that our system only loses 1 correct segmentation in exchange of gaining 5.3 correct ones.</Paragraph> <Paragraph position="6"> Likewise in the case of the PK test corpus in Table 4, the gain/loss ratio is 4.67. The ratio is smaller than that for the AS test corpus. It is thus evident that the comparison and replacement by means of BMM and FMM offers a substantial achievement in the accuracy of the segmentation process.</Paragraph> <Section position="1" start_page="548" end_page="548" type="sub_section"> <SectionTitle> for PKo </SectionTitle> <Paragraph position="0"> We are aware that the performance of replacement may be improved by using probabilities of n-grams, conditional probabilities involving the boundary words, and perhaps by considering all possibilities of segmentations for the same string of texts, as in some other segmentation systems. On the semantic level, the overall message of a paragraph can be examined as well by gathering statistics of collocating words. The ordering of applying these algorithms, however, should be important, and how they interplay with one another will be an arena to explore.</Paragraph> <Paragraph position="1"> Although we have not incorporated such enhancement measures into our system in this exercise, the dictionary can nevertheless support such extension with the necessary statistical data. All previous results are based on the first-stage of segmentation with a large dictionary. Since we had processed texts from different Chinese speech communities including Beijing, Hong Kong, Taipei, and others, the dictionary used for segmentation also consists of all words appearing in any of these communities. In order to investigate the effect of locality on the dictionary used in segmentation, two independent dictionaries have been generated from the Beijing portion and Taipei portion, and all the above stages were repeated for the two test corpora, with results shown below in Table 5 and Table 6.</Paragraph> <Paragraph position="2"> ary The results show that dictionaries derived from specific communities alone yield slightly smaller F-measures than that derived from all places together. The largest difference lies in ROOV where it is 0.605 and 0.663 for ASo and 0.815 and 0.851 for PKo, confirming the significance of adopting a large and all-rounded dictionary in word segmentation.</Paragraph> </Section> </Section> <Section position="6" start_page="548" end_page="548" type="evalu"> <SectionTitle> 4 Error Analysis </SectionTitle> <Paragraph position="0"> We have examined the discrepancies between the gold standard files and our resultant segmented files, and it is found that the segmentation errors can be basically classified into several categories.</Paragraph> <Paragraph position="1"> The errors due to standard divergence have the most impact. For example, H/0 is considered the correct segmentation in the AS test corpus while H0 is one word in our large dictionary.</Paragraph> <Paragraph position="2"> Inconsistencies within the same corpus (both training and test corpora) also give rise to performance fluctuations. There are cases where the same phrase is segmented differently. For example, in the AS training corpus, both UW//O and UW/-O are found. Similar cases are also found in the test corpus, e.g. /0-0 O vs. /0-O .</Paragraph> <Paragraph position="3"> Another factor that affects the segmentation performance over the PK corpus is encoding conversion. Our production system is based primarily on materials which are in BIG5 encoding, specifically traditional Chinese characters in the BIG5 encoding space. Since the given test data are in simplified Chinese characters, a process of encoding conversion to BIG5 is in place. Such a conversion is a one-to-many mapping and thus some original words will be distorted, influencing segmentation correctness.</Paragraph> </Section> class="xml-element"></Paper>