File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/j05-4005_concl.xml

Size: 3,159 bytes

Last Modified: 2025-10-06 13:54:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-4005">
  <Title>Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach</Title>
  <Section position="13" start_page="570" end_page="570" type="concl">
    <SectionTitle>
9. Conclusions
</SectionTitle>
    <Paragraph position="0"> This article presents a pragmatic approach to Chinese word segmentation. Our main contributions are threefold. First, we view Chinese words as segmentation units whose definition is pragmatic in nature and depends on how they are used and processed (differently) in realistic applications, while theoretical linguists define words using purely linguistic criteria. Second, we propose a pragmatic mathematical framework for Chinese word segmentation, where various problems of word segmentation (i.e., word breaking, morphological analysis, factoid detection, NER, and NWI) are solved in a unified approach. The approach is based on linear models where component models are inspired by source-channel models of Chinese sentence generation. Third, we describe in detail an adaptive Chinese word segmenter, MSRSeg. This pragmatic system consists of two components: (1) a generic segmenter that is based on the mathematical framework of word segmentation and unknown word detection, and that can adapt to different domain-specific vocabularies, and (2) a set of output adaptors for adapting the output of the former to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.</Paragraph>
    <Paragraph position="1"> One area of our future work is to apply MSRSeg in a wide range of practical applications. We believe that some application-specific features can also be integrated into the framework. For instance, in MT, it would be interesting to investigate how to jointly optimize the performances of both word segmentation and word alignment.</Paragraph>
    <Paragraph position="2"> As one of the reviewers pointed out, though the reliable high performance of MSRSeg is impressive, it is by far one of the most complex systems with access to the richest resources. Hence, another interesting area of our future work is to explore whether the performance is attributed to a superior architecture or simply to the richer resources. We have developed a simplified version of MSRSeg, called S-MSRSeg. It does not use the morph-lexicon and is trained using one-fifth of the MSR training data in Table 4, which are only partially labeled (i.e., LWs are not annotated). Interestingly, S-MSRSeg achieves very similar (or slightly worse) performance on the five test sets in Table 4. This demonstrates again the potential of our pragmatic approach to Chinese word segmentation. The work reported in this article represents not an end but a beginning of yet another view of Chinese word segmentation. Toward this end, S-MSRSeg and its training and test data sets are publicly available (e.g., at http://research.microsoft.com/[?]jfgao) for the sake of encouraging others to improve upon the work we have carried out.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML