File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/j00-3004_concl.xml

Size: 2,191 bytes

Last Modified: 2025-10-06 13:52:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="J00-3004">
  <Title>A Compression-based Algorithm for Chinese Word Segmentation</Title>
  <Section position="5" start_page="390" end_page="391" type="concl">
    <SectionTitle>
7. Conclusions
</SectionTitle>
    <Paragraph position="0"> The problem of word segmentation of Chinese text is important in a variety of contexts, particularly with the burgeoning interest in digital libraries and other systems that store and process text on a massive scale. Existing techniques are either linguistically based, using a dictionary of words, or rely on hand-crafted segmentation rules, or use adaptive models that have been specifically created for the purpose of Chinese word segmentation. We have developed an alternative based on a general-purpose character-level model of text--the kind of models used in the very best text compression schemes. These models are formed adaptively from training text.</Paragraph>
    <Paragraph position="1"> The advantage of using character-level models is that they do not rely on a dictionary and therefore do not necessarily fail on unusual words. In effect, they can fall back on general properties of language statistics to process novel text. The advantage of basing models on a corpus of training text is that particular characteristics of the text are automatically taken into account in language statistics--as exemplified by the significant differences between the models formed for the PH and Rocling corpora.</Paragraph>
    <Paragraph position="2"> Encouraging results have been obtained using the new scheme. Our results compare very favorably with the results of Hockenmaier and Brew (1998) on the PH corpus; unfortunately no other researchers have published quantitative results on a  Computational Linguistics Volume 26, Number 3 standard corpus. Further work is needed to analyze the results of the Rocling corpus in more detail.</Paragraph>
    <Paragraph position="3"> The next step is to use automatically segmented text to investigate the digital library applications we have described: information retrieval, text summarization, document clustering, and keyphrase extraction.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML