File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2056_intro.xml

Size: 5,582 bytes

Last Modified: 2025-10-06 14:03:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2056">
  <Title>Unsupervised Segmentation of Chinese Text by Use of Branching Entropy</Title>
  <Section position="3" start_page="0" end_page="428" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The theme of this paper is the following assumption: null The uncertainty of tokens coming after a sequence helps determine whether a given position is at a boundary. (A) Intuitively, as illustrated in Figure 1, the variety of successive tokens at each character inside a word monotonically decreases according tothe osetlength, because thelonger the preceding character n-gram, the longer the preceding context and the more it restricts the appearance of possible next tokens. For example, it is easier to guess which character comes after \natura&amp;quot; than after \na&amp;quot;. On the other hand, the uncertainty at the position of a word border becomes greater, and the complexity increases, as the position is out of context. With the same example, it is dicult to guess which character comes after \natural &amp;quot;. This suggests that a word border can be detected by focusing on the dierentials of the uncertainty of branching.</Paragraph>
    <Paragraph position="1"> In this paper, we report our study on applying this assumption to Chinese word seg- null successive tokens and a word boundary mentation by formalizing the uncertainty of successive tokens via the branching entropy (which we mathematically dene in the next section). Our intention in this paper is above all to study the fundamental and scientic statistical property underlying language data, so that it can be applied to language engineering. The above assumption (A) dates back to the fundamental work done by Harris (Harris, 1955), where he says that when the number of dierent tokens coming after every prex of a word marks the maximum value, then the location corresponds to the morpheme boundary. Recently, with the increasing availability of corpora, this property underlying language has been tested through segmentation into words and morphemes. Kempe (Kempe, 1999) reports a preliminary experiment to detectwordborders in Germanand English texts by monitoring the entropy of successive characters for 4-grams. Also, the second author of this paper (Tanaka-Ishii, 2005) have shown how Japanese and Chinese can be segmented into words by formalizing the uncertainty with the branching entropy. Even though the test data was limited to a small amount in this work, the report suggested how assumption  (A) holds better when each of the sequence elements forms a semantic unit. This motivated our work to conduct a further, larger-scale test in the Chinese language, which is the only human language consisting entirely of ideograms (i.e., semantic units). In this sense, the choice of Chinese as the language in our work is essential. null If the assumption holds well, the most important and direct application is unsupervised text segmentation into words. Many works in unsupervised segmentation so far could be interpreted as formulating assumption (A) in a similar sense where branching stays low inside words but increases at a word or morpheme border. None of these works, however, is directly based on (A), and they introduce other factors within their overall methodologies. Some works are based on in-word branching frequencies formulated in an original evaluation function, as in (Ando and Lee, 2000) (boundary precision=84.5%,recall=78.0%, tested on 12500 Japanese ideogram words). Sun et al. (Sun et al., 1998) uses mutual information (boundary p=91.8%, no report for recall, 1588 Chinese characters), and Feng(Feng et al., 2004) incorporates branching counts in the evaluation function to be optimized for obtaining boundaries (wordprecision=76%, recall=78%, 2000sentences). Fromthe performance results listed here, we can see that unsupervised segmentation is more dicult, by far, than supervised segmentation; therefore, the algorithms are complex, and previous studies have tended to be limited in terms of both the test corpus size and the target.</Paragraph>
    <Paragraph position="2"> In contrast, as assumption (A) is simple, we keep this simplicity in our formalization and directly test the assumption on a large-scale test corpus consisting of 1001 KB manually segmented data with the training corpus consisting of 200 MB of Chinese text.</Paragraph>
    <Paragraph position="3"> Chinese is such an important language that supervised segmentation methods are already very mature. The current state-of-the-art segmentation software developed by (Low et al., 2005), which ranks as the best in the SIGHAN bakeo (Emerson, 2005), attains word precision and recall of 96.9% and 96.8%, respectively, on the PKU track. There is also free  n ) for Chinese characters when n is increased software such as (Zhang et al., 2003) whose performance is also high. Even then, as most supervised methods learn on manually segmented newspaper data, when the input text is not from newspapers, the performance can be insucient. Given that the construction of learning data is costly, we believe the performance can be raised by combining the supervised and unsupervised methods.</Paragraph>
    <Paragraph position="4"> Consequently, this paper veries assumption (A) in a fundamental manner for Chinese text and addresses the questions of why and to what extent (A) holds, when applying it to the Chinese word segmentation problem. We rst formalize assumption (A) in a general manner.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML