File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/00/a00-2032_relat.xml

Size: 3,976 bytes

Last Modified: 2025-10-06 14:15:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2032">
  <Title>Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji</Title>
  <Section position="6" start_page="245" end_page="246" type="relat">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> Japanese Many previously proposed segmentation methods for Japanese text make use of either a pre-existing lexicon (Yamron et al., 1993; Matsumoto and Nagao, 1994; Takeuchi and Matsumoto, 1995; Nagata, 1997; Fuchi and Takagi, 1998) or pre-segmented training data (Nagata, 1994; Papa: georgiou, 1994; Nagata, 1996a; Kashioka et al., 1998; Mori and Nagao, 1998). Other approaches bootstrap from an initial segmentation provided by a baseline algorithm such as Juman (Matsukawa et al., 1993; Yamamoto, 1996).</Paragraph>
    <Paragraph position="1"> Unsupervised, non-lexicon-based methods for Japanese segmentation do exist, but they often have limited applicability. Both Tomokiyo and Ries (1997) and Teller and Batchelder (1994) explicitly avoid working with kanji charactes. Takeda and Fujisaki (1987) propose the short unit model, a type of Hidden Markov Model with linguisticallydetermined topology, to segment kanji compound words. However, their method does not handle three-character stem words or single-character stem words with affixes, both of which often occur in proper nouns. In our five test datasets, we found that 13.56% of the kanji sequences contain words that cannot be handled by the short unit model.</Paragraph>
    <Paragraph position="2"> Nagao and Mori (1994) propose using the heuris- null tic that high-frequency character n-grams may represent (portions of) new collocations and terms, but the results are not experimentally evaluated, nor is a general segmentation algorithm proposed.</Paragraph>
    <Paragraph position="3"> The work of Ito and Kohda (1995) similarly relies on high-frequency character n-grams, but again, is more concerned with using these frequent n-grams as pseudo-lexicon entries; a standard segmentation algorithm is then used on the basis of the induced lexicon. Our algorithm, on the hand, is fundamentally different in that it incorporates no explicit notion of word, but only &amp;quot;sees&amp;quot; locations between characters.</Paragraph>
    <Paragraph position="4"> Chinese According to Sproat et al. (1996), most prior work in Chinese segmentation has exploited lexical knowledge bases; indeed, the authors assert that they were aware of only one previously published instance (the mutual-information method of Sproat and Shih (1990)) of a purely statistical approach. In a later paper, Palmer (1997) presents a transformation-based algorithm, which requires pre-segmented training data.</Paragraph>
    <Paragraph position="5"> To our knowledge, the Chinese segmenter most similar to ours is that of Sun et al. (1998). They also avoid using a lexicon, determining whether a given location constitutes a word boundary in part by deciding whether the two characters on either side tend to occur together; also, they use thresholds and several types of local minima and maxima to make segmentation decisions. However, the statistics they use (mutual information and t-score) are more complex than the simple n-gram counts that we employ.</Paragraph>
    <Paragraph position="6"> Our preliminary reimplementation of their method shows that it does not perform as well as the morphological analyzers on our datasets, although we do not want to draw definite conclusions because some aspects of Sun et al's method seem incomparable to ours. We do note, however, that their method incorporates numerical differences between statistics, whereas we only use indicator functions; for example, once we know that one trigram is more common than another, we do not take into account the difference between the two frequencies. We conjecture that using absolute differences may have an adverse effect on rare sequences.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML