File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2032_metho.xml

Size: 5,350 bytes

Last Modified: 2025-10-06 14:07:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2032">
  <Title>Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji</Title>
  <Section position="4" start_page="242" end_page="243" type="metho">
    <SectionTitle>
3 Experimental Framework
</SectionTitle>
    <Paragraph position="0"> Our experimental data was drawn from 150 megabytes of 1993 Nikkei newswire (see Figure 1). Five 500-sequence held-out subsets were obtained from this corpus, the rest of the data serving as the unsegmented corpus from which to derive character n-gram counts. Each held-out subset was hand-segmented and then split into a 50-sequence parameter-training set and a 450-sequence test set.</Paragraph>
    <Paragraph position="1"> Finally, any sequences occurring in both a test set and its corresponding parameter-training set were discarded from the parameter-training set, so that these sets were disjoint. (Typically no more than five sequences were removed.)</Paragraph>
    <Section position="1" start_page="242" end_page="242" type="sub_section">
      <SectionTitle>
3.1 Held-out set annotation
</SectionTitle>
      <Paragraph position="0"> Each held-out set contained 500 randomly-extracted kanji sequences at least ten characters long (about twelve on average), lengthy sequences being the most difficult to segment (Takeda and Fujisaki, 1987). To obtain the gold-standard annotations, we segmented the sequences by hand, using an observation of Takeda and Fujisaki (1987) that many kanji compound words consist of two-character stem words together with one-character prefixes and suffixes. Using this terminology, our two-level bracketing annotation may be summarized as follows. 3 At 3A complete description of the annotation policy, including the treatment of numeric expressions, may be found in a technical report (Ando and Lee, 1999).</Paragraph>
      <Paragraph position="1"> the word level, a stem and its affixes are bracketed together as a single unit. At the morpheme level, stems are divided from their affixes. For example, although both naga-no (Nagano) and shi (city) can appear as individual words, naga-no-shi (Nagano city) is bracketed as \[\[naga-no\]\[shi\]\], since here shi serves as a suffix. Loosely speaking, word-level bracketing demarcates discourse entities, whereas morpheme-level brackets enclose strings that cannot be further segmented without loss of meaning. 4 For instance, if one segments naga-no in naga-no-shi into naga (long) and no (field), the intended meaning disappears. Here is an example sequence from our datasets: Three native Japanese speakers participated in the annotation: one segmented all the held-out data based on the above rules, and the other two reviewed 350 sequences in total. The percentage of agreement with the first person's bracketing was 98.42%: only 62 out of 3927 locations were contested by a verifier. Interestingly, all disagreement was at the morpheme level.</Paragraph>
    </Section>
    <Section position="2" start_page="242" end_page="243" type="sub_section">
      <SectionTitle>
3.2 Baseline algorithms
</SectionTitle>
      <Paragraph position="0"> We evaluated our segmentation method by comparing its performance against Chasen 1.05 (Matsumoto et al., 1997) and Juman 3.61, 6 (Kurohashi and Nagao, 1998), two state-of-the-art, publicallyavailable, user-extensible morphological analyzers. In both cases, the grammars were used as distributed without modification. The sizes of Chasen's and Juman's default lexicons are approximately 115,000 and 231,000 words, respectively.</Paragraph>
      <Paragraph position="1"> Comparison issues An important question that arose in designing our experiments was how to enable morphological analyzers to make use of the parameter-training data, since they do not have parameters to tune. The only significant way that they can be updated is by changing their grammars or lexicons, which is quite tedious (for instance, we had to add part-of-speech information ' to new entries by hand). We took what we felt to be a reasonable, but not too time-consuming, course of creating new lexical entries for all the bracketed words in the parameter-training data. Evidence that this  groups represent our algorithm with parameters tuned for different optimization criteria.</Paragraph>
      <Paragraph position="2"> was appropriate comes from the fact that these additions never degraded test set performance, and indeed improved it by one percent in some cases (only small improvements are to be expected because the parameter-training sets were fairly small).</Paragraph>
      <Paragraph position="3"> It is important to note that in the end, we are comparing algorithms with access to different sources of knowledge. Juman and Chasen use lexicons and grammars developed by human experts. Our algorithm, not having access to such pre-compiled knowledge bases, must of necessity draw on other information sources (in this case, a very large unsegmented corpus and a few pre-segmented examples) to compensate for this lack. Since we are interested in whether using simple statistics can match the performance of labor-intensive methods, we do not view these information sources as conveying an unfair advantage, especially since the annotated training sets were small, available to the morphological analyzers, and disjoint from the test sets.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML