File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/98/p98-1068_concl.xml

Size: 5,891 bytes

Last Modified: 2025-10-06 13:58:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1068">
  <Title>Japanese Morphological Analyzer using Word Co-occurrence -- JTAG- Takeshi FUCHI NTT Information and Communication Systems Laboratories</Title>
  <Section position="5" start_page="410" end_page="412" type="concl">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section, Japanese morphological  anayzers are evaluated using the following : * Segmentation * Part-of-speech tagging * Phonological representation FLAG, is compared with JUMAN 4 and CHASEN 5. A single &amp;quot;correct analysis&amp;quot; is  meaningless because these taggers use different parts-of-speech, grammars, and segmentation policies. We checked the outputs of each and selected the incorrect analyses that the grammar maker of each system must not expect.</Paragraph>
    <Section position="1" start_page="410" end_page="411" type="sub_section">
      <SectionTitle>
3.1 Comparison
</SectionTitle>
      <Paragraph position="0"> To make the output of each system comparable, we reduce them to 21 parts-of-speech and 14 verbinflection-types. In addition, we assume that the part-of-speech of unrecognized words is Noun.</Paragraph>
      <Paragraph position="1"> The segmentation policies are not unified.</Paragraph>
      <Paragraph position="2"> Therefore, the number of words in sentences is different from each other.</Paragraph>
      <Paragraph position="3"> Table II shows the system accuracy. We used 500 sentences 6 (19,519 characters) in the EDR 7 corpus. For segmentation, the accuracy of JTAG is  per sentence. Average 38 characters in one sentence. Sun Ultra-1 170Mhz.</Paragraph>
      <Paragraph position="4"> the same as that of JUMAN. Table II shows that JTAG assigns the correct phonological representations to unsegmented Japanese sentences more precisely than do the other systems.</Paragraph>
      <Paragraph position="5"> Table 1TI shows the ratio of sentences that are converted to the correct phonological representation where segmentation errors are ignored. 80,000 sentences s (3,038,713 characters, no Arabic numerals) were used in the EDR corpus. The average number of characters in one sentence is 38. JTAG converts 88.5% of sentences correctly. The ratio is much higher than that of the other systems.</Paragraph>
      <Paragraph position="6"> Table III also shows the processing time of each system. JTAG analyzes Japanese text more than do four times faster than the other taggers. The simplicity of the JTAG selection algorithm contributes to the fast processing speed.</Paragraph>
    </Section>
    <Section position="2" start_page="411" end_page="412" type="sub_section">
      <SectionTitle>
3.2 Adjustment Process
</SectionTitle>
      <Paragraph position="0"> To show the adjustablity of JTAG, we tuned it for a specific set of 10,000 sentences 9. The average number of words in a sentence is 21.</Paragraph>
      <Paragraph position="1"> Graph 1 shows the transition of the number of sentences converted correctly to their phonological representation. We finished the adjustment when the system could no longer be tuned in the framework of JTAG. The last accuracy rating (99.8% per sentence) shows the maximum ability of JTAG.</Paragraph>
      <Paragraph position="2"> The feature of each phase of the adjustment is described below.</Paragraph>
      <Paragraph position="3"> Phase I. In this phase, the grammar of JTAG was changed. New attribute values were introduced and the costs of connection rules were changed. s In the EDR corpus, 2.3% of sentences have errors and 1.5% of sentences have phonological representation inconsistencies. In this case, the sentences are not revised.</Paragraph>
      <Paragraph position="4"> 9 311,330 characters without Arabic numerals.</Paragraph>
      <Paragraph position="5"> Average 31 characters per sentence. In this case, we fixed all errors of the sentences and the inconsistency of their phonological representation.</Paragraph>
      <Paragraph position="6">  sentences correctly converted to phonological representation.</Paragraph>
      <Paragraph position="7"> These adjustments caused large occurrences of degradation in our tagger.</Paragraph>
      <Paragraph position="8"> Phase \]l. The grammar was almost fixed. One of the authors added unregistered words to the dictionaries, changed the costs of registered words, and supplied the information of the co-occurrence of words. The changes in the costs of words caused a small degree of degradation.</Paragraph>
      <Paragraph position="9"> Phase II1. In this phase, all unrecognized words were registered together. The unrecognized words were extracted automatically and checked manually. The time taken for this phase is the duration of the checking.</Paragraph>
      <Paragraph position="10"> Phase IV. Mainly, co-occurrence information was supplied. This phase caused some degradation, but these instances were very small.</Paragraph>
      <Paragraph position="11"> Graph 1 shows that JTAG converts 91.9% of open sentences to the correct phonological representation, and 99.8% of closed sentences. Without the co-occurrence information, the ratio is 97.5%. Therefore, the co-occurrence information corrects 2.3% of the sentences. Without new registered words, the ratio is 95.6%, so unrecognized words caused an error in 4.2% of the  sentences. Table IV shows the percentages of the causes.</Paragraph>
      <Paragraph position="12"> Conclusion We developed a Japanese morphological analyzer that analyzes unsegmented Japanese sentences more precisely than other popular analyzers. Our system uses the co-occurrence of words to select the correct sequence of words. The efficiency of the co-occurrence information was shown through experimental results. The precision of our current tagger is 98.7% and the recall is 99.1%. The accuracy of the tagger can be expected to increase because the risk of degradation is small when using the co-occurrence information.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML