File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1036_evalu.xml

Size: 8,796 bytes

Last Modified: 2025-10-06 14:00:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1036">
  <Title>A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context</Title>
  <Section position="7" start_page="280" end_page="282" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="280" end_page="281" type="sub_section">
      <SectionTitle>
4.1 Training and Test Data for the
Language Model
</SectionTitle>
      <Paragraph position="0"> We used the EDR Japanese Corpus Version 1.0 (EDR, 1991) to train the language model. It is a manually word segmented and tagged corpus of approximately 5.1 million words (208 thousand sentences). It contains a variety of Japanese sentences taken from newspapers, magazines, dictionaries, encyclopedias, textbooks, etc..</Paragraph>
      <Paragraph position="1"> In this experiment, we randomly selected two sets of 100 thousand sentences. The first 100 thousand sentences are used for training the language model.</Paragraph>
      <Paragraph position="2"> The second 100 thousand sentences are used for testing. The remaining 8 thousand sentences are used as a heldout set for smoothing the parameters.</Paragraph>
      <Paragraph position="3"> For the evaluation of the word segmentation accuracy, we randomly selected 5 thousand sentences from the test set of 100 thousand sentences. We call the first test set (100 thousand sentences) &amp;quot;test set-l&amp;quot; and the second test set (5 thousand sentences) &amp;quot;test set-T'. Table 4 shows the number of sentences, words, and characters of the training and test sets.</Paragraph>
      <Paragraph position="4"> There were 94,680 distinct words in the training test. We discarded the words whose frequency was one, and made a dictionary of 45,027 words. After replacing the words whose frequency was one with the corresponding unknown word tags, there were 474,155 distinct word bigrams. We discarded the bigrams with frequency one, and the remaining 175,527 bigrams were used in the word segmentation model.</Paragraph>
      <Paragraph position="5"> As for the unknown word model, word-based character bigrams are computed from the words with  frequency one (49,653 words). There were 3,120 distinct character unigrams and 55,486 distinct character bigrams. We discarded the bigram with frequency one and remaining 20,775 bigrams were used. There were 12,633 distinct character unigrams and 80,058 distinct character bigrams when we classified them for each word type and part of speech. We discarded the bigrams with frequency one and remaining 26,633 bigrams were used in the unknown word model.</Paragraph>
      <Paragraph position="6"> Average word lengths for each word type and part of speech were also computed from the words with frequency one in the training set.</Paragraph>
    </Section>
    <Section position="2" start_page="281" end_page="281" type="sub_section">
      <SectionTitle>
4.2 Cross Entropy and Perplexity
</SectionTitle>
      <Paragraph position="0"> Table 5 shows the cross entropy per word and character perplexity of three unknown word model. The first model is Equation (5), which is the combination of Poisson distribution and character zerogram (Poisson + zerogram). The second model is the combination of Poisson distribution (Equation (6)) and character bigram (Equation (7)) (Poisson + bigram). The third model is Equation (11), which is a set of word models trained for each word type (WT + Poisson + bigram). Cross entropy was computed over the words in test set-1 that were not found in the dictionary of the word segmentation model (56,121 words). Character perplexity is more intuitive than cross entropy because it shows the average number of equally probable characters out of 6,879 characters in JIS-X-0208.</Paragraph>
      <Paragraph position="1"> Table 5 shows that by changing the word spelling model from zerogram to big-ram, character perplexity is greatly reduced. It also shows that by making a separate model for each word type, character perplexity is reduced by an additional 45% (128 -~ 71). This shows that the word type information is useful for modeling the morphology of Japanese words.</Paragraph>
    </Section>
    <Section position="3" start_page="281" end_page="281" type="sub_section">
      <SectionTitle>
4.3 Part of Speech Prediction Accuracy
without Context
</SectionTitle>
      <Paragraph position="0"> Figure 3 shows the part of speech prediction accuracy of two unknown word model without context.</Paragraph>
      <Paragraph position="1"> It shows the accuracies up to the top 10 candidates.</Paragraph>
      <Paragraph position="2"> The first model is Equation (12), which is a set of word models trained for each part of speech (POS + Poisson + bigram). The second model is Equation (13), which is a set of word models trained for  each part of speech and word type (POS + WT + Poisson + bigram). The test words are the same 56,121 words used to compute the cross entropy.</Paragraph>
      <Paragraph position="3"> Since these unknown word models give the probability of spelling for each part of speech P(wlt), we used the empirical part of speech probability P(t) to compute the joint probability P(w, t). The part of speech t that gives the highest joint probability is selected.</Paragraph>
      <Paragraph position="5"> The part of speech prediction accuracy of the first and the second model was 67.5% and 74.4%, respectively. As Figure 3 shows, word type information improves the prediction accuracy significantly.</Paragraph>
    </Section>
    <Section position="4" start_page="281" end_page="282" type="sub_section">
      <SectionTitle>
4.4 Word Segmentation Accuracy
</SectionTitle>
      <Paragraph position="0"> Word segmentation accuracy is expressed in terms of recall and precision as is done in the previous research (Sproat et al., 1996). Let the number of words in the manually segmented corpus be Std, the number of words in the output of the word segmenter be Sys, and the number of matched words be M.</Paragraph>
      <Paragraph position="1"> Recall is defined as M/Std, and precision is defined as M/Sys. Since it is inconvenient to use both recall and precision all the time, we also use the F-measure to indicate the overall performance. It is calculated</Paragraph>
      <Paragraph position="3"> where P is precision, R is recall, and f~ is the relative importance given to recall over precision. We set  words 64.1%.</Paragraph>
      <Paragraph position="4"> Other than the usual recall/precision measures, we defined another precision (prec2 in Table 8), which roughly correspond to the tagging accuracy in English where word segmentation is trivial. Prec2 is defined as the percentage of correctly tagged unknown words to the correctly segmented unknown words. Table 8 shows that tagging precision is improved from 88.2% to 96.6%. The tagging accuracy in context (96.6%) is significantly higher than that without context (74.4%). This shows that the word bigrams using unknown word tags for each part of speech are useful to predict the part of speech.  put equal importance on recall and precision. Table 6 shows the word segmentation accuracy of four unknown word models over test set-2. Compared to the baseline model (Poisson + bigram), by using word type and part of speech information, the precision of the proposed model (POS + WT + Poisson + bigram) is improved by a modest 0.6%. The impact of the proposed model is small because the out-of-vocabulary rate of test set-2 is only 3.1%. To closely investigate the effect of the proposed unknown word model, we computed the word segmentation accuracy of unknown words. Table 7 shows the results. The accuracy of the proposed model (POS + WT + Poisson + bigram) is significantly higher than the baseline model (Poisson + bigram). Recall is improved from 31.8% to 42.0% and precision is improved from 65.0% to 66.4%.</Paragraph>
      <Paragraph position="5"> Here, recall is the percentage of correctly segmented unknown words in the system output to the all unknown words in the test sentences. Precision is the percentage of correctly segmented unknown words in the system's output to the all words that system identified as unknown words.</Paragraph>
      <Paragraph position="6"> Table 8 shows the tagging accuracy of unknown words. Notice that the baseline model (Poisson + bigram) cannot predict part of speech. To roughly estimate the amount of improvement brought by the proposed model, we applied a simple tagging strategy to the output of the baseline model. That is, words that include numbers are tagged as numbers, and others are tagged as nouns.</Paragraph>
      <Paragraph position="7"> Table 8 shows that by using word type and part of speech information, recall is improved from 28.1% to 40.6% and precision is improved from 57.3% to</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML