File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/c96-2202_evalu.xml

Size: 6,116 bytes

Last Modified: 2025-10-06 14:00:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2202">
  <Title>Word Extraction from Corpora and Its Part-of-Speech Estimation Using Distributional Analysis</Title>
  <Section position="5" start_page="1119" end_page="1121" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> We conducted two experiments, in each using a range of different thresholds for word measure.</Paragraph>
    <Paragraph position="1"> One experiment used the El)\]{ corl)us a.s a raw corpus (ignoring the POS tags) in order to cal('ulate recall and precision. The other experiment  used articles fl'om one year of the J apanese versiorl of Scientific A meT~ican in order to test whether we could incre~Lse the accuracy of the morphological analyzer (tagger) by this method.</Paragraph>
    <Section position="1" start_page="1119" end_page="1119" type="sub_section">
      <SectionTitle>
4.1 Conditions of the Exi)eriments
</SectionTitle>
      <Paragraph position="0"> For I)oth experiments, we considered tire five POSs to which almost all unknown words in  Japanese belong: I. verbal noun, e.g. N'~ (-~- 6) 'benkyou(.~,,'u)' &amp;quot;to study&amp;quot; 2. nora,s, e.g. '-&amp;quot;~ 'gakkou' &amp;quot;school&amp;quot; 3. re-type verb, e.g. :1~ -'e. ( 5 ) 'tal)e(ru)' &amp;quot;to eat&amp;quot; 4. i-type adjective, e.g. -~- (v,) 'samu(i)' &amp;quot;cold&amp;quot; 5. na-type adjective, e.g. ~'bv, (~av) 'kirei(na)'  &amp;quot;cleaI|&amp;quot; POS environments were defined as one POS-tagged string (assumed to be one morpheme), and were limited to strings made up only of h*ragana characters plus comma and period. The aim of this limitation was to reduce computational time during inatehing, and it was \['ell, that morl)hemes using kanji and katakana characters are too infrequent ~s contexts to exert much intluence on the results.</Paragraph>
      <Paragraph position="1"> Candidate for unknown words were limited to strings of two or more characters appearing in the corpus at least ten times and not containing any symbols such as parentheses. Since there are very few unknown words which consist of only one character, this limitation will not have much effect on the recall.</Paragraph>
    </Section>
    <Section position="2" start_page="1119" end_page="1121" type="sub_section">
      <SectionTitle>
4.2 Experiment 1: Word Extraction
</SectionTitle>
      <Paragraph position="0"> For evaluation purposes, we conducted a word extraction ext)eriment using the El)l{. corpus as a raw corpus, and calculated recall and precision \['or each threshold value (see Table 3). First, we calculated f'mi,~and p for all character n-grams,  of hiragana characters. Then, for each threshold level, our algorithm decided which of the candidate strings were words, and assigned a POS to each instance of the word-strings.</Paragraph>
      <Paragraph position="1"> Recall was computed as the percent, of all POS-tagged strings in the EDR corpus that were successfully identified by our algorithm as words and as belonging to the correct POS. In calculation of the recalls and the precisions, both POS and string is distinguished. Precision was calculated using the estimated frequency f((~,pos) = p(posl~ ) .f(tx) where f(,x)is the frequency of the string ~t in the corpus, and p(poslot) is the estimated probability that ct belongs to the pos.</Paragraph>
      <Paragraph position="2"> Judgement whether the string ~ belongs to pos or not was made by hand. The recalls are calculated for ones with the estimated probability more than or equal to 0.1. The reason for this is that the amount of the output is too enormous to check by hand. For the same reason we did not calculate the precisions for thresholds more than 0.25 in Table 3. This table tells us that the lower the threshold is, the higher the precision is. This result is consistent with the result derived from the hypothesis that we described in section 2.2. Besides, there is a tendency that in proportion ,as the frequency increases the precision rises.</Paragraph>
    </Section>
    <Section position="3" start_page="1121" end_page="1121" type="sub_section">
      <SectionTitle>
4.3 Experiment 2: Improvement of
Stochastic Tagging
</SectionTitle>
      <Paragraph position="0"> In order to test how much the accuracy of a tagger could be improved by adding extracted words to its dictionary, we developed a tagger based on a simple Markov model and analyzed one journal article 1. Using statistical parameters estimated from the EDR corpus, and an unknown word model based on character set heuristics (any kauji sequence is a noun, etc.), tagging accuracy was 95.9% (the percent of output morphemes which were correctly segmented and tagged).</Paragraph>
      <Paragraph position="1"> Next, we extracted words from the Japanese version of Scientific American (1990; 617,837 characters) using a threshold of 0.25. Unknown words were considered to he those which could not be divided into morphemes appearing in the learning corpus of the Markov model. Table 4 shows examples of extracted words, with unknown words a&amp;quot;Progress in Gallium Arsenide Semiconductors&amp;quot; (Scientific American; February, 1990) starred. Notice that some extracted words consist of more than one type of character, such as &amp;quot;3~ :/ -'~'~ (protein).&amp;quot; This is one of the advantages of our method over heuristics based on character type, which can never recognize mixed-character words. Another advantage is that our method is applicable to words belonging to more than one POS. For example, in Table 4 &amp;quot;Et;~ (nature)&amp;quot; is both a noun and the stem of a na-type adjective.</Paragraph>
      <Paragraph position="2"> We added the extracted unknown words to the dictionary of the stochastic tagger, where they are recorded with a frequency calculated by the following fo,'mula: (size~/size,)f(c~,pos), where size~ and size, are the size of the EDR corpus and the size of the Scientific A merican corpus respectively. Using this expanded dictionary, the tagger's accuracy improved to 98.2%. This result tells us that our method is useful as a preprocessor for a tagger.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML