XML Viewer - w97-0120

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0120_evalu.xml
Size: 6,056 bytes
Last Modified: 2025-10-06 14:00:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0120">
  <Title>A Self-Organlzing Japanese Word Segmenter using He-ristic Word Identification and Re-estimation</Title>
  <Section position="7" start_page="3984" end_page="3984" type="evalu">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3984" end_page="3984" type="sub_section">
      <SectionTitle>
6.1 The Nature of the Word Unigram Model
</SectionTitle>
      <Paragraph position="0"> Fizst, we will clarify the nature of the word unigram model. ROughly speaking, word unigram based word segmenters maximize the product of the word frequencies under the fewest word principle which subsumes the longest match pzinciple.</Paragraph>
      <Paragraph position="1"> If two word segmentation hypotheses divers in the number of words, the one with fewer words is almost always selected. For example, the input string is clc2 and the dictionary includes three words c1~, Cl, c2. To prefer segmentation hypothesis c11c2 over czc2, the following relation must  If two word segmentation hypotheses have the same number of words, the one with larger product of word frequencies is selected. For example, the input string is c~c2cs and the dictionary includes four words e~c~, cs, e~, e2c3. To prefer segmentation hypothesis e~c21cs over c~\[c2cs, the following relation must hold. V(ClC2) C(C~) V(Cl) V(C2C3) -- &lt;: (9) N N N N Since the denominator N is cancelled, it is obvious that the segmentation with larger product of frequencies is preferzed.</Paragraph>
    </Section>
    <Section position="2" start_page="3984" end_page="3984" type="sub_section">
      <SectionTitle>
6.2 Classification of Segmentation Errors
</SectionTitle>
      <Paragraph position="0"> There are three major types of segmentation errors. The first type is not an error but the ambiguity resulting from inconsistent manual segmentation, or the intrinsic indeterminacy of Japanese word segmentation. For example, in the manually segmented corpus, we found the string ~-\[\].&amp;~ (foreign laborer) is identified as one word in some places while in others it is divided into two words ~l-m)~ (foreigner) and ~ (laborer). However, the word unigram based segmenter consistently identifies it as a single word. We assume 3-5 % of the segmentation &amp;quot;errors&amp;quot; belong to this type.</Paragraph>
      <Paragraph position="1"> The second type is breakdown of unknown words. For example, the word ~#~ (funny) is segmented into two word hypotheses ~ (rare) and ~ (strange). This is because ~'~ is included in the dictionary. When a substring of an unknown word coincides with other word in the dictionary, it is very likely to be broken down into the dictionary word and the remaining substring. This is a major flaw of our word model using character unigram. It assigns too little probability to longer word hypotheses, especially more than thee characters.</Paragraph>
      <Paragraph position="2"> The third type is erroneous longest match. This happens frequently at the sequence of grammatical function words vrritten in hiragana. For e~ample, the phrase ~$1~ (gather) I C/ (INFL) \[ (and) \] ~ (come) I ~= (past-AUXV), which means ~came and gathered&amp;quot;, is segmented into ~ I -~'C (TOPIC) \[ -~1c (north), because the number of words is fewer. The larger the initial word list is, the more often a hiragana word happens to coincide with a sequence of other hiragana words, because the number of character types in hiragana is small (&lt; 100). This is the major reason why word segmentation accuracy levels off or decreases at a certain point, as the size of the initial word list increases.</Paragraph>
    </Section>
    <Section position="3" start_page="3984" end_page="3984" type="sub_section">
      <SectionTitle>
6.3 Classification of the Effects of Be-estimation
</SectionTitle>
      <Paragraph position="0"> There are two types of major changes in segmentation with re-estimation: word boundary adjustment and subdivision. The former moves a word boundary keeping the number of words unchanged.</Paragraph>
      <Paragraph position="1"> The latter break down a word into two or more words.</Paragraph>
      <Paragraph position="2"> Re-estimation usually improves a sequence of grammatical function words written in hiragana at the sentence final predicate phrase if the initial segmentation and the correct segmentation have the same number of words. For example, the incorrect initial segmentation ~ (take away) I (INFL + passive-AUXV) I ~=~ (ball) I ~t~ (not yet) is correctly adjusted to ~i~'l,~ (take away) I ~,h, (INFL + passive-AUXV) I ft. (past-AUXV) I ~ (still) I fr~ (COPULA), which means ``still be taken away&amp;quot;.</Paragraph>
      <Paragraph position="3"> Re-estimation subdivides an erroneous longest match if the frequencies of the shorter words are significantly large. For example, the incorrect initial segmentation ~ (restrain) I fr.~ (sea  bream) is correctly subdivided into ~ (restrain) \[ tr. (want-AUXV) \[ b~ (INFL), which means &amp;quot;want to restrain&amp;quot;.</Paragraph>
      <Paragraph position="4"> One of the most frequent undesirable effects of re-estimation is subdividing an infrequent word into highly frequent words, or a frequent word and an unknown word. For example, the correct infrequent word ~ (ambassador) is subdivided into two frequent words, ~ (use-ROOT) and (node).</Paragraph>
      <Paragraph position="5"> As we said before, one of the major virtues of re-estimation is its ability to remove inappropriate word hypotheses generated by the initial word identification procedure. For example, from the phrase Y~ (Soviet Union) I ~ (made-SUFFIX) I l~ (tank), which means &amp;quot;Soviet Union-made tank&amp;quot;, the initial word identifier extracts two word hypotheses Y and ~K, where the former is written in katakana and the latter is written in kanfi. If ~ and ~ is in the dictionary, the two erroneous word hypotheses &gt;' and ~I~iK are removed and the correct word t~ is added to the dictionary after re-estimation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML