File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1054_metho.xml

Size: 11,522 bytes

Last Modified: 2025-10-06 14:13:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1054">
  <Title>JAPANESE WORD SEGMENTATION BY HIDDEN MARKOV MODEL</Title>
  <Section position="4" start_page="0" end_page="283" type="metho">
    <SectionTitle>
2. JAPANESE WORD SEGMENTATION
TECHNIQUES
</SectionTitle>
    <Paragraph position="0"> Current Japanese word segmentation techniques consistently rely on large lexicons of words in their decision making procedure. A typical word processor will have both a knowledge base which encodes rules of Japanese grammar, as well as a lexicon of over 50,000 words \[Mori, et. al. 1990\].</Paragraph>
    <Paragraph position="1"> Hypothesized segments of incoming text are analyzed to determine if they have any semantics and therefore are more likely to be correct segments; the grammar rules are then invoked to make final segmentation decisions. This technology achieves 95 percent accuracy in segmentation.</Paragraph>
    <Paragraph position="2"> An alternative approach to Japanese word processing technology is the development of an architecture for Japanese segmentation and part of speech labeling shown in Figure 1 \[Matsukawa, et. al. 1993\].</Paragraph>
    <Paragraph position="3">  and part of speech labeling architecture.</Paragraph>
    <Paragraph position="4"> The architecture of the system is as follows: JUMAN, a rule-based morphological processor which uses a 40,000 word lexicon and a connectivity matrix to determine word segmentation and part of speech labeling, 2. AMED, a rule-based segmentation and part of speech correction system trained on parallel hypothesized and correct annotations of identical text, 3. POST, a hidden Markov model which disambiguates segmentation and part of speech decisions.</Paragraph>
    <Paragraph position="5">  This unified architecture achieved an error rate of 8.3% in word segmentation; this level of error can be attributed to JUMAN's relatively small lexicon and lack of sufficient training data for the AMED and POST modules.</Paragraph>
    <Paragraph position="6"> Dragon Systems' LINGSTAT machine translation system \[Yamron, et. al., 1993\] uses a maximum likelihood segmentation algorithm which, in essence, calculates all possible segmentations of a sentence using a large lexicon and chooses the one with the best score, or likelihood. The implementation uses a dynamic programming algorithm to make this search efficient.</Paragraph>
    <Paragraph position="7"> MAJESTY is a recently developed morphological preprocessor for Japanese text \[Kitani, et. al., 1993\]. On a test corpus, it achieved better than 98% accuracy in word segmentation and part of speech determination; this represents the state-of-the-art in such technology.</Paragraph>
    <Paragraph position="8"> Teller, et. al. \[1994\] present a probabilistic algorithm which uses character type information and bi-gram frequencies on characters in conjunction with a small knowledge base to segment non-kanji stnngs. While it is related to our hidden Markov model approach in that it is character-based and does not rely on any lexicons, it differs in that it reties on a certain amount of a priori knowledge about the morphology of the Japanese language. This algorithm achieved 94.4% accuracy in segmenting words in a test corpus.</Paragraph>
  </Section>
  <Section position="5" start_page="283" end_page="285" type="metho">
    <SectionTitle>
3. HIDDEN MARKOV MODEL
</SectionTitle>
    <Paragraph position="0"> Hidden Markov models are widely used stochastic processes which have two components. The first is an observable stochastic process that produces sequences of symbols from a given alphabet. This process depends on a separate hidden stochastic process that yields a sequence of states. Hidden Markov models can be viewed as finite state machines which generate sequences of symbols by jumping from one state to another and &amp;quot;emitting&amp;quot; an observation at each state.</Paragraph>
    <Paragraph position="1"> The general recognition problem, as stated in the literature, is: given a hidden Markov model, M, with n symbols and m states, and a sequence of observations, O = OlO2...o t, determine the most likely sequence of states, S = SlS2...s t which could yield the observed sequence of symbols.</Paragraph>
    <Section position="1" start_page="283" end_page="284" type="sub_section">
      <SectionTitle>
3.1. Model Development
</SectionTitle>
      <Paragraph position="0"> The hidden Markov model for Japanese word segmentation was designed with several goals in mind: * Avoiding an approach which relies on having a large lexicon of Japanese words.</Paragraph>
      <Paragraph position="1"> * Allow the model to be easily extensible (with new training, of course) to accommodate more data or a different language.</Paragraph>
      <Paragraph position="2"> While not of paramount importance, an algorithm which segments rapidly would be preferred; word segmentation is a pre-processing step in most Japanese text systems and should be as unobtrusive and transparent as possible.</Paragraph>
      <Paragraph position="3"> One possible algorithm for segmentation is to use a hidden Markov model to find the most likely sequence of words based on a brute force computation of every possible sequence of words; the POST component of the word segmentation architecture described in Section 2 uses a similar model. This, though, violates the above constraint of no reliance on a Japanese word lexicon. Given that we would like to avoid the overhead associated with constructing and using a word-based lexicon, we are therefore forced to approach the problem in a manner which focuses on discrete characters and their interrelationships.</Paragraph>
      <Paragraph position="4"> The segmentation model we developed avoids the need for both a lexicon of Japanese words and explicit rules. It takes advantage of the effectiveness of subsequences of two text characters in determining the presence or absence of a word boundary. In essence, we will show that the morphology of the Japanese language is such that 2-character sequences have some underlying meaning or significance with respect to word boundaries.</Paragraph>
      <Paragraph position="5"> To solidify this idea, let us focus on two unspecified text characters, k 1 and k 2. Suppose that out of 100 places in the training data where k 1 is followed by k 2, the vast majority of these occur at word boundaries. From a probabilistic viewpoint, we are justified in coming to the conclusion that &amp;quot;klk 2 denotes a word boundary&amp;quot;. To complicate things, assume that out of 100 places where k 1 is followed by k 2, 50 of these are at word boundaries and 50 are inside words. It would seem that no conclusions could be drawn from this situation. On the other hand, if we notice that the 50 instances of klk 2 at word boundaries all had word boundaries between k 1 and the character preceding k 1, but none of the instances of klk 2 within a word had word boundaries before the kl, then we can hypothesize the following relationship, where 'T' denotes a word boundary and k x is the character preceding kl: ifk x I k 1, then k 1 I k 2 otherwise klk 2 This is exactly the sort of hidden structure that HMMs are geared towards uncovering.</Paragraph>
      <Paragraph position="6"> Proceeding in this manner, a model for Japanese word segmentation was developed which capitalizes on this idea of the significance of 2-character sequences in word boundary determination; the state transition diagram is shown in Figure 2. In the model there are just two possible states, either a word boundary (B) or a word continuation (C). The observation symbols are all possible 2-character sequences The kanji alphabet consists of approximately 50,000 characters; of these, 6,877 form a standard character set which suits most text processing purposes \[Miyazawa, 1990; Mori, et. al.</Paragraph>
      <Paragraph position="7"> 1990\]. Factoring in the size of the hiragana and katakana alphabets, the number of possible 2-character sequences generated exclusively by this subset approaches 5&amp;quot;107, a clearly unmanageable amount of data. An implicit assumption  of our model is that there is a small subset of all possible 2-character sequences which in fact accounts for a large percentage of the 2-character sequences normally used in written text. It is such a subset which the model hopes to uncover and use in further classification.</Paragraph>
      <Paragraph position="9"> The algorithm proceeds by sliding a 2-character window over an input sentence and calculating how likely it is that each of these 2-character sequences is a word boundary or within a word, given the previous 2-character sequence's status as either a word boundary or continuation. In this manner, the model is a bi-gram model \[Meteer, et. al., 1991\] over 2-character sequences since it relies only on the previous state. It is important to note that consecutive 2-character windows overlap by one character. Figure 3 portrays the progression of the window across part of a line of text emitting the 2-character observation symbols.</Paragraph>
    </Section>
    <Section position="2" start_page="284" end_page="284" type="sub_section">
      <SectionTitle>
3.2. Training
</SectionTitle>
      <Paragraph position="0"> The model is trained using supervised training over a previously annotated corpus. Specifically, training is accomplished by taking the corpus of segmented text and simply counting the number of times each 2-character sequence  sequences that were absent from the training data) leads to observation probabilities of 0. To rectify this, upon coming across an unknown 2-character sequence in the test data, the algorithm assigns the observation an a priori probability by postulating that the sequence was actually seen once in the training data. This probability is a sufficiently low value, balancing the fact that the sequence was never seen when all possible symbols were being gathered, with the hypothesis that it might be a valid observation. This procedure is an admission that even extensive training might not attain complete coverage of the domain.</Paragraph>
    </Section>
    <Section position="3" start_page="284" end_page="285" type="sub_section">
      <SectionTitle>
3.4. Implementation Issues
</SectionTitle>
      <Paragraph position="0"> Algorithm -- There are generally two basic algorithms for hidden Markov model recognition: the forward-backward algorithm and the Viterbi algorithm \[Viterbi, 1967\]. The forward-backward algorithm (Baum-Welch) computes the likelihood that the sequence of observation symbols was produced by any possible state sequence. The Viterbi model, on the other hand, computes the likelihoods based on the best possible state sequence and is more efficient to compute and train. The word segmentation HMM implementation uses the Viterbi approach. This difference is transparent and matters only at the implementation level.</Paragraph>
      <Paragraph position="1"> Kanji and Kana -- Due to their vast numbers, two bytes are needed to represent Japanese text characters rather than the conventional one byte for English characters. The implementation can easily support either one or two byte characters with few modifications.</Paragraph>
      <Paragraph position="3"> Sentence by Sentence Input -- The only assumption on the input to the hidden Markov model is that the text be predivided into sentences. Periods were the sole indicators of sentence endings that were used. This assumption is made to provide for the incremental processing of a body of text.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML