File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/h94-1054_concl.xml
Size: 2,973 bytes
Last Modified: 2025-10-06 13:57:21
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1054"> <Title>JAPANESE WORD SEGMENTATION BY HIDDEN MARKOV MODEL</Title> <Section position="7" start_page="286" end_page="287" type="concl"> <SectionTitle> 5. CONCLUSION </SectionTitle> <Paragraph position="0"> We have implemented and described a hidden Markov model for Japanese word segmentation. The bi-gram model is characterized by an unconventional set of observation symbols, namely, the set of 2-character sequences. The model is also extremely simple in that it consists of only two states which encode the existence or absence of a word boundary between any two characters. This probabilistic model was trained over a large corpus of annotated data and then tested over a different set of data to measure performance; it achieves word segmentation accuracy of 91.15% and determines 96.48% of all the word boundaries correctly. When contrasted with the state-of-the-art, the HMM emerges as a worthy contender to related algorithms based on several observations: 1. First and foremost, this HMM approach completely circumvents the need for Japanese word lexicons which other approaches heavily rely upon; the storage issues and overhead for word look-up are thus avoided.</Paragraph> <Paragraph position="1"> including initialization time.</Paragraph> <Paragraph position="2"> 4. The model is designed to be easily extensible with additional data or to a different language; no lexicons are needed, simply a sufficiently large body of text on which the algorithm can be trained.</Paragraph> <Paragraph position="3"> Most disappointing about the performance of the model is the large discrepancy between the word accuracy and the word boundary accuracy. This is surely a side-effect of the bi-gram model topology; there is no way to relate the beginning and ending boundaries of a single word with this model unless the word begins and ends in consecutive states (a one-character word).</Paragraph> <Paragraph position="4"> Regardless, it is interesting and impressive that a two state bi-gram model can model Japanese word boundaries so effectively. With additional training data, we anticipate that the algorithm's performance will increase. The next generation of this model should somehow incorporate and model the relationship between boundaries of the same word in an effort to raise the word segmentation accuracy closer to the accuracy level of word boundary determination. Another modification to the algorithm which might improve performance is extending it to be a tri-gram model. The HMM could also be trained and tested on a different language, Chinese for instance, to see how well it performs.</Paragraph> <Paragraph position="5"> The results of this research are encouraging; the re-training and extensions noted above should be pursued to increase accuracy and to obtain a sense of how generally applicable to comparable domains this hidden Markov model is.</Paragraph> </Section> class="xml-element"></Paper>