File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1054_intro.xml

Size: 1,774 bytes

Last Modified: 2025-10-06 14:05:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1054">
  <Title>JAPANESE WORD SEGMENTATION BY HIDDEN MARKOV MODEL</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> The segmentation of Japanese words is one of the main challenges in the automatic processing of Japanese text.</Paragraph>
    <Paragraph position="1"> Unlike English text which has spaces that separate consecutive words, there are no such word boundary indicators in sentences of Japanese (kanji and kana) text.</Paragraph>
    <Paragraph position="2"> The algorithms used to obtain robust segmentation of Japanese text generally utilize two techniques, lexicon and rule-based approaches. Large lexicons are inevitably used in conjunction with or as a part of the text segmenting algorithms that have been developed. These lexicons are often time consuming to build and are thus not an optimal solution.</Paragraph>
    <Paragraph position="3"> Knowledge based approaches typically entail a significant amount of human effort in specifying the rules that will be used to determine word segmentation and do not provide sufficient coverage of the language's grammar rules.</Paragraph>
    <Paragraph position="4"> This paper introduces a hidden Markov model (HMM) which has been developed for Japanese word segmentation. Hidden Markov models are part of the larger class of probabilistic algorithms. These approaches use large sets of data to abstract away the structure of the domain being learned, as probabilities. We will see that with sufficient data such a stochastic process can achieve 91% accuracy in word segmentation which approaches the state-of-the-art in s6gmentation techniques.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML