XML Viewer - w97-0126

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0126_intro.xml
Size: 4,110 bytes
Last Modified: 2025-10-06 14:06:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0126">
  <Title>A Statistical Approach to Thai Morphological Analyzer*</Title>
  <Section position="2" start_page="0" end_page="289" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> One of the major problcrns in many languages, such as Japanese, Chinese, Korean and Thai, is word boundary ambiguity because these languages do not have any delimiters between words. The second problem is tagging ambigui~' which occurs when there is more than one tag for one word.</Paragraph>
    <Paragraph position="1"> Another probleau is implicit spelling error that occurs because some incorrect words can be found in a diotionm3, .This problem is very hard to solve with a simple approach, such as dictionary approach. Thai morphological ~n~lysis must face these three problems which cause many possible alternative or/and the erroneous chains of words. These problems generate a lot of unuecessary work for the parser. In order to simplify the parser and speed it up, three important points to bear in mind when cousidering the morphological processing are neat segmentation of characters into words, part of speech tagging selection, and implicit spelling error detection. This work attempts to provide a computational solution, called Word Filtering, to handle those three points prior to parsing.</Paragraph>
    <Paragraph position="2"> The proposed model of Tb.ai morphological analysis consists of three steps: sentence segmenting, spell checking and word ill, ring. Using word fonnation rules and a dictionary look up algorithm in the first step, all possible word groups with all possible tags will be given. If there is any explicit error, the second step, that is spell checking, will give a suggestion about a set of most likely words. However, the implicit spelling error may still exist and will affect the parser. That is, the parser must search a large set of tagged word combinations in order to choose the fight one. Thus, the main goal of word filtering is to reduce the combination of unuseful tagged words and to identify implicit spelling error.</Paragraph>
    <Paragraph position="3"> *A Statistical Approach to Thai Morphological Analyzer is a part of WPA (Writing Production Assistant) Project supported by the National Research Council of Thailand.</Paragraph>
    <Paragraph position="5"> The proposed Word Filtering method consists of two steps: a filtering process and a scanning process. The first process will try to filter out any incorrect word boundary and any unsuitable tag. The second process detects and corrects the implicit spelling error by generating the new words for the detected error.</Paragraph>
    <Paragraph position="6"> The basic idea of the filtering process is to calculate the probabilities of all possible chains of tagged words by using a trigram of the Markov Model. The most likely sequences of tagged words are the ones that maximize chain probabilities. Nevertheless, they may be an erroneous chain which have implicit speRing errors. Thus, the Word Filtering, also includes the scanning process to detect and correct the error. At this step, a set of words will be generated by a generating function and be replaced to the detected word. The most likely sequences of correct words arc the ones that maximize chain probabilities. Both filtering and scanning processes use the statistical infomaation collected from the hand-tagged corpus. From results of the experiments on small corpus (about 10,000 sentences), word filter can criminate alternative word sequences and can correct the implicit error quite well.</Paragraph>
    <Paragraph position="7"> In the following section, key problems in Thai morphological analysis are described. Then, we present the overview of a computational morphological processing in section 3. In the section 4, the concept of how to use the statistical information to handle word boundary ambiguity, tagging ambiguity and implicit spelling error will be explained. Finally-, we present the conclusion result of*.he experiment.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML