File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2194_intro.xml
Size: 7,387 bytes
Last Modified: 2025-10-06 14:06:02
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2194"> <Title>A Gradual Refinement Model for A Robust Thai Morphological Analyzer</Title> <Section position="2" start_page="0" end_page="1087" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> One of the important requirements for developing practical natural language processing system is a morphological analyzer that can automatically assign the correct POS (part-of-speech) tagging to the correct word with time and space efficiency. For non-separated languages such as Japanese, Korea, Chinese and Thai, the more task in morphological analyzer is needed, i.e, segmenting an input sentence into the right words (Nobesawa et.al, 1994; Seung-Shik Kang et.al, 1994). However, there is another problematic aspect, called implicit spelling error, that should be solved in morphological processing level. The implicit spelling errors are spelling errors which make the other right meaningful words., This work attempts to provide a robust morphological analyzer by using a gradual refinement module for weeding out the many possible alternatives and/or the erroneous chains of words caused by those three non-trivial problems: word boundary ambiguity, POS tagging ambiguity and implicit spelling error.</Paragraph> <Paragraph position="1"> Many researchers have used a corpus based approach to POS tagging such as trigram model (Charniak, 1993); feature structure tagger (Kemp,1994), to word segmentation, such as D-bigram (Nobesawa et.al, 1994), to both POS tagging and word segmentation (Nagata, 1994) and to spelling error detection as well as correction (Araki et.al, 1994; Kawtrakul, et.al, 1995(b)). Eventhough a corpus based approach exhibits seemingly high average accuracy, it requires a large amount of training data and validation, data (Franz, 1995).</Paragraph> <Paragraph position="2"> Instead of using a corpus based approach, a new simple hybrid technique which incorporates heuristic, syntactic and semantic knowledge is proposed to Thai morphological analyzer. It consists of word-boundary preference, syntactic coarse rules and semantic strength measurement. To implement this technique, a three-stage approach is adopted to the gradual refinement module : preference based pruning, syntactic based pruning and semantic based pruning. Each stage will gradually weed out word boundary ambiguities, tag ambiguities and implicit spelling errors.</Paragraph> <Paragraph position="3"> Our preliminary experiment shows that the proposed model can work with a time-efficiency and increase the accuracy of word boundary and tagging disambiguation as well as the implicit spelling error correction.</Paragraph> <Paragraph position="4"> In the following sections, we will begin by reviewing three non-trivial problems of Thai morphological analyzer. An overview of the gradual refinement module will be given. We will then show the algorithm with examples for pruning the erroneous word chains prior to parsing. Finally, the results of applying this algorithm will be presented. 2. Three Nontrivial Problems of Thai</Paragraph> <Section position="1" start_page="0" end_page="1086" type="sub_section"> <SectionTitle> Morphological Processing. 2.1 Word Boundary Ambiguity </SectionTitle> <Paragraph position="0"> Like many other languages such as Japanese, Chinese and Korean, Thai sentences are formed with a sequence of words mostly Without explicit delimiters. Especially, for Thai and Japanese written in Hirakana (Nobesawa,1994), a word is a stream of characters. This causes the problem of word boundary ambiguity (see Fig. 1).</Paragraph> <Paragraph position="1"> stream of characters W 1 W2</Paragraph> <Paragraph position="3"> Figurel. Two possible grouping characters into words: longest possible segment or shortest possible segment There are two possible grouping characters into words, i.e, shortest possible segment such as (1) and longest possible segment such as (2) in Fig.1.</Paragraph> <Paragraph position="4"> Each word given by either way of grouping has a meaning. In our corpus, more than 50% of sentences include word boundary ambiguity. This causes a lot of alternative chains of words where some are meaningless.</Paragraph> </Section> <Section position="2" start_page="1086" end_page="1086" type="sub_section"> <SectionTitle> 2.2 Tagging Ambiguity </SectionTitle> <Paragraph position="0"> Thai word can have more than one part of speech. In our corpus, only 2% of sentences are written by using one-tagged words. Accordingly, tag ambiguity in Thai causes a large set of tagged word combinations. We found that a sentence with 12 words can generate 3027 syntactic patterns of word chain. Both word boundary and tag ambiguity also create complexity in syntactic analysis.</Paragraph> </Section> <Section position="3" start_page="1086" end_page="1087" type="sub_section"> <SectionTitle> 2.3 Implicit Spelling Error </SectionTitle> <Paragraph position="0"> Spelling errors in Thai are classified into two types (Kawtrakul, 1995 (b)): explicit spelling error and implicit spelling error. The former can be detected easily by using a dictionary-based approach.</Paragraph> <Paragraph position="1"> The latter can not be detected by simply using dictionary since the error Call lead to words that are unintended, but spelled correctly. Table I shows three kinds of spelling errors caused by carelessness and lack of knowledge.</Paragraph> <Paragraph position="2"> In Thai, implicit spelling errors can occur more easily than in English because there are 2 distinctive characters on each-keypad. From the result of our experiment, 2,286 words can generate 6,609 implicit spelling error words where 75.68 % of those errors have new syntactic categories. This will cause an erroneous pattern of word chain which increases a lot of unnecessary job to the parser. Accordingly, Thai morphological analysis is not only expected to assign the right tag to the right word but should correct the implicit spelling error prior to parsing.</Paragraph> <Paragraph position="3"> 3. An Overview of Thai Morphological Analyzer with a Gradual Refinement Module Instead of using a corpus based approach which requires a large amount of training data and validation data, a new simple hybrid technique which incorporates heuristic, syntactic and semantic knowledge is proposed to a gradual refinement module which gradually weeds out the alternative and/or the erroneous chains of words caused by those three nontrivial problems. The techniqueis implemented by using word boundary preference, syntactic coarse rules and semantic dependency strength measurement. Fig.2 shows an overview of the system.</Paragraph> <Paragraph position="4"> The system consists of four steps: Step 0: This step provides all possible word groupings with all possible .tags by using word formation rules and Lexicon base (Kawtrakul et.al, 1995 (a)). If there is any explicit spelling error, it will be detected and suggested for correction. At this stage a temporary dictionary is created for the remaining steps.</Paragraph> <Paragraph position="5"> Step 1-3 : These steps are preference based pruning, syntactic based pruning and semantic based pruning. Each step will gradually weeds out word boundary ambiguities, tag ambiguities and implicit spelling errors.</Paragraph> </Section> </Section> class="xml-element"></Paper>