XML Viewer - w97-0126

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0126_metho.xml
Size: 10,905 bytes
Last Modified: 2025-10-06 14:14:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0126">
  <Title>A Statistical Approach to Thai Morphological Analyzer*</Title>
  <Section position="3" start_page="289" end_page="293" type="metho">
    <SectionTitle>
2. Key Problems in Thai Morphological Analysis
</SectionTitle>
    <Paragraph position="0"> There are three nontrivial problems of Thai morphological processing: word boundary ambiguity, tagging ambiguity and impficit spelling errors.</Paragraph>
    <Section position="1" start_page="289" end_page="289" type="sub_section">
      <SectionTitle>
2.1 Word Boundary Ambiguity
</SectionTitle>
      <Paragraph position="0"> Thai seutences are simile to the Japanese's and Chinese's in terms of having no blank space to mark each words within the same sentence. Additionally most of Thai words are multisyllabic words.</Paragraph>
      <Paragraph position="1"> Some of them contains more than monosyllabic words as parts of its component. This causes word boundary ambiguity.</Paragraph>
      <Paragraph position="2"> Let C be a sequence of characters: C = c~ c.2 c3 ...</Paragraph>
      <Paragraph position="3"> Let W be a sequence of words: W = wl w2 ... w, where wl = cn..~ Giving a stream of characters, the possible word segmentation is as following : stream of characters wl w2 ~.__l--~c,c, I Ic, c,c.,</Paragraph>
      <Paragraph position="5"> As shown above, the word in &amp;quot;C,CzC~C4Cs&amp;quot; pattern has two ambiguous forras. One is &amp;quot;C, C2&amp;quot; and &amp;quot;C~C4Cs&amp;quot;. The other one is &amp;quot;'C,C:C~&amp;quot; and &amp;quot;'C4Cs&amp;quot;. In our corpus, more than 50% of sentences include word boundary ambiguity-.</Paragraph>
      <Paragraph position="6"> The assignment of part of speech to the segmented word is also effected by the word boundary ambiguity. This causes the ambiguous pattern in a sentence The example is as shown in the following:</Paragraph>
      <Paragraph position="8"> In the above example, only c) and d) are the meaningful sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="289" end_page="291" type="sub_section">
      <SectionTitle>
2.2 Tag Ambiguity
</SectionTitle>
      <Paragraph position="0"> A Thai word can have more than one part of speech. This tag ambiguity can cause a large set of tagged word combinations. Consider the following example&amp;quot;</Paragraph>
      <Paragraph position="2"> The above multiple-tagged words give 1024 combinations of word chain. However, only one word chain is correct. Figure 1. shows tag ambiguity in our corpus. As we can see, there are about 95% of the words are ambiguous with regards to the t.~s they take.</Paragraph>
      <Paragraph position="3"> Number ofta~s Number of words Percentage  Both word boundary and tag ambiguity increase the complexity in syntax analysis. It also increases the amount of time used for parsing the sentences. Besides these two ambiguities, spelling errors in Thai, called implicit spelling errors, also cause a lot more work for the parser.</Paragraph>
    </Section>
    <Section position="3" start_page="291" end_page="293" type="sub_section">
      <SectionTitle>
2.3 Implicit Spelling Error
</SectionTitle>
      <Paragraph position="0"> Implicit spelling errors, one of ill-formedness usually encountered in documents, are caused by either carelessness or lack of knowledge. This type of error can not be d~ectecl by simply using a dictionary approach. There are three kinds of typing errors caused by the carelessness: Missing,</Paragraph>
      <Paragraph position="2"> pron V N prep N conj. N mod I push raR down water unffi leg expensive The implicit spelling errors can occur much easier in Thai than in English and Japanese (in Hiragana) because the errors ah~ays involve using aword that has a similar pronunciation. There are about 20-30% of Thai words that can cause this kind of the confusion to typist. Additionally, there are 2 characters in one key pad (see figure 2). Thus, keyboard mistyping increases the way of implicit misspelling which can not be dete~ed easily using the dictionary - based approach.</Paragraph>
      <Paragraph position="4"> In this work, we attempt to provide a computational solution to handle these three nontrivial problems for making ~ejob of a parser much easier. The next section will present the overview of the system.</Paragraph>
      <Paragraph position="5"> 3. An Overview of a Computational Morphological Processing for Thai A computalional model consists of word segmenting, spelling checking and word filtering processes is proposed to handle the morphological problems mentioned earlier. (see figure 3)  Input sentence is a stream of characters without explicit delimiters. Using word formation rules and Lexicon base look up algorithm \[KAW95(a)\], the word segmenting process, wig provide all possible word grouping with all possible tags. If there is any explicit error then the second step, the spell checking process, will be called to give a suggestion with a set of most likely, word \[KAW9fi(a)\]. However, an implicit spelliug error may stiU exist. In order to choose the right tagged word combination, word tering process will use the statistical association among words, coUected as a statistical base, to eliminate the alternative and/or erroneous chain of words which is caused by word boundary and tagging ambiguities and implicit spelling error. This paper concentrates only on the word filtering process. The detail of the process will be discussed next.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="293" end_page="295" type="metho">
    <SectionTitle>
4. Word Filtering
</SectionTitle>
    <Paragraph position="0"> A11 of word boundary, part of speech tag and implicit spelling error can be disambiguated by using a trigram model \[CHAR 93\] to calculate the probabilities of word cluster. The sentences shown in example 1, lc) and ld) are meaningful sentences. In other words, they have the strength of association of word in a chain more than la) and ld) have. The association between words in &amp;quot;'the boat shake&amp;quot; is stronger than in &amp;quot;'the boat ox&amp;quot;. In example 2, we can also can find the most likely sequence of parts of speech by considering the previous part of speech. Since an implicit spelling error affects both meaning and tag, (such as : ~ (fly) : v ~ ~J~ (on): preposition) the special process is needed .</Paragraph>
    <Paragraph position="1"> Consequently, word filtering will consists of two processes : a filtering process used to eliminate unuseful ragged word combinations and a scanning process used to detect and correct an implicit spelling error by generating a new set of words according to the cause of errors and selecting the one that maximizes the probabilities of word cluster. Both processes need to look up a statistical information collected from the hand-tagged corpus.</Paragraph>
    <Section position="1" start_page="293" end_page="293" type="sub_section">
      <SectionTitle>
4.1 The Training Corpus
</SectionTitle>
      <Paragraph position="0"> The ~alning corpus is a set of sentences, divided into two groups. Each sentence in the first group is prepared to give a context for a word which has a possibility to become an implicit spelling error, and a context for a sequence of words that have word boundary ambiguity. The second group are sentences prepared to give a context for a multiple-tagged word. All of these sentences have already segmented and tagged. A statistical information will be collected as a statistical base to support both filtering process and SCanning process. Thus, collected statistics not only emphasize on the frequency of using individual words but also on the cluster of words.</Paragraph>
    </Section>
    <Section position="2" start_page="293" end_page="295" type="sub_section">
      <SectionTitle>
4.2 Markov Model as a Statistical Model of Filtering Process
</SectionTitle>
      <Paragraph position="0"> A Trigram Model \[CHAR 93\] is utilized to calculate the probabilities of word cluster, i.e.</Paragraph>
      <Paragraph position="1"> how the previous two words affects the probabilities of next word. This can be explain in equation  where PeO0 is the estimated probability for Xbased on some count C: So to estimate the probability of w, appear after &amp;quot;w;..,,w,.l&amp;quot;, we count how many times the pair &amp;quot;~,..,,w,.t&amp;quot; appears in our corpus and how many times &amp;quot;w,.2,w,4,w,&amp;quot; appears and divide. Because of the sparse-data problem in trigram model, rather than equation (1), we instead USe&amp;quot; t=l t~l Thus, we can compute the better probabildes although the relevant trigram or bigram data are missing. The result from the experiment shows that the assigned values. I, 3, .6 to 2 1,/i,_,, ~ s, \[CHAR 93\] respectively, will give the satisfied solution for Thai word sequence probability. Using equation (3), the strength of association of words in a chain can be calculated. In order to handle the tagging ambiguity problem. A U'igram part of speech model is also used \[DeRose 88\] = P(,,Iw,),(,,I,,_;,,_,) (,&gt; Since the proposed model is provided for disambiguating both word boundary and tag, we use the average of probabilities calculated by the equation (3) and (4) as the strength of a chain of tagged words and select the higher one as the most likely sequence of corrected word with their tags. For example, the strength of word chain (see the example l) in lc) higher than la) while the probabilities of the sequence of parts of speech of la) and Ic) are equal. Based on the average of the strength of word chain and the most likely sequence of parts of speech, Ic) will be selected as the solution of word segmentation and tagging.</Paragraph>
      <Paragraph position="2"> 4.2.2 Two parts of Word Filtering There are two parts in word filtering (see the figure below) A set of tagged word combination  The first part of word filtering, i.e., the faltering process, calculates the strength of each tagged word combination. The combination(s) that gives the highest value will be the most likely  sequence(s) of tagged words. In the second part, scanning process, an implicit spelling error will be detected and corrected \[KAW95(b)\]. That is, the weakest strength of word cluster will be assumed to have an implicit spelling error. Then a new set of words which are generated according to the causes of error will be replaced to flint detected word one by one. A replaced word which gives the highest value of the strength of word chain will be a solution of an implicit spelling error.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="295" end_page="295" type="metho">
    <SectionTitle>
5. Conclusion
</SectionTitle>
    <Paragraph position="0"> From the results of the experiment shown below, word filter can eliminate many of alternative word sequences and corr~t the unplicit error. This result makes the job of the parser much easier and speeds it up.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML