File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2194_metho.xml

Size: 5,698 bytes

Last Modified: 2025-10-06 14:14:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2194">
  <Title>A Gradual Refinement Model for A Robust Thai Morphological Analyzer</Title>
  <Section position="3" start_page="1087" end_page="1088" type="metho">
    <SectionTitle>
4. A Gradual Refinement Module
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1087" end_page="1087" type="sub_section">
      <SectionTitle>
4.1 Preference based Pruning
</SectionTitle>
      <Paragraph position="0"> From the Fig.l, Thai Word Segmentation can be implemented as follows: case i - only longest segmentation or shortest segmentation is possible, case ii - both longest segmentation and shortest segmentation are possible.</Paragraph>
      <Paragraph position="1"> The former will be processed at this stage by looking up the preferred words (see Table 2). Some of them are determined by the cooccurrence word in the left or right* For the latter, it will be processed by the next steps.</Paragraph>
      <Paragraph position="3"> means a word in the right.</Paragraph>
      <Paragraph position="4"> In summary, word boundary preference is used to prune the word chains which consist of impossibly occurred or rarely occured word segmentation.</Paragraph>
    </Section>
    <Section position="2" start_page="1087" end_page="1087" type="sub_section">
      <SectionTitle>
4.2 Syntactic based Pruning and Implicit Spelling
Correction
</SectionTitle>
      <Paragraph position="0"> At this stage, the syntactic coarse rules are used for pruning the remaining erroneous word chains caused by the word boundary ambiguities, tagging ambiguities and/or implicit spelling errors.</Paragraph>
      <Paragraph position="1"> Syntactic Coarse Rules .'An example of the syntactic coarse rules for a set of two consecutive words (Wi,Wi+l) in Thai grammar is given as follows * if Wi is noun then Wi+ z might be : noun, verb, modifier ..... ifWi is verb then Wi+ s might be : noun, postverb, rood ..... The POS matrix (PM) given below is used to implement the finite state automaton model of the syntactic coarse rules: where syntactic category of Wi is cati and Wi+l, is catH.</Paragraph>
      <Paragraph position="2">  cat i stop noun verb start 1 1 noun 0 1 I verb 1 1 I mod. 0 0 0 postv. 0 el. 0 &amp;quot;-0 eati+ 1 rood. postv, cl ...... .</Paragraph>
      <Paragraph position="4"> Note: start means the beginning of a sentence, stop means the end of a sentence.</Paragraph>
      <Paragraph position="5"> Together with the POS matrix, some constraints, called flag, are used to change the PMij from 0 to 1. For example : if there exist &amp;quot;'verb&amp;quot; before &amp;quot;'modifier&amp;quot; then flag = 1 else flag : 0 According to the above constraint, PMij, where i = modifier and j = postverb, can be changed from 0 to 1 if flag equals 1. Based on POS matrix and constraints, now, we can use the following definition to detect the position of error in the word chains. \[---True if PMi i : 1 cati ' cati+l= &amp;quot;t True if ( PMij = 0) ^ (flag : 1) L_.False if ( PMij = 0) ^ (flag = 0) Consider the following example * W1 W2 W3 W4 {tube-shape container : n, el} {is : v} {on : prep} {table : n} As shown above, Wl has 2 tags : noun and classifier. However, &amp;quot;'classifier&amp;quot; will be pruned since it violates the syntactic coarse rule that &amp;quot;classifier&amp;quot; could not be an initial word. The POS matrix is used to disambiguate word boundary as well.</Paragraph>
      <Paragraph position="6"> Finally, if there is no word chain which has all right POS sequences, the erroneous word chain, which has the error marker at the most remote position, will be selected and be expected that there is an implicit spelling error. Then the word generating function will be called for generating a set of candidate words to that position and the process will start pruning at this stage again.</Paragraph>
    </Section>
    <Section position="3" start_page="1087" end_page="1088" type="sub_section">
      <SectionTitle>
4.3 Semantic based Pruning and Implicit Spelling
Correction
</SectionTitle>
      <Paragraph position="0"> Since the syntactic coarse rules only weed out the erroneous POS word chains, some errors still  remain. At this stage, the semantic information from Lexicon Matrix (Kawtrakul et.al, 1995 (a)) is accessed and used to calculate the semantic dependency strength between certain pairs of two words. Consider the following example : t spelling error t~q ~Jlfl { he:pron. } {bent:verb } { so nmch: modifier } I I wea~h stron~ strength As shown in the above example, there is no POS chain error, but there exists an implicit spelling error which can be detected by using the semantic dependency strength. The word generating function will be called for generating a set of candidate words for the two consecutive words that have weak strength, and the process will return to step 2 for pruning the erroneous POS chains and then goto step 3 for calculating the semantic strength again. The strongest strength chain will be selected as the most likely sequence of the right words in the right tags. The final solution for the above example is t'll'l ~ ~qn {he:pron. } {think:verb} {so much: modifier}</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="1088" end_page="1088" type="metho">
    <SectionTitle>
5. Experimentation Results
</SectionTitle>
    <Paragraph position="0"> We tested the system on PC-486 DX2 by using two hundred sentences corpus. The percentages of word correctly segmented, tagged and spelled, based on the gradual refinement module and time efficiency are compared with the results based on a statistical approach to word filtering on small training corpus (Kawtrakul, 1995 (b)) as shown in Table 4.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML