File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1175_intro.xml

Size: 2,093 bytes

Last Modified: 2025-10-06 14:02:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1175">
  <Title>Combining Prediction by Partial Matching and Logistic Regression for Thai Word Segmentation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In Thai language, characters are written without explicit word boundaries. Depending on the contexts, there can be many ways to break a string into words, for instance, &amp;quot;`aacch`ng&amp;quot; can be segmented as &amp;quot;`aa*cch`ng&amp;quot; or &amp;quot;`aacch*`ng&amp;quot;, and &amp;quot;nangtaaklm&amp;quot; can be segmented as &amp;quot;nang*taa*klm&amp;quot; or &amp;quot;nang*taak*lm&amp;quot;. This complicates the task of identifying word boundaries.</Paragraph>
    <Paragraph position="1"> Longest matching is the most popular approach to Thai word segmentation (Pooworawan, 1986).</Paragraph>
    <Paragraph position="2"> The algorithm scans text from left to right and selects the longest match with a dictionary entry at each point, in a greedy fashion. However, longest possible words may not comply with the actual meanings. For example, &amp;quot;chaawbaanr`kraabphra&amp;quot; is segmented by the longest matching as &amp;quot;chaawbaan-r`kraab-phra&amp;quot; instead of the correct segmentation &amp;quot;chaawbaanr`-kraab-phra&amp;quot;. This type of ambiguity is referred to as character-level ambiguity. In addition, &amp;quot;ekhaarabr`ngethaa cchaakephuue`n&amp;quot; is segmented as &amp;quot;ekhaa-rabr`ng-ethaa-cchaak-ephuue`n&amp;quot; instead of the correct segmentation &amp;quot;ekhaa-rab-r`ngethaa-cchaak-ephuue`n&amp;quot;. This is referred to as syllable-level ambiguity.</Paragraph>
    <Paragraph position="3"> The technique we propose is a two-step process to word segmentation. In the first step, text is segmented into a sequence of syllables, whose structures are more well-defined. This reduces the character-level ambiguity. The remaining syllable-level ambiguity is the task of combining those syllables into words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML