File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1175_intro.xml
Size: 2,093 bytes
Last Modified: 2025-10-06 14:02:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1175"> <Title>Combining Prediction by Partial Matching and Logistic Regression for Thai Word Segmentation</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In Thai language, characters are written without explicit word boundaries. Depending on the contexts, there can be many ways to break a string into words, for instance, &quot;`aacch`ng&quot; can be segmented as &quot;`aa*cch`ng&quot; or &quot;`aacch*`ng&quot;, and &quot;nangtaaklm&quot; can be segmented as &quot;nang*taa*klm&quot; or &quot;nang*taak*lm&quot;. This complicates the task of identifying word boundaries.</Paragraph> <Paragraph position="1"> Longest matching is the most popular approach to Thai word segmentation (Pooworawan, 1986).</Paragraph> <Paragraph position="2"> The algorithm scans text from left to right and selects the longest match with a dictionary entry at each point, in a greedy fashion. However, longest possible words may not comply with the actual meanings. For example, &quot;chaawbaanr`kraabphra&quot; is segmented by the longest matching as &quot;chaawbaan-r`kraab-phra&quot; instead of the correct segmentation &quot;chaawbaanr`-kraab-phra&quot;. This type of ambiguity is referred to as character-level ambiguity. In addition, &quot;ekhaarabr`ngethaa cchaakephuue`n&quot; is segmented as &quot;ekhaa-rabr`ng-ethaa-cchaak-ephuue`n&quot; instead of the correct segmentation &quot;ekhaa-rab-r`ngethaa-cchaak-ephuue`n&quot;. This is referred to as syllable-level ambiguity.</Paragraph> <Paragraph position="3"> The technique we propose is a two-step process to word segmentation. In the first step, text is segmented into a sequence of syllables, whose structures are more well-defined. This reduces the character-level ambiguity. The remaining syllable-level ambiguity is the task of combining those syllables into words.</Paragraph> </Section> class="xml-element"></Paper>