File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1057_intro.xml

Size: 3,062 bytes

Last Modified: 2025-10-06 14:01:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1057">
  <Title>Non-Dictionary-Based Thai Word Segmentation Using Decision Trees</Title>
  <Section position="3" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Word segmentation is a crucial topic in analysis of languages without word boundary markers. Many researchers have been trying to develop and implement in order to gain higher accuracy.</Paragraph>
    <Paragraph position="1"> Unlike in English, word segmentation in Thai, as well as in many other Asian languages, is more complex because the language does not have any explicit word boundary delimiters, such as a space, to separate between each word. It is even more complicated to precisely segment and identify the word boundary in Thai language because there are several levels and several roles in Thai characters that may lead to ambiguity in segmenting the words. In the past, most researchers had implemented Thai word segmentation systems based on using a dictionary ([2], [3], [4], [6], [7]). When using a dictionary, word segmentation has to cope with an unknown word problem. Up to present, it is clear that most researches on Thai word segmentation with a dictionary suffer from this problem and then introduce some particular process to handle such problem. In our preliminary experiment, we extracted words from a pre-segmented corpus to form a dictionary, randomly deleted some words from the dictionary and used the modified dictionary in segmentation process based two well-known techniques; Maximum and Longest Matching methods. The result is shown in Figure 1. The percentages of accuracy with different percentages of unknown words are explored. We found out that in case of no unknown words, the accuracy is around 97% in both maximum matching and longest matching but the accuracy drops to 54% and 48% respectively, in case that 50% of words are unknown words. As the percentage of unknown words rises, the percentage of accuracy drops continuously. This result reflects seriousness of unknown word problem in word segmentation.</Paragraph>
    <Paragraph position="2">  percentage of unknown words In this paper, to take care of both known and unknown words, we propose the implementation of a non-dictionary-based system with the knowledge based on the decision tree model ([5]). This model attempts to identify word boundaries of a Thai text. To do  National Electronics and Computer Technology Center (NECTEC), 539/2 Sriyudhya Rd., Rajthevi Bangkok 10400, Thailand this, the specific information about the structure of Thai words is needed. We called such information in our method as syntactic attributes of Thai words. As the learning stage, a training corpus is utilized to construct a decision tree based on C4.5 algorithm. In the segmentation process, a Thai text is segmented according to the rules produced by the obtained decision tree. The rest shows the proposed method, experimental results, discussion and conclusion.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML