File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1057_metho.xml
Size: 9,580 bytes
Last Modified: 2025-10-06 14:07:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1057"> <Title>Non-Dictionary-Based Thai Word Segmentation Using Decision Trees</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 2. PREVIOUS APPROACHES </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 Longest Matching </SectionTitle> <Paragraph position="0"> Most of Thai early works in Thai word segmentation are based on longest matching method ([4]). The method scans an input sentence from left to right, and select the longest match with a dictionary entry at each point. In case that the selected match cannot lead the algorithm to find the rest of the words in the sentence, the algorithm will backtrack to find the next longest one and continue finding the rest and so on. It is obvious that this algorithm will fail to find the correct the segmentation in many cases because of its greedy characteristic. For example:aiphaamehsii (go to see the queen) will be incorrectly segmented as: aip(go) haam (carry) eh(deviate) sii (color), while the correct one that cannot be found by the algorithm is: aip(go) haa(see) mehsii (Queen).</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Maximum Matching </SectionTitle> <Paragraph position="0"> The maximum matching algorithm was proposed to solve the problem of the longest matching algorithm describes above ([7]).</Paragraph> <Paragraph position="1"> This algorithm first generates all possible segmentations for a sentence and then select the one that contain the fewest words, which can be done efficiently by using dynamic programming technique. Because the algorithm actually finds real maximum matching instead of using local greedy heuristics to guess, it always outperforms the longest matching method. Nevertheless, when the alternatives have the same number of words, the algorithm cannot determine the best candidate and some other heuristics have to be applied. The heuristic often used is again the greedy one: to prefer the longest matching at each point. For the example, taak(expose) lm(wind) is preferred to taa(eye) klm(round).</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.3 Feature-based Approach </SectionTitle> <Paragraph position="0"> A number of feature-based methods have been developed in ([3]) for solving ambiguity in word segmentation. In this approach, the system generates multiple possible segmentation for a string, which has segmentation ambiguity. The problem is that how to select the best segmentation from the set of candidates. At this point, this research applies and compares two learning techniques, called RIPPER and Winnow. RIPPER algorithm is a propositional learning algorithm that constructs a set of rules while Winnow algorithm is a weighted-majority learning algorithm that learns a network, where each node in the network is called a specialist. Each specialist looks at a particular value of an attribute of the target concept, and will vote for a value of the target concept based on its specialty; i.e., based on a value of the attribute it examines. The global algorithm combines the votes from all specialists and makes decision. This approach is a dictionary-based approach. It can acquire up to 91-99% of the number of correct segmented sentences to the total number of sentences.</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.4 Thai Character Chuster </SectionTitle> <Paragraph position="0"> In Thai language, some contiguous characters tend to be an inseparable unit, called Thai character cluster (TCC). Unlike word segmentation that is a very difficult task, segmenting a text into TCCs is easily realized by applying a set of rules. The method to segment a text into TCCs was proposed in ([8]). This method needs no dictionary and can always correctly segment a text at every word boundaries.</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 3. WORD SEGMENTATION WITH DECISION TREE MODELS </SectionTitle> <Paragraph position="0"> In this paper, we propose a word segmentation method that (1) uses a set of rules to combine contiguous characters to an inseparable unit (syllable-like unit) and (2) then applies a learned decision tree to combine these contiguous units to words. This section briefly shows the concept of TCC and the proposed method based on decision trees.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Segmenting a Text into TCCs </SectionTitle> <Paragraph position="0"> In Thai language, some contiguous characters tend to be an inseparable unit, called Thai character cluster (TCC). Unlike word segmentation that is a very difficult task, segmenting a text into TCCs is easily recognized by applying a set of rules (in our system, 42 BNF rules). The method to segment a text into TCCs was proposed in [8]. This method needs no dictionary and can always correctly segment a text at every word boundaries. As the first step of our word segmentation approach, a set of rules is applied to group contiguous characters in a text together to form TCCs. The accuracy of this process is 100% in the sense that there is no possibility that these units are divided to two or more units, which are substrings in two or more different words. This process can be implemented without a dictionary, but uses a set of simple linguistic rules based on the types of characters. Figure 2 displays the types of Thai characters. As an example rule, a front vowel and its next consonant must exist in the same unit. Figure 3 shows a fragment of a text segmented into TCCs by the proposed method and its correct word segmentation. Here, a character '|' indicates a segmentation point. The corpus where characters are grouped into TCCs is called a TCC corpus.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Learning a Decision Tree for Word Segmentation </SectionTitle> <Paragraph position="0"> To learn a decision tree for this task, some attri for classifying whether two contiguous TCCs are unit or not. In this paper, eight types of attribute proposed to identify possible word boundaries answers (or classes) in the decision tree for thi types: combine and not combine. Moreover, to two contiguous TCCs should be combined or not, the front of the current two TCCs and the TCC behind the into account. That is, there are four sets of attr two for current two TCCs and two for TCCs behind the current TCCs. Therefore, the total num is 32 (that is, 8x4) and there is one dependent v whether the current two contiguous TCCs should be</Paragraph> <Paragraph position="2"> Figure 5 illustrates an example of the process to extract attributes from the TCC corpus and use them as a training corpus. The process is done by investigating the current TCCs in the buffer and recording their attribute values. The dependent variable is set by comparing the combination of the second and the third blocks of characters in the buffer to the same string in the correct word-segmented corpus, the corpus that is segmented by human. The result of this comparison will output whether the second and the third blocks in the buffer should be merged to each other or not.</Paragraph> <Paragraph position="3"> This output is then kept as a training set with the dependent variable, &quot;Combine (1)&quot; or &quot;NotCombine (0)&quot;. Repetitively, the start of the buffer is shifted by one block. This process executes until the buffer reaches the end of the corpus. The obtained training set then is used as the input to the C4.5 application ([5]) for learning a decision tree.</Paragraph> <Paragraph position="4"> The C4.5 program will examine and construct the decision tree using the statistical values calculated from the events occurred.</Paragraph> <Paragraph position="5"> After the decision tree is created, the certainty factor is calculated and assigned to each leaf as a final decision-making factor. This certainty factor is the number that identifies how certain the answer at each terminal node is. It is calculated according to the number of terminal class answers at each leaf of the tree. For example, at leaf node i, if there are ten terminal class answers; six of them are &quot;Combine&quot; and the rest are &quot;Not Combine&quot;. The answer at this node would be &quot;Combine&quot; with the certainty factor equals to 0.6 (6/10). On the other hand, leaf node j has 5 elements; two are &quot;Combine&quot; and three are &quot;Not Combine&quot;, then the answer at this node would be &quot;Not Combine&quot; with the certainty factor equals to 0.6 (3/5). The general formula for the certainty factor (CF) is shown as follow: CFi = Total number of the answer elements at leaf node i Total number of all elements at leaf node i We also calculate the recall, precision, and accuracy as defined below: Precision = number of correct '|'s in the system answer number of '|'s in the system answer Recall = number of correct '|'s in the system answer number of '|'s in the correct answer Accuracy = number of correct segmented units in system answer total number of segmented units in correct answer</Paragraph> </Section> </Section> class="xml-element"></Paper>