File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1175_metho.xml

Size: 8,367 bytes

Last Modified: 2025-10-06 14:08:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1175">
  <Title>Combining Prediction by Partial Matching and Logistic Regression for Thai Word Segmentation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Syllable Segmentation
</SectionTitle>
    <Paragraph position="0"> Prediction by Partial Matching (PPM) (Bell et al., 1990; Cleary and Witten, 1984), a symbolwise compression scheme, is used to build the model for Thai text. PPM generates a prediction for each input symbol based on its previous context (i.e., a few, say k, forecoming symbols in the text). The prediction is encoded in form of conditional probability, conditioned on the preceding context.</Paragraph>
    <Paragraph position="1"> PPM maintains predictions, computed from the training data, for the largest context (k) as well as all shorter contexts in tables, as shown in Table 2.</Paragraph>
    <Paragraph position="2"> Syllable segmentation can be viewed as the problem of inserting spaces between pairs of characters in the text. Thai language consists of 66 distinct characters. Treating each character individually as in (Teahan et al., 2000) requires a large amount of training data in order to calculate all the probabilities in the tables, as well as a large amount of table space and time to lookup data from the tables. We reduce the amount of training data required by partitioning the characters into 16 types, as shown in Table 1. As a side effect of the character classification, the algorithm can handle syllables not present in the training data. Each character is represented by its respective type symbol. For instance &amp;quot;thMaa*Rthay*dii*suu*elh*ehliiym*e`aa*aiw&amp;quot; is represented as: &amp;quot;de*zdps*mu*hlt*asthg*ahsutss* aor*fst&amp;quot;. We then compute the predictions for each symbol as described in the previous section, and the results are shown in Table 2.</Paragraph>
    <Paragraph position="3">  We illustrate the insertion of spaces between characters using text &amp;quot;chMaaainkkhuu&amp;quot;. In Thai, tonals are not useful for the segmentation purpose, thus are first filtered out, and the text is converted to &amp;quot;de*fs*mu*hl&amp;quot;.</Paragraph>
    <Paragraph position="4"> Given an order of k, the algorithm computes the likelihood of each possible next symbol (i.e., the next character in the text or a space) by considering a context of size k at a time and then proceed to the next symbol in the text. The process is repeated until the text is exhausted. From the text &amp;quot;de*fs*mu*hl&amp;quot;, the model for space insertion becomes a tree-like structure, as shown in Figure 1.</Paragraph>
    <Paragraph position="5"> In order to predict the next symbol, the algorithm follows the concept of PPM by attempting to find first the context of length k (k = 2 in this example) for this symbol in the context table (i.e., e*-&gt;f). If the context is not found, it passes the probability of the escape character at this level and goes down one level to the (k-1) context table to find the current context of length k-1 (i.e., *-&gt;f). The process is repeated until a context is found. If it continues to fail to find a context, it may go down ultimately to order (-1) corresponding to equiprobable level for which the probability of any next character is 1/|A|, where A is the number of distinct characters.</Paragraph>
    <Paragraph position="6"> If, on the other hand, a context of length q, 0&lt;=q &lt;=k, is found, then the probability of this next character is estimated to be the product of probabilities of escape characters at levels k, k-1, ..., q+1 multiplied by the probability for the context found at the q-th level.</Paragraph>
    <Paragraph position="7"> To handle zero frequency, we use method D (PPMD) (Witten and Bell, 1991) where the escape character gets a probability of (d/2n), and the symbol gets a probability of (2c-1)/2n where n is the total number of symbols seen previously, d is the total number of distinct contexts, and c is the total number of contexts that appear in the string.</Paragraph>
    <Paragraph position="8"> After the tree-like structure is created, the algorithm selects as the final result the path with the highest probability at the lowest node. This corresponds to the path that gives the best compression according to the PPM text  To improve the efficiency of the algorithm, the structure can be pruned by the following set of rules, generated from the language analysis: The nodes surrounded by a rectangle in Figure 1 are pruned according to the rules above. Thus, they do not generate further subtrees.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="43" type="metho">
    <SectionTitle>
4 Combining Syllables into Words
</SectionTitle>
    <Paragraph position="0"> In this section, we propose a technique to form words by combining syllables together. In order to combine syllables into words, for each sentence we first locate ambiguous sequences of syllables, i.e., syllable sequences that can be combined in many ways. The forward and backward syllable-level longest matching are performed. These algorithms are modified from the original longest matching, described in Section 1, by considering syllable as a unit, instead of character. For instance, a syllable sequence &amp;quot;raay*ngaan*epn*tn*chbab&amp;quot; is processed according to the forward longest matching as &amp;quot;raay ngaan*epntn*chbab&amp;quot;, while as &amp;quot;raayngaan*epn*tnchbab&amp;quot; according to the backward longest matching. The inconsistencies between the two algorithms suggest ambiguous sequences of syllables in the sentence. In this example, an ambiguous sequence of syllables is &amp;quot;epn*tn*chbab&amp;quot;.</Paragraph>
    <Paragraph position="1"> After identifying ambiguous syllable sequences, we perform the following steps: Step 1: Between the results of the forward and backward longest matching, the one with all words appearing in the dictionary is selected as the result of the ambiguous sequence. If both results satisfy this condition, go to Step 2.</Paragraph>
    <Paragraph position="2"> Step 2: The result with the least number of words is taken as the answer. If the number of words are equal, go to Step 3.</Paragraph>
    <Paragraph position="3"> Step 3: A logistic regression model for combining syllables is consulted. This step will be discussed in details below.</Paragraph>
    <Paragraph position="4">  Syllable 1</Paragraph>
    <Section position="1" start_page="0" end_page="43" type="sub_section">
      <SectionTitle>
Logistic Regression Model
4.1 Logistic Regression Model for Combining
Syllables
</SectionTitle>
      <Paragraph position="0"> The model to combine syllables is built upon Binary Logistic Regression whose answers are either combine or not combine. The model considers four consecutive syllables at a time when modeling the decision of whether to combine the middle two syllables together. The first and the fourth syllables are considered the context of the two middle ones. Table 3 shows the organization of data for the model. In the first row, the training data specifies that syllables &amp;quot;rab&amp;quot; and &amp;quot;r`ng&amp;quot; (with the preceding contextual syllable &amp;quot;ekhaa&amp;quot; and the following contextual syllable &amp;quot;ethaa&amp;quot;) should not be combined. The model is trained by every row of the training data. The result is a trained logistic regression model that can be used for guiding whether the middle two syllables should be combined in the context of the surrounding syllables (the first and the fourth syllables).</Paragraph>
      <Paragraph position="1"> In the model, each syllable (in Table 3) is represented by a set of features. The syllables under consideration (the second and the third syllables) are represented by 65 features, listed in  The contextual syllables (the first and the fourth) are represented by a fewer number of features to make it less specific to the training contexts. The variables for contextual syllables are those statistically significant to the prediction, returned with the regression. The final set consists of 35 variables, as shown in Table 5. The value of each variable is either 1 or -1 which means either the syllable contains or does not contain that particular character, respectively.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML