File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1080_metho.xml

Size: 16,458 bytes

Last Modified: 2025-10-06 14:10:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1080">
  <Title>Self-Organizing D2-gram Model for Automatic Word Spacing</Title>
  <Section position="5" start_page="633" end_page="633" type="metho">
    <SectionTitle>
about BDBC
BG
</SectionTitle>
    <Paragraph position="0"> , which implies that the number of bi-grams reaches BDBC</Paragraph>
  </Section>
  <Section position="6" start_page="633" end_page="634" type="metho">
    <SectionTitle>
BK
</SectionTitle>
    <Paragraph position="0"> . In order to obtain stable statistics for all bigrams, a great large volume of corpora will be required. If higher order D2-gram is adopted for better accuracy, the volume of corpora required will be increased exponentially.</Paragraph>
    <Paragraph position="1"> The main drawback of D2-gram model is that it suffers from data sparseness however large the corpus is. That is, there are many D2-grams of which frequency is zero. To avoid this problem, many smoothing techniques have been proposed for construction of D2-gram models (Chen and Goodman, 1996). Most of them belongs to one of two categories. One is to pretend each D2-gram occurs once more than it actually did (Mitchell, 1996). The other is to interpolate D2-grams with lower dimensional data (Jelinek and Mercer, 1980; Katz, 1987). However, these methods artificially modify the original distribution of corpus. Thus, the final probabilities used in learning with D2-grams are the ones distorted by a smoothing technique. null A maximum entropy model can be considered as another way to avoid zero probability in D2-gram models (Rosenfeld, 1996). Instead of constructing separate models and then interpolate them, it builds a single, combined model to capture all the information provided by various knowledge sources. Even though a maximum entropy approach is simple, general, and strong, it is computationally very expensive. In addition, its performance is mainly dependent on the relevance of knowledge sources, since the prior knowledge on the target problem is very important (Park and Zhang, 2002). Thus, when prior knowledge is not clear and computational cost is an important factor, D2-gram models are more suitable than a maximum entropy model.</Paragraph>
    <Paragraph position="2"> Adapting features or contexts has been an important issue in language modeling (Siu and Ostendorf, 2000). In order to incorporate long-distance features into a language model, (Rosenfeld, 1996) adopted triggers, and (Mochihashi and Mastumoto, 2006) used a particle filter. However, these methods are restricted to a specific language model. Instead of long-distance features, some other researchers tried local context extension. For this purpose, (Sch&amp;quot;utze and Singer, 1994) adopted a variable memory Markov model proposed by (Ron et al., 1996), (Kim et al., 2003) applied selective extension of features to POS tagging, and (Dickinson and Meurers, 2005) expanded context of D2-gram models to find errors in syntactic anno- null tation. In these methods, only neighbor words or features of the target D2-grams became candidates to be added into the context. Since they required more information for better performance or detecting errors, only the context extension was considered. null</Paragraph>
  </Section>
  <Section position="7" start_page="634" end_page="634" type="metho">
    <SectionTitle>
3 Automatic Word Spacing by D2-gram
</SectionTitle>
    <Paragraph position="0"> Model The problem of automatic word spacing can be regarded as a binary classification task. Let a sentence be given as CB BP DB</Paragraph>
  </Section>
  <Section position="8" start_page="634" end_page="634" type="metho">
    <SectionTitle>
BD
DB
BE
BMBMBMDB
</SectionTitle>
    <Paragraph position="0"> . If i.i.d. sampling is assumed, the data from this sentence are given as BW BPBO B4DC  ,istrue.Itisfalse otherwise. Therefore, the automatic word spacing is to estimate a function CU BM CA  AX CUD8D6D9CTBNCUCPD0D7CTCV. That is, our task is to determine whether a space should be put after a syllable DB</Paragraph>
    <Paragraph position="2"> context.</Paragraph>
    <Paragraph position="3"> The probabilistic method is one of the strong and most widely used methods for estimating CU. That is, for each DB  cording to the values of D2 in automatic word spacing. null where BV is a counting function. Determining the context size, the value of D2,in D2-gram models is closely related with the corpus size. The larger is D2, the larger corpus is required to avoid data sparseness. In contrast, though low-order D2-grams do not suffer from data sparseness severely, they do not reflect the language characteristics well, either. Typically researchers have used D2 BP BE or D2 BP BF, and achieved high performance in many tasks (Bengio et al., 2003). Figure 1 supports that bigram and trigram outperform low-order (D2 BP BD) and high-order (D2 AL BG) D2-grams in automatic word spacing. All the experimental settings for this figure follows those in Section 5. In this figure, bigram model shows the best accuracy and trigram achieves the second best, whereas unigram model results in the worst accuracy. Since the bigram model is best, a self-organizing D2-gram model explained below starts from bigram.</Paragraph>
  </Section>
  <Section position="9" start_page="634" end_page="636" type="metho">
    <SectionTitle>
4 Self-Organizing D2-gram Model
</SectionTitle>
    <Paragraph position="0"> To tackle the problem of fixed window size in D2-gram models, we propose a self-organizing structure for them.</Paragraph>
    <Section position="1" start_page="634" end_page="635" type="sub_section">
      <SectionTitle>
4.1 Expanding D2-grams
</SectionTitle>
      <Paragraph position="0"> When D2-grams are compared with B4D2B7BDB5-grams, their performance in many tasks is lower than that of B4D2 B7BDB5-grams (Charniak, 1993). Simultaneously the computational cost for B4D2 B7BDB5-grams is far higher than that for D2-grams. Thus, it can be justified to use B4D2 B7BDB5-grams instead of D2- null grams only when higher performance is expected.</Paragraph>
      <Paragraph position="1"> In other words, B4D2 B7BDB5-grams should be different from D2-grams. Otherwise, the performance would not be different. Since our task is attempted with a probabilistic method, the difference can be measured with conditional distributions. If the conditional distributions of D2-grams and B4D2 B7BDB5-grams are similar each other, there is no reason to adopt B4D2 B7BDB5-grams.</Paragraph>
      <Paragraph position="2">  how large D2-grams should be used. It recursively finds the optimal expanding window size. For instance, let bigrams (D2 BPBE) be used at first. When the difference between bigrams and trigrams (D2 BP</Paragraph>
    </Section>
    <Section position="2" start_page="635" end_page="636" type="sub_section">
      <SectionTitle>
4.2 Shrinking D2-grams
</SectionTitle>
      <Paragraph position="0"> Shrinking D2-grams is accomplished in the direction opposite to expanding D2-grams. After comparing D2-grams with B4D2A0BDB5-grams, B4D2A0BDB5-grams are used instead of D2-grams only when they are similar enough. The difference BWB4D2BND2 A0 BDB5 between D2-grams and B4D2A0 BDB5-grams is, once again, measured by Kullback-Leibler divergence. That  shrinking window size, but can not be further reduced when the current model is an unigram.</Paragraph>
      <Paragraph position="1"> The merit of shrinking D2-grams is that it can construct a model with a lower dimensionality.</Paragraph>
      <Paragraph position="2"> Since the maximum likelihood estimate is used in calculating probabilities, this helps obtaining stable probabilities. According to the well-known curse of dimensionality, the data density required is reduced exponentially by reducing dimensions.</Paragraph>
      <Paragraph position="3"> Thus, if the lower dimensional model is not different so much from the higher dimensional one, it is highly possible that the probabilities from lower dimensional space are more stable than those from higher dimensional space.</Paragraph>
    </Section>
    <Section position="3" start_page="636" end_page="636" type="sub_section">
      <SectionTitle>
4.3 Overall Self-Organizing Structure
</SectionTitle>
      <Paragraph position="0"> For a given i.i.d. sample DC</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="636" end_page="636" type="metho">
    <SectionTitle>
CX
</SectionTitle>
    <Paragraph position="0"> , there are three possibilities on changing D2-grams. First one is not to change D2-grams. It is obvious when D2-grams are not changed. This occurs when both BWB4D2BND2B7</Paragraph>
    <Paragraph position="2"> are met.</Paragraph>
    <Paragraph position="3"> This is when the expanding results in too similar distribution to that of the current D2-grams and the distribution after shrinking is too different from that of the current D2-grams.</Paragraph>
    <Paragraph position="4"> The remaining possibilities are then expanding and shrinking. The application order between them can affect the performance of the proposed method. In this paper, an expanding is checked prior to a shrinking as shown in Figure 4. The function ChangingWindowSize first calls HowLargeExpand. The non-zero return value of HowLargeExpand implies that the window size of the current D2-grams should be enlarged. Otherwise, ChangingWindowSize checks if the window size should be shrinked by calling HowSmall-Shrink.IfHowSmallShrink returns a negative integer, the window size should be shrinked to (D2 + shr). If both functions return zero, the window size should not be changed.</Paragraph>
    <Paragraph position="5"> The reason why HowLargeExpand is called prior to HowSmallShrink is that the expanded D2-grams handle more specific data. (D2 B7BD)-grams, in general, help obtaining higher accuracy than D2grams, since (D2 B7BD)-gram data are more specific than D2-gram ones. However, it is time-consuming to consider higher-order data, since the number of kinds of data increases. The time increased due to expanding is compensated by shrinking. After shrinking, only lower-oder data are considered, and then processing time for them decreases.</Paragraph>
    <Section position="1" start_page="636" end_page="636" type="sub_section">
      <SectionTitle>
4.4 Sequence Tagging
</SectionTitle>
      <Paragraph position="0"> Since natural language sentences are sequential as their nature, the word spacing can be considered as a special POS tagging task (Lee et al., 2002) for which a hidden Markov model is usually adopted.</Paragraph>
      <Paragraph position="1"> The best sequence of word spacing for the sen- null Since this equation follows Markov assumption, the best sequence is found by applying the Viterbi algorithm.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="636" end_page="637" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="636" end_page="637" type="sub_section">
      <SectionTitle>
5.1 Data Set
</SectionTitle>
      <Paragraph position="0"> The data set used in this paper is the HANTEC corpora version 2.0 distributed by KISTI  . From this corpus, we extracted only the HKIB94 part which consists of 22,000 news articles in 1994 from Hankook Ilbo. The reason why HKIB94 is chosen is that the word spacing of news articles is relatively more accurate than other texts. Even though this data set is composed of totally 12,523,688 Korean syllables, the number of unique syllables is just  ods for automatic word spacing.</Paragraph>
      <Paragraph position="1"> 2,037 after removing all special symbols, digits, and English alphabets.</Paragraph>
      <Paragraph position="2"> The data set is divided into three parts: training (70%), held-out (20%), and test (10%). The held-out set is used only to estimate AI EXP and</Paragraph>
    </Section>
  </Section>
  <Section position="12" start_page="637" end_page="637" type="metho">
    <SectionTitle>
AI
SHR
</SectionTitle>
    <Paragraph position="0"> . The number of instances in the training set is 8,766,578, that in the held-out set is 2,504,739, and that in test set is 1,252,371. Among the 1,252,371 test cases, the number of positive instances is 348,278, and that of negative instances is 904,093. Since about 72% of test cases are negative, this is the baseline of the automatic word spacing.</Paragraph>
    <Section position="1" start_page="637" end_page="637" type="sub_section">
      <SectionTitle>
5.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> To evaluate the performance of the proposed method, two well-known machine learning algorithms are compared together. The tested machine learning algorithms are (i) decision tree and (ii) support vector machines. We use C4.5 release 8 (Quinlan, 1993) for decision tree induction and</Paragraph>
      <Paragraph position="2"> (Joachims, 1998) for support vector machines. For all experiments with decision trees and support vector machines, the context size is set to two since the bigram shows the best performance in Figure 1.</Paragraph>
      <Paragraph position="3"> Table 1 gives the experimental results of various methods including machine learning algorithms and self-organizing D2-gram model. The 'self-organizing bigram' in this table is the one proposed in this paper. The normal D2-grams achieve an accuracy of around 88%, while decision tree and support vector machine produce that of around 89%. The self-organizing D2-gram model achieves 91.31%. The accuracy improvement by the self-organizing D2-gram model is about 19% over the baseline, about 3% over the normal D2-gram model, and 2% over decision trees and support vector machines. null In order to organize the context size for D2-grams  cation order of context expanding and shrinking. online, two operations of expanding and shrinking were proposed. Table 2 shows how much the number of errors is affected by their application order. The number of errors made by expanding first is 108,831 while that by shrinking first is 114,343. That is, if shrinking is applied ahead of expanding, 5,512 additional errors are made. Thus, it is clear that expanding should be considered first.</Paragraph>
      <Paragraph position="4"> The errors by expanding can be explained with two reasons: (i) the expression power of the model and (ii) data sparseness. Since Korean is a partially-free word order language and the omission of words are very frequent, D2-gram model that captures local information could not express the target task sufficiently. In addition, the class-conditional distribution after expanding could be very different from that before expanding due to data sparseness. In such cases, the expanding should not be applied since the distribution after expanding is not trustworthy. However, only the difference between two distributions is considered in the proposed method, and the errors could be made by data sparseness.</Paragraph>
      <Paragraph position="5"> Figure 5 shows that the number of training instances does not matter in computing probabilities of D2-grams. Even though the accuracy increases slightly, the accuracy difference after 900,000 instances is not significant. It implies that the errors made by the proposed method is not from the lack of training instance but from the lack of its expression power for the target task. This result also complies with Figure 1.</Paragraph>
    </Section>
    <Section position="2" start_page="637" end_page="637" type="sub_section">
      <SectionTitle>
5.3 Effect of Right Context
</SectionTitle>
      <Paragraph position="0"> All the experiments above considered left context only. However, Kang reported that the probabilistic model using both left and right context outperforms the one that uses left context only (Kang, 2004). In his work, the word spacing probabil- null are computed respectively based on the syllable frequency.</Paragraph>
      <Paragraph position="1"> In order to reflect the idea of bidirectional context in the proposed model, the model is enhanced  of which values are determined using a held-out data.</Paragraph>
      <Paragraph position="2"> The change of accuracy by the context is shown in Table 3. When only the right context is used, the accuracy gets 88.26% which is worse than the left context only. That is, the original D2-gram is a relatively good model. However, when both left and right context are used, the accuracy becomes 92.54%. The accuracy improvement by using additional right context is 1.23%. This results coincide with the previous report (Lee et al., 2002). The AB</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML