XML Viewer - w03-1724

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1724_metho.xml
Size: 6,773 bytes
Last Modified: 2025-10-06 14:08:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1724">
  <Title>Integrating Ngram Model and Case-based Learning For Chinese Word Segmentation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Ngram model and training
</SectionTitle>
    <Paragraph position="0"> An ngram model can be utilized to nd the most probable segmentation of a sentence. Given a Chinese sentence s = c1c2 cm (also denoted as cn1 ), its probabilistic segmentation into a word sequence w1w2 wk (also denoted as wk1) with the aid of an ngram model can be formulated as</Paragraph>
    <Paragraph position="2"> where denotes string concatenation, wi 1i n+1 the context (or history) of wi, and n is the order of the ngram model in use. We have opted for uni-gram for the sake of simplicity. Accordingly, p(wijwi 1i n+1) in (1) becomes p(wi), which is commonly estimated as follows, given a corpus C for training.</Paragraph>
    <Paragraph position="4"> In order to estimate a reliable p(wi), the ngram model needs to be trained with the EM algorithm using the available training corpus. Each EM iteration aims at approaching to a more reliable f(w) for estimating p(w), as follows:</Paragraph>
    <Paragraph position="6"> where k denotes the current iteration, S(s) the set of all possible segmentations for s, and f k(w 2s0) the occurrences of w in a particular segmentation s0.</Paragraph>
    <Paragraph position="7"> However, assuming that every sentence always has a segmentation, the following equation holds:</Paragraph>
    <Paragraph position="9"> Accordingly, we can adjust (3) as (5) with a normalization factor = Ps02S(s) pk(s0), to avoid favoring words in shorter sentences too much. In general, shorter sentences have higher probabilities.</Paragraph>
    <Paragraph position="11"> Following the conventional idea to speed up the EM training, we turned to the Viterbi algorithm. The underlying philosophy is to distribute more probability to more probable events. The Viterbi segmentation, by utilizing dynamic programming techniques to go through the word trellis of a sentence ef ciently, nds the most probable segmentation under the current parameter estimation of the language model, ful lling (1)). Accordingly, (6) becomes</Paragraph>
    <Paragraph position="13"> where the normalization factor is skipped, for only the Viterbi segmentation is used for EM reestimation. Equation (7) makes the EM training with the Viterbi algorithm very simple for the uni-gram model: iterate word segmentation, as (1), and word count updating, via (7), sentence by sentence through the training corpus until there is a convergence. null Since the EM algorithm converges to a local maxima only, it is critical to start the training with an initial f0(w) for each word not too far away from its true value. Our strategy for initializing f 0(w) is to assume all possible words in the training corpus as equiprobable and count each of them as 1; and then p0(w) is derived using (2). This strategy is supposed to have a weaker bias to favor longer words than maximal matching segmentation.</Paragraph>
    <Paragraph position="14"> For the bakeoff, the ngram model is trained with the unsegmented training corpora together with the test sets. It is a kind of unsupervised training.</Paragraph>
    <Paragraph position="15"> Adding the test set to the training data is reasonable, to allow the model to have necessary adaptation towards the test sets. Experiments show that the training converges very fast, and the segmentation performance improves signi cantly from iteration to iteration. For the bakeoff experiments, we carried out the training in 6 iterations, because more iterations than this have not been observed to bring any significant improvement on segmentation accuracy to the training sets.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Case-based learning for disambiguation
</SectionTitle>
    <Paragraph position="0"> No matter how well the language model is trained, probabilistic segmentation cannot avoid mistakes on ambiguous strings, although it resolves most ambiguities by virtue of probability. For the remaining unresolved ambiguities, however, we have to resort to other strategies and/or resources. Our recent study (Kit et al., 2002) shows that case-based learning is an effective approach to disambiguation.</Paragraph>
    <Paragraph position="1"> The basic idea behind the case-based learning is to utilize existing resolutions for known ambiguous strings to do disambiguation if similar ambiguities occur again. This learning strategy can be implemented in two straightforward steps:  1. Collection of correct answers from the training corpus for ambiguous strings together with their contexts, resulting in a set of context-dependent transformation rules; 2. Application of appropriate rules to ambiguous  strings.</Paragraph>
    <Paragraph position="2"> A transformation rule of this type is actually an example of segmentation, indicating how an ambiguous string is segmented within a particular context. It has the following general form:</Paragraph>
    <Paragraph position="4"> where is the ambiguous string, Cl and Cr its left and right contexts, respectively, and w1 w2 wk the correct segmentation of given the contexts.</Paragraph>
    <Paragraph position="5"> In our implementation, we set the context length on each side to two words.</Paragraph>
    <Paragraph position="6"> For a particular ambiguity, the example with the most similar context in the example (or, rule) base is applied. The similarity is measured by the sum of the length of the common suf x and pre x of, respectively, the left and right contexts. The details of computing this similarity can be found in (Kit et al., 2002) . If no rule is applicable, its probabilistic segmentation is retained.</Paragraph>
    <Paragraph position="7"> For the bakeoff, we have based our approach to ambiguity detection and disambiguation rule extraction on the assumption that only ambiguous strings cause mistakes: we detect the discrepancies of our probabilistic segmentation and the standard segmentation of the training corpus, and turn them into transformation rules. An advantage of this approach is that the rules so derived carry out not only disambiguation but also error correction. This links our disambiguation strategy to the application of Brill's (1993) transformation-based error-driven learning to Chinese word segmentation (Palmer, 1997; Hockenmaier and Brew, 1998).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 System architecture
</SectionTitle>
    <Paragraph position="0"> The overall architecture of our word segmentation system is presented in Figure 1.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML