XML Viewer - w06-1655

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1655_metho.xml
Size: 24,166 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1655">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Hybrid Markov/Semi-Markov Conditional Random Field for Sequence Segmentation</Title>
  <Section position="4" start_page="465" end_page="469" type="metho">
    <SectionTitle>
2 Hybrid Markov/Semi-Markov CRF
</SectionTitle>
    <Paragraph position="0"> The model we describe is formally a type of semi-Markov CRF, distinguished only in that it also involves CRF-style features. So we first describe the semi-Markov model in its general form.</Paragraph>
    <Section position="1" start_page="465" end_page="466" type="sub_section">
      <SectionTitle>
2.1 Semi-Markov CRF
</SectionTitle>
      <Paragraph position="0"> An (unlabeled) semi-Markov conditional random field is a log-linear model defining the conditional probability of a segmentation given an observation sequence. The general form of a log-linear model is as follows: given an input x [?] X, an output y[?]Y , a feature mapping Ph : XxY mapsto-Rn, and a weight vector w, the conditional probability of y given x is estimated as:</Paragraph>
      <Paragraph position="2"> is typically chosen to maximize the conditional likelihood of a labeled training set. In the word 2We assume the weight assigned to the log-probability feature is positive.</Paragraph>
      <Paragraph position="3">  segmentation task, x is an ordered sequence of characters (x1,x2,...,xn), and y is a set of indices corresponding to the start of each word: {y1,y2,...,ym} such that y1 = 1, ym [?] n, and for all j, yj &lt; yj+1. A log-linear model in this space is an order-1 semi-CRF if its feature map Ph decomposes according to</Paragraph>
      <Paragraph position="5"> where phS is a local feature map that only considers one chunk at a time (defining ym+1 = n+1). This decomposition is responsible for the characteristic independence assumptions of the semi-CRF.</Paragraph>
      <Paragraph position="6"> Hand-in-hand with the feature decomposition and independence assumptions comes the capacity for exact decoding using the Viterbi algorithm, and exact computation of the objective gradient using the forward-backward algorithm, both in time quadratic in the lengths of the sentences.</Paragraph>
      <Paragraph position="7"> Furthermore, if the model is constrained to propose only chunkings with maximum word length k, then the time for inference and training becomes linear in the sentence length (and in k). For Chinese word segmentation, choosing a moderate value of k does not pose any significant risk, since the vast majority of Chinese words are only a few characters long: in our training set, 91% of word tokens were one or two characters, and 99% were five characters or less.</Paragraph>
      <Paragraph position="8"> Using a semi-CRF as opposed to a traditional Markov CRF allows us to model some aspects of word segmentation that one would expect to be very informative. In particular, it makes possible the use of local indicator function features of the type &amp;quot;the chunk consists of character sequence kh1,...,khlscript,&amp;quot; or &amp;quot;the chunk is of length lscript.&amp;quot; It also enables &amp;quot;pseudo-bigram language model&amp;quot; features, firing when a given word occurs in the context of a given character unigram or bigram.3 And crucially, although it is slightly less natural to do so, any feature used in an order-1 Markov CRF can also be represented in a semi-CRF. As Markov CRFs are used in the most competitive Chinese word segmentation models to date, one might expect that incorporating both types of features could yield a superior model.</Paragraph>
      <Paragraph position="9"> 3We did not experiment with this type of feature.</Paragraph>
    </Section>
    <Section position="2" start_page="466" end_page="467" type="sub_section">
      <SectionTitle>
2.2 CRF vs. Semi-CRF
</SectionTitle>
      <Paragraph position="0"> In order to compare the two types of linear CRFs, it is convenient to define a representation of the segmentation problem in terms of character labels as opposed to sets of whole words. Denote by L(y) [?]{B,C}n (for BEGIN vs. CONTINUATION) the sequence {L1,L2,...Ln} of labels such that Li = B if and only if yi [?]y. It is clear that if we constrain L1 = B, the two representations y and L(y) are equivalent. An order-1 Markov CRF is a log-linear model in which the global feature vector Ph decomposes into a sum over local feature vectors that consider bigrams of the label sequence:</Paragraph>
      <Paragraph position="2"> (where Ln+1 is defined as B). The local features that are most naturally expressed in this context are indicators of some joint event of the label bi-gram (Li,Li+1) and nearby characters in x. For example, one might use the feature &amp;quot;the current character xi is kh and Li = C&amp;quot;, or &amp;quot;the current and next characters are identical and Li = Li+1 = B.&amp;quot; Although we have heretofore disparaged the CRF as being incapable of representing such powerful features as word identity, the type of features that it most naturally represents should be helpful in CWS for generalizing to unseen words. For example, the first feature mentioned above could be valuable to rule out certain word boundaries if kh were a character that typically occurs only as a suffix but that combines freely with a variety of root forms to create new words. This type of feature (specifically, a feature indicating the absence as opposed to the presence of a chunk boundary) is a bit less natural in a semi-CRF, since in that case local features phS(yj,yj+1,x) are defined on pairs of adjacent boundaries. Information about which tokens are not on boundaries is only implicit, making it a bit more difficult to incorporate that information into the features. Indeed, neither Liang (2005) nor Sarawagi and Cohen (2004) nor any other system using a semi-Markov CRF on any task has included this type of feature to our knowledge. We hypothesize (and our experiments confirm) that the lack of this feature explains the failure of the semi-CRF to outperform the CRF for word segmentation in the past.</Paragraph>
      <Paragraph position="3"> Before showing how CRF-type features can be used in a semi-CRF, we first demonstrate that the semi-CRF is indeed strictly more expressive than  the CRF, meaning that any global feature map Ph that decomposes according to (2) also decomposes according to (1). It is sufficient to show that for any feature map PhM of a Markov CRF, there exists a semi-Markov-type feature map PhS such that for</Paragraph>
      <Paragraph position="5"> To this end, note that there are only four possible label bigrams: BB, BC, CB, and CC. As a direct result of the definition of L(y), we have that</Paragraph>
      <Paragraph position="7"> length one begins at i, or equivalently, there exists a word j such that yj = i and yj+1[?]yj = 1. Similarly, (Li,Li+1) = (B,C) if and only if some word of length &gt; 1 begins at i, etc. Using these conditions, we can define phS to satisfy equation 3 as follows:</Paragraph>
      <Paragraph position="9"> otherwise. Defined thus,summationtextmj=1 phS will contain exactly n phM terms, corresponding to the n label bigrams.4 null</Paragraph>
    </Section>
    <Section position="3" start_page="467" end_page="468" type="sub_section">
      <SectionTitle>
2.3 Order-1 Markov Features in a Semi-CRF
</SectionTitle>
      <Paragraph position="0"> While it is fairly intuitive that any feature used in a 1-CRF can also be used in a semi-CRF, the above argument reveals an algorithmic difficulty that is likely another reason that such features are not typically used. The problem is essentially an effect of the sum for CC label bigrams in (4): quadratic time training and decoding assumes that the features of each chunk phS(yj,yj+1,x) can be multiplied with the weight vector w in a number of operations that is roughly constant over all chunks, 4We have discussed the case of Markov order-1, but the argument can be generalized to show that an order-M CRF has an equivalent representation as an order-M semi-CRF, for any M.</Paragraph>
      <Paragraph position="2"> scores sab with 1-CRF-type features.</Paragraph>
      <Paragraph position="3"> but if one na&amp;quot;ively distributes the product over the sum, longer chunks will take proportionally longer to score, resulting in cubic time algorithms.5 In fact, it is possible to use these features without any asymptotic decrease in efficiency by means of a dynamic program. Both Viterbi and forward-backward involve the scores sab = w* phS(a,b,x). Suppose that before starting those algorithms, we compute and cache the score sab of each chunk, so that remainder the algorithm runs in quadratic time, as usual. This pre-computation can be done quickly if we first compute the values sCCi = w*phM(C,C,i,x), and use them to fill in the values of sab as shown in Figure 1.</Paragraph>
      <Paragraph position="4"> In addition, computing the gradient of the semi-CRF objective requires that we compute the expected value of each feature. For CRF-type features, this is tantamount to being able to compute the probability that each label bigram (Li,Li+1) takes any value. Assume that we have already run standard forward-backward inference so that we have for any (a,b) the probability that the subsequence (xa,xa+1,...,xb[?]1) segments as a chunk, P(chunk(a,b)). Computing the probability that (Li,Li+1) takes the values BB, BC or CB is simple to compute:</Paragraph>
      <Paragraph position="6"> 5Note that the problem would arise even if only zero-order Markov (label unigram) features were used, only in that case the troublesome features would be those that involved the label unigram C.</Paragraph>
      <Paragraph position="8"> but the same method of summing over chunks cannot be used for the value CC since for each label bigram there are quadratically many chunks corresponding to that value. In this case, the solution is deceptively simple: using the fact that for any given label bigram, the sum of the probabilities of the four labels must be one, we can deduce that</Paragraph>
      <Paragraph position="10"> One might object that features of the C and CC labels (the ones presenting algorithmic difficulty) are unnecessary, since under certain conditions, their removal would not in fact change the expressivity of the model or the distribution that maximizes training likelihood. This will indeed be the case when the following conditions are fulfilled:  1. All label bigram features are of the form</Paragraph>
      <Paragraph position="12"> for some label bigram a and predicate pred, and any such feature with a given predicate has variants for all four label bigrams a.</Paragraph>
      <Paragraph position="13"> 2. No regularization is used during training.</Paragraph>
      <Paragraph position="14"> A proof of this claim would require too much space for this paper, but the key is that, given a model satisfying the above conditions, one can obtain an equivalent model via adding, for each feature type over pred, some constant to the four weights corresponding to the four label bigrams, such that the CC bigram features all have weight zero.</Paragraph>
      <Paragraph position="15"> In practice, however, one or both of these conditions is always broken. It is common knowledge that regularization of log-linear models with a large number of features is necessary to achieve high performance, and typically in NLP one defines feature templates and chooses only those features that occur in some positive example in the training set. In fact, if both of these conditions are fulfilled, it is very likely that the optimal model will have some weights with infinite values. We conclude that it is not a practical alternative to omit the C and CC label features.</Paragraph>
    </Section>
    <Section position="4" start_page="468" end_page="469" type="sub_section">
      <SectionTitle>
2.4 Generative Features in a Discriminative
Model
</SectionTitle>
      <Paragraph position="0"> When using the output of a generative model as a feature in a discriminative model, Raina et al.</Paragraph>
      <Paragraph position="1"> (2004) provide a justification for the use of log conditional odds as opposed to log-probability: they show that using log conditional odds as features in a logistic regression model is equivalent to discriminatively training weights for the features of a Na&amp;quot;ive Bayes classifier to maximize conditional likelihood.6 They demonstrate that the resulting classifier, termed a &amp;quot;hybrid generative/discriminative classifier&amp;quot;, achieves lower test error than either pure Na&amp;quot;ive Bayes or pure logistic regression on a text classification task, regardless of training set size.</Paragraph>
      <Paragraph position="2"> The hybrid generative/discriminative classifier also uses a unique method for using the same data used to estimate the parameters of the component generative models for training the discriminative model parameters w without introducing bias. A &amp;quot;leave-one-out&amp;quot; strategy is used to choose w, whereby the feature values of the i-th training example are computed using probabilities estimated with the i-th example held out. The beauty of this approach is that since the probabilities are estimated according to (smoothed) relative frequency, it is only necessary during feature computation to maintain sufficient statistics and adjust them as necessary for each example.</Paragraph>
      <Paragraph position="3"> In this paper, we experiment with the use of a single &amp;quot;hybrid&amp;quot; local semi-CRF feature, the smoothed log conditional odds that a given sub-</Paragraph>
      <Paragraph position="5"> where wordcount(xab) is the number of times xab forms a word in the training set, and nonwordcount(xab) is the number of times xab occurs, not segmented into a single word. The models we test are not strictly speaking hybrid generative/discriminative models, since we also use indicator features not derived from a generative model. We did however use the leave-one-out approach for computing the log conditional odds feature during training.</Paragraph>
      <Paragraph position="6"> 6In fact, one more step beyond what is shown in that paper is required to reach the stated conclusion, since their features are not actually log conditional odds, but log P(x|y)P(x|!y). It is simple to show that in the given context this feature is equivalent to log conditional odds.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="469" end_page="469" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> To test the ideas discussed in this paper, we compared the performance of semi-CRFs using various feature sets on a Chinese word segmentation task. The data used was the Microsoft Research Beijing corpus from the Second International Chinese Word Segmentation Bakeoff (Emerson, 2005), and we used the same train/test split used in the competition. The training set consists of 87K sentences of Beijing dialect Chinese, hand segmented into 2.37M words. The test set contains 107K words comprising roughly 4K sentences.</Paragraph>
    <Paragraph position="1"> We used a maximum word length k of 15 in our experiments, which accounted for 99.99% of the word tokens in our training set. The 249 training sentences that contained words longer than 15 characters were discarded. We did not discard any test sentences.</Paragraph>
    <Paragraph position="2"> In order to be directly comparable to the Bakeoff results, we also worked under the very strict &amp;quot;closed test&amp;quot; conditions of the Bakeoff, which require that no information or data outside of the training set be used, not even prior knowledge of which characters represent Arabic numerals, Latin characters or punctuation marks.</Paragraph>
    <Section position="1" start_page="469" end_page="469" type="sub_section">
      <SectionTitle>
3.1 Features Used
</SectionTitle>
      <Paragraph position="0"> We divide our main features into two types according to whether they are most naturally used in a CRF or a semi-CRF.</Paragraph>
      <Paragraph position="1"> The CRF-type features are indicator functions that fire when the character label (or label bigram) takes some value and some predicate of the input at a certain position relative to the label is satisfied. For each character label unigram L at position i, we use the same set of predicate templates checking:</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="469" end_page="470" type="metho">
    <SectionTitle>
ABAB sequence for j = (i[?]4)...i
</SectionTitle>
    <Paragraph position="0"> The latter four feature templates are designed to detect character or word reduplication, a morphological phenomenon that can influence word segmentation in Chinese. The first two of these were also used by Tseng et al. (2005).</Paragraph>
    <Paragraph position="1"> For label bigrams (Li,Li+1), we use the same templates, but extending the range of positions by one to the right.7 Each label uni- or bigram also has a &amp;quot;prior&amp;quot; feature that always fires for that label configuration. All configurations contain the above features for the label unigram B, since these are easily used in either a CRF or semi-CRF model. To determine the influence of CRF-type features on performance, we also test configurations in which both B and C label features are used, and configurations using all label uni- and bigrams.</Paragraph>
    <Paragraph position="2"> In the semi-Markov conditions, we also use as feature templates indicators of the length of a word lscript, for lscript = 1...k, and indicators of the identity of the corresponding character sequence.</Paragraph>
    <Paragraph position="3"> All feature templates were instantiated with values that occur in positive training examples. We found that excluding CRF-type features that occur only once in the training set consistently improved performance on the development set, so we use a count threshold of two for the experiments. We do not do any thresholding of the semi-CRF features, however.</Paragraph>
    <Paragraph position="4"> Finally, we use the single generative feature, log conditional odds that the given string forms a word. We also present results using the more typical log conditional probability instead of the odds, for comparison. In fact, these are both semi-Markov-type features, but we single them out to determine what they contribute over and above the other semi-Markov features.</Paragraph>
    <Section position="1" start_page="469" end_page="470" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> The results of test set runs are summarized in table 3.2. The columns indicate which CRF-type features were used: features of only the label B, features of label unigrams B and C, or features of all label unigrams and bigrams. The rows indicate which semi-Markov-type features were used:  dices are chosen so that the feature set exhibits no asymmetry with respect to direction: for each feature considering some boundary and some property of the character(s) at a given offset to the left, there is a corresponding feature considering that boundary and the same property of the character(s) at the same offset to the right, and vice-versa.</Paragraph>
      <Paragraph position="1">  figurations.</Paragraph>
      <Paragraph position="2"> &amp;quot;semi&amp;quot; means length and word identity features were used, &amp;quot;prob&amp;quot; means the log-probability feature was used, and &amp;quot;odds&amp;quot; means the log-odds feature was used.</Paragraph>
      <Paragraph position="3"> To establish the impact of each type of feature (C label unigrams, label bigrams, semi-CRF-type features, and the log-odds feature), we look at the reduction in error brought about by adding each type of feature. First consider the effect of the CRF-type features. Adding the C label features reduces error by 31% if no semi-CRF features are used, by 16% when semi-CRF indicator features are turned on, and by 13% when all semi-CRF features (including log-odds) are used. Using all label bigrams reduces error by 44%, 25%, and 15% in these three conditions, respectively.</Paragraph>
      <Paragraph position="4"> Contrary to previous conclusions, our results show a significant impact due to the use of semi-CRF-type features, when CRF-type features are held constant. Adding semi-CRF indicator features results in a 38% error reduction without CRF-type features, and 18% with them. Adding semi-CRF indicator features plus the log-odds feature gives 52% and 27% in these two conditions, respectively.</Paragraph>
      <Paragraph position="5"> Finally, across configurations, the log conditional odds does much better than log conditional probability. When the log-odds feature is added to the complete CRF model (uni+bi) as the only semi-CRF-type feature, errors are reduced by 24%, compared to only 7.6% for the logprobability. Even when the other semi-CRF-type features are present as well, log-odds reduces error by 13% compared to 2.5% for log-probability.</Paragraph>
      <Paragraph position="6"> Our best model, combining all features, resulted in an error reduction of 12% over the highest score on this dataset from the 2005 Sighan closed test competition (96.4%), achieved by the pure CRF system of Tseng et al. (2005).</Paragraph>
    </Section>
    <Section position="2" start_page="470" end_page="470" type="sub_section">
      <SectionTitle>
3.3 Discussion
</SectionTitle>
      <Paragraph position="0"> Our results indicate that both Markov-type and semi-Markov-type features are useful for generalization to unseen data. This may be because the two types of features are in a sense complementary: semi-Markov-type features such as wordidentity are valuable for modeling the tendency of known strings to segment as words, while label based features are valuable for modeling properties of sub-lexical components such as affixes, helping to generalize to words that have not previously been encountered. We did not explicitly test the utility of CRF-type features for improving recall on out-of-vocabulary items, but we note that in the Bakeoff, the model of Tseng et al. (2005), which was very similar to our CRF-only system (only containing a few more feature templates), was consistently among the best performing systems in terms of test OOV recall (Emerson, 2005).</Paragraph>
      <Paragraph position="1"> We also found that for this sequence segmentation task, the use of log conditional odds as a feature results in much better performance than the use of the more typical log conditional probability. It would be interesting to see the log-odds applied in more contexts where log-probabilities are typically used as features. We have presented the intuitive argument that the log-odds may be advantageous because it does not exhibit the 0-1 asymmetry of the log-probability, but it would be satisfying to justify the choice on more theoretical grounds.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="470" end_page="471" type="metho">
    <SectionTitle>
4 Relation to Previous Work
</SectionTitle>
    <Paragraph position="0"> There is a significant volume of work exploring the use of CRFs for a variety of chunking tasks, including named-entity recognition, gene prediction, shallow parsing and others (Finkel et al., 2005; Culotta et al., 2005; Sha and Pereira, 2003). The current work indicates that these systems might be improved by moving to a semi-CRF model.</Paragraph>
    <Paragraph position="1"> There have not been a large number of studies using the semi-CRF, but the few that have been done found only marginal improvements over pure CRF systems (Sarawagi and Cohen, 2004; Liang, 2005; Daum'e III and Marcu, 2005). Notably, none of those studies experimented with features of chunk non-boundaries, as is achieved by the use of CRF-type features involving the label C, and we take this to be the reason for their not obtaining higher results.</Paragraph>
    <Paragraph position="2">  Although it has become fairly common in NLP to use the log conditional probabilities of events as features in a discriminative model, we are not aware of any work using the log conditional odds.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML