File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1655_intro.xml
Size: 6,821 bytes
Last Modified: 2025-10-06 14:03:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1655"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Hybrid Markov/Semi-Markov Conditional Random Field for Sequence Segmentation</Title> <Section position="3" start_page="0" end_page="465" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The problem of segmenting sequence data into chunks arises in many natural language applications, such as named-entity recognition, shallow parsing, and word segmentation in East Asian languages. Two popular discriminative models that have been proposed for these tasks are the conditional random field (CRFs) (Lafferty et al., 2001) and the semi-Markov conditional random field (semi-CRF) (Sarawagi and Cohen, 2004).</Paragraph> <Paragraph position="1"> A CRF in its basic form is a model for labeling tokens in a sequence; however it can easily be adapted to perform segmentation via labeling each token as BEGIN or CONTINUATION, or according to some similar scheme. CRFs using this technique have been shown to be very successful at the task of Chinese word segmentation (CWS), starting with the model of Peng et al. (2004). In the Second International Chinese Word Segmentation Bakeoff (Emerson, 2005), two of the highest scoring systems in the closed track competition were based on a CRF model. (Tseng et al., 2005; Asahara et al., 2005) While the CRF is quite effective compared with other models designed for CWS, one wonders whether it may be limited by its restrictive independence assumptions on non-adjacent labels: an order-M CRF satisfies the order-M Markov assumption that, globally conditioned on the input sequence, each label is independent of all other labels given the M labels to its left and right.</Paragraph> <Paragraph position="2"> Consequently, the model only &quot;sees&quot; word boundaries within a moving window of M + 1 characters, which prohibits it from explicitly modeling the tendency of strings longer than that window to form words, or from modeling the lengths of the words. Although the window can in principle be widened by increasing M, this is not a practical solution as the complexity of training and decoding a linear sequence CRF grows exponentially with the Markov order.</Paragraph> <Paragraph position="3"> The semi-CRF is a sequence model that is designed to address this difficulty via careful relaxation of the Markov assumption. Rather than recasting the segmentation problem as a labeling problem, the semi-CRF directly models the distribution of chunk boundaries.1 In terms of inde1As it was originally described, the semi-CRF also assigns labels to each chunk, effectively performing joint segmentation and labeling, but in a pure segmentation problem such as CWS, the use of labels is unnecessary.</Paragraph> <Paragraph position="4"> pendence, using an order-M semi-CRF entails the assumption that, globally conditioned on the input sequence, the position of each chunk boundary is independent of all other boundaries given the positions of the M boundaries to its left and right regardless of how far away they are. Even with an order-1 model, this enables several classes of features that one would expect to be of great utility to the word segmentation task, in particular word length and word identity.</Paragraph> <Paragraph position="5"> Despite this, the only work of which we are aware exploring the use of a semi-Markov CRF for Chinese word segmentation did not find significant gains over the standard CRF (Liang, 2005). This is surprising, not only because the additional features a semi-CRF enables are intuitively very useful, but because as we will show, an order-M semi-CRF is strictly more powerful than an order-M CRF, in the sense that any feature that can be used in the latter can also be used in the former, or equivalently, the semi-CRF makes strictly weaker independence assumptions. Given a judicious choice of features (or simply enough training data) the semi-CRF should be superior.</Paragraph> <Paragraph position="6"> We propose that the reason for this discrepancy may be that despite the greater representational power of the semi-CRF, there are some valuable features that are more naturally expressed in a CRF segmentation model, and so they are not typically included in semi-CRFs (indeed, they have not to date been used in any semi-CRF model for any task, to our knowledge). In this paper, we show that semi-CRFs are strictly more expressive, and also demonstrate how CRF-type features can be used in a semi-CRF model for Chinese word segmentation. Our experiments show that a model incorporating both types of features can outperform models using only one or the other type.</Paragraph> <Paragraph position="7"> Orthogonally, we explore in this paper the use of a very powerful feature for the semi-CRF derived from a generative model.</Paragraph> <Paragraph position="8"> It is common in statistical NLP to use as features in a discriminative model the (logarithm of the) estimated probability of some event according to a generative model. For example, Collins (2000) uses a discriminative classifier for choosing among the top N parse trees output by a generative baseline model, and uses the log-probability of a parse according to the baseline model as a feature in the reranker. Similarly, the machine translation system of Och and Ney uses log-probabilities of phrasal translations and other events as features in a log-linear model (Och and Ney, 2002; Och and Ney, 2004). There are many reasons for incorporating these types of features, including the desire to combine the higher accuracy of a discriminative model with the simple parameter estimation and inference of a generative one, and also the fact that generative models are more robust in data sparse scenarios (Ng and Jordan, 2001).</Paragraph> <Paragraph position="9"> For word segmentation, one might want to use as a local feature the log-probability that a segment is a word, given the character sequence it spans. A curious property of this feature is that it induces a counterintuitive asymmetry between the is-word and is-not-word cases: the component generative model can effectively dictate that a certain chunk is not a word, by assigning it a very low probability (driving the feature value to negative infinity), but it cannot dictate that a chunk is a word, because the log-probability is bounded above.2 If instead the log conditional odds log Pi(y|x)Pi(!y|x) is used, the asymmetry disappears. We show that such a log-odds feature provides much greater benefit than the log-probability, and that it is useful to include such a feature even when the model also includes indicator function features for every word in the training corpus.</Paragraph> </Section> class="xml-element"></Paper>