File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1004_intro.xml

Size: 7,303 bytes

Last Modified: 2025-10-06 14:02:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1004">
  <Title>Discriminative Hidden Markov Modeling with Long State Dependence using a kNN Ensemble</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> A Hidden Markov Model (HMM) is a model where a sequence of observations is generated in addition to the Markov state sequence. It is a latent variable model in the sense that only the observation sequence is known while the state sequence remains &amp;quot;hidden&amp;quot;. In recent years, HMMs have enjoyed great success in many tagging applications, most notably part-of-speech (POS) tagging (Church 1988; Weischedel et al 1993; Merialdo 1994) and named entity recognition (Bikel et al 1999; Zhou et al 2002).</Paragraph>
    <Paragraph position="1"> Moreover, there have been also efforts to extend the use of HMMs to word sense disambiguation (Segond et al 1997) and shallow/full parsing (Brants et al 1997; Skut et al 1998; Zhou et al 2000).</Paragraph>
    <Paragraph position="2"> Traditionally, a HMM segments and labels sequential data in a generative way, assigning a joint probability to paired observation and state sequences. More formally, a generative (first-order) HMM (GHMM) is given by a finite set of states including an designated initial state and an designated final state, a set of possible observation , two conditional probability distributions: a state transition model from s to , for and an output model, for  . A sequence of observations is generated by starting from the designated initial state, transmiting to a new state according to , emitting an observation selected by that new state according to p , transmiting to another new state and so on until the designated final state is generated.</Paragraph>
    <Paragraph position="4"> There are several problems with this generative approach. First, many tasks would benefit from a richer representation of observations--in particular a representation that describes observations in terms of many overlapping features, such as capitalization, word endings, part-of-speech in addition to the traditional word identity. Note that these features always depends on each other.</Paragraph>
    <Paragraph position="5"> Furthermore, to define a joint probability over the observation and state sequences, the generative approach needs to enumerate all the possible observation sequences. However, in some tasks, the set of all the possible observation sequences is not reasonably enumerable. Second, the generative approach fails to effectively model the dependence in the observation sequence. Moreover, it is difficult for the generative approach to model the long state dependence since it is not reasonably practical for ngram modeling(e.g. bigram for the first-order GHMM and trigram for the secnodorder GHMM) to be beyond trigram. Third, the generative approach normally estimates the parameters to maximize the likelihood of the observation sequence. However, in many NLP tasks, the goal is to predict the state sequence given the observation sequence. In other words, the generative approach inappropriately applies a generative joint probability model for a conditional probability problem. In summary, the main reasons behind these problems of the generative approach are the strong context independent assumption and the generative nature in modeling sequential data.</Paragraph>
    <Paragraph position="6"> While the dependence between successive states can be directly modeled by its state transition model, the generative approach fails to directly capture the observation dependence in the output model. From this viewpoint, a GHMM can be also called an observation independent HMM.</Paragraph>
    <Paragraph position="7"> To resolve above problems in GHMMs, some researches have been done to move from the generative approach to the discriminative approach.</Paragraph>
    <Paragraph position="8"> Discriminative HMMs (DHMMs) do not expend modeling effort on the observation sequnce, which are fixed at test time. Instead, DHMMs model the state sequence depending on arbitrary, non-independent features of the observation sequence, normally without forcing the model to account for the distribution of those dependencies. Punyakanok and Roth (2000) proposed a projection-based DHMM (PDHMM) which represents the probability of a state transition given not only the current observation but also past and future observations and used the SNoW classifier (Roth 1998, Carlson et al 1999) to estimate it (SNoW-PDHMM thereafter). McCallum et al (2000) proposed the extact same model and used maximum entropy to estimate it (ME-PDHMM thereafter). Lafferty et al (2001) extanded ME-PDHMM using conditional random fields by incorporating the factored state representation of the same model (that is, representing the probability of a state given the observation sequence and the previous state) to alleviate the label bias problem in projection-based DHMMs, which can be biased towards states with few successor states (CRF-DHMM thereafter). Similar work can also be found in Bouttou (1991).</Paragraph>
    <Paragraph position="9"> Punyakanok and Roth (2000) also proposed a nonprojection-based DMM which separates the dependence of a state on the previous state and the observation sequence, by rewriting the GHMM in a discriminative way and heuristically extending the notation of an observation to the observation sequence. Zhou et al (2000) systematically derived the exact same model as in Punyakanok and Roth (2000) and used back-off modeling to esimate the probability of a state given the observation sequence (Backoff-DHMM thereafter) while Punyakanok and Roth (2000) used the SNoW classifier to estimate it(SNoW-DHMM thereafter).</Paragraph>
    <Paragraph position="10"> This paper follows our previous work in Zhou et al (2000) and proposes an alternative nonprojection-based DHMM with long state dependence (LSD-DHMM), which separates the dependence of a state on the previous states and the observation sequence. Moreover, a variable-length mutual information based modeling approach (VLMI) is proposed to capture the long state dependence of a state on the previous states.</Paragraph>
    <Paragraph position="11"> In addition, an ensemble of kNN probability estimators is proposed to capture the observation dependence of a state on the observation sequence.</Paragraph>
    <Paragraph position="12"> Experimentation shows that VLMI effectively captures the long state dependence. It also shows that the kNN ensemble captures the dependence between the features of the observation sequence more effectively than classifier-based approaches, by forcing the model to account for the distribution of those dependencies.</Paragraph>
    <Paragraph position="13"> The layout of this paper is as follows. Section 2 first proposes the LSD-DHMM and then presents the VLMI to capture the long state dependence.</Paragraph>
    <Paragraph position="14"> Section 3 presents the kNN probability estimator to capture the observation dependence while Section 4 presents the kNN ensemble. Section 5 introduces shallow parsing, while experimental results are given in Section 6. Finally, some conclusion will be drawn in Section 7.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML