File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1004_metho.xml

Size: 10,767 bytes

Last Modified: 2025-10-06 14:08:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1004">
  <Title>Discriminative Hidden Markov Modeling with Long State Dependence using a kNN Ensemble</Title>
  <Section position="3" start_page="0" end_page="111" type="metho">
    <SectionTitle>
2. LSD-DHMM: Discriminative HMM with
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="111" type="sub_section">
      <SectionTitle>
Long State Dependence
</SectionTitle>
      <Paragraph position="0"> In principle, given an observation sequence , the goal of a conditional probability model is to find a stochastic optimal state sequence s that maximizes</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"/>
      <Paragraph position="6"> Obviously, the second term MI captures the mutual information between the state sequence and the observation sequence o . To compute efficiently, we propose a novel mutual information independence assumption: ),(</Paragraph>
      <Paragraph position="8"/>
      <Paragraph position="10"/>
      <Paragraph position="12"/>
      <Paragraph position="14"> That is, we assume a state is only dependent on the observation sequence o and independent on other states in the state sequence s . This assumption is reasonable because the dependence among the states in the state sequence has been</Paragraph>
      <Paragraph position="16"/>
      <Paragraph position="18"> The above model consists of two models: the state transition model [?] which measures the state dependence of a state given the previous states, and the output model which measures the observation dependence of a state given the observation sequence in a discriminative way. Therefore, we call the above model as in equation (4) a discriminative HMM (DHMM) with long state dependence (LSD-DHMM). The LSD-DHMM separates the dependence of a state on the previous states and the observation sequence. The main difference between a GHMM and a LSD-DHMM lies in their output models in that the output model of a LSD-DHMM directly captures the context dependence between successive observations in determining the &amp;quot;hidden&amp;quot; states while the output model of the GHMM fails to do so. That is, the output model of a LSD-DHMM overcomes the strong context independent assumption in the GHMM and becomes observation context dependent. Therefore, the LSD-DHMM can also be called an observation context dependent HMM. Compared with other DHMMs, the LSD-DHMM explicitly models the long state dependence and the non-projection nature of the LSD-DHMM alleviates the label bias problem inherent in projection-based DHMMs.</Paragraph>
      <Paragraph position="20"> Computation of a LSD-DHMM consists of two parts. The first is to compute the state transition model: . Traditionally, ngram modeling(e.g. bigram for the first-order GHMM and trigram for the second-order GHMM) is used to estimate the state transition model. However, such approach fails to capture the long state dependence since it is not reasonably practical for ngram modeling to be beyond trigram. In this paper, a variable-length mutual information-based modeling approach (VLMI) is proposed as follow:</Paragraph>
      <Paragraph position="22"> )2( ni [?][?] , we first find a minimal )i0( kk p[?] where the frequency of s is bigger than a threshold (e.g. 10) and then estimate</Paragraph>
      <Paragraph position="24"> In this way, the long state dependence can be captured maximally in a dynamical way. Here, the frequencies of variable-length state sequences are estimated using the simple Good-Turing approach (Gale et al 1995).</Paragraph>
      <Paragraph position="26"> The second is to estimate the output model: . Ideally, we would have sufficient training data for every event whose conditional probability we wish to calculate. Unfortunately, there is rarely enough training data to compute accurate probabilities when decoding on new data. Traditionally, there are two existing approaches to resolve this problem: linear interpolation (Jelinek 1989) and back-off (Katz 1987). However, these two approaches only work well when the number of different information sources is limited. When a long context is considered, the number of different information sources is exponential and not reasonably enumerable. The current tendency is to recast it as a classification problem and use the output of a classifier, e.g. the maximum entropy classifier (Ratnaparkhi 1999) to estimate the state probability distribution given the observation sequence. In the next two sections, we will propose a more effective ensemble of kNN probability estimators to resolve this problem.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="111" end_page="111" type="metho">
    <SectionTitle>
3. kNN Probability Estimator
</SectionTitle>
    <Paragraph position="0"> The main challenge for the LSD-DHMM is how to reliably estimate p in its output model.</Paragraph>
    <Paragraph position="1"> For efficiency, we can always assume , where the pattern entry . That is, we only consider the observation dependence in a window of 2N+1 observations (e.g. we only consider the current observation, the previous observation and the next observation when N=1). For convenience, we denote P as the conditional state probability distribution of the states given E and</Paragraph>
    <Paragraph position="3"> as the conditional state probability of given .</Paragraph>
    <Paragraph position="5"> The kNN probability estimator estimates by first finding the K nearest neighbors of frequently occurring pattern entries and then aggregating them to make a proper estimation of . Here, the conditional state probability distribution is estimated instead of the classification in a traditional kNN classifier. To do so, all the frequently occurring pattern entries are extracted from the training corpus in an exhaustive way and stored in a dictionary . In order to limit the dictionary size and keep efficiency, we constrain a valid set of pattern entry forms ValidEntry to consider only the most informative information sources. Generally, ValidEntry can be determined manually or automatically according to the applications. In Section 5, we will give an example.</Paragraph>
    <Paragraph position="7"> Given a pattern entry E and a dictionary of frequently occurring pattern entries , a simple algorithm is applied to find the K nearest neighbors of the pattern entry from the dictionary as follows:</Paragraph>
    <Paragraph position="9"> * compare with each entry in the dictionary and find all the compatible entries</Paragraph>
    <Paragraph position="11"> * compute the cosine similarity between E and each of the compatible entries</Paragraph>
    <Paragraph position="13"> * sort out the K nearest neighbors according to their cosine similarities Finally, the conditional state probability distribution of the pattern entry is aggregated over those of its K nearest neighbors weighted by their frequencies and cosine similarities</Paragraph>
    <Paragraph position="15"> In the literature, an ensemble has been widely used in the classification problem to combine several classifiers (Breiman 1996; Hamamoto 1997; Dietterich 1998; Zhou Z.H. et al 2002; Kim et al 2003). It is well known that an ensemble often outperforms the individual classifiers that make it up (Hansen et al 1990).</Paragraph>
    <Paragraph position="16"> In this paper, an ensemble of kNN probability estimators is proposed to estimate the conditional state probability distribution P instead of the classification. This is done through a bagging technique (Breiman 1996) to aggregate several kNN probability estimators. In bagging, the M kNN probability estimators in the ensemble</Paragraph>
    <Paragraph position="18"> independently via a bootstrap technique and then they are aggregated via an appropriate aggregation method. Usually, we have a single training set and need M training sample sets to construct a kNN ensemble with M independent kNN probability estimators. From the statistical viewpoint, we need to make the training sample sets different as much as possible in order to obtain a higher aggregation performance. For doing this, we often use the bootstrap technique which builds M replicate data sets by randomly re-sampling with replacement from the given training set repeatedly. Each example in the given training set may appear repeatedly or not at all in any particular replicate training sample set. Each training sample set is used to train a certain kNN probability estimator.</Paragraph>
    <Paragraph position="19"> Finally, the conditional state probability distribution of the pattern entry E is averaged over those of the M kNN probability estimators in the ensemble:</Paragraph>
    <Paragraph position="21"/>
  </Section>
  <Section position="5" start_page="111" end_page="211" type="metho">
    <SectionTitle>
5. Shallow Parsing
</SectionTitle>
    <Paragraph position="0"> In order to evaluate the LSD-DHMM and the proposed variable-length mutual information modeling approach for the long state dependence in the state transition model and the kNN ensemble for the observation dependence in the output model, we have applied it in the application of shallow parsing.</Paragraph>
    <Paragraph position="1"> For shallow parsing, we have o , where is the word sequence and is the part-of-speech (POS) sequence, while the &amp;quot;hidden&amp;quot; states are represented as structural tags to bracket and differentiate various categories of phrases. The basic idea of using the structural tags to represent the &amp;quot;hidden&amp;quot; states is similar to Skut et al (1998) and Zhou et al (2000). Here, a structural tag consists of three parts:  = * Boundary Category (BOUNDARY): it is a set of four values: &amp;quot;O&amp;quot;/&amp;quot;B&amp;quot;/&amp;quot;M&amp;quot;/&amp;quot;E&amp;quot;, where &amp;quot;O&amp;quot; means that current word is a whOle phrase and &amp;quot;B&amp;quot;/&amp;quot;M&amp;quot;/&amp;quot;E&amp;quot; means that current word is at the Beginning/in the Middle/at the End of a phrase. * Phrase Category (PHRASE): it is used to denote the category of the phrase.</Paragraph>
    <Paragraph position="2"> * Part-of-Speech (POS): Because of the limited number of boundary and phrase categories, the POS is added into the structural tag to represent more accurate state transition and output models.</Paragraph>
    <Paragraph position="3"> For example, given the following POS tagged sentence as the observation sequence:</Paragraph>
    <Paragraph position="5"> We can have a corresponding sequence of structural tags as the &amp;quot;hidden&amp;quot; state sequence:</Paragraph>
    <Paragraph position="7"> and an equivalent phrase chunked sentence as the shallow parsing result:</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML