File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1023_metho.xml

Size: 17,388 bytes

Last Modified: 2025-10-06 14:15:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1023">
  <Title>A Second-Order Hidden Markov Model for Part-of-Speech Tagging</Title>
  <Section position="3" start_page="175" end_page="175" type="metho">
    <SectionTitle>
2 Hidden Markov Models
</SectionTitle>
    <Paragraph position="0"> A hidden Markov model (HMM) is a statistical construct that can be used to solve classification problems that have an inherent state sequence representation. The model can be visualized as an interlocking set of states. These states are connected by a set of transition probabilities, which indicate the probability of traveling between two given states. A process begins in some state, then at discrete time intervals, the process &amp;quot;moves&amp;quot; to a new state as dictated by the transition probabilities. In an HMM, the exact sequence of states that the process generates is unknown (i.e., hidden). As the process enters each state, one of a set of output symbols is emitted by the process. Exactly which symbol is emitted is determined by a probability distribution that is specific to each state. The output of the HMM is a sequence of output symbols.</Paragraph>
    <Section position="1" start_page="175" end_page="175" type="sub_section">
      <SectionTitle>
2.1 Basic Definitions and Notation
</SectionTitle>
      <Paragraph position="0"> According to (Rabiner, 1989), there are five elements needed to define an HMM: 1. N, the number of distinct states in the model. For part-of-speech tagging, N is the number of tags that can be used by the system. Each possible tag for the system corresponds to one state of the HMM.</Paragraph>
      <Paragraph position="1"> 2. M, the number of distinct output symbols in the alphabet of the HMM. For part-of-speech tagging, M is the number of words in the lexicon of the system.</Paragraph>
      <Paragraph position="2">  3. A = {a/j}, the state transition probabil- null ity distribution. The probability aij is the probability that the process will move from state i to state j in one transition. For part-of-speech tagging, the states represent the tags, so aij is the probability that the model will move from tag ti to tj -- in other words, the probability that tag tj follows ti. This probability can be estimated using data from a training corpus.</Paragraph>
      <Paragraph position="3"> 4. B = {bj(k)), the observation symbol probability distribution. The probability bj(k) is the probability that the k-th output symbol will be emitted when the model is in state j. For part-of-speech tagging, this is the probability that the word Wk will be emitted when the system is at tag tj (i.e., P(wkltj)). This probability can be estimated using data from a training corpus. 5. 7r = {Tri}, the initial state distribution. 7ri is the probability that the model will start in state i. For part-of-speech tagging, this is the probability that the sentence will begin with tag ti.</Paragraph>
      <Paragraph position="4"> When using an HMM to perform part-of-speech tagging, the goal is to determine the most likely sequence of tags (states) that generates the words in the sentence (sequence of output symbols). In other words, given a sentence V, calculate the sequence U of tags that maximizes P(VIU ). The Viterbi algorithm is a common method for calculating the most likely tag sequence when using an HMM. This algorithm is explained in detail by Rabiner (1989) and will not be repeated here.</Paragraph>
    </Section>
    <Section position="2" start_page="175" end_page="175" type="sub_section">
      <SectionTitle>
2.2 Calculating Probabilities for
Unknown Words
</SectionTitle>
      <Paragraph position="0"> In a standard HMM, when a word does not occur in the training data, the emit probability for the unknown word is 0.0 in the B matrix (i.e., bj(k) = 0.0 if wk is unknown). Being able to accurately tag unknown words is important, as they are frequently encountered when tagging sentences in applications. Most work in the area of unknown words and tagging deals with predicting part-of-speech information based on word endings and affixation information, as shown by work in (Mikheev, 1996), (Mikheev, 1997), (Weischedel et al., 1993), and (Thede, 1998). This section highlights a method devised for HMMs, which differs slightly from previous approaches.</Paragraph>
      <Paragraph position="1"> To create an HMM to accurately tag unknown words, it is necessary to determine an estimate of the probability P(wklti) for use in the tagger. The probability P(word contains sjl tag is ti) is estimated, where sj is some &amp;quot;suffix&amp;quot; (a more appropriate term would be word ending, since the sj's are not necessarily morphologically significant, but this terminology is unwieldy). This new probability is stored in a matrix C = {cj(k)), where cj(k) = P(word has suffix ski tag is tj), replaces bj(k) in the HMM calculations for unknown words. This probability can be estimated by collecting suffix information from each word in the training corpus.</Paragraph>
      <Paragraph position="2"> In this work, suffixes of length one to four characters are considered, up to a maximum suffix length of two characters less than the length of the given word. An overall count of the number of times each suffix/tag pair appears in the training corpus is used to estimate emit probabilities for words based on their suffixes, with some exceptions. When estimating suffix probabilities, words with length four or less are not likely to contain any word-ending information that is valuable for classification, so they are ignored. Unknown words are presumed to be open-class, so words that are not tagged with an open-class tag are also ignored.</Paragraph>
      <Paragraph position="3"> When constructing our suffix predictor, words that contain hyphens, are capitalized, or contain numeric digits are separated from the main calculations. Estimates for each of these categories are calculated separately. For example, if an unknown word is capitalized, the probability distribution estimated from capitalized words is used to predict its part of speech. However, capitalized words at the beginning of a sentence are not classified in this way-the initial capitalization is ignored. If a word is not capitalized and does not contain a hyphen or numeric digit, the general distribution is used. Finally, when predicting the possible part of speech for an unknown word, all possible matching suffixes are used with their predictions smoothed (see Section 3.2).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="175" end_page="178" type="metho">
    <SectionTitle>
3 The Second-Order Model for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="175" end_page="176" type="sub_section">
      <SectionTitle>
Part-of-Speech Tagging
</SectionTitle>
      <Paragraph position="0"> The model described in Section 2 is an example of a first-order hidden Markov model. In part-of-speech tagging, it is called a bigram tagger. This model works reasonably well in part-of-speech tagging, but captures a more limited  amount of the contextual information than is available. Most of the best statistical taggers use a trigram model, which replaces the bigram transition probability aij = P(rp = tjITp_ 1 -~ ti) with a trigram probability aijk : P(7&amp;quot;p = tklrp_l = tj, rp-2 = ti). This section describes a new type of tagger that uses trigrams not only for the context probabilities but also for the lexical (and suffix) probabilities. We refer to this new model as a full second-order hidden Markov model.</Paragraph>
    </Section>
    <Section position="2" start_page="176" end_page="176" type="sub_section">
      <SectionTitle>
3.1 Defining New Probability
Distributions
</SectionTitle>
      <Paragraph position="0"> The full second-order HMM uses a notation similar to a standard first-order model for the probability distributions. The A matrix contains state transition probabilities, the B matrix contains output symbol distributions, and the C matrix contains unknown word distributions.</Paragraph>
      <Paragraph position="1"> The rr matrix is identical to its counterpart in the first-order model. However, the definitions of A, B, and C are modified to enable the full second-order HMM to use more contextual information to model part-of-speech tagging. In the following sections, there are assumed to be P words in the sentence with rp and Vp being the p-th tag and word in the sentence, respectively.</Paragraph>
      <Paragraph position="2">  The A matrix defines the contextual probabilities for the part-of-speech tagger. As in the trigram model, instead of limiting the context to a first-order approximation, the A matrix is defined as follows: A = {aijk), where&amp;quot; aija= P(rp = tklrp_l = tj, rp-2 = tl), 1 &lt; p &lt; P Thus, the transition matrix is now three dimensional, and the probability of transitioning to a new state depends not only on the current state, but also on the previous state. This allows a more realistic context-dependence for the word tags. For the boundary cases of p = 1 and p = 2, the special tag symbols NONE and SOS are used.</Paragraph>
      <Paragraph position="3">  The B matrix defines the lexical probabilities for the part-of-speech tagger, while the C matrix is used for unknown words. Similarly to the trigram extension to the A matrix, the approximation for the lexical and suffix probabilities can also be modified to include second-order information as follows:</Paragraph>
      <Paragraph position="5"> In these equations, the probability of the model emitting a given word depends not only on the current state but also on the previous state. To our knowledge, this approach has not been used in tagging. SOS is again used in the p = 1 case.</Paragraph>
    </Section>
    <Section position="3" start_page="176" end_page="178" type="sub_section">
      <SectionTitle>
3.2 Smoothing Issues
</SectionTitle>
      <Paragraph position="0"> While the full second-order HMM is a more precise approximation of the underlying probabilities for the model, a problem can arise from sparseness of data, especially with lexical estimations. For example, the size of the B matrix is T2W, which for the WSJ corpus is approximately 125,000,000 possible tag/tag/word combinations. In an attempt to avoid sparse data estimation problems, the probability estimates for each distribution is smoothed. There are several methods of smoothing discussed in the literature. These methods include the additive method (discussed by (Gale and Church, 1994)); the Good-Turing method (Good, 1953); the Jelinek-Mercer method (Jelinek and Mercer, 1980); and the Katz method (Katz, 1987).</Paragraph>
      <Paragraph position="1"> These methods are all useful smoothing algorithms for a variety of applications. However, they are not appropriate for our purposes. Since we are smoothing trigram probabilities, the additive and Good-Turing methods are of limited usefulness, since neither takes into account bi-gram or unigram probabilities. Katz smoothing seems a little too granular to be effective in our application--the broad spectrum of possibilities is reduced to three options, depending on the number of times the given event occurs.</Paragraph>
      <Paragraph position="2"> It seems that smoothing should be based on a function of the number of occurances. Jelinek-Mercer accommodates this by smoothing the n-gram probabilities using differing coefficients (A's) according to the number of times each n-gram occurs, but this requires holding out training data for the A's. We have implemented a model that smooths with lower order information by using coefficients calculated from the number of occurances of each trigram, bigram, and unigram without training. This method is explained in the following sections.</Paragraph>
      <Paragraph position="3">  To estimate the state transition probabilities, we want to use the most specific information.</Paragraph>
      <Paragraph position="4">  However, that information may not always be available. Rather than using a fixed smoothing technique, we have developed a new method that uses variable weighting. This method attaches more weight to triples that occur more often.</Paragraph>
      <Paragraph position="5"> The tklrp-1 P=ka formula for the estimate /3 of P(rp =</Paragraph>
      <Paragraph position="7"> number of times tk occurs number of times sequence tjta occurs number of times sequence titjtk occurs total number of tags that appear number of times tj occurs number of times sequence titj occurs where:</Paragraph>
      <Paragraph position="9"> The formulas for k2 and k3 are chosen so that the weighting for each element in the equation for/3 changes based on how often that element occurs in the training data. Notice that the sum of the coefficients of the probabilities in the equation for/3 sum to one. This guarantees that the value returned for/3 is a valid probability.</Paragraph>
      <Paragraph position="10"> After this value is calculated for all tag triples, the values are normalized so that ~ /3 -- 1, tkET creating a valid probability distribution.</Paragraph>
      <Paragraph position="11"> The value of this smoothing technique becomes clear when the triple in question occurs very infrequently, if at all. Consider calculating /3 for the tag triple CD RB VB. The information for this triple is:</Paragraph>
      <Paragraph position="13"> If smoothing were not applied, the probability would have been 0.000, which would create problems for tagger generalization. Smoothing allows tag triples that were not encountered in the training data to be assigned a probability of occurance.</Paragraph>
      <Paragraph position="14">  For the lexical and suffix probabilities, we do something somewhat different than for context probabilities. Initial experiments that used a formula similar to that used for the contextual estimates performed poorly. This poor performance was traced to the fact that smoothing allowed too many words to be incorrectly tagged with tags that did not occur with that word in the training data (over-generalization). As an alternative, we calculated the smoothed probability/3 for words as follows:</Paragraph>
      <Paragraph position="16"> Notice that this method assigns a probability of 0.0 to a word/tag pair that does not appear in the training data. This prevents the tagger from trying every possible combination of word and tag, something which both increases running time and decreases the accuracy. We believe the low accuracy of the original smoothing scheme emerges from the fact that smoothing the lexical probabilities too far allows the contextual information to dominate at the expense of the lexical information. A better smoothing approach for lexical information could possibly be created by using some sort of word class idea, such as the genotype idea used in (Tzoukermann and Radev, 1996), to improve our /5 estimate.</Paragraph>
      <Paragraph position="17">  In addition to choosing the above approach for smoothing the C matrix for unknown words, there is an additional issue of choosing which suffix to use when predicting the part of speech. There are many possible answers, some of which are considered by (Thede, 1998): use the longest matching suffix, use an entropy measure to determine the &amp;quot;best&amp;quot; affix to use, or use an average. A voting technique for cij(k) was determined that is similar to that used for contextual smoothing but is based on different length suffixes. null Let s4 be the length four suffix of the given word. Define s3, s2, and sl to be the length three, two, and one suffixes respectively. If the length of the word is six or more, these four suffixes are used. Otherwise, suffixes up to length n - 2 are used, where n is the length of the word. Determine the longest suffix of these that matches a suffix in the training data, and calculate the new smoothed probability:</Paragraph>
      <Paragraph position="19"> curs in the training data.</Paragraph>
      <Paragraph position="20"> * ~ij(Sk) -- the estimate of Cij(8k) from the previous lexical smoothing.</Paragraph>
      <Paragraph position="21"> After calculating/5, it is normalized. Thus, suffixes of length four are given the most weight, and a suffix receives more weight the more times it appears. Information provided by suffixes of length one to four are used in estimating the probabilities, however.</Paragraph>
    </Section>
    <Section position="4" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
3.3 The New Viterbi Algorithm
</SectionTitle>
      <Paragraph position="0"> Modification of the lexical and contextual probabilities is only the first step in defining a full second-order HMM. These probabilities must also be combined to select the most likely sequence of tags that generated the sentence.</Paragraph>
      <Paragraph position="1"> This requires modification of the Viterbi algorithm. First, the variables ~ and C/ from (Rabiner, 1989) are redefined, as shown in Figure 1. These new definitions take into account the added dependencies of the distributions of A, B, and C. We can then calculate the most likely tag sequence using the modification of the Viterbi algorithm shown in Figure 1. The running time of this algorithm is O (NT3), where N is the length of the sentence, and T is the number of tags. This is asymptotically equivalent to the running time of a standard trigram tagger that maximizes the probability of the entire tag sequence.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML