File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1025_metho.xml
Size: 22,539 bytes
Last Modified: 2025-10-06 14:13:54
<?xml version="1.0" standalone="yes"?> <Paper uid="P94-1025"> <Title>PART-OF-SPEECH TAGGING USING A VARIABLE MEMORY MARKOV MODEL Hinrich Schiitze Center for the Study of Language and Information</Title> <Section position="3" start_page="0" end_page="182" type="metho"> <SectionTitle> VARIABLE MEMORY MARKOV MODELS </SectionTitle> <Paragraph position="0"> Markov models are a natural candidate for language modeling and temporal pattern recognition, mostly due to their mathematical simplicity. However, it is obvious that finite memory Markov models cannot capture the recursive nature of language, nor can they be trained effectively with long memories. The notion of variable contez~ length also appears naturally in the context of universal coding (Rissanen, 1978; Rissanen and Langdon, 1981). This information theoretic notion is now known to be closely related to efficient modeling (Rissanen, 1988). The natural measure that appears in information theory is the description length, as measured by the statistical predictability via the Kullback-Leibler (KL) divergence. The VMM learning algorithm is based on minimizing the statistical prediction error of a Markov model, measured by the instantaneous KL divergence of the following symbols, the current statistical surprise of the model. The memory is extended precisely when such a surprise is significant, until the overall statistical prediction of the stochastic model is sufficiently good. For the sake of simplicity, a POS tag is termed a symbol and a sequence of tags is called a string. We now briefly describe the algorithm for learning a variable memory Markov model. See (Ron et al., 1993; Ron et al., 1994) for a more detailed description of the algorithm.</Paragraph> <Paragraph position="1"> We first introduce notational conventions and define some basic concepts. Let \]E be a finite alphabet. Denote by \]~* the set of all strings over \]E. A string s, over L TM of length n, is denoted by s = sls2...sn. We denote by * the empty string. The length of a string s is denoted by Isl and the size of an alphabet \]~ is denoted by \[\]~1. Let Prefix(s) = SlS2...Sn_l denote the longest prefix of a string s, and let Prefix*(s) denote the set of all prefixes of s, including the empty string. Similarly, Suffix(s) = s2sz...s, and Suffix* (s) is the set of all suffixes of s. A set of strings is called a suffix (prefix) free set if, V s E S: SNSuffiz*(s ) = $ (SNPrefiz*(s) = 0).</Paragraph> <Paragraph position="2"> We call a probability measure P, over the strings in E* proper if P(o) = 1, and for every string s, Y~,er P(sa) = P(s). Hence, for every prefix free set S, ~'~,es P(s) < 1, and specifically for every integer n > O, ~'~se~, P(s) = 1.</Paragraph> <Paragraph position="3"> A prediction suffix tree T over \]E, is a tree of degree I~l. The edges of the tree are labeled by symbols from ~E, such that from every internal node there is at most one outgoing edge labeled by each symbol. The nodes of the tree are labeled by pairs (s,%) where s is the string associated with the walk starting from that node and ending in the root of the tree, and 7s : ~ ---* \[0,1\] is the output probability function of s satisfying )&quot;\]~o~ 7s (a) = 1. A. prediction suffix, tree. induces probabilities on arbitrarily long strings m the following manner. The probability that T generates a string w = wtw2...wn in E~, denoted by PT(w), is IIn=l%.i-,(Wi), where s o = e, and for 1 < i < n - 1, s J is the string labeling the deepest node reached by taking the walk corresponding to wl...wi starting at the root of T. By definition, a prediction suffix tree induces a proper measure over E*, and hence for every prefix free set of strings {wX,...,wm}, ~=~ PT(w i) < 1, and specifically for n > 1, then ~,E~, PT(S) = 1.</Paragraph> <Paragraph position="4"> A Probabilistic Finite Automaton (PFA) A is a 5-tuple (Q, E, r, 7, ~), where Q is a finite set of n states, ~ is an alphabet of size k, v : Q x E --~ Q is the transition function, 7 : Q x E ~ \[0,1\] is the output probability function, and ~r : Q ~ \[0,1\] is the probability distribution over the start states.</Paragraph> <Paragraph position="5"> The functions 3' and r must satisfy the following requirements: for every q E Q, )-'~oe~ 7(q, a) = 1, and ~e~O rr(q) = 1. The probability that A generates a string s = sls2...s. E En 0 n is PA(s) = ~-~qoEq lr(q ) I-Ii=x 7(q i-1, sl), where qi+l ~_ r(qi,si). 7&quot; can be extended to be defined on Q x E* as follows: 7&quot;(q, sts2...st) = 7&quot;(7&quot;(q, st...st-x),st) = 7&quot;(7&quot;(q, Prefiz(s)),st). The distribution over the states, 7r, can be replaced by a single start state, denoted by e such that r(C/, s) = 7r(q), where s is the label of the state q. Therefore, r(e) = 1 and r(q) = 0 if q # e.</Paragraph> <Paragraph position="6"> For POS tagging, we are interested in learning a sub-class of finite state machines which have the following property. Each state in a machine M belonging to this sub-class is labeled by a string of length at most L over E, for some L _> O. The set of strings labeling the states is suffix free. We require that for every two states qX, q2 E Q and for every symbol a E ~, if r(q 1,or) = q2 and qt is labeled by a string s 1, then q2 is labeled by a string s ~ which is a suffix of s 1 * or. Since the set of strings labeling the states is suffix free, if there exists a string having this property then it is unique. Thus, in order that r be well defined on a given set of string S, not only must the set be suffix free, but it must also have the property, that for every string s in the set and every symbol a, there exists a string which is a suffix of scr. For our convenience, from this point on, if q is a state in Q then q will also denote the string labeling that state.</Paragraph> <Paragraph position="7"> A special case of these automata is the case in which Q includes all I~l L strings of length L.</Paragraph> <Paragraph position="8"> These automata are known as Markov processes of order L. We are interested in learning automata for which the number of states, n, is much smaller than IEI L, which means that few states have long memory and most states have a short one. We refer to these automata as variable memory Markov (VMM) processes. In the case of Markov processes of order L, the identity of the states (i.e. the identity of the strings labeling the states) is known and learning such a process reduces to approximating the output probability function.</Paragraph> <Paragraph position="9"> Given a sample consisting of m POS tag sequences of lengths Ix,12,..., l,~ we would like to find a prediction suffix tree that will have the same statistical properties as the sample and thus can be used to predict the next outcome for sec;uences generated by the same source. At each stage we can transform the tree into a variable memory Markov process. The key idea is to iteratively build a prediction tree whose probability measure equals the empirical probability measure calculated from the sample.</Paragraph> <Paragraph position="10"> We start with a tree consisting of a single node and add nodes which we have reason to believe should be in the tree. A node as, must be added to the tree if it statistically differs from its parent node s. A natural measure to check the statistical difference is the relative entropy (also known as the Kullback-Leibler (KL) divergence) (Kullback, 1959), between the conditional probabilities P(.Is) and P(.las). Let X be an observation space and P1, P2 be probability measures over X then the KL divergence between P1 and P1 x P2 is, D L(PIlIP )= * In our case, the KL divergence measures how much additional information is gained by using the suffix ~rs for prediction instead of the shorter suffix s. There are cases where the statistical difference is large yet the probability of observing the suffix as itself is so small that we can neglect those cases.</Paragraph> <Paragraph position="11"> Hence we weigh the statistical error by the prior probability of observing as. The statistical error measure in our case is,</Paragraph> <Paragraph position="13"> Therefore, a node as is added to the tree if the statistical difference (defined by Err(as, s)) between the node and its parrent s is larger than a predetermined accuracy e. The tree is grown level by level, adding a son of a given leaf in the tree whenever the statistical error is large. The problem is that the requirement that a node statistically differs from its parent node is a necessary condition for belonging to the tree, but is not sufficient. The leaves of a prediction suffix tree must differ from their parents (or they are redundant) but internal nodes might not have this property. Therefore, we must continue testing further potential descendants of the leaves in the tree up to depth L. In order to avoid exponential grow in the number of strings tested, we do not test strings which belong to branches which are reached with small probability. The set of strings, tested at each step, is denoted by S, and can be viewed as a kind of frontier of the growing tree T.</Paragraph> </Section> <Section position="4" start_page="182" end_page="183" type="metho"> <SectionTitle> USING A VMM FOR POS TAGGING </SectionTitle> <Paragraph position="0"> We used a tagged corpus to train a VMM. The syntactic information, i.e. the probability of a spe- null cific word belonging to a tag class, was estimated using maximum likelihood estimation from the individual word counts. The states and the transition probabilities of the Markov model were determined by the learning algorithm and tag output probabilities were estimated from word counts (the static information present in the training corpus). The whole structure, for two states, is depicted in Fig. 1. Si and Si+l are strings of tags corresponding to states of the automaton. P(ti\[Si) is the probability that tag ti will be output by state Si and P(ti+l\]Si+l) is the probability that the next tag ti+l is the output of state Si+l.</Paragraph> <Paragraph position="2"> When tagging a sequence of words Wl,,, we want to find the tag sequence tl,n that is most likely for Wl,n. We can maximize the joint probability of wl,, and tl,n to find this sequence: 1</Paragraph> <Paragraph position="4"> P(tl,., Wl,.) can be expressed as a product of conditional probabilities as follows:</Paragraph> <Paragraph position="6"> With the simplifying assumption that the probability of a tag only depends on previous tags and that the probability of a word only depends on its tags, we get:</Paragraph> <Paragraph position="8"> Given a variable memory Markov model M,</Paragraph> <Paragraph position="10"> are represented by the transition probabilities of the corresponding automaton. The tags tl,n for a sequence of words wt,n are therefore chosen according to the following equation using the Viterbi algorithm:</Paragraph> <Paragraph position="12"> The terms P(wi) are constant for a given sequence wi and can therefore be omitted from the maximization. We perform a maximum likelihood estimation for P(ti) by calculating the relative frequency of ti in the training corpus. The estimation of the static parameters P(tilwi) is described in the next section.</Paragraph> <Paragraph position="13"> We trained the variable memory Markov model on the Brown corpus (Francis and Ku~era, 1982), with every tenth sentence removed (a total of 1,022,462 tags). The four stylistic tag modifiers &quot;FW&quot; (foreign word), &quot;TL&quot; (title), &quot;NC&quot; (cited word), and &quot;HL&quot; (headline) were ignored reducing the complete set of 471 tags to 184 different tags.</Paragraph> <Paragraph position="14"> The resulting automaton has 49 states: the null state (e), 43 first order states (one symbol long) and 5 second order states (two symbols long). This means that 184-43=141 states were not (statistically) different enough to be included as separate states in the automaton. An analysis reveals two possible reasons. Frequent symbols such as &quot;ABN&quot; (&quot;half&quot;, &quot;all&quot;, &quot;many&quot; used as prequantifiers, e.g. in &quot;many a younger man&quot;) and &quot;DTI&quot; (determiners that can be singular or plural, &quot;any&quot; and &quot;some&quot;) were not included because they occur in a variety of diverse contexts or often precede unambiguous words. For example, when tagged as &quot;ABN .... half&quot;, &quot;all&quot;, and &quot;many&quot; tend to occur before the unambiguous determiners &quot;a&quot;, &quot;an&quot; and &quot;the&quot;.</Paragraph> <Paragraph position="15"> Some rare tags were not included because they did not improve the optimization criterion, minimum description length (measured by the KLdivergence). For example, &quot;HVZ*&quot; (&quot;hasn't&quot;) is not a state although a following &quot;- ed&quot; form is always disambiguated as belonging to class &quot;VBN&quot; (past participle). But since this is a rare event, describing all &quot;HVZ* VBN&quot; sequences separately is cheaper than the added complexity of an automaton with state &quot;HVZ*&quot;. We in fact lost some accuracy in tagging because of the optimization criterion: Several &quot;-ed&quot; forms after forms of &quot;have&quot; were mistagged as &quot;VBD&quot; (past tense).</Paragraph> <Paragraph position="16"> tion is significantly different when using a longer suffix for prediction. Those states are identified automatically by the VMM learning algorithm. A better prediction and classification of POS-tags is achieved by adding those states with only a small increase in the computation time.</Paragraph> <Paragraph position="17"> The two-symbol states were &quot;AT JJ&quot;, &quot;AT NN&quot;, &quot;AT VBN&quot;, &quot;JJ CC&quot;, and &quot;MD RB&quot; (article adjective, article noun, article past participle, adjective conjunction, modal adverb). Table 1 lists two of the largest differences in transition probabilities for each state. The varying transition probabilities are based on differences between the syntactic constructions in which the two competing states occur. For example, adjectives after articles (&quot;AT JJ&quot;) are almost always used attributively which makes a following preposition impossible and a following noun highly probable, whereas a predicative use favors modifying prepositional phrases. Similarly, an adverb preceded by a modal (&quot;MD RB&quot;) is followed by an infinitive (&quot;VB&quot;) half the time, whereas other adverbs occur less often in pre-infinitival position. On the other hand, a past participle is virtually impossible after &quot;MD RB&quot; whereas adverbs that are not preceded by modals modify past participles quite often.</Paragraph> <Paragraph position="18"> While it is known that Markov models of order 2 give a slight improvement over order-1 models (Charniak et al., 1993), the number of parameters in our model is much smaller than in a full order-2 Markov model (49&quot;184 = 9016 vs. 184&quot;184&quot;184 -6,229,504). null</Paragraph> </Section> <Section position="5" start_page="183" end_page="184" type="metho"> <SectionTitle> ESTIMATION OF THE STATIC PARAMETERS </SectionTitle> <Paragraph position="0"> We have to estimate the conditional probabilities P(ti\[wJ), the probability that a given word ufi will appear with tag t i, in order to compute the static parameters P(w j It/) used in the tagging equations described above. A first approximation would be to use the maximum likelihood estimator: p(ti\[w j) = C( ti, w i) c(w ) where C(t i, w j) is the number of times t i is tagged as w~ in the training text and C(wJ) is the number of times w/ occurs in the training text. However, some form of smoothing is necessary, since any new text will contain new words, for which C(w j) is zero. Also, words that are rare will only occur with some of their possible parts of speech in the training text. One solution to this problem</Paragraph> <Paragraph position="2"> where I is the number of tags, 184 in our case.</Paragraph> <Paragraph position="3"> It turns out that Good-Turing is not appropriate for our problem. The reason is the distinction between closed-class and open-class words. Some syntactic classes like verbs and nouns are productive, others like articles are not. As a consequence, the probability that a new word is an article is zero, whereas it is high for verbs and nouns. We need a smoothing scheme that takes this fact into account.</Paragraph> <Paragraph position="4"> Extending an idea in (Charniak et al., 1993), we estimate the probability of tag conversion to find an adequate smoothing scheme. Open and closed classes differ in that words often add a tag from an open class, but rarely from a closed class.</Paragraph> <Paragraph position="5"> For example, a word that is first used as a noun will often be used as a verb subsequently, but closed classes such as possessive pronouns (&quot;my&quot;, &quot;her&quot;, &quot;his&quot;) are rarely used with new syntactic categories after the first few thousand words of the Brown corpus. We only have to take stock of these &quot;tag conversions&quot; to make informed predictions on new tags when confronted with unseen text. Formally, let W\] ''~ be the set of words that have been seen with t i, but not with t k in the training text up to word wt. Then we can estimate the probability that a word with tag t i will later be seen with tag</Paragraph> <Paragraph position="7"> This formula also applies to words we haven't seen so far, if we regard such words as having occurred with a special tag &quot;U&quot; for &quot;unseen&quot;. (In this case, W~ '-'k is the set of words that haven't occurred up to l.) PI,n(U ---* k) then estimates the probability that an unseen word has tag t k. Table 2 shows the estimates of tag conversion we derived from our training text for 1 = 1022462- 100000, m = 1022462, where 1022462 is the number of words in the training text. To avoid sparse data problems we assumed zero probability for types of tag conversion with less than 100 instances in the training set.</Paragraph> <Paragraph position="8"> tag conversion Our smoothing scheme is then the following heuristic modification of Good-Turing:</Paragraph> <Paragraph position="10"> where Tj is the set of tags that w/has in the training set and T is the set of all tags. This scheme has the following desirable properties: * As with Good-Turing, smoothing has a small effect on estimates that are based on large counts.</Paragraph> <Paragraph position="11"> * The difference between closed-class and open-class words is respected: The probability for conversion to a closed class is zero and is not affected by smoothing.</Paragraph> <Paragraph position="12"> * Prior knowledge about the probabilities of conversion to different tag classes is incorporated. For example, an unseen word w i is five times as likely to be a noun than an adverb. Our estimate for P(ti\]w j) is correspondingly five times higher for &quot;NN&quot; than for &quot;RB&quot;.</Paragraph> </Section> <Section position="6" start_page="184" end_page="185" type="metho"> <SectionTitle> ANALYSIS OF RESULTS </SectionTitle> <Paragraph position="0"> Our result on the test set of 114392 words (the tenth of the Brown corpus not used for training) was 95.81%. Table 3 shows the 20 most frequent errors.</Paragraph> <Paragraph position="1"> Three typical examples for the most common error (tagging nouns as adjectives) are &quot;Communist&quot;, &quot;public&quot; and &quot;homerun&quot; in the following sentences.</Paragraph> <Paragraph position="2"> The words &quot;public&quot; and &quot;communist&quot; can be used as adjectives or nouns. Since in the above sentences an adjective is syntactically more likely, this was the tagging chosen by the VMM. The noun &quot;homerun&quot; didn't occur in the training set, therefore the priors for unknown words biased the tagging towards adjectives, again because the position is more typical of an adjective than of a noun.</Paragraph> <Paragraph position="3"> Two examples of the second most common error (tagging past tense forms (&quot;VBD&quot;) as past participles (&quot;VBN&quot;)) are &quot;called&quot; and &quot;elected&quot; in the following sentences: * the party called for government operation of all utilities * When I come back here after the November election you'll think, you're my man - elected. Most of the VBD/VBN errors were caused by words that have a higher prior for &quot;VBN&quot; so that in a situation in which both forms are possible according to local syntactic context, &quot;VBN&quot; is chosen. More global syntactic context is necessary to find the right tag &quot;VBD&quot; in the first sentence. The second sentence is an example for one of the tagging mistakes in the Brown corpus, &quot;elected&quot; is clearly used as a past participle, not as a past tense form.</Paragraph> <Paragraph position="4"> Comparison with other Results Charniak et al.'s result of 95.97% (Charniak et al., 1993) is slightly better than ours. This difference is probably due to the omission of rare tags that permit reliable prediction of the following tag (the case of &quot;HVZ.&quot; for &quot;hasn't&quot;). Kupiec achieves up to 96.36% correctness (Kupiec, 1992), without using a tagged corpus for training as we do. But the results are not easily comparable with ours since a lexicon is used that lists only possible tags. This can result in increasing the error rate when tags are listed in the lexicon that do not occur in the corpus. But it can also decrease the error rate when errors due to bad tags for rare words are avoided by looking them up in the lexicon. Our error rate on words that do not occur in the training text is 57%, since only the general priors are used for these words in decoding. This error rate could probably be reduced substantially by incorporating outside lexical information. null</Paragraph> </Section> class="xml-element"></Paper>