PART-OF-SPEECH TAGGING USING 
A VARIABLE MEMORY MARKOV MODEL 
Hinrich Schiitze 
Center for the Study of 
Language and Information 
Stanford, CA 94305-4115 
Internet: schuetze~csli.stanford.edu 
Yoram Singer 
Institute of Computer Science and 
Center for Neural Computation 
Hebrew University, Jerusalem 91904 
Internet: singer@cs.huji.ac.il 
Abstract 
We present a new approach to disambiguating syn- 
tactically ambiguous words in context, based on 
Variable Memory Markov (VMM) models. In con- 
trast to fixed-length Markov models, which predict 
based on fixed-length histories, variable memory 
Markov models dynamically adapt their history 
length based on the training data, and hence may 
use fewer parameters. In a test of a VMM based 
tagger on the Brown corpus, 95.81% of tokens are 
correctly classified. 
INTRODUCTION 
Many words in English have several parts of speech 
(POS). For example "book" is used as a noun in 
"She read a book." and as a verb in "She didn't 
book a trip." Part-of-speech tagging is the prob- 
lem of determining the syntactic part of speech of 
an occurrence of a word in context. In any given 
English text, most tokens are syntactically am- 
biguous since most of the high-frequency English 
words have several parts of speech. Therefore, a 
correct syntactic classification of words in context 
is important for most syntactic and other higher- 
level processing of natural language text. 
Two stochastic methods have been widely 
used for POS tagging: fixed order Markov models 
and Bidden Markov models. Fixed order Markov 
models are used in (Church, 1989) and (Charniak 
et al., 1993). Since the order of the model is as- 
sumed to be fixed, a short memory (small order) is 
typically used, since the number of possible combi- 
nations grows exponentially. For example, assum- 
ing there are 184 different tags, as in the Brown 
corpus, there are 1843 = 6,229,504 different or- 
der 3 combinations of tags (of course not all of 
these will actually occur, see (Weischedel et al., 
1993)). Because of the large number of param- 
eters higher-order fixed length models are hard 
to estimate. (See (Brill, 1993) for a rule-based 
approach to incorporating higher-order informa- 
tion.) In a Hidden iarkov Model (HMM) (Jelinek, 
1985; Kupiec, 1992), a different state is defined 
for each POS tag and the transition probabilities 
and the output probabilities are estimated using 
the EM (Dempster et al., 1977) algorithm, which 
guarantees convergence to.a local minimum (Wu, 
1983). The advantage of an HMM is that it can be 
trained using untagged text. On the other hand, 
the training procedure is time consuming, and a 
fixed model (topology) is assumed. Another dis- 
advantage is due to the local convergence proper- 
ties of the EM algorithm. The solution obtained 
depends on the initial setting of the model's pa- 
rameters, and different solutions are obtained for 
different parameter initialization schemes. This 
phenomenon discourages linguistic analysis based 
on the output of the model. 
We present a new method based on vari- 
able memory Markov models (VMM) (Ron et al., 
1993; Ron et al., 1994). The VMM is an approx- 
imation of an unlimited order Markov source. It 
can incorporate both the static (order 0) and dy- 
namic (higher-order) information systematically, 
while keeping the ability to change the model due 
to future observations. This approach is easy to 
implement, the learning algorithm and classifica- 
tion of new tags are computationally efficient, and 
the results achieved, using simplified assumptions 
for the static tag probabilities, are encouraging. 
VARIABLE MEMORY MARKOV 
MODELS 
Markov models are a natural candidate for lan- 
guage modeling and temporal pattern recognition, 
mostly due to their mathematical simplicity. How- 
ever, it is obvious that finite memory Markov mod- 
els cannot capture the recursive nature of lan- 
guage, nor can they be trained effectively with 
long memories. The notion of variable contez~ 
length also appears naturally in the context of uni- 
versal coding (Rissanen, 1978; Rissanen and Lang- 
don, 1981). This information theoretic notion is 
now known to be closely related to efficient mod- 
eling (Rissanen, 1988). The natural measure that 
181 
appears in information theory is the description 
length, as measured by the statistical predictabil- 
ity via the Kullback-Leibler (KL) divergence. 
The VMM learning algorithm is based on min- 
imizing the statistical prediction error of a Markov 
model, measured by the instantaneous KL diver- 
gence of the following symbols, the current statisti- 
cal surprise of the model. The memory is extended 
precisely when such a surprise is significant, until 
the overall statistical prediction of the stochastic 
model is sufficiently good. For the sake of sim- 
plicity, a POS tag is termed a symbol and a se- 
quence of tags is called a string. We now briefly de- 
scribe the algorithm for learning a variable mem- 
ory Markov model. See (Ron et al., 1993; Ron et 
al., 1994) for a more detailed description of the 
algorithm. 
We first introduce notational conventions and 
define some basic concepts. Let \]E be a finite al- 
phabet. Denote by \]~* the set of all strings over 
\]E. A string s, over L TM of length n, is denoted 
by s = sls2...sn. We denote by • the empty 
string. The length of a string s is denoted by 
Isl and the size of an alphabet \]~ is denoted by 
\[\]~1. Let Prefix(s) = SlS2...Sn_l denote the 
longest prefix of a string s, and let Prefix*(s) 
denote the set of all prefixes of s, including the 
empty string. Similarly, Suffix(s) = s2sz...s, 
and Suffix* (s) is the set of all suffixes of s. A set 
of strings is called a suffix (prefix) free set if, V s E 
S: SNSuffiz*(s ) = $ (SNPrefiz*(s) = 0). 
We call a probability measure P, over the strings 
in E* proper if P(o) = 1, and for every string s, 
Y~,er P(sa) = P(s). Hence, for every prefix free 
set S, ~'~,es P(s) < 1, and specifically for every 
integer n > O, ~'~se~, P(s) = 1. 
A prediction suffix tree T over \]E, is a tree 
of degree I~l. The edges of the tree are labeled 
by symbols from ~E, such that from every internal 
node there is at most one outgoing edge labeled 
by each symbol. The nodes of the tree are labeled 
by pairs (s,%) where s is the string associated 
with the walk starting from that node and end- 
ing in the root of the tree, and 7s : ~ ---* \[0,1\] 
is the output probability function of s satisfying 
)"\]~o~ 7s (a) = 1. A. prediction suffix, tree. induces 
probabilities on arbitrarily long strings m the fol- 
lowing manner. The probability that T gener- 
ates a string w = wtw2...wn in E~, denoted by 
PT(w), is IIn=l%.i-,(Wi), where s o = e, and for 
1 < i < n - 1, s J is the string labeling the deep- 
est node reached by taking the walk corresponding 
to wl...wi starting at the root of T. By defini- 
tion, a prediction suffix tree induces a proper mea- 
sure over E*, and hence for every prefix free set 
of strings {wX,...,wm}, ~=~ PT(w i) < 1, and 
specifically for n > 1, then ~,E~, PT(S) = 1. 
A Probabilistic Finite Automaton (PFA) A is 
a 5-tuple (Q, E, r, 7, ~), where Q is a finite set of 
n states, ~ is an alphabet of size k, v : Q x E --~ Q 
is the transition function, 7 : Q × E ~ \[0,1\] is the 
output probability function, and ~r : Q ~ \[0,1\] is 
the probability distribution over the start states. 
The functions 3' and r must satisfy the following 
requirements: for every q E Q, )-'~oe~ 7(q, a) = 
1, and ~e~O rr(q) = 1. The probability that 
A generates a string s = sls2...s. E En 0 n 
is PA(s) = ~-~qoEq lr(q ) I-Ii=x 7(q i-1, sl), where 
qi+l ~_ r(qi,si). 7" can be extended to be de- 
fined on Q x E* as follows: 7"(q, sts2...st) = 
7"(7"(q, st...st-x),st) = 7"(7"(q, Prefiz(s)),st). 
The distribution over the states, 7r, can be re- 
placed by a single start state, denoted by e such 
that r(¢, s) = 7r(q), where s is the label of the state 
q. Therefore, r(e) = 1 and r(q) = 0 if q # e. 
For POS tagging, we are interested in learning 
a sub-class of finite state machines which have the 
following property. Each state in a machine M 
belonging to this sub-class is labeled by a string 
of length at most L over E, for some L _> O. The 
set of strings labeling the states is suffix free. We 
require that for every two states qX, q2 E Q and 
for every symbol a E ~, if r(q 1,or) = q2 and qt 
is labeled by a string s 1, then q2 is labeled by 
a string s ~ which is a suffix of s 1 • or. Since the 
set of strings labeling the states is suffix free, if 
there exists a string having this property then it 
is unique. Thus, in order that r be well defined on 
a given set of string S, not only must the set be 
suffix free, but it must also have the property, that 
for every string s in the set and every symbol a, 
there exists a string which is a suffix of scr. For our 
convenience, from this point on, if q is a state in 
Q then q will also denote the string labeling that 
state. 
A special case of these automata is the case 
in which Q includes all I~l L strings of length L. 
These automata are known as Markov processes of 
order L. We are interested in learning automata 
for which the number of states, n, is much smaller 
than IEI L, which means that few states have long 
memory and most states have a short one. We re- 
fer to these automata as variable memory Markov 
(VMM) processes. In the case of Markov processes 
of order L, the identity of the states (i.e. the iden- 
tity of the strings labeling the states) is known and 
learning such a process reduces to approximating 
the output probability function. 
Given a sample consisting of m POS tag se- 
quences of lengths Ix,12,..., l,~ we would like to 
find a prediction suffix tree that will have the 
same statistical properties as the sample and thus 
can be used to predict the next outcome for se- 
c;uences generated by the same source. At each 
182 
stage we can transform the tree into a variable 
memory Markov process. The key idea is to iter- 
atively build a prediction tree whose probability 
measure equals the empirical probability measure 
calculated from the sample. 
We start with a tree consisting of a single 
node and add nodes which we have reason to be- 
lieve should be in the tree. A node as, must be 
added to the tree if it statistically differs from its 
parent node s. A natural measure to check the 
statistical difference is the relative entropy (also 
known as the Kullback-Leibler (KL) divergence) 
(Kullback, 1959), between the conditional proba- 
bilities P(.Is) and P(.las). Let X be an obser- 
vation space and P1, P2 be probability measures 
over X then the KL divergence between P1 and 
P1 x P2 is, D L(PIlIP )= • In 
our case, the KL divergence measures how much 
additional information is gained by using the suf- 
fix ~rs for prediction instead of the shorter suffix s. 
There are cases where the statistical difference is 
large yet the probability of observing the suffix as 
itself is so small that we can neglect those cases. 
Hence we weigh the statistical error by the prior 
probability of observing as. The statistical error 
measure in our case is, 
Err(as, s) 
= P(crs)DgL (P(.las)llP(.ls)) 
= P(as) P(a'las) log 
: ~,0,~ P(asa')log p(P/s°;p'() ) 
Therefore, a node as is added to the tree if the sta- 
tistical difference (defined by Err(as, s)) between 
the node and its parrent s is larger than a prede- 
termined accuracy e. The tree is grown level by 
level, adding a son of a given leaf in the tree when- 
ever the statistical error is large. The problem is 
that the requirement that a node statistically dif- 
fers from its parent node is a necessary condition 
for belonging to the tree, but is not sufficient. The 
leaves of a prediction suffix tree must differ from 
their parents (or they are redundant) but internal 
nodes might not have this property. Therefore, 
we must continue testing further potential descen- 
dants of the leaves in the tree up to depth L. In 
order to avoid exponential grow in the number of 
strings tested, we do not test strings which belong 
to branches which are reached with small prob- 
ability. The set of strings, tested at each step, 
is denoted by S, and can be viewed as a kind of 
frontier of the growing tree T. 
USING A VMM FOR POS 
TAGGING 
We used a tagged corpus to train a VMM. The 
syntactic information, i.e. the probability of a spe- 
183 
cific word belonging to a tag class, was estimated 
using maximum likelihood estimation from the in- 
dividual word counts. The states and the transi- 
tion probabilities of the Markov model were de- 
termined by the learning algorithm and tag out- 
put probabilities were estimated from word counts 
(the static information present in the training cor- 
pus). The whole structure, for two states, is de- 
picted in Fig. 1. Si and Si+l are strings of tags cor- 
responding to states of the automaton. P(ti\[Si) 
is the probability that tag ti will be output by 
state Si and P(ti+l\]Si+l) is the probability that 
the next tag ti+l is the output of state Si+l. 
P(Si+llSi) 
V 7 
P(TilSi) P(Ti+IlSi+I) 
Figure 1: The structure of the VMM based POS 
tagger. 
When tagging a sequence of words Wl,,, we 
want to find the tag sequence tl,n that is most 
likely for Wl,n. We can maximize the joint proba- 
bility of wl,, and tl,n to find this sequence: 1 
T(Wl,n) = arg maxt,, P(tl,nlWl,n) P(t,..,~,,.) 
= arg maxt~,. P(wl,.) 
= arg maxt~,.P(tl,.,wl,. ) 
P(tl,., Wl,.) can be expressed as a product of con- 
ditional probabilities as follows: 
P(tl,., Wl,.) = 
P(ts)P(wl Itl)P(t~ltl, wl)e(w21tl,2, wl) 
... P(t. It 1,._ 1, Wl,.-1)P(w. It1,., wl,.- 1) 
= fi P(tiltl,i-1, wl,i-1)P(wiltl,i, Wl,/-1) 
i=1 
With the simplifying assumption that the proba- 
bility of a tag only depends on previous tags and 
that the probability of a word only depends on its 
tags, we get: 
P(tl,n, wl,.) = fix P(tiltl,i-1) P(wilti) 
i=1 
Given a variable memory Markov model M, 
P(tilQ,i-1) is estimated by P(tilSi-l,M) where 
1 Part of the following derivation is adapted from 
(Charniak et al., 1993). 
Si = r(e, tx,i), since the dynamics of the sequence 
are represented by the transition probabilities of 
the corresponding automaton. The tags tl,n for 
a sequence of words wt,n are therefore chosen ac- 
cording to the following equation using the Viterbi 
algorithm: 
t% 
7-M(Wl,n) -- arg maxq.. H P(tilSi-l' M)P(wilti) 
i=1 
We estimate P(wilti) indirectly from P(tilwi) us- 
ing Bayes' Theorem: 
P(wilti) = P(wi)P(tilwi) P(ti) 
The terms P(wi) are constant for a given sequence 
wi and can therefore be omitted from the maxi- 
mization. We perform a maximum likelihood es- 
timation for P(ti) by calculating the relative fre- 
quency of ti in the training corpus. The estima- 
tion of the static parameters P(tilwi) is described 
in the next section. 
We trained the variable memory Markov 
model on the Brown corpus (Francis and Ku~era, 
1982), with every tenth sentence removed (a total 
of 1,022,462 tags). The four stylistic tag modifiers 
"FW" (foreign word), "TL" (title), "NC" (cited 
word), and "HL" (headline) were ignored reduc- 
ing the complete set of 471 tags to 184 different 
tags. 
The resulting automaton has 49 states: the 
null state (e), 43 first order states (one symbol 
long) and 5 second order states (two symbols 
long). This means that 184-43=141 states were 
not (statistically) different enough to be included 
as separate states in the automaton. An analy- 
sis reveals two possible reasons. Frequent symbols 
such as "ABN" ("half", "all", "many" used as pre- 
quantifiers, e.g. in "many a younger man") and 
"DTI" (determiners that can be singular or plu- 
ral, "any" and "some") were not included because 
they occur in a variety of diverse contexts or often 
precede unambiguous words. For example, when 
tagged as "ABN .... half", "all", and "many" tend 
to occur before the unambiguous determiners "a", 
"an" and "the". 
Some rare tags were not included because they 
did not improve the optimization criterion, min- 
imum description length (measured by the KL- 
divergence). For example, "HVZ*" ("hasn't") is 
not a state although a following "- ed" form is al- 
ways disambiguated as belonging to class "VBN" 
(past participle). But since this is a rare event, de- 
scribing all "HVZ* VBN" sequences separately is 
cheaper than the added complexity of an automa- 
ton with state "HVZ*". We in fact lost some ac- 
curacy in tagging because of the optimization cri- 
terion: Several "-ed" forms after forms of "have" 
were mistagged as "VBD" (past tense). 
transition to one-symbol two-symbol 
state state 
NN JJ: 0.45 AT JJ: 0.69 
IN JJ: 0.06 AT JJ: 0.004 
IN NN: 0.27 AT NN: 0.35 
NN: 0.14 AT NN: 0.10 
NN 
IN 
NN 
JJ 
VB 
VBN 
VBN: 0.08 AT VBN: 0.48 
VBN: 0.35 AT VBN: 0.003 
CC: 0.12 JJ CC: 0.04 
CC: 0.09 JJ CC: 0.58 
RB: 0.05 MD RB: 0.48 
RB: 0.08 MD RB: 0.0009 
Table 1: States for which the statistical predic- 
tion is significantly different when using a longer 
suffix for prediction. Those states are identified 
automatically by the VMM learning algorithm. A 
better prediction and classification of POS-tags is 
achieved by adding those states with only a small 
increase in the computation time. 
The two-symbol states were "AT JJ", "AT 
NN", "AT VBN", "JJ CC", and "MD RB" (ar- 
ticle adjective, article noun, article past partici- 
ple, adjective conjunction, modal adverb). Ta- 
ble 1 lists two of the largest differences in transi- 
tion probabilities for each state. The varying tran- 
sition probabilities are based on differences be- 
tween the syntactic constructions in which the two 
competing states occur. For example, adjectives 
after articles ("AT JJ") are almost always used 
attributively which makes a following preposition 
impossible and a following noun highly probable, 
whereas a predicative use favors modifying prepo- 
sitional phrases. Similarly, an adverb preceded by 
a modal ("MD RB") is followed by an infinitive 
("VB") half the time, whereas other adverbs oc- 
cur less often in pre-infinitival position. On the 
other hand, a past participle is virtually impossi- 
ble after "MD RB" whereas adverbs that are not 
preceded by modals modify past participles quite 
often. 
While it is known that Markov models of order 
2 give a slight improvement over order-1 models 
(Charniak et al., 1993), the number of parameters 
in our model is much smaller than in a full order-2 
Markov model (49"184 = 9016 vs. 184"184"184 -- 
6,229,504). 
ESTIMATION OF THE STATIC 
PARAMETERS 
We have to estimate the conditional probabilities 
P(ti\[wJ), the probability that a given word ufi will 
appear with tag t i, in order to compute the static 
parameters P(w j It/) used in the tagging equations 
described above. A first approximation would be 
184 
to use the maximum likelihood estimator: 
p(ti\[w j) = C( ti, w i) 
c(w ) 
where C(t i, w j) is the number of times t i is tagged 
as w~ in the training text and C(wJ) is the num- 
ber of times w/ occurs in the training text. How- 
ever, some form of smoothing is necessary, since 
any new text will contain new words, for which 
C(w j) is zero. Also, words that are rare will only 
occur with some of their possible parts of speech 
in the training text. One solution to this problem 
is Good-Turing estimation: 
p(tilwj) _ C(t', wJ) + 1 
c(wJ) + I 
where I is the number of tags, 184 in our case. 
It turns out that Good-Turing is not appropri- 
ate for our problem. The reason is the distinction 
between closed-class and open-class words. Some 
syntactic classes like verbs and nouns are produc- 
tive, others like articles are not. As a consequence, 
the probability that a new word is an article is 
zero, whereas it is high for verbs and nouns. We 
need a smoothing scheme that takes this fact into 
account. 
Extending an idea in (Charniak et al., 1993), 
we estimate the probability of tag conversion to 
find an adequate smoothing scheme. Open and 
closed classes differ in that words often add a tag 
from an open class, but rarely from a closed class. 
For example, a word that is first used as a noun 
will often be used as a verb subsequently, but 
closed classes such as possessive pronouns ("my", 
"her", "his") are rarely used with new syntactic 
categories after the first few thousand words of the 
Brown corpus. We only have to take stock of these 
"tag conversions" to make informed predictions on 
new tags when confronted with unseen text. For- 
mally, let W\] ''~ be the set of words that have been 
seen with t i, but not with t k in the training text up 
to word wt. Then we can estimate the probability 
that a word with tag t i will later be seen with tag 
t ~ as the proportion of words allowing tag t i but 
not t k that later add tk: 
P~m(i --* k) = 
I{nll<n<m ^ i ~k , ~k wnEW I" OW,,- t ^t~=t~}l 
iw~'.-kl 
This formula also applies to words we haven't seen 
so far, if we regard such words as having occurred 
with a special tag "U" for "unseen". (In this case, 
W~ '-'k is the set of words that haven't occurred up 
to l.) PI,n(U ---* k) then estimates the probability 
that an unseen word has tag t k. Table 2 shows 
the estimates of tag conversion we derived from 
our training text for 1 = 1022462- 100000, m = 
1022462, where 1022462 is the number of words in 
the training text. To avoid sparse data problems 
we assumed zero probability for types of tag con- 
version with less than 100 instances in the training 
set. 
tag conversion 
U --* NN 
U~JJ 
U --~ NNS 
U --* NP 
U ~ VBD 
U ~ VBG 
U --~ VBN 
U --~ VB 
U---, RB 
U ~ VBZ 
U --* NP$ 
VBD -~ VBN 
VBN --* VBD 
VB --* NN 
NN ~ VB 
estimated probability 
0.29 
0.13 
0.12 
0.08 
0.07 
0.07 
0.06 
0.05 
0.05 
0.01 
0.01 
0.09 
0.05 
0.05 
0.01 
Table 2: Estimates for tag conversion 
Our smoothing scheme is then the following 
heuristic modification of Good-Turing: 
C(t i, W j) -k ~k,ETi Rim(k1 --+ i) 
g(tilwi) = C(wi) + Ek,ETi,k2E T Pam(kz --" ks) 
where Tj is the set of tags that w/has in the train- 
ing set and T is the set of all tags. This scheme 
has the following desirable properties: 
• As with Good-Turing, smoothing has a small ef- 
fect on estimates that are based on large counts. 
• The difference between closed-class and open- 
class words is respected: The probability for 
conversion to a closed class is zero and is not 
affected by smoothing. 
• Prior knowledge about the probabilities of con- 
version to different tag classes is incorporated. 
For example, an unseen word w i is five times as 
likely to be a noun than an adverb. Our esti- 
mate for P(ti\]w j) is correspondingly five times 
higher for "NN" than for "RB". 
ANALYSIS OF RESULTS 
Our result on the test set of 114392 words (the 
tenth of the Brown corpus not used for training) 
was 95.81%. Table 3 shows the 20 most frequent 
errors. 
Three typical examples for the most common 
error (tagging nouns as adjectives) are "Commu- 
nist", "public" and "homerun" in the following 
sentences. 
185 
VMM: 
correct: 
NN 
VBD 
NNS 
VBN 
JJ 
VB 
"'CS 
'NP 
IN 
VBG 
RB 
QL 
\]1  JIVBNI NIVB°I INI °sI 
259 102 
110 
63 
227 
165 
142 
194 
94 
219 
112 
63 
103 
RPIQLI B 
100 
71 
76 
Table 3: Most common errors. 
VB I VBG 
69 66 
* the Cuban fiasco and the Communist military 
victories in Laos 
• to increase public awareness of the movement 
• the best homerun hitter 
The words "public" and "communist" can be used 
as adjectives or nouns. Since in the above sen- 
tences an adjective is syntactically more likely, 
this was the tagging chosen by the VMM. The 
noun "homerun" didn't occur in the training set, 
therefore the priors for unknown words biased the 
tagging towards adjectives, again because the po- 
sition is more typical of an adjective than of a 
noun. 
Two examples of the second most common er- 
ror (tagging past tense forms ("VBD") as past 
participles ("VBN")) are "called" and "elected" 
in the following sentences: 
• the party called for government operation of all 
utilities 
• When I come back here after the November elec- 
tion you'll think, you're my man - elected. 
Most of the VBD/VBN errors were caused by 
words that have a higher prior for "VBN" so that 
in a situation in which both forms are possible ac- 
cording to local syntactic context, "VBN" is cho- 
sen. More global syntactic context is necessary 
to find the right tag "VBD" in the first sentence. 
The second sentence is an example for one of the 
tagging mistakes in the Brown corpus, "elected" 
is clearly used as a past participle, not as a past 
tense form. 
Comparison with other Results 
Charniak et al.'s result of 95.97% (Charniak et al., 
1993) is slightly better than ours. This difference 
is probably due to the omission of rare tags that 
permit reliable prediction of the following tag (the 
case of "HVZ." for "hasn't"). 
Kupiec achieves up to 96.36% correctness 
(Kupiec, 1992), without using a tagged corpus for 
training as we do. But the results are not eas- 
ily comparable with ours since a lexicon is used 
that lists only possible tags. This can result in in- 
creasing the error rate when tags are listed in the 
lexicon that do not occur in the corpus. But it can 
also decrease the error rate when errors due to bad 
tags for rare words are avoided by looking them up 
in the lexicon. Our error rate on words that do not 
occur in the training text is 57%, since only the 
general priors are used for these words in decod- 
ing. This error rate could probably be reduced 
substantially by incorporating outside lexical in- 
formation. 
DISCUSSION 
While the learning algorithm of a VMM is efficient 
and the resulting tagging algorithm is very simple, 
the accuracy achieved is rather moderate. This is 
due to several reasons. As mentioned in the intro- 
ductory sections, any finite memory Markov model 
cannot capture the recursive nature of natural lan- 
guage. The VMM can accommodate longer sta- 
tistical dependencies than a traditional full-order 
Markov model, but due to its Markovian nature 
long-distance statistical correlations are neglected. 
Therefore, a VMM based tagger can be used for 
pruning many of the tagging alternatives using its 
prediction probability, but not as a complete tag- 
ging system. Furthermore, the VMM power can 
be better utilized in low level language process- 
ing tasks such as cleaning up corrupted text as 
demonstrated in (Ron et al., 1993). 
We currently investigate other stochastic 
models that can accommodate long distance sta- 
tistical correlation (see (Singer and Tishby, 1994) 
for preliminary results). However, there are theo- 
retical clues that those models are much harder to 
learn (Kearns et al., 1993), including HMM based 
models (Abe and Warmuth, 1992). 
186 
Another drawback of the current tagging 
scheme is the independence assumption of the un- 
derlying tags and the observed words, and the ad- 
hoc estimation of the static probabilities. We are 
pursuing a systematic scheme to estimate those 
probabilities based on Bayesian statistics, by as- 
signing a discrete probability distribution, such as 
the Dirichlet distribution (Berger, 1985), to each 
tag class. The a-posteriori probability estimation 
of the individual words can be estimated from the 
word counts and the tag class priors. Those priors 
can be modeled as a mixture of Dirichlet distribu- 
tions (Antoniak, 1974), where each mixture com- 
ponent would correspond to a different tag class. 
Currently we estimate the state transition prob- 
abilities from the conditional counts assuming a 
uniform prior. The same technique can be used to 
estimate those parameters as well. 
ACKNOWLEDGMENT 
Part of this work was done while the second au- 
thor was visiting the Department of Computer 
and Information Sciences, University of California, 
Santa-Cruz, supported by NSF grant IRI-9123692. 
We would like to thank Jan Pedersen and Naf- 
tali Tishby for helpful suggestions and discussions 
of this material. Yoram Singer would like to thank 
the Charles Clore foundation for supporting this 
research. We express our appreciation to faculty 
and students for the stimulating atmosphere at 
the 1993 Connectionist Models Summer School at 
which the idea for this paper took shape. 

References 
N. Abe and M. Warmuth, On the computational 
complexity of approximating distributionsby 
probabilistic automata, Machine Learning, 
Vol. 9, pp. 205-260, 1992. 
C. Antoniak, Mixture of Dirichlet processes with 
applications to Bayesian nonparametric prob- 
lems, Annals of Statistics, Vol. 2, pp. 1152- 
174, 1974. 
J. Berger, Statistical decision theory and Bayesian 
analysis, New-York: Springer-Verlag, 1985. 
E. Brill. Automatic grammar induction and pars- 
ing free text: A transformation-based ap- 
proach. In Proceedings of ACL 31, pp. 259- 
265, 1993. 
E. Charniak, Curtis Hendrickson, Neil Jacobson, 
and Mike Perkowitz, Equations for Part-of- 
Speech Tagging, Proceedings of the Eleventh 
National Conference on Artificial Intelligence, 
pp. 784-789, 1993. 
K.W. Church, A Stochastic Parts Program and 
Noun Phrase Parser for Unrestricted Text, 
Proceedings of ICASSP, 1989. 
A. Dempster, N. Laird, and D. Rubin, Maximum 
Likelihood estimation from Incomplete Data 
via the EM algorithm, J. Roy. Statist. Soc., 
Vol. 39(B), pp. 1-38, 1977. 
W.N. Francis and F. Ku~era, Frequency Analysis 
of English Usage, Houghton Mifflin, Boston 
MA, 1982. 
F. Jelinek, Robust part-of-speech tagging using 
a hidden Markov model, IBM Tech. Report, 
1985. 
M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, 
R. Schapire, L. Sellie, On the Learnability of 
Discrete Distributions, The 25th Annual ACM 
Symposium on Theory of Computing, 1994. 
S. Kullback, Information Theory and Statistics, 
New-York: Wiley, 1959. 
J. Kupiec, Robust part-of-speech tagging using a 
hidden Markov model, Computer Speech and 
Language, Vol. 6, pp. 225-242, 1992. 
L.R. Rabiner and B. H. Juang, An Introduction 
to Hidden Markov Models, IEEE ASSP Mag- 
azine, Vol. 3, No. 1, pp. 4-16, 1986. 
J. Rissanen, Modeling by shortest data discription, 
Automatica, Vol. 14, pp. 465-471, 1978. 
J. Rissanen, Stochastic complexity and modeling, 
The Annals of Statistics, Vol. 14, No. 3, pp. 
1080-1100, 1986. 
J. Rissanen and G. G. Langdon, Universal model- 
ing and coding, IEEE Trans. on Info. Theory, 
IT-27, No. 3, pp. 12-23, 1981. 
D. Ron, Y. Singer, and N. Tishby, The power 
of Amnesia, Advances in Neural Information 
Processing Systems 6, 1993. 
D. Ron, Y. Singer, and N. Tishby, Learning 
Probabilistic Automata with Variable Memory 
Length, Proceedings of the 1994 Workshop on 
Computational Learning Theory, 1994. 
Y. Singer and N. Tishby, Inferring Probabilis- 
tic Acyclic Automata Using the Minimum 
Description Length Principle, Proceedings of 
IEEE Intl. Symp. on Info. Theory, 1994. 
R. Weischedel, M. Meteer, R. Schwartz, L. 
Ramshaw, and :I. Palmucci. Coping with am- 
biguity and unknown words through prob- 
abilistic models. Computational Linguistics, 
19(2):359-382, 1993. 
J. Wu, On the convergence properties of the EM 
algorithm, Annals of Statistics, Vol. 11, pp. 
95-103, 1983. 
