Handling Sparse Data by Successive Abstraction 
Christer Samuelsson 
Universit£t des Saarlandes, FR 8.7, Computcrlinguistik 
Postfach 1150, D-66041 Saarbrfickcn, Germany 
Internet: christer©coli.uni-sb, de 
Abstract 
A general, practical method for hand- 
ling sparse data that avoids held-out 
data and iterative reestimation is derived 
from first principles. It has been tested 
on a part-of-speech tagging task and out- 
performed (deleted) interpolation with 
context-independent weights, even when 
the latter used a globally optimal para- 
meter setting determined a posteriori. 
1 Introduction 
Sparse data is a perennial problem when applying 
statistical techniques to natural language proces- 
sing. The fundamental problem is that there is of- 
ten not enough data to estimate the required sta- 
tistical parameters, i.e., the probabilities, directly 
from the relative frequencies. This problem is ac- 
centuated by the fact that in the search for more 
accurate probabilistic language models, more and 
more contextual information is added, resulting in 
more and more complex conditionings of the cor- 
responding conditional probabilities. This in turn 
means that the number of observations tends to 
be quite small for such contexts. Over the years, 
a number of techniques have been proposed to 
handle this problem. 
One of two different main ideas behind these 
techniques is that complex contexts can be gene- 
ralized, and data from more general contexts can 
be used to improve the probability estimates for 
more specific contexts. This idea is usually re- 
ferred to as back-off smoothing, see (Katz 1987). 
These techniques typically require that a sepa- 
rate portion of the training data be held out from 
the parameter-estimation phase and saved for de- 
termining appropriate back-off weights. Further~ 
more, determining the back-off weights usually re- 
quires resorting to a time-consuming iterative ree- 
stimation procedure. A typical example of such a 
technique is "deleted interpolation", which is de- 
scribed in' Section 5.1 below. 
The other main idea is concerned with im- 
proving the estimates of low-frequency, or no- 
frequency, outcomes apparently without trying to 
generalize the conditionings. Instead, these tech- 
niques are based on considerations of how popu- 
lation frequencies in general tend to behave. Ex- 
amples of this are expected likelihood estimation 
(ELE), see Section 5.2 below, and Good-Turing 
estimation, see (Good 1953). 
We will here derive from first principles a practi- 
cal method for handling sparse data that does not 
need separate training data for determining the 
back-off weights and which lends itself to direct 
calculation, thus avoiding time-consuming reesti- 
mation procedures. 
2 Linear Successive Abstraction 
Assume that we want to estimate the conditional 
probability P(x I C) of tile outcome x given a 
context C from the number of times N~ it occurs 
in N = ICI trials, but that this data is sparse. 
Assume further that there is abundant data in a 
more general context C t D C that we want to use 
to get a better estimate of P(x I C). The idea is to 
let the probability estimate/5(x I C) in context C 
be a flmction g of the relative frequency f(x I C) 
of the outcome x in context C and the probability 
estimate P(x \[C') ill context C': 
IV) = g(f(  I IV')) 
Let us generalize this scenario slightly to the si- 
tuation were wc have a sequence of increasingly 
more general contexts Cm C Urn-1 C ... C C1, 
i.e., where there is a linear order of the various 
contexts Ck. We can then build the estimate of 
P(x I Ck) on the relative frequency f(x I Ck) 
in context Ck and the previously established esti- 
mate of P(x I Ck-1). Wc call this method li- 
near successive abstraction. A simple example is 
estimating the probability P(x I/n-j+l¢..., In) of 
word class x given l,-j+l,...,ln, tile last j let- 
ters of a word ll,...,l,. In this case, the esti- 
mate will be based on the relative frequencies 
f(x I l,,_~+,,..., l,,),..., f(x \[ In), f(x). 
We will here consider the special case when the 
flmction g is a weighted sum of the relative fre- 
quency and the previous estimate, appropriately 
895 
renormalized: 
f(x I + 0 P(x I P(x I Ck) = 
1+0 
We want the weight 0 to depend on the context 
Ck, and in particular be proportional to some 
measure of how spread out the relative frequen- 
cies of the various outcomes in context Ck are from 
the statistical mean. The variance is the quadratic 
moment w.r.t, the mean, and is thus such a mea- 
sure. However, we want the weight to have the 
same dimension as the statistical mean, and the 
dimension of the variance is obviously the square 
of the dimension of the mean. The square root 
of the variance, which is the standard deviation, 
should thus be a suitable quantity. For this rea- 
son we will use the standard deviation in Ck as 
a weight, i.e., 0 = ~r(Ck). One could of course 
multiply this quantity with any reasonable real 
constant, but we will arbitrarily set this constant 
to one, i.e., use ~r(Ck) itself. 
In linguistic applications, the outcomes are 
usually not real numbers, but pieces of lingui- 
stic structure such as words, part-of-speech tags, 
grammar rules, bits of semantic tissue, etc. This 
means that it is not quite obvious what the stan- 
dard deviation, or the statistical mean for that 
matter, actually should be. To put it a bit more 
abstractly, we need to calculate the standard de- 
viation of a non-numerical random variable. 
2.1 Deriving the Standard Deviation 
So how do we find the standard deviation of a 
non-numerical random variable? One way is to 
construct an equivalent numerical random varia- 
ble and use the standard deviation of the latter. 
This can be done in several different ways. The 
one we will use is to construct a numerical random 
variable with a uniform distribution that has the 
same entropy as the non-numerical one. Whether 
we use a discrete or continuous random variable 
is, as we shall see, of no importance. 
We will first factor out the dependence on the 
context size. Quite in general, if ~N is the sample 
mean of N independent observations of any nu- 
merical random variable ( with variance a0 2, i.e., 
-} N = ~-,i=1 (i, then 
~2 = Var\[~N\] = 
1N 1 N ~ 
---- Var\[ '~-~(i\] = ~ ~Var\[(i\]---- 
i=1 i=1 
In our case, the number of observations N is sim- 
ply the size of the context Ck, by which we mean 
the number of times Ck occurred in the training 
data, i.e., the frequency count of Ck, which we will 
denote \]Ck\[. Since the standard deviation is the 
square root of the variance, we have 
o-(cn = Vic,,I 
Here ~r0 does not depend on the number of obser- 
vations in'cofftext Ck, only on the underlying pro- 
bability distribution conditional on context Ck. 
To estimate cr0(Ck), we assume that we have eit- 
her a discrete uniform distribution on {1,..., M} 
or a continuous uniform distribution on \[0, M\] 
that is as hard to predict as the one in Ck in the 
sense that the entropy is the same. The entropy 
H\[~\] of a random variable ~ is the expectation va- 
lue of the logarithm of P((). In the discrete case 
we thus have 
H\[(\] = E\[-lnP(()\] : ~-~-P(xi) lnP(xi) 
i 
tIere P(xi) is the probability of the random va- 
riable ( taking the value xi, which is ~ for all 
possible outcomes xi and zero otherwise. Thus, 
the entropy is In M: 
M 
E-P(xi) lnP(xi) = E-~-lnM------ = lnM 
i i=1 
The continuous case is similar. We thus have that 
ln M = H\[Ck\] or M = e IIICk\] 
The variance of these uniform distributions is M 2 
1--T in the continuous case and ~ in the dis- 
crete case. We thus have 
M 1 1 
cr°(Ck) = X/~ x/r~M -1 - xff2e -H\[ckl 
Unfortunately, the entropy It\[Ck\] depends on the 
probability distribution of context Ck and thus 
on Cro(Ck). Since we want to avoid trying to solve 
highly nonlinear equations, and since we have ac- 
cess to an estimate of the probability distribution 
of context Ck-1, we will make the following ap- 
proximation: 
O'0(Ck-1) 1 ~(Ck) ~ 
It is starting to look sensible to specify ~r- 1 instead 
of ~, i.e., instead of ~ we will write lq-o" ' c~-I q-1 " 
2.2 The Final Recurrence Formula 
We have thus established a recurrence formula for 
the estimate of the probability distribution in con- 
text Ck given the estimate of the probability dis- 
tribution in context Ck-1 and the relative frequen- 
cies in context Ck: 
P(xICk) = (1) 
,r(Ck)-ly(x C~) + p(x I C~-l) 
~(Ck) -~ + 1 
and 
 (cn = 
We will start by estimating the probability distri- 
bution in the most general context C1, if necessary 
896 
directly from the relative frequencies. Since this 
is the most general context, this will be the con- 
text with the most training data. Thus it stands 
the best chances of the relative frequencies being 
acceptably accurate estimates. This will allow us 
to calculate an estimate of the probability distri- 
bution in context C2, which in turn will allow ns 
to calculate an estimate of the probability distri- 
bution in context Ca, etc. We can thus calcu- 
late estimates of the probability distributions in 
all contexts C1,..., Cm. 
We will next consider some examples from part- 
of-speech tagging. 
3 Examples from PoS Tagging 
Part-of-speech (PoS) tagging consists in assigning 
to each word of an input text a (set of) tag(s) from 
a finite set of possible tags, a tag palette or a tag 
set. The reason that this is a research issue is that 
a word can in general be assigned different tags 
depending on context. In statistical tagging, the 
relevant information is extracted from a training 
text and fitted into a statistical language model, 
which is then used to assign the most likely tag to 
each word in the input text. 
The statistical language model usually consists 
of lexical probabilities, which determine the pro- 
bability of a particular tag conditional on the par- 
ticular word, and contextual probabilities, which 
determine the probability of a particular tag con- 
ditional on the surrounding tags. The latter con- 
ditioning is usually on the tags of the neighbouring 
words, and very often on the N - 1 previous tags, 
so-called (tag) N-gram statistics. These proba- 
bilities can bc estimated either from a pretagged 
training corpus or from untagged text, a lexicon 
and an initial bias. We will here consider the for- 
mer case. 
Statistical taggers usually work as follows: 
First, each word in the input word string 
1471, • .., W, is assigned all possible tags according 
to the lexicon, thereby creating a lattice. A dyna- 
mic programming technique is then used to find 
tag the sequence 5/\],..., ~, that maximizes 
P(T1,...,Tn I Wl, . .., Wn) = 
tt 
= IIP(Tk T1,...,Tk-1;Wl,...,Wn) 
k=l 
1:=1 
P(Tk Tk-N+I,...,Tk-1; VIZk) 
• 7' ".P(T~ wk) P(Tk ~k-N+l,..., k-l) \[ 
k=l 
fl P(Tk 
P(T,) 
Tk-N+~,..., ~-~)" P(Wk I Tk) 
~:~ P(Wk) 
Since the maximum does not depend on the fac- 
tors P(Wk), these can be omitted, yielding the 
standard statistical PoS tagging task: 
max \]-\[ P(Tk IU~-~V+~,...,Tk-J.P(Wk JT~) 
TI ,...,T~, t~l= 
This is well-described in for example (DeRose 
1988). 
We thus have to estimate the two following sets 
of probabilities: 
• Lexical probabilities: 
The probability of each tag T i conditional on 
the word W that is to be tagged, p(r' I I wr! i 
Often the converse probabilities P(W 
are given instead, but we will for reasons soou 
to become apparent use the former formula- 
tion. 
® Tag N-grams: 
The probability of tag T i at position k in 
the input string, denoted T~, given that tags 
7~.-N+1 T ,..., k-1 have been assigned to the 
previous N- 1 words• Often N is set to two or 
three, and thus bigralns or trigrams are em- 
ployed. When using trigram statistics, this 
quantity is P(T~ \]7' k-~,Tk-1). 
3.1 N-gram Back-off Smoothing 
We will first consider estimating the N-gram pro- 
babilities P(T~ \]Tk-N+I,...,Tk-1). IIere, there 
is an obvious sequence of generalizations of the 
context 5/~-N+1,..., 7~-1 with a linear order, na- 
mely ~/~--N+I ~ C Tk-N+2, ,Tk-1 C ,..., k-1 ... 
• cT • . k-1 C fl, where f~ means "no information", 
corresponding to the nnigram probabilities. Tiros 
we will repeatedly strip off the tag furthest from 
the current word and use the estimate of the pro- 
bability distribution in this generalized context to 
improve the estimate in the current context. This 
means that when estimating the (j + 1)-gram pro- 
babilities, we back off to the estimate of the j- 
gram probabilities. 
7' So when estimating P(T\[ I Tk-j,..., ~-~), we 
simply strip off the tag 5~_j and apply Eq. (1): 
~-~(~'~ \[ Tk-j,...,Tk-i) = 
--1 ,i r, f(% I :tk-j,..., rk-~) = + 
+ P(T  
and 
~-1+1 
~-~+1,..., Tk-O 
~-1 + 1 
• ., 7}~_ 1\[ e-ll\[Tk_j+l ..... T~._ 11 
3.2 Handling Unknown Words 
We will next consider improving the probability 
estimates for unknown words, i.e., words that do 
not occur in tile training corpus, and for which we 
therefore have no lexical probabilities, The same 
technique could actually be used for improving the 
estimates of the lexical probabilities of words that 
897 
do occur in the training corpus. The basic idea is 
thaF there is a substantial amount of information 
in the word suffixes, especially for languages with 
a richer morphological structure than English. For 
this reason, we will estimate the probability distri- 
bution conditional on an unknown word from the 
statistical data available for words that end with 
the same sequence of letters. Assume that the 
word consists of the letters I1,. • •, I,~. We want to 
know the probabilities P(T i I ll,...,ln) for the 
various tags Ti. 1 Since the word is unknown, this 
data is not available. However, if we look at the 
sequence of generalizations of "ending with same 
last j letters", here denoted ln-j+l, • •., In, we rea- 
lize that sooner or later, there will be observations 
available, in the worst case looking at the last zero 
letters, i.e., at the unigram probabilities. 
So when estimating P(T i I In-j+l,...,ln), we 
simply omit the jth last letter In-j+l and apply 
Eq. (1): 
P(T I = 
= er-lf( Ti I l,-j+l,...,ln) + 
~-1+1 
P( Ti I In-i+2,..., 1,,) + 
a-l+l 
and 
e r-1 = ln-j+l,...,l, l e -tl\[1"-j+= ..... M 
This data can be collected from the words in the 
training corpus with frequencies below some thres- 
hold, e.g., words that occur less than say ten 
times, and can be indexed in a tree on reversed 
suffixes for quick access. 
4 Partial Successive Abstraction 
If there is only a partial order of the various gene- 
ralizations, the scheme is still viable. For example, 
consider generalizing symmetric trigram statistics, 
i.e., statistics of the form P(T I Tz, Tr). Here, both 
Tt, the tag of the word to the left, and Tr, the tag 
of the word to the right, are one-step generaliza- 
tions of the context 7}, Tr, and both have in turn 
the common generalization ~ ("no information"). 
We modify Eq. (1) accordingly: 
D(T I Tt,T~ ) = cr(Tt,T~) -~ f(T I Tz,T~ ) 
o'(r/,Tr) -1 q- 1 -~ 
I P(TIT0 + P(TIT~ ) +- 
2 ~(T,,T~)-~ + i 
and 
P(T I T~) 
P(TIT,) 
a(Ti) -1 f(T I T~) + P(T) 
a(Tz) -1 + 1 
~r(Tr) -1 f(T ITs) + P(T) 
~(T~) -I + 1 
1Or really, P(T i I 1o, 11,..., ln) where lo is a special 
symbol indicating the beginning o1 the word. 
We call this partial successive abstraction. Since 
we really want to estimate cr in the more specific 
context, and since the standard deviation (with 
the dependence on context size factored out) will 
most likely not increase when we specialize the 
context, we will use: 
1 ~r(Tt, Tr) = ~ min(a0(T~), c~0(Tr)) 
In the general case, where we have M one-step 
generalizations C~ of C, we arrive at the equation 
P(x I c) = 
-1/(x I c) + EY=I I el) 
and 
+ 1 
1 min a0(C~) 
o'(C) -- \[X//~ 16{i ..... M} 
-1 = v, iT -ntca 
By calculating the estimates of the probability 
distributions in such an order that whenever esti- 
mating the probability distribution in some parti- 
cular context, the probability distributions in all 
more general contexts have already been estima- 
ted, we can guarantee that all quantities necessary 
for the calculations are available. 
5 Relationship to Other Methods 
We will next compare the proposed method to, 
in turn, deleted interpolation, expected likelihood 
estimation and Katz's back-off scheme. 
5.1 Deleted Interpolation 
Interpolation requires that the training corpus is 
divided into one part used to estimate the relative 
frequencies, and a separate held-back part used 
to cope with sparse data through back-off smoo- 
thing. For example, tag trigram probabilities carl 
be estimated as follows: 
P(Tj \[Tk-2, Tk-1) ~ Alf(7~) + 
+ Auf(Tik ITk_1) + Aaf(Tik I Tk-2,T~_I) 
Since the probability estimate is a linear combina- 
tion of the various observed relative frequencies, 
this is called linear interpolation. The weights 13. 
may depend on the conditionings, but are required 
to be nonnegative and to sum to one over j. An 
enhancement is to partition the training set into 
n parts and in turn perform linear interpolation 
with each of the n parts held out to determine 
the back-off weights and use the remaining n - 1 
parts for parameter estimation. The various back- 
off weights are combined in the process. This is 
usually referred to as deleted interpolation. 
The weights Aj are determined by maximizing 
the probability of the held-out part of the trai- 
ning data, see (Jelinek & Mercer 1980). A locally 
898 
optimal weight setting can be found using Baum- 
Welch reestimation, see (Baum 1972). Baum- 
Welch reestimation is however prohibitively time- 
consuming for complex contexts if the weights are 
allowed to depend on the contexts, while succes- 
sive abstraction is clearly tractable; the latter ef- 
fectively determines these weights directly from 
the same data as the relative frequcncies. 
5.2 Expected Likelihood Estimation 
Expected likelihood estimation (ELE) consists in 
assigning an extra half a count to all outcomes. 
Thus, an outcome that didn't occur in the trai- 
ning data receives half a count, an outcome that 
occurred once receives three half counts. This is 
equivalent to assigning a count of one to the oc- 
curring, and one third to the non-occurring out- 
comes. To give an indication of how successive 
abstraction is related to ELE, consider the fol- 
lowing special case: If we indeed have a uniform 
distribution with M outcomes of probability M ! in 
context Ck-1 and there is but one observation of 
one single outcome in context Ck, then Eq. (1) will 
assign to this outcome the probability vqh+l and vqh+M 
to the other, non-occurring, outcomes 1 So v~+m' 
if we had used 2 instead of vq2 in Eq. (1), this 
would have been equivalent to assigning a count 
of one to the outcome that occurred, anti a count 
of one third to the ones that didn't. As it is, the 
latter outcomes are assigned a count of 1 4i5+~" 
5.3 Katz's Back-Off Scheme 
The proposed method is identical to Katz's back- 
off method (Katz 1987) up to the point of sugge- 
sting a, in the general case non-linear, retreat to 
more general contexts: 
P(= I c) = g(f(= I It')) 
Blending the involved distributions f(x \] C) and 
/5( x I C'), rather than only backing oft" to C' if f(x \] C) 
is zero, and in particular, instantiating 
the flmction g(f, P) to a weighted sum, distinguis- 
hes the two approaches. 
6 Experiments 
A standard statistical trigram tagger has been im- 
plemented that uses linear successive abstraction 
for smoothing the trigram and bigram probabili- 
ties, as described in Section 3.1, and that handles 
unknown words using a reversed suffix tree, as de- 
scribed in Section 3.2, again using linear succes- 
sive abstraction to improve the probability esti- 
mates. This tagger was tested on the Susanne 
Corpus, (Sampson 1995), using a reduced tag set 
of 62 tags. The size of the training corpus A was 
almost 130,000 words. There were three separate 
test corpora B, C and D consisting of approxima- 
tely 10,000 words each. 
Test corpus 
Tagger 
Error rate (%) 
- tag omissions 
-unknown words 
Unknown words 
Error rate (%) 
B 
bigram trigram HMM 
4.41 4,36 4.49 
0.67 
1.36 1.20 1.52 
6.18 
22.1 19.4 24,5 
Test corpus C 
bigram HMM trigram Tagger 
Error rate (%) 
- tag omissions 
- unknown words 
Unknown words 
Error rate (%) 
4.26 3.93 4.03 
0.68 
1.43 1.30 1.34 
7.78 
18.3 16.8 17.3 
Test corpus D 
bigram HMM Tagger 
Error rate (%) 
- tag omissions 
- unknown words 
Unknown words 
Error rate (%) 
trigram 
5.14 4.81 5.13 
0.94 
1.80 1.63 2.02 
8.06 
22.3 20.2 25.0 
Figure 1: Results on the Susanne Corpus 
Tile performance of the tagger was compared 
with that of an tlMM-based trigram tagger that 
uses linear interpolation for N-gram smoothing, 
but where the back-off weights do not depend on 
the eonditionings. An optimal weight, setting was 
determined for each test corpus individually, and 
used in the experiments. Incidentally, this setting 
varied considerably from corpus to corpus. Thus, 
this represented the best possible setting of back- 
off weights obtainable by linear interpolation, and 
in particular by linear deleted interpolation, when 
these are not allowed to depend on the context. 
In contrast, the successive abstraction scheme 
determined the back-off weights automatically 
from the training corpus alone, and the same 
weight setting was nsed for all test corpora, yiel- 
ding results that were at least on par with those 
obtained using linear interpolation with a globally 
optimal setting of contcxt-independent back-off 
weights determined a posteriori. Both taggers 
handled unknown words by inspecting the suffi- 
xes, but the HMM-based tagger did not smooth 
the probability distributions. 
The experimental results are shown in Figure 1. 
Note that the absolute performance of the trigram 
tagger is around 96 % accuracy in two cases and 
distinctly above 95 % accuracy in all cases, which 
is clearly state-of-the-art results. Since each test 
corpus consisted of about 10,000 words, and the 
error rates are between 4 and 5 %, the 5 percent 
significance level for differences in error rate is bet- 
ween 0.39 and 0.43 % depending on the error rate, 
and the 10 percent significance level is between 
0.32 and 0,36 %. 
899 
We see that the trigram tagger is better than 
the bigram tagger in all three cases and signifi- 
cantly better at significance level 10 percent, but 
not at 5 percent, in case C. So at this signifi- 
cance level, we can conclude that smoothed tri- 
gram statistics improve on bigram statistics alone. 
The trigram tagger performed better than the 
HMM-based one in all three cases, but not sig- 
nificantly better at any significance level below 
10 percent. This indicates that the successive 
abstraction scheme yields back-off weights that 
are at least as good as the best ones obtainable 
through linear deleted interpolation with context- 
independent back-off weights. 
7 Summary and Further Directions 
In this paper, we derived a general, practical me- 
thod for handling sparse data from first principles 
that avoids held-out data and iterative reestima- 
tion. It was tested on a part-of-speech tagging 
task and outperformed linear interpolation with 
context-independent weights, even when the lat- 
ter used a globally optimal parameter setting de- 
termined a posteriori. 
Informal experiments indicate that it is possible 
to achieve slightly better performance by replacing 
the expression for ~ro~(Ck) with a fixed global con- 
1 stant (while retaining the factor I~kl' which is 
most likely a quite accurate model of the depen- 
dence on context size). However, the optimal va- 
lue for this parameter varied more than an order 
of magnitude, and the improvements in perfor- 
mance were not very large. Furthermore, subop- 
timal choices of this parameter tended to degrade 
performance, rather than improve it. This indi- 
cates that the proposed formula is doing a pretty 
good job of approximating an optimal parameter 
choice. It would nonetheless be interesting to see 
if the formula could be improved on, especially 
seeing that it was theoretically derived, and then 
directly applied to the tagging task, immediately 
yielding the quoted results. 
Acknowledgements 
The work presented in this article was funded by 
the N3 "Bidirektionale Linguistische Deduktion 
(BiLD)" project in the Sonderforschungsbereich 
314 Kiinstliche Intelligeuz -- Wissensbasierte Sy- 
steme. 
I wish to thank greatly Thorsten Brants, Slava 
Katz, Khalil Sima'an, the audiences of seminars 
at the University of Pennsylvania and the Uni- 
versity of Sussex, in particular Mark Liberman, 
and the anonymous reviewers of Coling and ACL 
for pointing out inaccuracies and supplying useful 
comments and suggestions to improvements. 
References 
L. E. Baum. 1972. "An inequality and as- 
sociated maximization technique in statistical 
estimation for probabilistic functions of Markov 
processes". Inequalities IIh 1-8. 
Steven J. DeRose. 1988. "Grammatical Cate- 
gory Disambiguation by Statistical Optimiza- 
tion". Computational Linguistics, 14(1): 31- 
39. 
I. J. Good. 1953. "The population frequencies of 
species and the estimation of population para- 
meters". Biometrika, 40: 237-264. 
Frederick Jelinek and Robert L. Mercer. 1980. 
"Interpolated Estimation of Markov Source Pa- 
ramenters from Sparse Data". Pattern Recogni- 
tion in Practice: 381-397. North Holland. 
Slava M. Katz. 1987. "Estimation of Probabi- 
lities from Sparse Data for the Language Mo- 
del Component of a Speech Recognizer". IEEE 
Transactions on Acoustics, Speech, and Signal 
Processing, 35(3): 400-401. 
Geoffrey Sampson. 1995. English for the Compu- 
ter. Oxford University Press. 
900 
