COMBINATION OF N-GRAMS AND STOCHASTIC 
CONTEXT-FREE GRAMMARS FOR LANGUAGE MODELING* 
Jos6-Miguel Benedf and Joan-Andreu SSnchez 
De, i)artmnento de Sistemas Informgti(:os y Coml)utaci6n 
Universidad Polil;(!ClfiCa (te Valencia 
Cmnino de \Sera s/n, d6022 Valencia (Slmin) 
e-mail: {.jt)enedi,j andreu} (@dsic.ul)v.es 
Abstract 
This t)al)t;r de, scribes a hybrid prol)osal to 
combine n-grams and Stochastic Context-Free 
Grammars (SCFGs) tbr  modeling. A 
classical n-gram model is used to cat)lure the 
local relations between words, while a stochas- 
tic grammatical inodel is considered to repre- 
sent the hmg-term relations between syntactical 
stru(:tm'es. In order to define this granmlatical 
model, which will 1)e used on large-vo(:almlary 
comph'~x tasks, a eategory-t)ased SCFG and a 
prol)abilisti(" model of' word (tistrilmtion in the 
categories have been 1)rol)osed. Methods for 
leanfing these stochastic models tTor complex 
tasks are described, and algorithms for con> 
puting the word transition probal)ilities are also 
1)resented. Filmily, ext)erilnents using the Penn 
Treel)ank corpus improved by 30% the test; set; 
l)erph~xity with regard to the classical n-gram 
models. 
1 Introduction 
Language modeling is an important asl)e('t to 
consider in large-vocabulary Sl)eeeh recognition 
systenls (Bahl et al., 1983; ,lelinek, 1998). The 
n--grain models are the most widely-used for a 
wide range of domains (Bahl et al., 1983). The 
n-grams are simple and robust models and ad- 
equately capture the local restrictions between 
words. Moreover, it is well-known how to es- 
timate the parameters of the lnodet and how 
to integrate them in a speech recognition sys- 
tem. However, the n-grmn models cannot ad- 
equately characterize the long-term constraints 
of the sentences of the tasks. 
On the other hand, Stochastic Context-Free 
Grammars (SCI)'Gs) allow us a better model- 
* This work has been partially SUl)l)<)rted by the S1)anish 
CICYT under contract (TIC98/0423-C(16). 
ing of long-term relations and work well on 
lhnited-domain tasks of low perplexity. How- 
ever, SCFGs work poorly for large-vocabulary, 
general-purpose tasks because learning SCFGs 
and the (:Olnlmtation of word transition 1)roba - 
bilities present serious 1)roblenls tLr ('olnplex real 
tasks. 
In the literature, a nulnber of works have pro- 
posed ways to generalize the n-gram models (.le- 
linek, 1998; Siu and Ostendorf, 2000) or com- 
1)ining with other structural models (Bellegarda, 
1998; Gilet and Ward, 1998; Chellm and Jelinek, 
1998). 
In this iml)er, we present a confl)ined lan- 
guage model defined as a linear combination of 
n-grams, whk'h are llse(t to capture the local re- 
lations between words, and a stoehasti(: gram- 
matieal model whi(:h is used to represent the 
glottal relation 1)etw(x'dl synl:aetie strll(;tllrt~s, hi 
or(ter to (:at)turc, these lollg-terltl relations an(t to 
solve the main 1)rolflems derived Dora the large- 
vocabulary complex tasks, we 1)l'Ol)ose here to 
detine: a eategory--ba,~ed SCFG and a prolmbilis- 
tic model of word distrilmt;ion in the categories. 
Taking into a(:count this proposal, we also de- 
scribe here how to solve the learning of these 
stochastic models and their integrati(m prol> 
1CIlIS. 
With regard to the learning problem, several 
algorithms that learn SCFGs by means of es- 
timation algorithms have been 1)reposed (Lari 
and Young, 1990; Pereira and Schal)es, 1992; 
Sfinehez and Benedi, 1998), and pronfising re- 
suits have been achieved with category-based 
SCFGs on real tasks (Sfi.nchez and Benedi, 
\]999). 
In relation to the integration problem, we 
l)resent two algorithms that compute the word 
transition 1)robability: the first algorithm is 
based on the l~efl;-to-ll,ight Inside algorithln 
55 
(LRI) (Jelinek and Lafferty, 1991), and the sec- 
ond is based on an application of a Viterbi 
scheme to the LRI algorithm (the VLRI algo- 
rithm) (S~nehez and Benedf, 1997). 
Finally, in order to evaluate the behavior of 
this proposal, experiments with a part of the 
Wall Street Journal processed in the Penn Tree- 
bank project were carried out and significant im- 
provements with regard to the classical n-gram 
models were achieved. 
2 The  model 
An important problem related to  mod- 
eling is the evaluation of Pr(wk I wl... wk-1). 
In order to compute this probability, we pro- 
pose a hybrid  model defined as a sire- 
ple linear combination of n-gram models and a 
stochastic grammatical model G~: 
Pr(wklwl...w~_l) = c~Pr(~klwk-n...wt~-l) 
+(1 - c~) P"(wklw~... wk-,, G~), (1) 
where 0 < c~ < 1 is a weight factor which de- 
pends on the task. 
The expression Pr(w/~lwk_n...wk-,) is the 
word probability of occurrence of w/~ given by 
the n-gram model. The parameters of this 
model can be easily estinmted, and the ex- 
pression Pr(wl~lWl~_n... wtc-1) can be efficiently 
computed (Bahl et al., 1983; Jelinek, 1998). 
In order to define the stochastic gram- 
matical model G~ of the expression 
Pr(wk\]w~ . . . wk_j, G~) for large-vocabulary 
complex tasks, we propose a combination of 
two different stochastic models: a category- 
based SCFG (G~), that allows us to represent 
the long-term relations between these syntacti- 
cal structures and a probabilistic model of word 
distribution into categories (Cw). 
This proposal introduces two imlmrtant as- 
peels, which are the estimation of the parame- 
ters of the stochastic models, Gc and Cw, and 
the computation of the following expression: 
Pr(~klWl... Wk-1, ac, Cw) 
= Pr(wl... wk... la~, c~) (2) 
IC , C,,,) 
3 Training of the models 
The parameters of the described model are es- 
timated fi'om a training sample, that is, from 
a set of sentences. Each word of the sentence 
has a part-of speech tag (POStag) associated 
to it. These POStags are considered as word 
categories and are the terminal symbols of the 
SCFG. h'om this training sample, the parame- 
ters of G~ and C~ can be estimated as tbllows. 
First, tile parameters of Cw, represented by 
Pr(w\[c), are computed as: 
= E,o, (3) 
where N(w,c) is the number of times that the 
word w has been labeled with the POStagc. It 
is important to note that a word w can belong to 
different categories. In addition, it may hapt)en 
that a word in a test set does not appear in 
the training set, and therefore some smoothing 
technique has to be carried out. 
With regard to the estimation of the category- 
based SCFGs, one of the most widely-known 
methods is the Inside-Outside (IO) algo- 
rithln (Lari and Young, 1990). The application 
of this algorithm presents important problems 
which are accentuated in real tasks: the time 
complexity per iteration and the large number 
of iterations that are necessary to converge. An 
alternative to the IO algorithm is a.n algorithm 
based on the Viterbi score (VS algorithm) (Ney, 
1992). The convergence of the VS algorithm 
is faster than the IO algorithm. However, the 
SCFGs obtained are, in genera.l, not as well 
learned (Simchez et al., 1996). 
Another possibility for estimating SCFCs, 
which is somewhere between the IO and VS al- 
gorithms, has recently been proposed. This ap- 
proach considers only a certain subset of deriva- 
tions in the estimation process. In order to 
select this subset of derivations, two alterna- 
tives have been considered: froln structural in- 
formation content in a bracketed corpus (Pereira 
and Schabes, 1992; Amaya et al., 1999), and 
from statistical information content in the k- 
best derivations (Sgmchez and Benedl, 1998). 
In the first alternative, the IOb and VSb algo- 
rithms which learn SCFGs from partially brack- 
eted corpora were defined (Pereira and Schabes, 
1992; Amaya et al., 1999). In the second alter- 
native, the kVS algorithm for the estimation of 
the probability distributions of a SCFG fl'om the 
k-best derivations was proposed (Shnchez and 
Benedi, 1998). 
56 
All of these algorithms have a tilne (:omi)lexity 
O('n,a\[PI), where 'n is the length of the input 
st;ring, and \[1)1 is the size. of the SCFG. 
These algorithms have been tested in 
real tasks fl)r estimating cat(,gory-1)ased 
SCFOs (Sfinchez and Benedf, 1999) and the 
results obtained justify their applicatiol, in 
complex real tasks. 
4 Integration of the model 
l?rom exl)ression (2), it can bee se(m that in or- 
der to integrate the too(M, it is necessary to 
efli(:iently ('oml)ute the expression: 
P~0,,~... ',,,k... la,.., <,,). (4) 
In order to describo how this computation (:an 
l)e m~de, we tirst introduce some notation. 
A Court:el-Free, Grammar G is a four-tul)le 
(N, E, P, S), wher(; N is the tinit(; set of nont(;r- 
minals, )2 is the tinite sol; of terminals (N ~-/E = 
0), S ~ N is the axiom or initial symbol and 
1' is the finite set of t)rodu(:tions or ruh;s of the 
tbrm A -+ it, where A c N a.nd c~ C (N U E) + 
(only grmmmtrs with non (;mt)ty rules ar(; con- 
sidered). FOI" siml)li('ity (but without loss of g('.n- 
erality) only (:ontext-iYee grammars in Ch, om.s'ky 
Normal Form are. considere(l, that is, grammars 
with rules of the form A -+ HC or A -> v wh(n'(: 
A,B,C C N and v ~ )2. 
A Stoch, a.stic Contcxt-l';rc.c U'raw, w, wl" G.~ is a 
pair (G,p), where G is a (:ontext-fr(,.(; grain- 
mar and p : P -+\]0,1\] is a 1)robal)ility tim(:- 
{;ion of rule ai)l)li('al;ion su(:h that VA ~ N: }~,c(Nu>~)+ 
p(A --+ ,~) -- i. 
Now, we pr(:sent two algorithms ill order to 
compute the word transition 1)rol)at)ility. The 
first algorithm is based on the Ll/i algorithm, 
a.nd the second is based on an apt)li('atiou of a 
Viterbi s(:heme to the LRI algorithln (the VLI/\] 
a.lgorithm). 
Probability of generating an initial 
substring 
The COmlmtation of (4) is l)as('.d on an algo- 
rithm which is a modith:ation of the I,RI algo- 
rithm (aelinek and Lafl'erty, 1991). This new 
algorithln is based on the detlnition of Pr(A << 
i,j)) = Pr(A ~ wi...wj ... Ic.. c,,,) as th(, 
1)robability that A generates the initial sul)string 
wi... wj... given Gc and C.,,,. This can l)e com- 
puted with the following (lynamic 1)rogrmmning 
s(:henl(;: 
c 
+ ~ Q(A ~ D)p(D ~ ~)P"(~"~l~)), 
D 
.i-1 
Pr(d<<i,j)= E ~ Q(A ~ BC) 
B,CGN l=i 
pl.(J~ < ~:, 1 >) p,.(c << 1 + 1, j). 
::: this way, Pr(~,~...~,k...IG,:,6',,,) = 
Pr(S << l,k). 
In this cxi)ression, Q(A ~ D) is the proba- 
bility that D is the leftmost nol:terminal in all 
sentential fOHllS which are derived from A. The 
vahu; Q(A ~ BC) is the probability that BC 
is th(; initial substring of all sentential forms de- 
riv(;d from i\. Pr(H < i,l >) is th{; probability 
that the substring "wi... wz is generated from/~ 
given G,: and C.,,,. Its contlmt;ation will be de- 
fined \]ater. 
It shouh:l be noted that th(; combination of 
the models G,. and C~,, in carried out in the vah:e 
P'r(A << i, i). This is the lnain difl:'crcnce with 
resp(wt the \]A/I algorithm. 
Probability of the best derivation 
generating an initial substring 
An algorithm whi(:\]l is similar to the previous 
(>he (-m~ l)e (l(~fin(~d t)ased on the \;iterl)i ,~(:lmme. 
In this way, it is l)onsil)le to obtain the \])cst; pars- 
ing of an initial sul)string. This new algorithm is 
also related to the \/'Lll.I algol'ithni (Shn(:hez and 
B('aw, di, 1997) and is 1)ased on the (lciinition of 
P,~'(A << ',:, J)) = P,~'(A ~ ",,i . . . ',,j . . . IG~:, Cw) 
as the probability of the most probal)le 1)arsing 
which generates wi...wj.., from A given G,: 
and C,,. This can 1)(i (:omputcd as follows: 
p:(A << i,,:)= m:?x(p(A ~ c)l),(~,,~l,.'), 
m~Lx(Q(A => D)'p(D ~ c)Pr('wi\[c))), D 
PI~'(A << i,j) = max :mix (Q(A ~ \]3C) H,CcN l=i...j-I 
A 
Pr(\]3 < i,l >)Pr(C << l + 1,j)) . 
A A Thorcfo,o >4",~...'wk... \[a,:, <,,) -- p,.(s << 
1,k). 
In this expression, Q(A ~ D) is the t)rob - 
al)ility that D is the leftmost nontermina, l in 
57 
the most t)robable sentential form which is de- 
rived ti'om d. The value Q(A ~ BC) is the 
probability that BC is the initial substring of 
most the probable sentential form derived from 
A. Pr(B < i, 1 >) is the probability of the most 
probable parse which generates wi • • • wl froli1 B. 
Probability of generating a string 
The wflue Pr(A < i,j >) = Pr(A d> wi...'u;jlG~,Go) 
is defined as the probability 
that the substring wi... wj is generated from 
A given G~ and C,~. To calculate this proba- 
bility a modification of the well-known Inside 
algorithm (Lari and Young, 1990) is proposed. 
This computation is carried out by using the 
following dynamic progralmning scheme: 
Pr(A < i,i >) -- ~p(A-~ c) Pr(wilc) , 
c 
j--I 
Pr(A<i,j>) : E E p(A -+ BC) 
B,CcN l=i 
pr(B < i,l >)Pr(C < 1 + 1,j >). 
In this way, Pr(w~ ...whiGs, C,,) = Pr(S < 
1,n >). 
As we have commented above, the combina- 
tion of the two parts of the grammatical model 
is carried out in the value Pr(A < i, i >). 
Probability of the best derivation 
generating a string 
The t)rol)abitity of the best derivation that 
genel'~-gtes a string, Pr('u,1... ~t/2,~l~c, 6'w) , can 
be evaluated using a Viterbi-like scheme (Ney, 
1992). As in the previous case, the computation 
of this probability is based on the definition of 
p .(A < g,j >) = pU-(A <,,) as 
the probability of the best derivation that gen- 
erates the substring wi... wj fi'om A given Gc 
and Cw. Similarly: 
Pr(A<i,i>) = 
Pr(A < i,j >) = 
Therefore, 
1, n >). 
inax \])(a -9 C)I)I'('//)ilC) , C 
max n,ax -+Be) B,CCN ... 
Pr(B < i,l >)Pr(C < 1 + 1,j >) . 
 nla , C,,o) = P -(X < 
Finally, the time complexity of these algo- 
rithms is the same as the algorithms they are 
related to, there%re the time colnplexity is O(k:alrl), 
where tc is the length of the input 
string and IPI is the size of the SCFG. 
5 Experiments with the Penn 
Treebank Corpus 
The corpus used in the experiments was the part 
of the Wall Street Journal which had been pro- 
cessed in the Petal %'eebank project 1 (Marcus el: 
al., 1993). This corpus consists of English texts 
collected from the Wall Street Journal from edi- 
tions of the late eighties. It contains approx- 
imately one million words. This corpus was 
automatically labelled, analyzed and manually 
checked as described in (Marcus et 31., 1993). 
There are two kinds of labelling: a POStag la- 
belling and a syntactic labelling. The size of 
the vocalmlary is greater than 25,000 diil'erent 
words, the POStag vocabulary is composed of 
45 labels 2 and the syntactic vocabulary is com- 
posed of 14 labels. 
The corpus was divided into sentences accord- 
ing to the bracketing. In this way, we obtained 
a corpus whose main characteristics are shown 
in Table 1. 
Table 1: Characteristics of the Petal Treebank 
corpus once it; was divided into sentences. 
No. of Av. Std. Min. Max. 
senten, length deviation length length 
49,207 23.61 11.13 1 249 
We took advantage of the category-based 
SCFGs estimated in a previous work (Simchez 
and Benedf, 1998). These SCFGs were esti- 
mated with sentences which had less than 15 
words. Therefore, in this work, we assumed such 
restriction. The vocabulary size of the new cor- 
pus was 6,333 different words. For the exper- 
iments, the corpus was divided into a training 
corpus (directories 00 to 1.9) and a test corpus 
(directories 20 to 24). The characteristics of 
these sets can be seen in Table 2. The part of the 
1Release 2 of this data set can be ob- 
tained t'rmn the Linguistic Data Consor- 
tium with Catalogue number LDC94T4B 
(http://www.ldc.upenn.edu/ldc/nofranm.html) 
2There are 48 labels defined in (Marcus et al., 1993), 
however, three of ttmm do not appear in the corpus. 
58 
(-orlms lal)(:led with l)()Stags was used to (:st;i- 
mate the p~wameters of tlm grammati(:al me(M, 
while the non-lad)e\](;(l part was u,s(',d i;o estimate 
th(; parameters (it" the n-grmn lnodc.l. \¥(~ now 
des('ribe the estinmtion l)roec,,~s in (l('%ail. 
Table 2: Chm'acteristics of th(: data. s('A;s (h~iined 
for the eXl)eriments wh(',n the senl:en(:(~s wi(;h 
more l;lmn 15 l)OSl;ags were r(;moved. 
Da.ta \[ No. of I Av. Std. \] 
l(,ngl h deviation S(~, J; SellI;ell. \]_. 
~l.i;st . 2,295 1 .1~ 3.55 
The 1)a.rmn(%er,q of a 3-grmn too(l(;1 were ('~s- 
timatcd with the softw~re tool des('rit)('.(1 in 
(l/,osenfehl, 1995) :t. W(~ u,qed tlm linear ini;(',rl/o- 
la.tion ~qmooth t('~(:lmiqu(~ SUpl)orted by ,;hi,~ tool. 
Th('~ o1:l;-of-v(/(:al)lflary words wcr('~ groul/e(l in 
the same (:\]as,~ and w(u'e used in th(~ ('omt)ula- 
1;ion of i;\]~('~ perl)h~'xity. ~.l'h(,. I:(~sl ,~(:I l)(~rl)l('~xity 
with t:his mo(lel was 180.4. 
T\]w, values ()f ('.xt)r(~,qsi()n (3) wure (:()ml)ut(!(t 
from the t~:.gged and l:on-l:agg(:(t i)alq; O1' \[;lie, 
l:raining corpus. In or(h'a' to avoid mill val- 
ues, the m~seen (~'vents wer('~ lal)ele(1 with ~ Sl)C- 
cial symbol 'w' wlfich did not a pl)ear in (;he 
i, s.,:h -¢ 0, 
Vc C (/, whtu'(~ (/ was I:h('~ set ()f (:at('.g()ri(~s. 
That is, all th(: ('at(',gori(~s could g(',n(:rat(', i;\]:(: uu-- 
,q('x'~n evenI, This l)rolml)ility took a. v(~ry small 
valu(; (s(',v('.ral ()rd(;rs of magnii;u(l('~ h'~,qs tlmn 
minw~v,c(:c l)r('wlc), where V was the \.'o('alm- 
lary of the tra.ining corpus), and (liffer(mI; vahte.~ 
of this i)robability did not chang('~ tlm r(~sults. 
The i)aramet('~rs of an initial ergodic SCFG 
were estimated with each one of the estimation 
methods mentioned in Se('tion 3. This SCFG 
had 3,374 rules, (:omt)osed fl'om 45 terminal 
syml)ols (the numl)er of l)()Stags) and \]d non- 
terminal symbols (the nmnber of synl;a('l;i(: la- 
bels). The prol)z~l)ilitics were rmidolnly gem'a'- 
ated mid t;hree different seeds were tested, lint 
only one of them is reported given that the re- 
suits were very similar. The training (:orlms was 
the labele(l part of the des('ril)ed (:orlm,q. The 
1)erl)lexity of the labeled part of the test; s(:t for 
all.clcas(~ 2.04 is availal)le at htl;l)://svr- 
www.cng.cmn.ac.uk/~ 1)rcl4/toolkit.html. 
diti'(n'('~nt (~stimation algorithms (;a.n l)c. ,~cen in 
Tal)le 3. 
\[I.'abl(~ 3: PCxl)lexity of the labeled 1)¢tl't; Of I;\]lO 
test set with the SCFC, estimated with the 
methods mentioned in Section 3. 
7 1 /'~ vs l a,\ s l ,ot, I\ sb I 
Once we had estimated the lmramei;ers of the 
defined model, we applied expression (1) 1)y us- 
ing the IAI,\] algorithm m~d the VI.\[/,I algorithm 
in ('~xt)w, ssion (d). Th(; test set lWa'l)lexitly that 
was ol)I;ained in flmction t)f (t %r difl't'a'(:nt; esti- 
nm~tion algorithms (VS, kVS, lOb mid VSb) can 
be seen in \]rig. 1. 
In the best case, the tn'ot)osed l~mguage model 
el)rained more than a. 30% inlI)rOVellle:l|; OVer 
re,~ults ol)taincd 1)y the 3-gram lmlguagc motM 
(s(w. ~l'at)le d). This result wa,q ol)t;ainc.d wh('~n 
th(: SCFG usl;imat(~d with the lOb algorit;hm 
wa,~ u,~(;(1. The SCFGs ('.stimat('.(l with ()ther al- 
g()ril;hms also ()l)tain('.d iml)ortanl; ilnlirovt,.nw.lfl;,~ 
(:Oinl)ar(;d l;o \[;he 3-grain. In ~(t(lition, ii; can be 
oliserv('.d i;hat t)oth th(,. LI-(.I algorithm and the 
VIA/I algoril;hm obtained good results. 
Tal)le d: \]3(~,st tx~st lW, rl)lexity for difl'(~,rem; SCFG 
(:.%imation algori thins, and I;h(~ \])er(:cntage of im- 
i)rovt'.mc~lll; with resi)(wl; I;o i;hc 3-gram model. 
VSm kVS lOb VSI) 
\]All 133.6 130.3 124.6 136.3 
i % improv. 25.9% 27.8% 30.9% 24.5% 
\[ V\]~l:l I ~ 137.2 I 13Z4 I V,9.7 I 
\[ %iml)rov. 20.5% 23.0% 26.6% 17.0% 
An important aspect to note is (;hat the 
weight; of the grmmnatical part was approxi- 
mat;ely 50%, which means that this part pro- 
vided iml)ori;mlI; inform~tion to the  
model. 
6 Conclusions 
A nc'w  model has been introduced. 
This new  model is detined as a~ lin- 
ear ('olnl)in~ttion of an n-gram which repre- 
s(mts relations betwe('~n words, and a stochastic 
59 
20O 
190 
180 
> 
"~ 17(1 
160 
150 
14(\] 
13(\] 
120 
20O 
19(\] 
180 
'~ 17(\] 
160 
15(\] 
14(1 
130 
120 
;,, /l:j~ 
3-ffrgtn: 
VSb 
kVS VS 
IOb 
0 3 0 4 0.5 0.6 0.1 0.2 
~/ 
3~gt'anl 
VSb 
VS 
kVS 
lOb 
0.7 0,8 0.9 
/ 
0.1 0 2 0.3 0.4 0.5 0.6 0 7 0.8 0.9 0 (t 
Figure 1: Test set perplexity obtained with 
the proposed  models in function of 
gamma. Different curves correspond to SCFGs 
estimated with different algorithms. The up- 
per graphic correst)onds to the results obtained 
when the LRI algorithm was used in the lan- 
guage models, and the lower graphic corre- 
sponds to the results obtained with the VLRI 
algorithm. 
grammatical model which is used to represent 
the global relation between syntactic structures. 
The stochastic graminatical model is composed 
of a category-based SCFG and a probabilistic 
model of word distribution in the categories. 
Several algorithms have been described to esti- 
mate the parameters of the model flom a the 
smnple. In addition, efficient algorithms tbr 
solving the problem of the interpretation with 
this model have been presented. 
The proposed model has been tested on the 
part of Wall Street .Journal processed in the 
Penn Treebank project, and the results obtained 
improved by more tlmn 30% the test set; per- 
plexity over results obtained by a simple 3-grain 
model. 

References 

F. Amaya, J.M. Benedi, and J.A. Shuchez. 
1999. Learning of stochastic context-free 
grammars from bracketed corpora by means 
of reestimation algorithms. In M.I. Torres 
and A. Sanfeliu, editors, Proc. VIII Spanish 
Symposium on Pattern Recognition and Im- 
age Analysis, pages 119 126, Bilbao, Est)afia, 
May. AERFAI. 

L.R. Bahl, F. Jelinek, and ILL. Mercer. 1983. 
A maximmn likelihood approach to continu- 
ous speech recognition. IEEE Trans. Pattern 
Analysis and Machine Intelligence, PAMI- 
5(2):179 190. 

J.R. Bellegarda. 1998. A multispan  
modeling frmnework tbr large vocabulary 
speech recognition. IEEE Trans. Speech and 
Audio Processing, 6 (5):456-476. 

C. Chelba and F. Jelinek. 1998. Exploiting syntactic structure for  modeling. In 
Proc. COLING, Montreal, Canada. University of Montreal. 

J. Gilet and W. Ward. 1998. A  model 
combining trigrams and stochastic context- 
fl'ee grammars. In In 5th International Con- 
.fercnce on Spoken Language Processing, pages 
2319 2322, Sidney, Australia. 

F. Jelinek and J.D. Lafferty. 1991. Comlputation of the probability of initial substring generation by stochastic context-free grammars. 
Computational Linguistics, 17(3):315 323. 

F. Jelinek. 1998. Statistical Meth, ods for Speech 
Recognition. MIT Press. 

K. Lari and S.J. Young. 1990. The estimation 
of stochastic context-fl'ee grmnmars using the 
inside-outside algorithm. Computer, Speech 
and Language, 4:35 56. 

M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. 1993. Building a large annotated corpus of english: the penn treebank. 
Computational Linguistics, 19 (2):313-330. 

H. Ney. 1992. Stochastic grmmnars and pattern 
recognition. In P. Laface and R. De Mori, ed- 
itors, Speech Recognition and Understanding. 
Recent Advances, pages 319 344. Springer- 
Verlag. 

F. Pereira and Y. Schabes. 1992. Inside-outside reestimation from partially brackcted corpora In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, 1)ages 128 135. University 
of l)elawarc. 

R. Rosenfeld. 1995. The cmu statistical  modeling toolkit and its use in the 1994 arpa csr evaluation. In ARPA Spoken Language Technology Workshop, Austin, Texas, USA. 

M. Siu and M. Osto.ndorf. 2000. \;al'iablc 
n-grams mid (;xl;(~.n,siolLs for convcrsatiomtl 
speech langu~g(' mo(h;ling. IEEI~' Tm,',,s. on 
Speech and A'udio P'roc(;s.sing, 8 (1) :63 75. 

.I.A. Sfi.nch(;z and ,J.M. B(;ncdf. 11997. ()Olnl)llta- 
tion of the probability of the best (hwiw~t;ion of 
an initial substring fi'om a stochastic (:ontcxt- 
fl'ec; grammar. In A. Smffeliu, .\]..l. Villa.mleVa, 
and .J. \;itri;t, editors, Prm:. VII Spanish, Sym- 
posi'um, on Pattern Rccogu, itio'n and image 
Analysi. L pages 181-186, Barco.hma., Eslmfia, 
April. AERFAI. 

J.A. SSn(:h¢;z mid J.M. Bcn('(lf. 1998. Esti- 
mation of the l)robability di,~tributions of 
sto(:ha,sti(: (;oni;('~xt-frc(; grammar,s from th(; \],:- 
1)(;st derivations. In b~, 5th, b~,ter'national Co'n- 
.f('r(:nc(: on Spoke',, Langua9(: l)'ro(:c,ssing, imgc,s 
2d95 2498, Sidney, Australia. 

.J.A. Sgmchcz mid ,\].M. Bcn(;(li. 1999. lx~rn- 
ing of ,%ochast, ic cont(~xt-fl'(~'c grammm's by 
memos of ('stima.tion ~flgol'ithms. In 1)'roe. EU- 
H, OSl)EECl\]'99, volume 4, lmges 1799 1802, 
Budal)cSt , Hungary. 

J.A. S~nchez, J.M. B(;nedf, and F. Casacu- 
berta. 1996. Comparison lmi,ween the insi(h~- 
outside algorithm mid the vitcrl)i a.lgorithm 
for stochastic context-fl'ee grammars. In 
P. Perncr, P. Wang, amd A. I{osenfe.ld, edi- 
tors, Advances in Str'uct'm'al and Syntactical 
Pattcrn ll, ccogu, ition, pages 50 59. Springer- 
Verlag. 
