Modeling Topic Coherence for Speech Recognition 
Satoshi Sekine 
Computer Scion(:(; \])eI)ai'tmcnt 
New York University 
715 Broadway, 7th floor 
New York, NY 10003, USA 
sekine@cs, nyu. edu 
Abstract 
St, atist,ical langmtge models play a ma- 
jor role in current spee~(:h re.cognition sys- 
tems. Most of these models have ti)- 
cussed on relatively local interactions be- 
tween words. R(.'(:ently, however, |her('. 
have been sevcr;d attempts to incorpo- 
rate other knowlcdg(; source.s, in par- 
ticular long(x-range word (tet)(;nden(:ies, 
in order to improve. Sl)(.~ech r(;(:ognize.rs. 
We will 1)rcs(~.nt one. such m('.t;ho(l, which 
tries to autonmticatly utilize t)rolmri;ics 
of topic continuity. Whim a l)asc-linc. 
spee.ch re.(x)gnil;ion sysl;em gencra, l;('.s a.\[- 
ternativ(', hypothe.s(~s for a senl;enc( L we 
will ul~ilize the word prefercn(:(~s based 
on topic coherence to sele(:t tim b(;st hy~ 
pothesis. In our experiment, we achi(wed 
a 0.65% imI)rovenmn|; in the wor(1 ei- 
ror rat(', on top of t;h(; base-lin(! sysi;em. 
It corre.sponds to 10A0% (if tlm possit)le 
word error improvement. 
1 Introduction 
Statistical language models play ~ major role. in 
current language i)rocessing applications. Most 
of these models have fbcussed on r('~lative.ly local 
interactions betwe.en words, in t~articular, large. 
vocabulary sl)eech recognition systems have. used 
primarily bi-gram and tri~grmn language mod('.ls. 
Recently, howev(;r, there have been several at- 
te.mt)t,s to incorl)orate other knowl(~dge. SOlll'C(~,8~ 
and in pa.rticular longer-range word (tepe.nd(mcies, 
in order l;o improve speech recognizers, th'.rc, 
'loi~ger-range d(~tienden(:ics' me,ms dependencies 
extending beyoiM several words or beyond sen- 
l;(mce boundmies. 
There have be.en several al;l;eml)l;s in the last 
few years to make use of these prot)erti('~s. One 
of them is the "cache language mode.l" (Kulm, 
1988) (aelinek et al., 1.991) (Kul,icc, 1989). This 
is a dynamic language model which ul,ilizes the 
partially dictated document ("cache.") in order to 
predict the next word. In essence, it is based on 
the ol)se.rvation l;ll&(; ;~ wor(t whi(:h has ah'cady al)- 
1)care(1 in a (locum(.,nt has ;m incr(;ased i)rol)ability 
of r(;at)ticaring. Jelind¢ showed tim usefuhmss of 
this method in terms of spe(;ch re.(:ognii;ion quality. 
For sllorI; (locllIll(.~II~,s~ how('.v(w~ SllCh ~)~s lieWSt)3I)cr 
f~rl;icl(;s, th('. mnnl)cr of words whi(;h can 1)e a(:-- 
cunml~t(;(t fi'om tim l/rior text will be small and 
accordingly the b(;nelit of |;tie niethod will gen('x- 
ally tie small. 
l~os('.nti'M proliosed t;he %rigger model" to try 
to ov('.r(:om(~' this limitation (RosentbM, 1992). lie 
used a large (:orpus to build a s('.t of "trigger tmirs", 
each of which consists of it l)a.ir of words al)t)('.ar- 
big in a single (to(:um(mt ot: a liirg(~ corpus. Th(~se 
pairs ar('. used as a (:omI)oncnl; in th('~ t)rot)nt)ilisti(: 
mo(M. If a particular word w apt)('.ars in the 1)re- 
ceding t('~xt of do(:mncnt, the model will 1)red|(:( a. 
lmightened t)rot)al)ility not just for w t)ut also for 
all th(; wor(ls related to w through a trigger I/a.ir. 
Our apt)roa(:h can b(; briefly summarize(l as 
follows. The topic or sut)jecl; matter of ;m m- 
title influcn(:(;s its linguistic l)rot)eri;i(;s, su(:h as 
word (;hoic(~ and (:o-oc(:m'ren(:e patt(~rns; in ctl'(~ct 
it giv('.s rise |,o a very Sl)ccializc(l "sublmlguag(~" 
for th;tt topic. We try to find the sul)languag(~ 
to which t,h(', art|el(; 1Mongs based on the sen- 
tences already recognized. At a cert;fin stag(; of 
the st)ee(:h recognition processing of an art|el('., 
words in |;hi; pr(;viotls ll|;|;(2r~rlic(?s &ro s(~.le(;i;(Rt ~)~s 
keywords. Then, based on the keywords, simi- 
lm' arti('.h.'s are retri('.ved from a large corpus t)y 
a nmthod similar to that used iil information re- 
trieval. They are asseml)l('~d into a subla.nguag(', 
"mini-cortms" for tim mti(:h',. Then wc analyze the 
mini-(:orlms in or(lcr to d(~tc, rminc word l)rcf(;rcn(:c 
whi(:h will l)e used in analyzing the following sen- 
t(mcc. The &;tails (if e, ach stet) will be described 
lat(~u'. 
Our work is similar to I;ll~I; using trigger t)~firs. 
ltowever, the triggea' |);fir approach does a. v(uy 
broad s('~a.rch, retrieving m'ti(:h;s which have any 
word in common wil;h the, 1)rior discourse'. ()llr ap- 
pro~,ch, in contrast, makes a Inllch lnorc fOCllSSe(t 
sear(:h, taking only a small set of articles most 
similar to the prior discourse. This may allow us 
to make, sharper t)redictions in the case of well- 
913 
defined topics or sublanguages, and reduce the 
problems due to homographs by searching for a 
(:onjunction of words. (Rosenfeld has indicated 
that it may be t)ossible to achieve similar results 
by an enhancement to trigger pairs which uses 
multiple triggers (Rosenfeht, 1992).) In addition, 
our approach needs less machine power. This 
was one of tile major problems of Rosenfeld's ap- 
proach. 
Sekine has reported on the effectiveness of sub- 
language identification measured in terms of the 
Dequency of overlapping words between an arti- 
cle and the extracted sublanguage corpus (Sekine, 
1994). In this paper, we report on its practical 
benefits for speech recognition. 
2 Speech Recognition System 
This research is being done in collaboration wittl 
SRI, which is providing the base of the com- 
bined speech recognition system. (Digalakis et.al., 
1995). We use the N-best hypotheses produced 
by the Sill system, alon G with their acoustic and 
language model scores. There are two acoustic 
scores and four language scores. Language scores 
are namely the word trigram model, two kinds of 
part of speech 5-gram model and the number of 
tokens. Note that none of their language models 
take long-range dependencies into account. We 
combine these scores with the score produced by 
our sublanguage (:omponent an(1 our cache inodel 
score, and then select the hypothesis with the, 
highest combined score as the output of our sys- 
tem. The system structure is shown in Figure 1. 
The relative weights of the eight scores are deter- 
mined by an optimization procedure on a train- 
ing data set, which was produced under the same 
conditions as our evaluation data set, trot has no 
overlap with tile evaluation data set. The actual 
conditions will be presented later. 
SRI Speech ~ __~YU I 
Recognizer 1 N-best / hi( 
Ii-grain scores 
Combine sco ic 
Best hypothesis 
Languag( t 
Models / 
,anguage score 
Figure 1: Structure of the system 
3 Sublanguage Component 
The sublanguage component performs the follow- 
ing four steps: 
1. Select keywords from previously uttered sen- 
tellces 
2. Collect silnilar articles flom a large corlms 
based tm the keywords 
3. Extract sublanguage words fl'om the similar 
articles 
4. Compute scores of N-best hypotheses based 
on tile sublanguage wor(ls 
A sublanguage analysis is performed separately 
for each sentence in an artieh; (afl;er the first sen- 
tence). There are several parameters in these pro~ 
eesses, and the values of the parameters we usetl 
for this experiment will be summarized at the end 
of each section below. We generally tried several 
parameter values and tile vahles shown in this pa- 
per are the best ones on our training data set;. 
We used a large corpus in the experiment as 
the source for similar articles. This corpus in- 
cludes 146,000 articles, or 76M tokens, from Jan- 
uary 1992 to .hfly 1995 of North American Busi- 
ness News which consists of Dow Jones Informa- 
tion Services, New York Times, Reuters North 
American Business Report, Los Angeles Times, 
and Washington Post. This corpus has no overlap 
with the evaluation data set, which is drawn from 
August 1995 North American Business News. 
Now, each step of our sublanguage component 
will be described in detail. 
Select Keywords 
The keywords which will be used in retrieving sim- 
ilar articles are selected from previously dictated 
sentences. Tile system we will describe here is an 
incremental adal)tation system, which uses only 
the inlbrmation the syst;em has acquired from tile 
previous utterances. St) it does not know the cor- 
rect transcriptions of prior sentences or any infor- 
mation about subsequent sentences in the article. 
Not all of l, he words from the prior sentences 
are used as keywords for retrieving similar arti- 
cles. As is the practice in information retrieval, 
we filtered out several types of words. First of 
all, we know that closed class words and high fre- 
quency words appear in most of the documents re- 
gardless of the topic, st) it is not useflfl to include 
these as keywords. On the other hand, very low 
frequency words soinetimes introduce noise into 
the retrieval process because of their peculiarity. 
Only open-class words of intermediate flequeney 
(actually frequency from 6 to 100000 in tile corpus 
of 146,000 articles) are retained as keywords and 
used in finding the similar articlem Also, because 
the N-best sentences inevitably contain errors, we 
set at threshold for the appearance of words in tile 
N-best sentences. Specifically, we require that a 
914 
word aptiear at least 15 tilnes ill the tilt) 20 N-besl; 
senten(;es (as ranke,(l l/y Sl{.\['s s(:ore) 1;o (luali(y as 
a keyword for retriewd. 
~ Pat"anmq,er £ -Vahm 
Max freqnen(:y of a keywor(l | 100000 I 
Minfrequeneyofakeyword (i0 / 
N-best for ke.ywor(1 sele(:tion 
Min word ai)pearanees in N-1)est i 15 • 
Collect Similar Articles 
The sex of keywords is used in order to retrieve 
similar artMes a.c(:ording to the folh)wing formu- 
las. Ilere Weigh,;('w) is the weight of word w, 
F'('w) is the, frequency of w(ird 'W in the 20 N- 
best senten(:es, M is the total Itumb(_!\]' ()\[' t()kens 
in the corpus, t/(w) is the Dequen(:y of word w in 
the corpllS, AScorc(a) is artMe seorc of a.rtMe a, 
which indicates the similarity between the set of 
keywords an(l the artMe, and n(a) is the mmfl)er 
of tokens in article a. 
M w,.,::jh~.(.,,,) : 1.'(.,,,) • 1,,:j(~76,,~) (0 
mgoo,.(~(.) -- F.,,,~,, w,@h.4,,,) lo.(,,(.)) (2) 
Fm, ch keyword is weighted by t;he l)rodu(:t of two 
factors. One (if them is the fr(;(luen(:y of the wor(t 
in the 20 N-1lest senten(:es, and the other is the 
log of the inverse t)robability ()f the wor(l in the 
large eortms. This is a standard metric ()f infer 
mation retrieval based on the assumption dmt the 
higher fre(luency w(irds provide less intormation 
about topics (Si)arck-Jones, 1973). Article scores 
(AScore) for all articles in the large (:ortms are 
(:oInputed as the sum of the weighted scores of 
the selecte(t keywords in each arti(:le, an(1 are n(ir- 
realized t)y the log of the size of each article. This 
s(:ore in(li(:ates the similarity b(!tween the set, of 
keywords and the article. We (:olleet the most 
similar 50 artMes D()m the corpus. These forln 
the. "sift)language. set", whi(:h will I)e use(l in ana- 
lyzing the f()llowing sentenc4; in the test m'ti(:le. 
I w l"e I 
umber of artMes ill sublanguage set 
Extract Snblanguage words 
Sublanguage words are extra(:ted from the col- 
lected sublanguage artMes. This extraction was 
done ill order to filter out tot)i(:-um'elated words. 
ltere, we exehtde flmetion WOl'dS, as we did for key- 
word selection, 1)eta,use flm(:don words are gener- 
ally coInIllOll throughout (ti\[thrent sul)languages. 
Next, to find strongly tot)ie related words, we ex- 
tracl;ed words which apl/ear in at least 3 (lilt (if the 
50 sublmlguage articles. Also, the do(:tnnent De- 
quen(:y in sublanguage articles has to be at least 
3 times the word Dequency in the large corpus: 
D t/('w) /50 > 3 (3) 
F(w)/M 
Itere, DF(w) is the number of do(:uments in whi(:h 
the wor(l apl)ears. We (:m~ expe(:t that these me;h- 
e(Is eliminate, less topi(: relate(l words, s() dtat only 
str(mgly tot)i(: related wor(ts are extra(:ted as the 
sul)language words. 
I'al'a,~,ete,' - \[ ~il.~ \] 
Min'lumof(l(/eumentswiththew')7'(l\[ 33 l 
Thr(!shol(1 ratio of wor(1 
in the set and in general 
Compute Scores of N-best Hypotheses 
Finally, we ('omput(,. scores of the N-best hypothe- 
s(;s gen('.rnted l)y the speech recognizer. 'Fhe top 
100 N-t)(;st hypotheses (ae(:ording to SIH's score) 
are r(>s(:or(~(l. The sul)language score we assign t;() 
each word is the logarithm of the ratio of d()cu- 
nlent h'e(lueal(:y in the sublanguage m'ticles to the 
word frequen(:y of the word in tile large corpus. 
The larger this s(:ore of a word, tile more strongly 
the word is related to the sublmlguage we found 
through the tirior discourse,. 
The s(:ore \['or ea(:h sentenc(; is eal(:ulated by a(:- 
(:unmlating the score of 1he sele(:te(l wor(ls ill the 
hyt)othesis, tlere l\[Sco'rc(h) is the sut)language 
score of hypothesis h. 
, ../oF(,. O/so, tls'(~.o,.(,(h) = ~ ,(,~ F~;,))X7 ~ (4) 
w in h 
This formula can be motivated by the fact that 
die sublanguage score will be combilm(t linearly 
with general language nlo(M s(:ores, whMl mainly 
consist of the logarithm of the tri-gram 1)robabil- 
ides. The denominator of the log in Formula 4 is 
the unigram probability of wor(t w. Sin(:e it is the 
(h!nonlinator ()f a logarithm, it; winks to reduce the 
effect of the general laltgliage model whMl may be 
(:;robed(led in the trigranl language mo(M score. 
The nmnerat()r is a pure sublanguage score and it 
works to ad(l tim s(;or(~ of the sublanguage mo(M 
to the ()ther s(:ores. 
4 Cache model 
A cache model was also used in o/lr (;xpetilll(!ilt. 
We (lid not use all the words in tile previous ul> 
teran(:e, but rather filtered out several types of 
words in order to retain only lopi(: relate(1 words. 
We, actually used all of the "selected keywor(ls" as 
explained in the last section for our ca(:he model. 
Seo~'es for the words iil (:~ehe (CS,:o,',<,,,)) a,e 
(xmq)ut(;d in a similar way to that for sublanguage 
words. Here, N' is the number of tokens in the 
previously uttere(t N-best sentences. 
1,"(,,,)/:v' cs',~o.,.,..(1,.)--_. ~, 
lo~( .... ) Is) 
,,, i, ..... I.. F(w)/M 
5 Experiment 
The speech recognition experiment has been con- 
ducted as a part of the 1995 AI1,PA continuous 
915 
speech recognition evaluation under the supervi- 
sion of NIST(NIST, 1996). The conditions of the 
experiment are: 
• The input is read speech of unlimited vocab- 
ulary texts, selected from several sources of 
North American Business (NAB) news from 
the period 1-31 August 1995 
• Three non-close talking microphones are used 
anonymously for each article 
• All speech is recorded in a room with back- 
ground noise in the range of 47 to 61 dB (A 
weighted) 
• The test involves 20 speakers and each 
speaker reads 15 sentences which are taken 
in sequence from a single article 
• Speaker gender iv unknown 
The SRI system, which we used as the base sys- 
tem, produces N-bent (with N=I.00) sentences and 
six kinds of scores, as they are explained before. 
We produce two additional scores based on the 
sublanguage model and the cache model. The 
two scores are linearly combined with SRI's six 
scores. The weights of the eight scores are deter- 
mined by minimizing the word error on the train- 
ing data set. The training data set, has speech 
data recorded under the same conditions as the 
evaluation data set. The training data set con- 
sists of 256 sentences, 17 articles (a part of the 
ARPA 1995 CSR "dev test" data distributed by 
NIST) and does not overlap the evaluation data 
set. 
The evaluation is clone with the tuned pa- 
rameters of the sublanguage component and the 
weights of the eight scores decided by the training 
optimization. Then the evaluation is conducted 
using 300 sentences, 20 articles, (the ARPA 1995 
CSR "eval test" distributed by NIST) disjoint 
fl'om the dev test and training corpus. The eval- 
uation of the sublanguage method has to be done 
by comparing the word error rate (WER) of the 
system with sublanguage scores to that of the SRI 
system without sublanguage scores. 
Inevitably, this evaluation is affected by the per- 
formance of the base system. In particular, the 
number of errors for the base system and the min- 
immn number of errors obtainable by choosing 
the N-best hypotheses with minimmn error, are 
important. (We will call the latter kinds of er- 
ror "MNE" for "minimal N-best errors".) The 
difference of these nmnbers indicates the possible 
improvement we (:an achieve by restoring the hy- 
potheses using additional components. 
We can't expect our sublanguage model to fix 
all of the 375 word errors (non-MNE). For one 
thing, there are a lot of word errors unrelated to 
the article topic, for exmnple function word re- 
placement ("a" replaced by "the"), or deletion 
or insertion of topic unrelated words (missing 
Num. of error WER 
SRI system 1522 25.37 % 
MNE 1147 19.12 % 
Possible 
hnprovement 375 6.25 % 
Figure 2: Word Error of the base system and MNE 
"over"). Also, the word errors in the first sen- 
tence of each article are not withii~ our means to 
tlX. \] 
6 Result 
The absolute improvement using the sublanguage 
component over SRI's system is 0.65%, from 
25.37% to 24.72%, as shown in Table 3. That is, 
the number of word errors is reduced froin 1522 
to 1483. This means that 1.0.40% of the possible 
improvement was achieved (39 out; of 375). The 
System 
SRI 
SRI+SL 
WEB. 
25.37 % 
24.72 % 
Num. of 
Error 
1522 
1483 
Improve 
exel. MNE 
10.40% 
Figure 3: Word Error Rate 
absolute improvement looks tiny, however, the reJ- 
ative improvement excluding MNE, 10.40 %, is 
quite impressive, becmlse there are several types 
of error which can not be corrected by the sublan- 
guage model, as was explained before. 
The following is an example of the actual outtmt 
of the system. (This is a relatively badly recog- 
nized example.) 
.... Example .... 
in recent weeks hyundai corporation and 
fujitsu limited announced plans for 
memory chip plants in oregon at projected 
costs of over one billion dollars each 
in recent weeks CONTINENTAL VERSION 
SUGGESTS ONLY limited announced plans for 
MEMBERSHIP FINANCING FOR IT HAD projected 
COST of one DAY each 
in recent weeks CONTINENTAL VERSION 
SUGGESTS ONLY limited announced plans for 
memory chip plants in WORTHINGTON PROJECT 
COST of one MILLION each 
1Note that, in our experiment, a few errors in tat- 
tim sentences were corrected, because of the weight 
optimization based oil the eight scores which includes 
all of the SRI's scores. But it; is very minor and these 
improvements are offset by a similar number of dis- 
improvements caused by tile same reason. 
916 
The first sentence is the correct transeriI)tion, the 
second one is SRI's best scored hypothesis, and 
the third one is the hypothesis with the highest 
combined score of SRI and our models. This sen- 
tence is the 15th in an article on memory chip 
production. As you can see, a mistake in SRI's hy- 
pothesis, membership instead of memory and chip, 
was replaced by the correct wor(ts. Ilowever, other 
parts of the sentence, like hyundai corporation 
and fuj itsu, were not amended. V~Te lkmnd that 
this particular error is one (If the MNE, for which 
there is no ¢'orreet candidate in the N-best hy- 
potheses. Another error, million or day instead 
of billion, is not a MNE. There exist some hy- 
potheses which have billion at the right spot, 
(the 47th ean(lidate is the top candidate which 
has the word). Our sublanguage model works to 
replace word day by million, but this was not 
the correct word. 
7 Discussion 
Although the actual improvement in word error 
rate is relatively small, partially because of fa(:- 
tors we (:ould not control, of which the probleni of 
MNE is the most important, the results suggest 
that the sul)language technique may lie useful in 
improving the si)eeeh recognition system. One of 
the methods for increasing the t)ossibility (if im- 
provement is to make N (of N-best) larger, thus 
including more. corre(:t hypotheses in the. N-best. 
We tried this, becmlse SRI actually provided us 
with 2000 N-best hypotheses. However, parame- 
ter optimization showed us that 100 is the oi)ti- 
real numl)er for this parmneter. This result can 
be explained by the folh)wing statistic. Table 4 
describes the nuinber of MNE as a function of N 
for the training data set; and evaluation (lata set:. 
Also in parentheses, the numl)er of possit)le im- 
proveinents for each case is shown. Ae('or(ling to 
N MNE MNE 
(evaluation) (training) 
1 1522 
50 1163 (359) 
100 1147 (375) 
200 \] 134 (388) 
500 1116 (406) 
1000 1109 (41.3) 
2000 1107 (415) 
1258 
991 (2(~7) 
960 (298) 
947 (311) 
935 (323) 
93() (328) 
929(329) 
Figure 4: N and Word Error 
the table, the number of MNE decreases rapidly 
for N up to 100; however, after that point, the 
number decreases only slightly. For example, in 
the ewduation data set, increasing N fl'om 500 to 
2000 introduces only 9 new possible word error im- 
provements. We believe this small number gives 
our colnponent greater opt)ortunity to include er- 
rors rather than improvenlents. 
Improvements will no doubt be possible through 
better adjustment of the parameter settings. 
There are parameters involved in the similarity 
calculation, the size of the sublanguage set, the 
ratio threshold, etc. To date, we have tuned them 
by manual optimization using a relatively small 
mnnt)er of trials and a very small training set (the 
20 articles for which we have N-best transcrit)- dons). 
We will need to use automatic optimiza- 
tion methods and a substantially larger training 
set;. Since we do not have a much larger set of 
articles with speech data, one possibility is to op- 
timize the systeln in terlns of perplexity using a 
nnlch larger text corpus for training, and apply 
the optimized parameters to the speech recogni- 
tion system. With regard to the size of sublan- 
guagc set, a constant size may not be optimal. 
Sekine (Sekine, 1994) reported on an extmriment 
which selects the size automatically by seeking the 
ininimum ratio ()\[ the docunlent set, perple.xity to 
|;lie estimated t)erplexity of randomly schooled doc- 
ument sets of that size. This approach can be 
applit'abh'~ to our systeln. 
Wc may also need to reconsider the strategy 
for incorporating the sublanguage component into 
the speech recognition system. For example, it 
might be worthwhile to reconsider how to mix our 
score with SRI's language model score. SRI pro- 
vides language model scores for each hyi)othesis, 
not for words. However, we can imagine that, if 
their language score can be computed with high 
confidence for a particular word, then our model 
should have. relatively little weight. On the other 
hand, if the language model has low confidence, 
sublanguage should have strong weight. In other 
words, the combination of the scores should not be 
done by linear combination at the sentence level, 
but should be done at the word level. 
Also there are several things we need to re- 
ewduate regm:ding our sublanguage model. ()lie 
of thenl is the threshold method we adopt here, 
which introduces undesirable discontinuities into 
our la.nguage model. The method for retrieving 
similar articles may also need to be modified. We. 
used a silnple technique whit:h is conunoil in in- 
formation retrieval research. However, the pur~ 
pose of our system is slightly different from that 
of information retrieval systems. So, one fllture 
direction is to look for a more suitable retrieval 
method for our purpose. 
in closing, we wish to mention that the sub- 
language technique we have described is a gen- 
eral approach to enhancing a statistical language 
model, and is therefore applicable to tasks be- 
sides speech recognition, such as optical (:haracter 
recognition and machine translation. For exam- 
lflC, if a machine translation system uses a statis- 
tical model for target language word choice, our 
917 
approach could improve word choice by selecting 
more topic related words. 
8 Acknowledgment 
The work reported here was supported by the Ad- 
vanced Research Projects Agency under contract 
DABT63-93-C-0058 from the Department of the 
Army. We would like to thank the collaboration 
partners at SRI, in particular Mr.Ananth Sankar 
and Mr.Victor S. Abrash. Also we thank for use- 
ful discussions and suggestions Prof. Grishman 
and Slava Katz. 
References 
Satoshi Sekine, John Sterling and Ralph Grish- 
mail 1995 NYU/BBN 1994 CSR evaluation 
In Proceedings of the ARPA Spoken Language 
Systems Technology Worksh, op 
R Kuhn. 1988 Speech Recognition and the Fre- 
quency of Recently Used Words: A Modified 
Markov Model for Natural Language In Pro- 
ceedings o:f I2th International Conference on 
Computational Linguistic 
F Jelinek, B Merialdo, S Roukos, and M Strauss. 
1991 A Dynamic Language Model for Speech 
Recognition In Proceedings of Speech and Nat- 
ural Language DARPA Workshop 
J Kupiec. 1989 Probabilistic Models of Short and 
Long Distance Word Dependencies in Running 
Text In Proceedings of Speech and Natural Lan- 
guage DARPA Workshop 
Ronald Rosenfeld and Xuedong Huang. 1992 Im- 
provements in Stochastic Language Modeling 
In Proceedings of DARPA Speech and Natural 
Language Workshop 
Satoshi Sekine 1994 A New Direction for Sub- 
language NLP In Prvcecdings of International 
conference on New Mcthods in Language Pro- 
cessing 
K Sparck-Jones. 1973 Index Term Weighting In 
Information Storage and Retrieval, Vol.9, p619- 
633 
Vassilios Digalakis, Mitch Weintraub, Ananth 
Sankar, Horaeio Franco, Leonardo Neumeyer, 
and Hy Murveit 1995 Continuous Speech Dic- 
tation on ARPA's North Business News Domain 
in Proceedings of the ARPA Spoken Language 
Systems Technology Workshop, p88-93 
David S. Pallett et.al, to appear 1995 Benchmark 
Test for the ARPA Spoken Language Program 
In Proceedings of the ARPA Spoken Languagc 
Systems Technology Workshop 
918 
