Statistical Language Modeling with Performance Benchmarks
using Various Levels of Syntactic-Semantic Information
Dharmendra KANEJIYA∗, Arun KUMAR†, Surendra PRASAD∗
∗Department of Electrical Engineering
†Centre for Applied Research in Electronics
Indian Institute of Technology
New Delhi 110016 INDIA
kanejiya@hotmail.com, arunkm@care.iitd.ernet.in, sprasad@ee.iitd.ernet.in
Abstract
Statistical language models using n-gram
approach have been under the criticism of
neglecting large-span syntactic-semantic in-
formation that influences the choice of the
next word in a language. One of the ap-
proaches that helped recently is the use of
latent semantic analysis to capture the se-
mantic fabric of the document and enhance
the n-gram model. Similarly there have
been some approaches that used syntactic
analysis to enhance the n-gram models. In
this paper, we explain a framework called
syntactically enhanced latent semantic anal-
ysis and its application in statistical lan-
guage modeling. This approach augments
each word with its syntactic descriptor in
terms of the part-of-speech tag, phrase type
or the supertag. We observe that given this
syntactic knowledge, the model outperforms
LSA based models significantly in terms of
perplexity measure. We also present some
observations on the effect of the knowledge
of content or function word type in language
modeling. This paper also poses the prob-
lem of better syntax prediction to achieve
the benchmarks.
1 Introduction
Statistical language models consist of estimat-
ing the probability distributions of a word given
the history of words so far used. The standard
n-gram language model considers two histories
to be equivalent if they end in the same n − 1
words. Due to the tradeoff between predictive
power and reliability of estimation, n is typically
chosen to be 2 (bi-gram) or 3 (tri-gram). Even
tri-gram model suffers from sparse-data estima-
tion problem, but various smoothing techniques
(Goodman, 2001) have led to significant im-
provements in many applications. But still the
criticism that n-grams are unable to capture the
long distance dependencies that exist in a lan-
guage, remains largely valid.
In order to model the linguistic structure that
spans a whole sentence or a paragraph or even
more, various approaches have been taken re-
cently. These can be categorized into two main
types : syntactically motivated and semanti-
cally motivated large span consideration. In
the first type, probability of a word is decided
based on a parse-tree information like grammat-
ical headwords in a sentence (Charniak, 2001)
(Chelba and Jelinek, 1998), or based on part-
of-speech (POS) tag information (Galescu and
Ringger, 1999). Examples of the second type
are (Bellegarda, 2000) (Coccaro and Jurafsky,
1998), where latent semantic analysis (LSA)
(Landauer et al., 1998) is used to derive large-
span semantic dependencies. LSA uses word-
document co-occurrence statistics and a matrix
factorization technique called singular value de-
composition to derive semantic similarity mea-
sure between any two text units - words or doc-
uments. Each of these approaches, when inte-
grated with n-gram language model, has led to
improved performance in terms of perplexity as
well as speech recognition accuracy.
While each of these approaches has been stud-
ied independently, it would be interesting to see
how they can be integrated in a unified frame-
work which looks at syntactic as well as seman-
tic information in the large span. Towards this
direction, we describe in this paper a mathemat-
ical framework called syntactically enhanced la-
tent syntactic-semantic analysis (SELSA). The
basic hypothesis is that by considering a word
alongwith its syntactic descriptor as a unit of
knowledge representation in the LSA-like frame-
work, gives us an approach to joint syntactic-
semantic analysis of a document. It also pro-
vides a finer resolution in each word’s seman-
tic description for each of the syntactic con-
texts it occurs in. Here the syntactic descriptor
can come from various levels e.g. part-of-speech
tag, phrase type, supertag etc. This syntactic-
semantic representation can be used in language
modeling to allocate the probability mass to
words in accordance with their semantic simi-
larity to the history as well as syntactic fitness
to the local context.
In the next section, we present the mathe-
matical framework. Then we describe its ap-
plication to statistical language modeling. In
section 4 we explain the the use of various lev-
els of syntactic information in SELSA. That is
followed by experimental results and conclusion.
2 Syntactically Enhanced LSA
Latent semantic analysis (LSA) is a statistical,
algebraic technique for extracting and inferring
relations of expected contextual usage of words
in documents (Landauer et al., 1998). It is
based on word-document co-occurrence statis-
tics, and thus is often called a ‘bag-of-words’
approach. It neglects the word-order or syn-
tactic information in a language which if prop-
erly incorporated, can lead to better language
modeling. In an effort to include syntactic in-
formation in the LSA framework, we have de-
veloped a model which characterizes a word’s
behavior across various syntactic and semantic
contexts. This can be achieved by augment-
ing a word with its syntactic descriptor and
considering as a unit of knowledge representa-
tion. The resultant LSA-like analysis is termed
as syntactically enhanced latent semantic anal-
ysis (SELSA). This approach can better model
the finer resolution in a word’s usage compared
to an average representation by LSA. This finer
resolution can be used to better discriminate
semantically ambiguous sentences for cognitive
modeling as well as to predict a word using
syntactic-semantic history for language model-
ing. We explain below, a step-by-step procedure
for building this model.
2.1 Word-Tag-Document Structure
The syntactic description of a word can be in
many forms like part-of-speech tag, phrase type
or supertags. In the description hereafter we
call any such syntactic information as tag of the
word. Now, consider a tagged training corpus
of sufficient size in the domain of interest. The
first step is to construct a matrix whose rows
correspond to word-tag pairs and columns cor-
respond to documents in the corpus. A docu-
ment can be a sentence, a paragraph or a larger
unit of text. If the vocabulary V consists of
I words, tagset T consists of J tags and the
number of documents in corpus is K, then the
matrix will be IJ × K. Let ci j,k denote the
frequency of word wi with tag tj in the docu-
ment dk. The notation i j (i underscore j) in
subscript is used for convenience and indicates
word wi with tag tj i.e., (i − 1)J + jth row of
the matrix. Then we find entropy εi j of each
word-tag pair and scale the corresponding row
of the matrix by (1−εi j). The document length
normalization to each column of the matrix is
also applied by dividing the entries of kth doc-
ument by nk, the number of words in document
dk. Let ci j be the frequency of i jth word-tag
pair in the whole corpus i.e. ci j = summationtextKk=1 ci j,k.
Then,
xi j,k = (1−εi j)ci j,kn
k
(1)
εi j = − 1logK
Ksummationdisplay
k=1
ci j,k
ci j log
ci j,k
ci j (2)
Once the matrix X is obtained, we perform
its singular value decomposition (SVD) and ap-
proximate it by keeping the largest R singular
values and setting the rest to zero. Thus,
X ≈ ˆX = USVT (3)
where, U(IJ ×R) and V(K×R) are orthonor-
mal matrices and S(R×R) is a diagonal matrix.
It is this dimensionality reduction step through
SVD that captures major structural associa-
tions between words-tags and documents, re-
moves ‘noisy’ observations and allows the same
dimensional representation of words-tags and
documents (albeit, in different bases).This same
dimensional representation is used (eq. 12) to
find syntactic-semantic correlation between the
present word and the history of words and then
to derive the language model probabilities. This
R-dimensional space can be called either syntac-
tically enhanced latent semantic space or latent
syntactic-semantic space.
2.2 Document Projection in SELSA
Space
After the knowledge is represented in the latent
syntactic-semantic space, we can project any
new document as a R dimensional vector ¯vselsa
in this space. Let the new document consist
of a word sequence wi1,wi2,...,win and let the
corresponding tag sequence be tj1,tj2,...,tjn,
where ip and jp are the indices of the pth word
and its tag in the vocabulary V and the tagset
T respectively. Let d be the IJ ×1 vector rep-
resenting this document whose elements di j are
the frequency counts i.e. number of times word
wi occurs with tag pj, weighted by its corre-
sponding entropy measure (1−εi j). It can be
thought of as an additional column in the ma-
trix X, and therefore can be thought of as hav-
ing its corresponding vector v in the matrix V.
Then, d = USvT and
¯vselsa=vS=dTU= 1n
nsummationdisplay
p=1
(1−εip jp)uip jp (4)
which is a 1×R dimensional vector representa-
tion of the document in the latent space. Here
uip jp represents the row vector of the SELSA
U matrix corresponding to the word wip and
tag tjp in the current document.
We can also define a syntactic-semantic simi-
larity measure between any two text documents
as the cosine of the angle between their pro-
jection vectors in the latent syntactic-semantic
space. With this measure we can address the
problems that LSA has been applied to, namely
natural language understanding, cognitive mod-
eling, statistical language modeling etc.
3 Statistical Language Modeling
using SELSA
3.1 Framework
We follow the framework in (Bangalore, 1996)
to define a class-based language model where
classes are defined by the tags. Here probability
of a sequence Wn of n words is given by
P(Wn) = summationdisplay
t1
...summationdisplay
tn
nproductdisplay
q=1
P(wq|tq,Wq−1,Tq−1)
P(tq|Wq−1,Tq−1) (5)
where ti is a tag variable for the word wi. To
compute this probability in realtime based on
local information, we make certain assumptions:
P(wq|tq,Wq−1,Tq−1) ≈ P(wq|tq,wq−1,wq−2)
P(tq|Wq−1,Tq−1) ≈ P(tq|tq−1) (6)
where probability of a word is calculated by
renormalizing the tri-gram probability to those
words which are compatible with the tag in con-
text. Similarly, tag probability is modeled using
a bi-gram model. Other models like tag based
likelihood probability of a word or tag tri-grams
can also be used. Similarly there is a motiva-
tion for using the syntactically enhanced latent
semantic analysis method to derive the word
probability given the syntax of tag and seman-
tics of word-history.
The calculation of perplexity is based on con-
ditional probability of a word given the word
history, which can be derived in the following
manner using recursive computation.
P(wq|Wq−1)
= summationdisplay
tq
P(wq|tq,Wq−1)P(tq|Wq−1)
≈ summationdisplay
tq
P(wq|tq,wq−1,wq−2)summationdisplay
tq−1
P(tq|tq−1)P(tq−1|Wq−1)
= summationdisplay
tq
P(wq|tq,wq−1,wq−2)summationdisplay
tq−1
P(tq|tq−1) P(Wq−1,tq−1)summationtext
tq−1 P(Wq−1,tq−1)
(7)
where,
P(Wq,tq)
=

summationdisplay
tq−1
P(Wq−1,tq−1)P(tq|tq−1)


P(wq|tq,wq−1,wq−2) (8)
A further reduction in computation is
achieved by restricting the summation over only
those tags which the target word can anchor. A
similar expression using the tag tri-gram model
can be derived which includes double summa-
tion. The efficiency of this model depends upon
the prediction of tag tq using the word his-
tory Wq−1. When the target tag is correctly
known, we can derive a performance bench-
mark in terms of lower bound on the perplexity
achievable. Furthermore, if we assume tagged
corpus, then tq’s and Tq’s become deterministic
variables and (5) and (7) can be written as,
P(Wn) =
nproductdisplay
q=1
P(wq|tq,Wq−1,Tq−1) (9)
P(wq|Wq−1) = P(wq|tq,Wq−1,Tq−1) (10)
respectively in which case the next described
SELSA language model can be easily applied to
calculate the benchmarks.
3.2 SELSA Language Model
SELSA model using tag information for each
word can also be developed and used along the
line of LSA based language model. We can
observe in the above framework the need for
the probability of the form P(wq|tq,Wq−1,Tq−1)
which can be evaluated using the SELSA rep-
resentation of the word-tag pair corresponding
to wq and tq and the history Wq−1Tq−1. The
former is given by the row uiq jq of SELSA U
matrix and the later can be projected onto the
SELSA space as a vector ˜¯vq−1 using (4). The
length of history can be tapered to reduce the
effect of far distant words using the exponential
forgetting factor 0 < λ < 1 as below:
˜¯vq−1 = 1q −1
q−1summationdisplay
p=1
λq−1−p(1−εip jp)uip jp (11)
The next step is to calculate the cosine mea-
sure reflecting the syntactic-semantic ‘closeness’
between the word wq and the history Wq−1 as
below:
K(wq,Wq−1) = uiq jq
˜¯vTq−1
bardbl uiq jqS12 bardblbardbl ˜¯vq−1S−12 bardbl
(12)
Then SELSA based probability
P(sel)(wq|Wq−1) is calculated by allocating
total probability mass in proportion to this
closeness measure such that least likely word
has a probability of 0 and all probabilities sum
to 1:
Kmin(Wq−1) = minw
i∈V
K(wi,Wq−1) (13)
ˆP(wq|Wq−1)= K(wq,Wq−1)−Kmin(Wq−1)summationtext
wi∈V
(K(wi,Wq−1)−Kmin(Wq−1))
(14)
But this results in a very limited dynamic range
for SELSA probabilities which leads to poor
performance. This is alleviated by raising the
above derived probability to a power γ > 1 and
then normalizing as follows(Coccaro and Juraf-
sky, 1998):
P(sel)(wq|Wq−1) =
ˆP(wq|Wq−1)γ
summationtext
wi∈V
ˆP(wi|Wq−1)γ (15)
This probability gives more importance to the
large span syntactic-semantic dependencies and
thus would be higher for those words which are
syntactic-semantically regular in the recent his-
tory as compared to others. But it will not
predict very well certain locally regular words
like of, the etc whose main role is to support
the syntactic structure in a sentence. On the
other hand, n-gram language models are able
to model them well because of maximum likeli-
hood estimation from training corpus and var-
ious smoothing techniques. So the best perfor-
mance can be achieved by integrating the two.
One way to derive the ‘SELSA + N-gram’ joint
probability P(sel+ng)(wq|Wq−1) is to use the ge-
ometric mean based integration formula given
for LSA in (Coccaro and Jurafsky, 1998) as fol-
lows:
P(sel+ng)(wq|Wq−1)=
[P(sel)(wq|Wq−1)]ξiq [P(wq|wq−1,...,wq−n+1)]1−ξiqsummationtext
wi∈V
[P(sel)(wi|Wq−1)]ξi[P(wi|wq−1,...,wq−n+1)]1−ξi
(16)
where, ξiq = 1−εiq jq2 and ξi = 1−εi jq2 are the ge-
ometric mean weights for SELSA probabilities
for the current word wq and any word wi ∈ V
respectively.
4 Various Levels of Syntactic
Information
In this section we explain various levels of syn-
tactic information that can be incorporated
within SELSA framework. They are supertags,
phrase type and content/fuction word type.
These are in decreasing order of complexity and
provide finer to coarser levels of syntactic infor-
mation.
4.1 Supertags
Supertags are the elementary structures of Lex-
icalized Tree Adjoining Grammars (LTAGs)
(Bangalore and Joshi, 1999). They are com-
bined by the operations of substitution and ad-
junction to yield a parse for the sentence. Each
supertag is lexicalized i.e. associated with at
least one lexical item - the anchor. Further,
all the arguments of the anchor of a supertag
are localized within the same supertag which al-
lows the anchor to impose syntactic and seman-
tic (predicate-argument) constraints directly on
its arguments. As a result, a word is typically
associated with one supertag for each syntac-
tic configuration the word may appear in. Su-
pertags can be seen as providing a much more
refined set of classes than do part-of-speech tags
and hence we expect supertag-based language
models to be better than part-of-speech based
language models.
4.2 Phrase-type
Words in a sentence are not just strung to-
gether as a sequence of parts of speech, but
rather they are organized into phrases, group-
ing of words that are clumped as a unit. A
sentence normally rewrites as a subject noun
phrase (NP) and a verb phrase (VP) which are
the major types of phrases apart from proposi-
tional phrases, adjective phrases etc (Manning
and Schutze, 1999). Using the two major phrase
types and the rest considered as other type, we
constructed a model for SELSA. This model as-
signs each word three syntactic descriptions de-
pending on its frequency of occurrence in each
of three phrase types across a number of doc-
uments. This model captures the semantic be-
haviour of each word in each phrase type. Gen-
erally nouns accur in noun phrases and verbs
occur in verb phrases while prepositions occur
in the other type. So this framework brings in
the finer syntactic resolution in each word’s se-
mantic description as compared to LSA based
average description. This is particularly more
important for certain words occurring as both
noun and verb.
4.3 Content or Function Word Type
If a text corpus is analyzed by counting word
frequencies, it is observed that there are cer-
tain words which occur with very high frequen-
cies e.g. the, and, a, to etc. These words have
a very important grammatical behaviour, but
they do not convey much of the semantics. Thse
words are called function or stop words. Sim-
ilarly in a text corpus, there are certain words
with frequencies in moderate to low range e.g.
car, wheel, road etc. They each play an impor-
tant role in deciding the semantics associated
with the whole sentence or document. Thus
they are known as content words. Generally
a list of vocabulary consists of a few hundred
function words and a few tens of thousands of
content words. However, they span more or less
the same frequency space of a corpora. So it is
also essential to give them equal importance by
treating them separately in a language model-
ing framework as they both convey some sort
of orthogonal information - syntactic vs seman-
tic. LSA is better at predicting topic bearing
content words while parsing based models are
better for function words. Even n-gram mod-
els are quite better at modeling function words,
but they lack the large-span semantic that can
be achieved by LSA. On the other hand, SELSA
model is suitable for both types of words as it
captures semantics of a word in a syntactic con-
text.
We performed experiments with LSA and
SELSA with various levels of syntactic informa-
tion in both the situations - content words only
vs content and function words together. In the
former case, the function words are treated by
n-gram model only.
5 Experiments and Discussion
A statistical language model is evaluated by
how well it predicts some hitherto unseen text
- test data - generated by the source to be
modeled. A commonly used quality measure
for a given model M is related to the en-
tropy of the underlying source and is known
as perplexity(PPL). Given a word sequence
w1,w2,...,wN to be used as a test corpus, the
perplexity of a language model M is given by:
PPL =exp

− 1N
Nsummationdisplay
q=1
logP(M)(wq|Wq−1)

 (17)
Perplexity also indicates the (geometric) aver-
age branching factor of the language according
to the modelM and thus indicates the difficulty
of a speech recognition task(Jelinek, 1999). The
lower the perplexity, the better the model; usu-
ally a reduction in perplexity translates into a
reduction in word error rate of a speech recog-
nition system.
We have implemented both the LSA and
SELSA models using the BLLIP corpus1 which
consists of machine-parsed English new stories
from the Wall Street Journal (WSJ) for the
years 1987, 1988 and 1989. We used the su-
pertagger (Bangalore and Joshi, 1999) to su-
pertag each word in the corpus. This had a tag-
ging acuracy of 92.2%. The training corpus con-
sisted of about 40 million words from the WSJ
1987, 1988 and some portion of 1989. This con-
sists of about 87000 documents related to news
stories. The test corpus was a section of WSJ
1989 with around 300,000 words. The baseline
tri-gram model had a perplexity of 103.12 and
bi-gram had 161.06. The vocabulary size for
words was 20106 and for supertags was 449.
5.1 Perplexity Results
In the first experiment, we performed SELSA
using supertag information for each word. The
word-supertag vocabulary was about 60000.
This resulted in a matrix of about 60000X87000
for which we performed SVD at various dimen-
sions. Similarly we trained LSA matrix and per-
formed its SVD. Then we used this knowledge to
calculate language model probability and then
1Available from the Linguistic Data Consor-
tium(LDC) www.ldc.upenn.edu
integrated with tri-gram probability using geo-
metric interpolation method (Coccaro and Ju-
rafsky, 1998). In the process, we had assumed
the knowledge of the content/function word
type for the next word being predicted. Fur-
thermore, in this experiment, we had used only
content words for LSA as well as SELSA repre-
sentation, while the function words were treated
by tri-gram model only. We also used the su-
pertagged test corpus, thus we knew the su-
pertag of the next word being predicted. These
results thus sets benchmarks for content word
based SELSA model. With these assumptions,
we obtained the perplexity values as shown in
Table 1.
SVD dimensions LSA+ SELSA+
R tri-gram tri-gram
tri-gram only 103.12 103.12
0 (uniform prob) 78.92 60.83
2 78.05 60.88
10 74.92 57.88
20 72.91 56.15
50 69.85 52.80
125 68.42 50.39
200 67.79 49.50
300 67.34 48.84
Table 1: Perplexity at different SVD dimensions
with content/function word type knowledge as-
sumed. For SELSA, these are benchmarks with
correct supertag knowledge.
These benchmark results show that given the
knowledge of the content or function word as
well as the supertag of the word being predicted,
SELSA model performs far better than the LSA
model. This improvement in the performance is
attributed to the finer level of syntactic infor-
mation available now in the form of supertag.
Thus given the supertag, the choice of the word
becomes very limited and thus perplexity de-
creases. The decrease in perplexity across the
SVD dimension shows that the SVD also plays
an important role and thus for SELSA it is tru-
ely a latent syntactic-semantic analysis. Thus if
we devise an algorithm to predict the supertag
of the next word with a very high accuracy, then
there is a gurantee of performance improvement
by this model compared to LSA.
Our next experiment, was based on no knowl-
edge of content or function word type of the
next word. Thus the LSA and SELSA matrices
had all the words in the vocabulary. We also
kept the SVD dimensions for both SELSA and
LSA to 125. The results are shown in Table
2. In this case, we observe that LSA achieves
the perplexity of 88.20 compared to the base-
line tri-gram 103.12. However this is more than
LSA perplexity of 68.42 when the knowledge of
content/function words was assumed. This rel-
ative increase is mainly due to poor modeling
of function words in the LSA-space. However
for SELSA, we can observe that its perplexity
of 36.37 is less than 50.39 value in the case of
knowledge about content/function words. This
is again attributed to better modeling of syntac-
tically regular function words in SELSA. This
can be better understood from the observation
that there were 305 function words compared
to 19801 content words in the vocabulary span-
ning 19.8 and 20.3 million words respectively in
the training corpus. Apart from this, there were
152,145 and 147 supertags anchoring function
word only, content word only and both types
of words respectively. Thus given a supertag
belonging to function word specific supertags,
the ‘vocabulary’ for the target word is reduced
by orders of magnitude compared to the case for
content word specific supertags. It is also worth
observing that the 125-dimensional SVD case of
SELSA is better than the 0-dimensional SVD
or uniform SELSA case. Thus the SVD plays
a role in deciphering the syntactic-semantically
important dimensions of the information space.
Model Perplexity
tri-gram only 103.12
LSA(125)+tri-gram 88.20
SELSA(125)+tri-gram 36.37
uniform-SELSA+tri-gram 41.79
Table 2: Perplexity without content/function
word knowledge. For SELSA, these are bench-
marks with correct supertag knowledge.
We also performed experiments using the
phrase-type (NP, VP, others) knowledge and
incorporated them within SELSA framework.
The resultant model was also used to calcu-
late perplexity values and the results on con-
tent/function type assumption set compares
favourably with LSA by improving the perfor-
mance. In another experiment we used the part-
of-speech tag of the previous word (prevtag)
within SELSA, but it couldn’t improve against
the plain LSA. These results shows that phrase
level information is somewhat useful if it can be
predicted correctly, but previous POS tags are
not useful.
Model Perplexity
tri-gram only 103.12
LSA(125)+tri-gram 68.42
phrase-SELSA(125)+tri-gram 64.78
prevtag-SELSA(125)+tri-gram 69.12
Table 3: Perplexity of phrase/prevtag based
SELSA with the knowledge of content/function
word type and the correct phrase/prevtag
Finally the utility of this language model can
be tested in a speech recognition experiment.
Here it can be most suitably applied in a second-
pass rescoring framework where the output of
first-pass could be the N-best list of either joint
word-tag sequences (Wang and Harper, 2002) or
word sequences which are then passed through
a syntax tagger. Both these approaches allow a
direct application of the results shown in above
experiments, however there is a possibility of
error propagation if some word is incorrectly
tagged. The other approach is to predict the
tag left-to-right from the word-tag partial prefix
followed by word prediction and then repeating
the procedure for the next word.
6 Conclusions and Research
Direction
We presented the effect of incorporating vari-
ous levels of syntactic information in a statisti-
cal language model that uses the mathematical
framework called syntactically enhanced LSA.
SELSA is an attempt to develop a unified frame-
work where syntactic and semantic dependen-
cies can be jointly represented. It general-
izes the LSA framework by incorporating var-
ious levels of the syntactic information along
with the current word. This provides a mech-
anism for statistical language modeling where
the probability of a word given the semantics
of the preceding words is constrained by the
adjacent syntax. The results on WSJ corpus
sets a set of benchmarks for the performance
improvements possible with these types of syn-
tactic information. The supertag based infor-
mation is very fine-grained and thus leads to a
large reduction in perplexity if correct supertag
is known. It is also observed that the knowledge
of the phrase type also helps to reduce the per-
plexity compared to LSA. Even the knowledge
of the content/function word type helps addi-
tionally in each of the SELSA based language
models. These benchmarks can be approached
with better algorithms for predicting the nec-
essary syntactic information. Our experiments
are still continuing in this direction as well as
toward better understanding of the overall sta-
tistical language modeling problem with appli-
cations to speech recognition.

References

S. Bangalore and A. K. Joshi. 1999. Supertag-
ging:an approach to almost parsing. Compu-
tational Linguistics, 25(2):237–265.

S. Bangalore. 1996. ”almost parsing” technique
for language modeling. In Proc. Int. Conf.
Spoken Language Processing, Philadeplphia,
PA, USA.

J. R. Bellegarda. 2000. Exploiting la-
tent semantic information in statistical lan-
guage modeling. Proceedings of the IEEE,
88(8):1279–1296.

E. Charniak. 2001. Immediate-head parsing for
language models. In Proc. 39th Annual Meet-
ing of the Association for Computational Lin-
guistics.

C. Chelba and F. Jelinek. 1998. Exploiting syn-
tactic structure for language modeling. In
Proc. COLING-ACL, volume 1, Montreal,
Canada.

N. Coccaro and D. Jurafsky. 1998. To-
wards better integration of semantic predic-
tors in statistical language modeling. In Proc.
ICSLP-98, volume 6, pages 2403–2406, Syd-
ney.

L. Galescu and E. R. Ringger. 1999. Aug-
menting words with linguistic information for
n-gram language models. In Proc. 6th Eu-
roSpeech, Budapest, Hungary.

J. T. Goodman. 2001. A bit of progress in lan-
guage modeling. Microsoft Technical Report
MSR-TR-2001-72.

F. Jelinek. 1999. Statistical methods for speech
recognition. The MIT Press.

T. K. Landauer, P. W. Foltz, and D. Laham.
1998. Introduction to latent semantic analy-
sis. Discourse Processes, 25:259–284.

C. Manning and H. Schutze. 1999. Foundations
of statistical natural language processing. The
MIT Press.

W. Wang and M. P. Harper. 2002. The super-
ARV language model: Investigating the effec-
tiveness of tightly integrating multiple knowl-
edge sources. In Proc. Conf. Empirical Meth-
ods in Natural Language Processing, pages
238–247, Philadelphia.
