Improvement of a Whole Sentence Maximum Entropy Language Model
Using Grammatical Featuresa0
Fredy Amayaa1 and Jos´e Miguel Bened´ı
Departamento de Sistemas Inform´aticos y Computaci´on
Universidad Polit´ecnica de Valencia
Camino de vera s/n, 46022-Valencia (Spain)
a2 famaya, jbenedi
a3 @dsic.upv.es
Abstract
In this paper, we propose adding
long-term grammatical information in
a Whole Sentence Maximun Entropy
Language Model (WSME) in order
to improve the performance of the
model. The grammatical information
was added to the WSME model as fea-
tures and were obtained from a Stochas-
tic Context-Free grammar. Finally, ex-
periments using a part of the Penn Tree-
bank corpus were carried out and sig-
nificant improvements were acheived.
1 Introduction
Language modeling is an important component in
computational applications such as speech recog-
nition, automatic translation, optical character
recognition, information retrieval etc. (Jelinek,
1997; Borthwick, 1997). Statistical language
models have gained considerable acceptance due
to the efficiency demonstrated in the fields in
which they have been applied (Bahal et al., 1983;
Jelinek et al., 1991; Ratnapharkhi, 1998; Borth-
wick, 1999).
Traditional statistical language models calcu-
late the probability of a sentence a4 using the chain
rule:
a5a7a6
a4a9a8a11a10
a5a7a6a13a12a15a14a16a12a18a17a20a19a21a19a21a19a22a12a24a23
a8a11a10
a23
a25
a26a28a27
a14
a5a7a6a13a12
a26a16a29a30a31a26
a8 (1)
a32This work has been partially supported by the Spanish
CYCIT under contract (TIC98/0423-C06).
a33Granted by Universidad del Cauca, Popay´an (Colom-
bia)
where a30a34a26 a10 a12a15a14a35a19a21a19a21a19a22a12 a26a13a36 a14 , which is usually known
as the history of a12 a26 . The effort in the language
modeling techniques is usually directed to the es-
timation ofa5a7a6a13a12 a26a22a29a30a34a26 a8 . The language model defined
by the expression a5a7a6a13a12 a26a22a29a30a34a26 a8 is named the condi-
tional language model. In principle, the deter-
mination of the conditional probability in (1) is
expensive, because the possible number of word
sequences is very great. Traditional conditional
language models assume that the probability of
the word a12 a26 does not depend on the entire history,
and the history is limited by an equivalence rela-
tion a37 , and (1) is rewritten as:
a5a7a6
a4a9a8a38a10
a5a7a6a13a12 a14 a12 a17 a19a21a19a21a19a22a12 a23
a8a38a39
a23
a25
a26a40a27
a14
a5a7a6a13a12
a26 a29
a37
a6
a30 a26
a8a16a8 (2)
The most commonly used conditional language
model is the n-gram model. In the n-gram model,
the history is reduced (by the equivalence rela-
tion) to the last a41a43a42a45a44 words. The power of the
n-gram model resides in: its consistence with the
training data, its simple formulation, and its easy
implementation. However, the n-gram model
only uses the information provided by the last
a41a46a42a47a44 words to predict the next word and so only
makes use of local information. In addition, the
value of n must be low (a48a50a49 ) because for a41a52a51a45a49
there are problems with the parameter estimation.
Hybrid models have been proposed, in an at-
tempt to supplement the local information with
long-distance information. They combine dif-
ferent types of models, like n-grams, with long-
distance information, generally by means of lin-
ear interpolation, as has been shown in (Belle-
garda, 1998; Chelba and Jelinek, 2000; Bened´ı
and S´anchez, 2000).
A formal framework to include long-distance
and local information in the same language model
is based on the Maximum Entropy principle
(ME). Using the ME principle, we can combine
information from a variety of sources into the
same language model (Berger et al., 1996; Rosen-
feld, 1996). The goal of the ME principle is that,
given a set of features (pieces of desired informa-
tion contained in the sentence), a set of functions
a53
a14a21a54a21a19a21a19a21a19
a53a9a55 (measuring the contribution of each
feature to the model) and a set of constraints1, we
have to find the probability distribution that satis-
fies the constraints and minimizes the relative en-
tropy (Divergence of Kullback-Leibler) a56 a6a57a5 a29a40a29a5a59a58 a8 ,
with respect to the distribution a5a59a58 .
The general Maximum Entropy probability dis-
tribution relative to a prior distribution a5 a58 is given
by the expression:
a5a7a6
a4a60a8a38a10
a44
a61
a5a59a58a62a6
a4a60a8a64a63a66a65a68a67a69a57a70a72a71a9a73
a69a75a74a76a69a78a77a40a79a81a80 (3)
where a61 is the normalization constant and a82 a26 are
parameters to be found. The a82 a26 represent the con-
tribution of each feature to the distribution.
From (3) it is easy to derive the Maximum
Entropy conditional language model (Rosenfeld,
1996): if a83 is the context space and a84 is the
vocabulary, then a83 xa84 is the states space, and if
a6a13a85a86a54a16a87
a8a89a88a90a83 xa84 then:
a5a7a6a13a87
a29
a85
a8a91a10
a44
a61
a6a13a85
a8
a63a66a65a68a67a69a57a70a34a71a66a73
a69a75a74a76a69a78a77a93a92a60a94a95a96a80 (4)
and a61 a6a13a85 a8 :
a97a98a6a13a85
a8a91a10a50a99
a95
a63a66a65a68a100a69a57a70a34a71a66a73
a69a13a74a81a69a101a77a57a92a60a94a95a96a80 (5)
where a97a98a6a13a85 a8 is the normalization constant depend-
ing on the context a85 . Although the conditional
ME language model is more flexible than n-gram
models, there is an important obstacle to its gen-
eral use: conditional ME language models have a
high computational cost (Rosenfeld, 1996), spe-
cially the evaluation of the normalization constant
(5).
1The constraints usually involve the equality between
theoretical expectation and the empirical expectation over
the training corpus.
Although we can incorporate local information
(like n-grams) and some kinds of long-distance
information (like triggers) within the conditional
ME model, the global information contained in
the sentence is poorly encoded in the ME model,
as happens with the other conditional models.
There is a language model which is able to take
advantage of the local information and at the same
time allows for the use of the global properties of
the sentence: the Whole Sentence Maximum En-
tropy model (WSME) (Rosenfeld, 1997). We can
include classical information such us n-grams,
distance n-grams or triggers and global proper-
ties of the sentence, as features into the WSME
framework. Besides the fact that the WSME
model training procedure is less expensive than
the conditional ME model, the most important
training step is based on well-developed statisti-
cal sampling techniques. In recent works (Chen
and Rosenfeld, 1999a), WSME models have been
successfully trained using features of n-grams and
distance n-grams.
In this work, we propose adding information to
the WSME model which is provided by the gram-
matical structure of the sentence. The informa-
tion is added in the form of features by means
of a Stochastic Context-Free Grammar (SCFG).
The grammatical information is combined with
features of n-grams and triggers.
In section 2, we describe the WSME model and
the training procedure in order to estimate the pa-
rameters of the model. In section 3, we define
the grammatical features and the way of obtaining
them from the SCFG. Finally, section 4 presents
the experiments carried out using a part of the
Wall Street Journal in order evalute the behavior
of this proposal.
2 Whole Sentence Maximum Entropy
Model
The whole sentence Maximum Entropy model di-
rectly models the probability distribution of the
complete sentence2. The WSME language model
has the form of (3).
In order to simplify the notation we write a102 a26a35a103
a63 a73
a69 , and define:
2By sentence, we understand any sequence of linguistic
units that belongs to a certain vocabulary.
a104
a6
a4a60a8a91a10
a55
a25
a26a28a27
a14
a102
a74a76a69a78a77a105a79a76a80
a26 (6)
so (3) is written as:
a5a7a6
a4a60a8a91a10
a44
a61
a5a59a58a106a6
a4a60a8
a104
a6
a4a9a8 (7)
where a4 is a sentence and the a102 a26 are now the pa-
rameters to be learned.
The training procedure to estimate the parame-
ters of the model is the Improved Iterative Scaling
algorithmn (IIS) (Della Pietra et al., 1995). IIS is
based on the change of the log-likelihood over the
training corpus a107 , when each of the parameters
changes from a82 a26 to a82 a26a109a108a46a110a96a26 , a110a96a26 a88a46a111 . Mathematical
considerations on the change in the log-likelihood
give the training equation:
a99
a79
a5a7a6
a4a60a8
a53
a26
a6
a4a9a8a64a63a109a112
a69a75a74a114a113a86a77a40a79a81a80
a42 a99
a115a86a116a60a117a119a118
a5a7a6
a4a60a8
a53
a26
a6
a4a60a8a91a10a121a120 (8)
where a53a123a122 a6 a4a60a8a15a10
a65
a55
a26a28a27
a14
a53
a26
a6
a4a60a8 . In each iteration of
the IIS, we have to find the value of the improve-
ment a110 a26 in the parameters, solving (8) with respect
to a110a96a26 for each a124a20a10a125a44 a19a21a19a21a19a109a54a16a126 .
The main obstacle in the WSME training pro-
cess resides in the calculation of the first sum in
(8). The sum extends over all the sentences a12 of
a given length. The great number of such sen-
tences makes it impossible, from computing per-
spective, to calculate the sum, even for a moderate
length3. Nevertheless, such a sum is the statisti-
cal expected value of a function of a12 with respect
to the distribution a5 : a127a129a128a131a130 a53 a26 a63 a112 a69a75a74 a113a86a132 . As is well
known, it could be estimated using the sampling
expectation as:
a127a11a128a131a130
a53
a26
a63 a112
a69a13a74 a113 a132
a39
a44
a133 a134a99
a135a96a27
a14
a53
a26
a6
a4
a135
a8a81a136
a74a114a113a137a77a105a79a13a138a139a80
a26 (9)
where a4 a14a35a19a21a19a21a19a114a54 a4
a134
is a random sample from a5 and
a136
a26
a10a140a63 a112
a69 .
Note that in (7) the constant a61 is unknown,
so direct sampling from a5 is not possible. In
sampling from such types of probability distribu-
tions, the Monte Carlo Markov Chain (MCMC)
3the number of sentences
a141 of length a142 is a143a144a145a143a146
sampling methods have been successfully used
when the distribition is not totally known (Neal,
1993). MCMC are based on the convergence of
certain Markov Chains to a target distribution a5 .
In MCMC, a path of the Markov chain is ran
for a long time, after which the visited states are
considered as a sampling element. The MCMC
sampling methods have been used in the param-
eter estimation of the WSME language models,
specially the Independence Metropolis-Hasting
(IMH) and the Gibb’s sampling algorithms (Chen
and Rosenfeld, 1999a; Rosenfeld, 1997). The
best results have been obtainded using the (IMH)
algorithm.
Although MCMC performs well, the distribu-
tion from which the sample is obtained is only an
approximation of the target sampling distribution.
Therefore samples obtained from such distribu-
tions may produce some bias in sample statis-
tics, like sampling mean. Recently, another sam-
pling technique which is also based on Markov
Chains has been developed by Propp and Wilson
(Propp and Wilson, 1996), the Perfect Sampling
(PS) technique. PS is based on the concept of
Coupling From the Past. In PS, several paths of
the Markov chain are running from the past (one
path in each state of the chain). In all the paths,
the transition rule of the Markov chain uses the
same set of random numbers to transit from one
state to another. Thus if two paths coincide in the
same state in time a147 , they will remain in the same
states the rest of the time. In such a case, we say
that the two paths are collapsed.
Now, if all the paths collapse at any given time,
from that point in time, we are sure that we are
sampling from the true target distribution a5 . The
Coupling From the Past algorithm, systematically
goes to the past and then runs paths in all states
and repeats this procedure until a time a148 has been
found. Once a148 has been found, the paths that be-
gin in time a42a89a148 all paths collapse at time a147a149a10a150a120 .
Then we run a path of the chain from the state
at time a147a131a10a151a42a24a148 to the actual time (a147a152a10a153a120 ), and
the last state arrived is a sample from the target
distribution. The reason for going from past to
current time is technical, and is detailed in (Propp
and Wilson, 1996). If the state space is huge (as
is the case where the state space is the set of all
sentences), we must define a stochastic order over
the state space and then run only two paths: one
beginning in the minimum state and the other in
the maximum state, following the same mecha-
nism described above for the two paths until they
collapse. In this way, it is proved that we get a
sample from the exact target distribution and not
from an approximate distribution as in MCMC
algorithms (Propp and Wilson, 1996). Thus, we
hope that in samples generated with perfect sam-
pling, statistical parameter estimators may be less
biased than those generated with MCMC.
Recently (Amaya and Bened´ı, 2000), the PS
was successfully used to estimate the param-
eters of a WSME language model . In that
work, a comparison was made between the per-
formance of WSME models trained using MCMC
and WSME models trained using PS. Features of
n-grams and features of triggers were used In both
kinds of models, and the WSME model trained
with PS had better performance. We then consid-
ered it appropriate to use PS in the training proce-
dure of the WSME.
The model parameters were completed with the
estimation of the global normalization constant
a61 . Using (7), we can deduce that a61
a10a121a127a11a128a114a154a38a155
a104
a6
a4a60a8a157a156
and thus estimate a61 using the sampling expecta-
tion.
a127a129a128a114a154a7a155
a104
a6
a4a60a8a157a156a158a39
a44
a133 a134a99
a135a96a27
a14
a104
a6
a4
a135
a8
where a4 a14 a54a21a19a21a19a21a19a109a54 a4
a134
is a random sample from a5 a58 .
Because we have total control over the distribition
a5a59a58 , is easy to sample from it in the traditional way.
3 The grammatical features
The main goal of this paper is the incorporation of
gramatical features to the WSME. Grammatical
information may be helpful in many aplications
of computational linguistics. The grammatical
structure of the sentence provides long-distance
information to the model, thereby complementing
the information provided by other sources and im-
proving the performance of the model. Grammat-
ical features give a better weight to such param-
eters in grammatically correct sentences than in
grammatically incorrect sentences, thereby help-
ing the model to assign better probabilities to cor-
rect sentences from the language of the applica-
tion. To capture the grammatical information, we
use Stochastic Context-Free Grammars (SCFG).
Over the last decade, there has been an increas-
ing interest in Stochastic Context-Free Grammars
(SCFGs) for use in different tasks (K., 1979;
Jelinek, 1991; Ney, 1992; Sakakibara, 1990).
The reason for this can be found in the capa-
bility of SCFGs to model the long-term depen-
dencies established between the different lexical
units of a sentence, and the possibility to incor-
porate the stochastic information that allows for
an adequate modeling of the variability phenom-
ena. Thus, SCFGs have been successfully used on
limited-domain tasks of low perplexity. However,
SCFGs work poorly for large vocabulary, general-
purpose tasks, because the parameter learning and
the computation of word transition probabilities
present serious problems for complex real tasks.
To capture the long-term relations and to solve
the main problem derived from the use of SCFGs
in large-vocabulary complex tasks,we consider
the proposal in (Bened´ı and S´anchez, 2000): de-
fine a category-based SCFG and a probabilistic
model of word distribution in the categories. The
use of categories as terminal of the grammar re-
duces the number of rules to take into account and
thus, the time complexity of the SCFG learning
procedure. The use of the probabilistic model of
word distribution in the categories, allows us to
obtain the best derivation of the sentences in the
application.
Actually, we have to solve two problems: the
estimation of the parameters of the models and
their integration to obtain the best derivation of a
sentence.
The parameters of the two models are esti-
mated from a training sample. Each word in the
training sample has a part-of-speech tag (POStag)
associated to it. These POStags are considered as
word categories and are the terminal symbols of
our SCFG.
Given a category, the probability distribution of
a word is estimated by means of the relative fre-
quency of the word in the category, i.e. the rela-
tive frequency which the word a12 has been labeled
with a POStag (a word a12 may belong to different
categories).
To estimate the SCFG parameters, several al-
gorithms have been presented (K. and S.J., 1991;
Pereira and Shabes, 1992; Amaya et al., 1999;
S´anchez and Bened´ı, 1999). Taking into account
the good results achieved on real tasks (S´anchez
and Bened´ı, 1999), we used them to learn our
category-based SCFG.
To solve the integration problem, we used an
algorithm that computes the probability of the
best derivation that generates a sentence, given
the category-based grammar and the model of
word distribution into categories (Bened´ı and
S´anchez, 2000). This algorithm is based on the
well-known Viterbi-like scheme for SCFGs.
Once the grammatical framework is defined,
we are in position to make use of the informa-
tion provided by the SCFG. In order to define the
grammatical features, we first introduce some no-
tation.
A Context-Free Grammar G is a four-tuple
a6a78a159a160a54a162a161a15a54a22a163a24a54a162a164
a8 , where
a159 is the finite set of non ter-
minals, a161 is a finite set of terminals (a159a125a165a131a161a125a166a10a140a167a62a8 ,
a164
a88
a159 is the initial symbol of the grammar and a163
is the finite set of productions or rules of the form
a168a170a169 a171 where a168
a88
a159 and
a171
a88
a6a78a159a173a172a174a161
a8a64a175 . We
consider only context-free grammars in Chomsky
normal form, that is grammars with rules of the
form a168a176a169 a177a152a178 or a168a176a169 a179 where a168 a54 a177 a54 a178 a88 a159
and a179 a88 a161 .
A Stochastic Context-Free Gramar a180
a79
is a pair
a6
a180
a54a78a5
a8 where a180 is a context-free grammar and
a5 is
a probability distribution over the grammar rules.
The grammatical features are defined as fol-
lows: let a4a181a10 a12a15a14a158a19a21a19a21a19a162a12a24a23 , a sentence of the train-
ing set. As mentioned above, we can compute the
best derivation of the sentence a4 , using the defined
SCFG and obtain the parse tree of the sentence.
Once we have the parse tree of all the sentences
in the training corpus, we can collect the set of all
the production rules used in the derivation of the
sentences in the corpus.
Formally: we define the set a127
a6
a4a9a8 a10
a182
a6a78a85a86a54a16a87a183a54a22a97
a8
a29
a97
a169
a85a98a87a59a184 , where a85a86a54a16a87a183a54a22a97
a88
a161a185a172a24a159 .
a127
a6
a4a60a8
is the set of all grammatical rules used in the
derivation of a4 . To include the rules of the form
a168a121a169a186a179 , where a168
a88
a159 and
a179
a88
a161 , in the set
a127
a6
a4a9a8 ,
we make use of a special symbol $ which is not
in the terminals nor in the non-terminals. If a rule
of the form a168a187a169a189a188 occurs in the derivation tree
of a4 , the corresponding element in a127 a6 a4a60a8 is written
as a6 a168 a54 a188 a54a162a190 a8 . The set a127a191a10 a172
a79
a116a60a117
a127
a6
a4a9a8 (where a107 is
the corpus), is the set of grammatical features.
a127 is the set representation of the grammati-
cal information contained in the derivation trees
of the sentences and may be incorporated to the
WSME model by means of the characteristic
functions defined as:
a53
a77a93a92a60a94a95a114a94a192a139a80
a6
a4a9a8a38a10a191a193
a44 if
a6a13a85a137a54a16a87a59a54a22a97
a8a24a88a46a127
a6
a4a9a8
a120 Othewise
(10)
Thus, whenever the WSME model processes a
sentence a4 , if it is looking for a specific gram-
matial feature, say a6a78a194a59a54a139a195a66a54a22a196 a8 , we get the derivation
tree for a4 and the set a127 a6 a4a60a8 is calculated from the
derivation tree. Finally, the model asks if the the
tuple a6a78a194a59a54a139a195a66a54a22a196 a8 is an element of a127 a6 a4a9a8 . If it is, the
feature is active; if not, the feature a6a78a194a59a54a139a195a66a54a22a196 a8 does
not contribute to the sentence probability. There-
fore, a sentence may be a grammatically incorrect
sentence (relative to the SCFG used), if deriva-
tions with low frequency appears.
4 Experimental Work
A part of the Wall Street Journal (WSJ) which
had been processed in the Penn Treebanck Project
(Marcus et al., 1993) was used in the experiments.
This corpus was automatically labelled and man-
ually checked. There were two kinds of labelling:
POStag labelling and syntactic labelling. The
POStag vocabulary was composed of 45 labels.
The syntactic labels are 14. The corpus was di-
vided into sentences according to the bracketing.
We selected 12 sections of the corpus at ran-
dom. Six were used as training corpus, three as
test set and the other three sections were used as
held-out for tuning the smoothing WSME model.
The sets are described as follow: the training cor-
pus has 11,201 sentences; the test set has 6,350
sentences and the held-out set has 5,796 sen-
tences.
A base-line Katz back-off smoothed trigram
model was trained using the CMU-Cambridge
statistical Language Modeling Toolkit 4 and used
as prior distribution in (3) i.e. a5a59a58 . The vocabu-
lary generated by the trigram model was used as
vocabulary of the WSME model. The size of the
vocabulary was 19,997 words.
4Available at:
http://svr-www.eng.cam.ac.uk/ prc14/toolkit.html
The estimation of the word-category probabil-
ity distribution was computed from the training
corpus. In order to avoid null values, the unseen
events were labeled with a special “unknown”
symbol which did not appear in the vocabulary,
so that the probabilitie of the unseen envent were
positive for all the categories.
The SCFG had the maximum number of rules
which can be composed of 45 terminal symbols
(the number of POStags) and 14 non-terminal
symbols (the number of syntactic labels). The
initial probabilities were randomly generated and
three different seeds were tested. However, only
one of them is here given that the results were
very similar.
The size of the sample used in the ISS was es-
timated by means of an experimental procedure
and was set at 10,000 elements. The procedure
used to generate the sample made use of the “di-
agnosis of convergence” (Neal, 1993), a method
by means of which an inicial portion of each run
of the Markov chain of sufficient length is dis-
carded. Thus, the states in the remaining portion
come from the desired equilibrium distribution.
In this work, a discarded portion of 3,000 ele-
ments was establiched. Thus in practice, we have
to generate 13,000 instances of the Markov chain.
During the IIS, every sample was tagged using
the grammar estimated above, and then the gram-
matical features were extracted, before combining
them with other kinds of features. The adequate
number of iterations of the IIS was established ex-
perimentally in 13.
We trained several WSME models using the
Perfect Sampling algorithm in the IIS and a dif-
ferent set of features (including the grammatical
features) for each model. The different sets of
features used in the models were: n-grams (1-
grams,2-grams,3-grams); triggers; n-grams and
grammatical features; triggers and grammatical
feautres; n-grams, triggers and grammatical fea-
tures.
The a41 -gram features,(N), was selected by
means of its frequency in the corpus. We select all
the unigrams, the bigrams with frequency greater
than 5 and the trigrams with frequency greater
than 10, in order to mantain the proportion of each
type of a41 -gram in the corpus.
The triggers, (T), were generated using a trig-
Feat. N T N+T
Without 143.197 145.432 129.639
With 125.912 122.023 116.42
% Improv. 12.10% 16.10% 10.2 %
Table 1: Comparison of the perplexity between
models with grammatical features and models
without grammatical features for WSME mod-
els over part of the WSJ corpus. N means fea-
tures of n-grams, T means features of Triggers.
The perplexity of the trained n-gram model was
PP=162.049
ger toolkit developed by Adam Berger 5. The
triggers were selected in acordance with de mu-
tual information. The triggers selected were those
with mutual information greater than 0.0001.
The grammatical features, (G), were selected
using the parser tree of all the sentences in the
training corpus to obtain the sets a127 a6a13a12 a8 and their
union a127 as defined in section 3.
The size of the initial set of features was:
12,023 a41 -grams, 39,428 triggers and 258 gramati-
cal features, in total 51,709 features. At the end of
the training procedure, the number of active fea-
tures was significantly reduced to 4,000 features
on average.
During the training procedure, some of the
a102
a26
a39 a120 and, so, we smooth the model. We
smoothed it using a gaussian prior technique. In
the gaussian technique, we assumed that the a102 a26
paramters had a gaussian (normal) prior probabil-
ity distribution (Chen and Rosenfeld, 1999b) and
found the maximum aposteriori parameter distri-
bution. The prior distribution was a102
a26a35a197
a159a198a6
a120
a54a22a199
a17
a26
a8 ,
and we used the held-out data to find the a199
a17
a26 pa-
rameters.
Table 1 shows the experimental results: the
first row represents the set of features used. The
second row shows the perplexity of the models
without using grammatical features. The third
row shows the perplexity of the models using
grammatical features and the fourth row shows
the improvement in perplexity of each model us-
ing grammatical features over the corresponding
model without grammatical features. As can be
seen in Table 1, all the WSME models performed
5Available at:
htpp://www.cs.cmu.edu/afs/cs/user/aberger/www/
better than the a41 -gram model, however that is nat-
ural because, in the worst case (if all a102 a26 a10a200a44 ), the
WSME models perform like the a41 -gram model.
In Table 1, we see that all the models us-
ing grammatical features perform better than the
models that do not use it. Since the training pro-
cedure was the same for all the models described
and since the only difference between the two
kinds of models compared were the grammatical
features, then we conclude that the improvement
must be due to the inclusion of such features into
the set of features. The average percentage of im-
provement was about 13%.
Also, although the model N+T performs bet-
ter than the other model without grammatical fea-
tures (N,T), it behaves worse than all the models
with grammatical features ( N+G improved 2.9%
and T+G improvd 5.9% over N+T).
5 Conclusions and future work
In this work, we have sucessfully added gram-
matical features to a WSME language model us-
ing a SCFG to extract the grammatical informa-
tion. We have shown that the the use of gram-
matical features in a WSME model improves the
performance of the model. Adding grammatical
features to the WSME model we have obtained
a reduction in perplexity of 13% on average over
models that do not use grammatical features. Also
a reduction in perplexity between approximately
22% and 28% over the n-gram model has been
obtained.
We are working on the implementation of other
kinds of grammatical features which are based on
the POStags sentences obtained using the SCFG
that we have defined. The prelimary experiments
have shown promising results.
We will also be working on the evaluation of
the word-error rate (WER) of the WSME model.
In the case of WSME model the WER may be
evaluated in a type of post-procesing using the n-
best utterances.
References
F. Amaya and J. M. Bened´ı. 2000. Using Perfect Sam-
pling in Parameter Estimation of a Wole Sentence
Maximum Entropy Language Model. Proc. Fourth
Computational Natural Language Learning Work-
shop, CoNLL-2000.
F. Amaya, J. A. S´anchez, and J. M. Bened´ı. 1999.
Learning stochastic context-free grammars from
bracketed corpora by means of reestimation algo-
rithms. Proc. VIII Spanish Symposium on Pattern
Recognition and Image Analysis, pages 119–126.
L.R. Bahal, F.Jelinek, and R. L. Mercer. 1983. A
maximun likelihood approach to continuous speech
recognition. IEEE Trans. on Pattern analysis and
Machine Intelligence, 5(2):179–190.
J. R. Bellegarda. 1998. A multispan language model-
ing framework for large vocabulary speech recogni-
tion. IEEE Transactions on Speech and Audio Pro-
cessing, 6 (5):456–467.
J.M. Bened´ı and J.A. S´anchez. 2000. Combination of
n-grams and stochastic context-free grammars for
language modeling. Porc. International conference
on computational lingustics (COLING-ACL), pages
55–61.
A.L. Berger, V.J. Della Pietra, and S.A. Della Pietra.
1996. A Maximun Entropy aproach to natural
languaje processing. Computational Linguistics,
22(1):39–72.
A. Borthwick. 1997. Survey paper on statistical lan-
guage modeling. Technical report, New York Uni-
versity.
A. Borthwick. 1999. A Maximum Entropy Approach
to Named Entity Recognition. PhD Dissertation
Proposal, New York University.
C. Chelba and F. Jelinek. 2000. Structured lan-
guage modeling. Computer Speech and Language,
14:283–332.
S. Chen and R. Rosenfeld. 1999a. Efficient sampling
and feature selection in whole sentence maximum
entropy language models. Proc. IEEE Int. Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP).
S. Chen and R. Rosenfeld. 1999b. A gaussian prior
for smoothing maximum entropy models. Techni-
cal Report CMU-CS-99-108, Carnegie Mellon Uni-
versity.
S. Della Pietra, V. Della Pietra, and J. Lafferty. 1995.
Inducing features of random fields. Technical Re-
port CMU-CS-95-144, Carnegie Mellon University.
F. Jelinek, B. Merialdo, S. Roukos, and M. Strauss.
1991. A dynamic language model for speech recog-
nition. Proc. of Speech and Natural Language
DARPA Work Shop, pages 293–295.
F. Jelinek. 1991. Up from trigrams! the strug-
gle for improved language models. Proc. of EU-
ROSPEECH, European Conference on Speech Co-
munication and Technology, 3:1034–1040.
F. Jelinek. 1997. Statistical Methods for Speech
Recognition. The MIT Press, Massachusetts Insti-
tut of Technology. Cambridge, Massachusetts.
Lari K. and Young S.J. 1991. Applications of stochas-
tic context-free grammars using the inside-outside
algorithm. Computer Speech and Language, pages
237–257.
Baker J. K. 1979. Trainable grammars for speech
recognition. Speech comunications papers for the
97th meeting of the Acoustical Society of America,
pages 547–550.
M. P. Marcus, B. Santorini, and M.A. Marcinkiewicz.
1993. Building a large annotates corpus of english:
the penn treebanck. Computational Linguistics, 19.
R. M. Neal. 1993. Probabilistic inference using
markov chain monte carlo methods. Technical Re-
port CRG-TR-93-1, Departament of Computer Sci-
ence, University of Toronto.
H. Ney. 1992. Stochastic grammars and pattern
recognition. In P. Laface and R. De Mori, editors,
Speech Recognition and Understanding. Recent Ad-
vances, pages 319–344. Springer Verlag.
F. Pereira and Y. Shabes. 1992. Inside-outsude reesti-
mation from partially bracketed corpora. Proceed-
ings of the 30th Annual Meeting of the Assotia-
tion for Computational Linguistics, pages 128–135.
University of Delaware.
J. G. Propp and D. B. Wilson. 1996. Exact sampling
with coupled markov chains and applications to sta-
tistical mechanics. Random Structures and Algo-
rithms, 9:223–252.
A. Ratnapharkhi. 1998. Maximum Entropy models for
natural language ambiguity resolution. PhD Dis-
sertation Proposal, University of Pensylvania.
R. Rosenfeld. 1996. A Maximun Entropy approach to
adaptive statistical language modeling. Computer
Speech and Language, 10:187–228.
R. Rosenfeld. 1997. A whole sentence Maximim En-
tropy language model. IEEE workshop on Speech
Recognition and Understanding.
Y. Sakakibara. 1990. Learning context-free grammars
from structural data in polinomila time. Theoretical
Computer Science, 76:233–242.
J. A. S´anchez and J. M. Bened´ı. 1999. Learning of
stochastic context-free grammars by means of esti-
mation algorithms. Proc. of EUROSPEECH, Eu-
ropean Conference on Speech Comunication and
Technology, 4:1799–1802.
