Automatic Learning of Language Model Structure
Kevin Duh and Katrin Kirchho 
Department of Electrical Engineering
University of Washington, Seattle, USA
fduh,katring@ee.washington.edu
Abstract
Statistical language modeling remains a challeng-
ing task, in particular for morphologically rich lan-
guages. Recently, new approaches based on factored
language models have been developed to address
this problem. These models provide principled ways
of including additional conditioning variables other
than the preceding words, such as morphological or
syntactic features. However, the number of possible
choices for model parameters creates a large space of
models that cannot be searched exhaustively. This
paper presents an entirely data-driven model selec-
tion procedure based on genetic search, which is
shown to outperform both knowledge-based and ran-
dom selection procedures on two di erent language
modeling tasks (Arabic and Turkish).
1 Introduction
In spite of novel algorithmic developments and
the increased availability of large text corpora,
statistical language modeling remains a di -
cult problem, particularly for languages with
rich morphology. Such languages typically ex-
hibit a large number of word types in relation
to word tokens in a given text, which leads
to high perplexity and a large number of un-
seen word contexts. As a result, probability es-
timates are often unreliable, even when using
standard smoothing and parameter reduction
techniques. Recently, a new language model-
ing approach, called factored language models
(FLMs), has been developed. FLMs are a gen-
eralization of standard language models in that
they allow a larger set of conditioning variables
for predicting the current word. In addition to
the preceding words, any number of additional
variables can be included (e.g. morphological,
syntactic, or semantic word features). Since
such features are typically shared across mul-
tiple words, they can be used to obtained bet-
ter smoothed probability estimates when train-
ing data is sparse. However, the space of pos-
sible models is extremely large, due to many
di erent ways of choosing subsets of condition-
ing word features, backo procedures, and dis-
counting methods. Usually, this space cannot
be searched exhaustively, and optimizing mod-
els by a knowledge-inspired manual search pro-
cedure often leads to suboptimal results since
only a small portion of the search space can
be explored. In this paper we investigate the
possibility of determining the structure of fac-
tored language models (i.e. the set of condition-
ing variables, the backo procedure and the dis-
counting parameters) by a data-driven search
procedure, viz. Genetic Algorithms (GAs). We
apply this technique to two di erent tasks (lan-
guage modeling for Arabic and Turkish) and
show that GAs lead to better models than ei-
ther knowledge-inspired manual search or ran-
dom search. The remainder of this paper is
structured as follows: Section 2 describes the
details of the factored language modeling ap-
proach. The application of GAs to the problem
of determining language model structure is ex-
plained in Section 3. The corpora used in the
present study are described in Section 4 and ex-
periments and results are presented in Section 5.
Section 6 compares the present study to related
work and Section 7 concludes.
2 Factored Language Models
A standard statistical language model com-
putes the probability of a word sequence W =
w1; w2; :::; wT as a product of conditional prob-
abilities of each word wi given its history, which
is typically approximated by just one or two pre-
ceding words (leading to bigrams, and trigrams,
respectively). Thus, a trigram language model
is described by
p(w1; :::; wT )  
TY
i=3
p(wijwi 1; wi 2) (1)
Even with this limitation, the estimation of
the required probabilities is challenging: many
word contexts may be observed infrequently
or not at all, leading to unreliable probabil-
ity estimates under maximum likelihood estima-
tion. Several techniques have been developed
to address this problem, in particular smooth-
ing techniques (Chen and Goodman, 1998) and
class-based language models (Brown and oth-
ers, 1992). In spite of such parameter reduc-
tion techniques, language modeling remains a
di cult task, in particular for morphologically
rich languages, e.g. Turkish, Russian, or Arabic.
Such languages have a large number of word
types in relation to the number of word tokens
in a given text, as has been demonstrated in
a number of previous studies (Geutner, 1995;
Kiecza et al., 1999; Hakkani-T ur et al., 2002;
Kirchho et al., 2003). This in turn results
in a high perplexity and in a large number of
out-of-vocabulary (OOV) words when applying
a trained language model to a new unseen text.
2.1 Factored Word Representations
A recently developed approach that addresses
this problem is that of Factored Language Mod-
els (FLMs) (Kirchho et al., 2002; Bilmes and
Kirchho , 2003), whose basic idea is to decom-
pose words into sets of features (or factors) in-
stead of viewing them as unanalyzable wholes.
Probabilistic language models can then be con-
structed over (sub)sets of word features instead
of, or in addition to, the word variables them-
selves. For instance, words can be decomposed
into stems/lexemes and POS tags indicating
their morphological features, as shown below:
Word: Stock prices are rising
Stem: Stock price be rise
Tag: Nsg N3pl V3pl Vpart
Such a representation serves to express lexical
and syntactic generalizations, which would oth-
erwise remain obscured. It is comparable to
class-based representations employed in stan-
dard class-based language models; however, in
FLMs several simultaneous class assignments
are allowed instead of a single one. In general,
we assume that a word is equivalent to a  xed
number (K) of factors, i.e. W  f1:K. The task
then is to produce a statistical model over the
resulting representation - using a trigram ap-
proximation, the resulting probability model is
as follows:
p(f1:K1 ; f1:K2 ; :::; f1:KT )  
TY
t=3
p(f1:Kt jf1:Kt 1 ; f1:Kt 2 )
(2)
Thus, each word is dependent not only on a sin-
gle stream of temporally ordered word variables,
but also on additional parallel (i.e. simultane-
ously occurring) features. This factored repre-
sentation can be used in two di erent ways to
improve over standard LMs: by using a product
model or a backo model. In a product model,
Equation 2 can be simpli ed by  nding con-
ditional independence assumptions among sub-
sets of conditioning factors and computing the
desired probability as a product of individual
models over those subsets. In this paper we
only consider the second option, viz. using the
factors in a backo procedure when the word
n-gram is not observed in the training data.
For instance, a word trigram that is found in
an unseen test set may not have any counts in
the training set, but its corresponding factors
(e.g. stems and morphological tags) may have
been observed since they also occur in other
words.
2.2 Generalized parallel backo 
Backo is a common smoothing technique in
language modeling. It is applied whenever
the count for a given n-gram in the training
data falls below a certain threshold  . In that
case, the maximum-likelihood estimate of the
n-gram probability is replaced with a probabil-
ity derived from the probability of the lower-
order (n  1)-gram and a backo weight. N-
grams whose counts are above the threshold re-
tain their maximum-likelihood estimates, dis-
counted by a factor that re-distributes proba-
bility mass to the lower-order distribution:
pBO(wtjwt 1; wt 2) (3)
=
 d
cpML(wtjwt 1; wt 2) if c >  3
 (wt 1; wt 2)pBO(wtjwt 1) otherwise
where c is the count of (wt; wt 1; wt 2), pML
denotes the maximum-likelihood estimate and
dc is a discounting factor that is applied to the
higher-order distribution. The way in which the
discounting factor is estimated determines the
actual smoothing method (e.g. Good-Turing,
Kneser-Ney, etc.) The normalization factor
 (wt 1; wt 2) ensures that the entire distribu-
tion sums to one. During standard backo , the
most distant conditioning variable (in this case
wt 2) is dropped  rst, then the second most dis-
tant variable etc. until the unigram is reached.
This can be visualized as a backo path (Fig-
ure 1(a)). If the only variables in the model are
words, such a backo procedure is reasonable.
tW 1tW- 2tW- 3tW-
tW 1tW- 2tW-
tW 1tW-
tW
(a)
F 1F 2F 3F
F
F 1F 2F F 1F 3F F 2F 3F
F 1F F 3FF 2F
(b)
Figure 1: Standard backo path for a 4-gram lan-
guage model over words (left) and backo graph for
4-gram over factors (right).
However, if variables occur in parallel, i.e. do
not form a temporal sequence, it is not imme-
diately obvious in which order they should be
dropped. In this case, several backo paths are
possible, which can be summarized in a backo 
graph (Figure 1(b)). In principle, there are sev-
eral di erent ways of choosing among di erent
paths in this graph:
1. Choose a  xed, predetermined backo path
based on linguistic knowledge, e.g. always drop
syntactic before morphological variables.
2. Choose the path at run-time based on statis-
tical criteria.
3. Choose multiple paths and combine their
probability estimates.
The last option, referred to as parallel backo ,
is implemented via a new, generalized backo 
function (here shown for a 4-gram):
pGBO(fjf1; f2; f3) (4)
=
 d
cpML(fjf1; f2; f3) if c >  4
 (f1; f2; f3)g(f; f1; f2; f3) otherwise
where c is the count of (f; f1; f2; f3),
pML(fjf1; f2; f3) is the maximum likeli-
hood distribution,  4 is the count threshold,
and  (f1; f2; f3) is the normalization factor.
The function g(f; f1; f2; f3) determines the
backo strategy. In a typical backo proce-
dure g(f; f1; f2; f3) equals pBO(fjf1; f2). In
generalized parallel backo , however, g can be
any non-negative function of f; f1; f2; f3. In
our implementation of FLMs (Kirchho et al.,
2003) we consider several di erent g functions,
including the mean, weighted mean, product,
and maximum of the smoothed probability
distributions over all subsets of the conditioning
factors. In addition to di erent choices for g,
di erent discounting parameters can be chosen
at di erent levels in the backo graph. For
instance, at the topmost node, Kneser-Ney
discounting might be chosen whereas at a
lower node Good-Turing might be applied.
FLMs have been implemented as an add-on
to the widely-used SRILM toolkit1 and have
been used successfully for the purpose of
morpheme-based language modeling (Bilmes
and Kirchho , 2003), multi-speaker language
modeling (Ji and Bilmes, 2004), and speech
recognition (Kirchho et al., 2003).
3 Learning FLM Structure
In order to use an FLM, three types of para-
meters need to be speci ed: the initial con-
ditioning factors, the backo graph, and the
smoothing options. The goal of structure learn-
ing is to  nd the parameter combinations that
create FLMs that achieve a low perplexity on
unseen test data. The resulting model space
is extremely large: given a factored word rep-
resentation with a total of k factors, there areP
k
n=1
 k
n
 possible subsets of initial condition-
ing factors. For a set of m conditioning factors,
there are up to m! backo paths, each with its
own smoothing options. Unless m is very small,
exhaustive search is infeasible. Moreover, non-
linear interactions between parameters make it
di cult to guide the search into a particular
direction, and parameter sets that work well
for one corpus cannot necessarily be expected
to perform well on another. We therefore need
an automatic way of identifying the best model
structure. In the following section, we describe
the application of genetic-based search to this
problem.
3.1 Genetic Algorithms
Genetic Algorithms (GAs) (Holland, 1975) are a
class of evolution-inspired search/optimization
techniques. They perform particularly well
in problems with complex, poorly understood
search spaces. The fundamental idea of GAs is
to encode problem solutions as (usually binary)
strings (genes), and to evolve and test successive
populations of solutions through the use of ge-
netic operators applied to the encoded strings.
Solutions are evaluated according to a  tness
function which represents the desired optimiza-
tion criterion. The individual steps are as fol-
1We would like to thank Je Bilmes for providing and
supporting the software.
lows:
Initialize: Randomly generate a set (popula-
tion) of strings.
While  tness improves by a certain threshold:
Evaluate  tness: calculate each string’s  tness
Apply operators: apply the genetic operators
to create a new population.
The genetic operators include the probabilis-
tic selection of strings for the next genera-
tion, crossover (exchanging subparts of di er-
ent strings to create new strings), and muta-
tion (randomly altering individual elements in
strings). Although GAs provide no guarantee
of  nding the optimal solution, they often  nd
good solutions quickly. By maintaining a pop-
ulation of solutions rather than a single solu-
tion, GA search is robust against premature
convergence to local optima. Furthermore, solu-
tions are optimized based on a task-speci c  t-
ness function, and the probabilistic nature of ge-
netic operators helps direct the search towards
promising regions of the search space.
3.2 Structure Search Using GA
In order to use GAs for searching over FLM
structures (i.e. combinations of conditioning
variables, backo paths, and discounting op-
tions), we need to  nd an appropriate encoding
of the problem.
Conditioning factors
The initial set of conditioning factors F are
encoded as binary strings. For instance, a
trigram for a word representation with three
factors (A,B,C) has six conditioning variables:
fA 1; B 1; C 1; A 2; B 2; C 2g which can be
represented as a 6-bit binary string, with a bit
set to 1 indicating presence and 0 indicating ab-
sence of a factor in F. The string 10011 would
correspond to F = fA 1; B 2; C 2g.
Backo graph
The encoding of the backo graph is more dif-
 cult because of the large number of possible
paths. A direct approach encoding every edge
as a bit would result in overly long strings, ren-
dering the search ine cient. Our solution is to
encode a binary string in terms of graph gram-
mar rules (similar to (Kitano, 1990)), which
can be used to describe common regularities in
backo graphs. For instance, a node with m
factors can only back o to children nodes with
m  1 factors. For m = 3, the choices for pro-
ceeding to the next-lower level in the backo 
1. {X1 X2 X3} −> {X1 X2}
2. {X1 X2 X3} −> {X1 X3}
3. {X1 X2 X3} −> {X2 X3}
4.      {X1 X2}  −> {X1}
5.      {X1 X2}  −> {X2}
PRODUCTION RULES:
AB
ABC
AB
ABC
BC AB
ABC
BC
A B
AB
ABC
BC
A B
a0
a0a1
a1
a2
a2a3
a3
a4
a4a5
a5
0
1
4 4
10110
3
(b) Generation of Backoff Graph by rules 1, 3, and 4
(a) Gene activates production rules
GENE:
Figure 2: Generation of Backo Graph from pro-
duction rules selected by the gene 10110.
graph can thus be described by the following
grammar rules:
RULE 1: fx1; x2; x3g ! fx1; x2g
RULE 2: fx1; x2; x3g ! fx1; x3g
RULE 3: fx1; x2; x3g ! fx2; x3g
Here xi corresponds to the factor at the ith
position in the parent node. Rule 1 indicates
a backo that drops the third factor, Rule 2
drops the second factor, etc. The choice of
rules used to generate the backo graph is en-
coded in a binary string, with 1 indicating the
use and 0 indicating the non-use of a rule, as
shown schematically in Figure 2. The presence
of two di erent rules at the same level in the
backo graph corresponds to parallel backo ;
the absence of any rule (strings consisting only
of 0 bits) implies that the corresponding backo 
graph level is skipped and two conditioning vari-
ables are dropped simultaneously. This allows
us to encode a graph using few bits but does not
represent all possible graphs. We cannot selec-
tively apply di erent rules to di erent nodes at
the same level { this would essentially require
a context-sensitive grammar, which would in
turn increase the length of the encoded strings.
This is a fundamental tradeo between the most
general representation and an encoding that is
tractable. Our experimental results described
below con rm, however, that su ciently good
results can be obtained in spite of the above
limitation.
Smoothing options
Smoothing options are encoded as tuples of in-
tegers. The  rst integer speci es the discount-
ing method while second indicates the minimum
count required for the n-gram to be included in
the FLM. The integer string consists of succes-
sive concatenated tuples, each representing the
smoothing option at a node in the graph. The
GA operators are applied to concatenations of
all three substrings describing the set of factors,
backo graph, and smoothing options, such that
all parameters are optimized jointly.
4 Data
We tested our language modeling algorithms on
two di erent data sets from two di erent lan-
guages, Arabic and Turkish.
The Arabic data set was drawn from the
CallHome Egyptian Conversational Arabic
(ECA) corpus (LDC, 1996). The training,
development, and evaluation sets contain
approximately 170K, 32K, and 18K words,
respectively. The corpus was collected for the
purpose of speech recognizer development for
conversational Arabic, which is mostly dialectal
and does not have a written standard. No
additional text material beyond transcriptions
is available in this case; it is therefore im-
portant to use language models that perform
well in sparse data conditions. The factored
representation was constructed using linguistic
information from the corpus lexicon, in combi-
nation with automatic morphological analysis
tools. It includes, in addition to the word, the
stem, a morphological tag, the root, and the
pattern. The latter two are components which
when combined form the stem. An example
of this factored word representation is shown
below:
Word:il+dOr/Morph:noun+masc-sg+article/
Stem:dOr/Root:dwr/Pattern:CCC
For our Turkish experiments we used a mor-
phologically annotated corpus of Turkish
(Hakkani-T ur et al., 2000). The annotation
was performed by applying a morphological
analyzer, followed by automatic morphological
disambiguation as described in (Hakkani-T ur
et al., 2002). The morphological tags consist
of the initial root, followed by a sequence of
in ectional groups delimited by derivation
boundaries (^DB). A sample annotation (for
the word yararlanmak, consisting of the root
yarar plus three in ectional groups) is shown
below:
yararmanlak:
yarar+Noun+A3sg+Pnon+Nom
^DB+Verb+Acquire+Pos
^DB+Noun+Inf+A3sg+Pnon+Nom
We removed segmentation marks (for titles
and paragraph boundaries) from the corpus
but included punctuation. Words may have
di erent numbers of in ectional groups, but
the FLM representation requires the same
number of factors for each word; we therefore
had to map the original morphological tags to
a  xed-length factored representation. This
was done using linguistic knowledge: according
to (O azer, 1999), the  nal in ectional group
in each dependent word has a special status
since it determines in ectional markings on
head words following the dependent word.
The  nal in ectional group was therefore
analyzed into separate factors indicating the
number (N), case (C), part-of-speech (P) and
all other information (O). Additional factors
for the word are the root (R) and all remaining
information in the original tag not subsumed
by the other factors (G). The word itself is
used as another factor (W). Thus, the above
example would be factorized as follows:
W:yararlanmak/R:yarar/P:NounInf-N:A3sg/
C:Nom/O:Pnon/G:NounA3sgPnonNom+Verb
+Acquire+Pos
Other factorizations are certainly possible;
however, our primary goal is not to  nd the
best possible encoding for our data but to
demonstrate the e ectiveness of the FLM
approach, which is largely independent of the
choice of factors. For our experiments we used
subsets of 400K words for training, 102K words
for development and 90K words for evaluation.
5 Experiments and Results
In our application of GAs to language model
structure search, the perplexity of models with
respect to the development data was used as
an optimization criterion. The perplexity of
the best models found by the GA were com-
pared to the best models identi ed by a lengthy
manual search procedure using linguistic knowl-
edge about dependencies between the word fac-
tors involved, and to a random search procedure
which evaluated the same number of strings as
the GA. The following GA options gave good
results: population size 30-50, crossover proba-
bility 0.9, mutation probability 0.01, Stochastic
Universal Sampling as the selection operator, 2-
point crossover. We also experimented with re-
initializing the GA search with the best model
found in previous runs. This method consis-
tently improved the performance of normal GA
search and we used it as the basis for the results
reported below. Due to the large number of fac-
N Word Hand Rand GA  (%)
Dev Set
2 593.8 555.0 556.4 539.2 -2.9
3 534.9 533.5 497.1 444.5 -10.6
4 534.8 549.7 566.5 522.2 -5.0
Eval Set
2 609.8 558.7 525.5 487.8 -7.2
3 545.4 583.5 509.8 452.7 -11.2
4 543.9 559.8 574.6 527.6 -5.8
Table 1: Perplexity for Turkish language models. N
= n-gram order, Word = word-based models, Hand
= manual search, Rand = random search, GA =
genetic search.
tors in the Turkish word representation, models
were only optimized for conditioning variables
and backo paths, but not for smoothing op-
tions. Table 1 compares the best perplexity re-
sults for standard word-based models and for
FLMs obtained using manual search (Hand),
random search (Rand), and GA search (GA).
The last column shows the relative change in
perplexity for the GA compared to the better
of the manual or random search models. For
tests on both the development set and evalu-
ation set, GA search gave the lowest perplex-
ity. In the case of Arabic, the GA search was
N Word Hand Rand GA  (%)
Dev Set
2 229.9 229.6 229.9 222.9 -2.9
3 229.3 226.1 230.3 212.6 -6.0
Eval Set
2 249.9 230.1 239.2 223.6 -2.8
3 285.4 217.1 224.3 206.2 -5.0
Table 2: Perplexity for Arabic language models
(w/o unknown words).
performed over conditioning factors, the back-
o graph, and smoothing options. The results
in Table 2 were obtained by training and test-
ing without consideration of out-of-vocabulary
(OOV) words. Our ultimate goal is to use these
language models in a speech recognizer with a
 xed vocabulary, which cannot recognize OOV
words but requires a low perplexity for other
N Word Hand Rand GA  (%)
Dev Set
2 236.0 195.5 198.5 193.3 -1.1
3 237.0 199.0 202.0 188.1 -5.5
Eval Set
2 235.2 234.1 247.7 233.4 -0.7
3 253.9 229.2 219.0 212.2 -3.1
Table 3: Perplexity for Arabic language models
(with unknown words).
word combinations. In a second experiment,
we trained the same FLMs from Table 2 with
OOV words included as the unknown word to-
ken. Table 3 shows the results. Again, we see
that the GA outperforms other search methods.
The best language models all used parallel back-
o and di erent smoothing options at di erent
backo graph nodes. The Arabic models made
use of all conditioning variables (Word, Stem,
Root, Pattern, and Morph) whereas the Turkish
models used only the W, P, C, and R variables
(see above Section 4).
6 Related Work
Various previous studies have investigated the
feasibility of using units other than words for
language modeling (e.g. (Geutner, 1995; C arki
et al., 2000; Kiecza et al., 1999)). However,
in all of these studies words were decomposed
into linear sequences of morphs or morph-like
units, using either linguistic knowledge or data-
driven techniques. Standard language models
were then trained on the decomposed represen-
tations. The resulting models essentially ex-
press statistical relationships between morphs,
such as stems and a xes. For this reason, a
context larger than that provided by a trigram
is typically required, which quickly leads to
data-sparsity. In contrast to these approaches,
factored language models encode morphological
knowledge not by altering the linear segmenta-
tion of words but by encoding words as parallel
bundles of features.
The general possibility of using multiple con-
ditioning variables (including variables other
than words) has also been investigated by
(Dupont and Rosenfeld, 1997; Gildea, 2001;
Wang, 2003; Zitouni et al., 2003). Mostly, the
additional variables were general word classes
derived by data-driven clustering procedures,
which were then arranged in a backo lattice
or graph similar to the present procedure. All
of these studies assume a  xed path through
the graph, which is usually obtained by an
ordering from more speci c probability distri-
butions to more general distributions. Some
schemes also allow two or more paths to be
combined by weighted interpolation. FLMs, by
contrast, allow di erent paths to be chosen at
run-time, they support a wider range of combi-
nation methods for probability estimates from
di erent paths, and they o er a choice of dif-
ferent discounting options at every node in the
backo graph. Most importantly, however, the
present study is to our knowledge the  rst to
describe an entirely data-driven procedure for
identifying the best combination of parameter
choices. The success of this method will facili-
tate the rapid development of FLMs for di er-
ent tasks in the future.
7 Conclusions
We have presented a data-driven approach to
the selection of parameters determining the
structure and performance of factored language
models, a class of models which generalizes
standard language models by including addi-
tional conditioning variables in a principled
way. In addition to reductions in perplexity ob-
tained by FLMs vs. standard language models,
the data-driven model section method further
improved perplexity and outperformed both
knowledge-based manual search and random
search.
Acknowledgments
We would like to thank Sonia Parandekar for the ini-
tial version of the GA code. This material is based
upon work supported by the NSF and the CIA un-
der NSF Grant No. IIS-0326276. Any opinions,  nd-
ings, and conclusions expressed in this material are
those of the authors and do not necessarily re ect
the views of these agencies.

References

Je A. Bilmes and Katrin Kirchho . 2003. Factored
language models and generalized parallel backo .
In Proceedings of HLT/NACCL, pages 4{6.

P.F. Brown et al. 1992. Class-based n-gram models
of natural language. Computational Linguistics,
18(4):467{479.

K. C arki, P. Geutner, and T. Schultz. 2000. Turkish
LVCSR: towards better speech recognition for ag-
glutinative languages. In Proceedings of ICASSP.

S. F. Chen and J. Goodman. 1998. An empirical
study of smoothing techniques for language mod-
eling. Technical Report Tr-10-98, Center for Re-
search in Computing Technology, Harvard Uni-
versity.

P. Dupont and R. Rosenfeld. 1997. Lattice based
language models. Technical Report CMU-CS-97-
173, Department of Computer Science, CMU.

P. Geutner. 1995. Using morphology towards better
large-vocabulary speech recognition systems. In
Proceedings of ICASSP, pages 445{448.

D. Gildea. 2001. Statistical Language Understand-
ing Using Frame Semantics. Ph.D. thesis, Uni-
versity of California, Berkeley.

D. Hakkani-T ur, K. O azer, and G okhan T ur. 2000.
Statistical morphological disambiguation for ag-
glutinative languages. In Proceedings of COLING.

D. Hakkani-T ur, K. O azer, and G okhan T ur. 2002.
Statistical morphological disambiguation for ag-
glutinative languages. Journal of Computers and
Humanities, 36(4).

J.H. Holland. 1975. Adaptation in Natural and Ar-
ti cial Systems. University of Michigan Press.
Gand Ji and Je Bilmes. 2004. Multi-speaker lan-
guage modeling. In Proceedings of HLT/NAACL,
pages 137{140.

D. Kiecza, T. Schultz, and A. Waibel. 1999. Data-
driven determination of appropriate dictionary
units for Korean LVCSR. In Proceedings of
ICASSP, pages 323{327.

K. Kirchho et al. 2002. Novel speech recognition
models for Arabic. Technical report, Johns Hop-
kins University.

K. Kirchho et al. 2003. Novel approaches to Ara-
bic speech recognition: Report from 2002 Johns-
Hopkins summer workshop. In Proceedings of
ICASSP, pages I{344{I{347.

Hiroaki Kitano. 1990. Designing neural networks
using genetic algorithms with graph generation
system. Complex Systems, pages 461{476.

LDC. 1996. http://www.ldc.upenn.edu/Catalog/-
LDC99L22.html.

K. O azer. 1999. Dependency parsing with an ex-
tended  nite state approach. In Proceedings of the
37th ACL.

W. Wang. 2003. Factorization of language
models through backing o lattices. Com-
putation and Language E-print Archive,
oai:arXiv.org/cs/0305041.

I. Zitouni, O. Siohan, and C.-H. Lee. 2003. Hierar-
chical class n-gram language models: towards bet-
ter estimation of unseen events in speech recogni-
tion. In Proceedings of Eurospeech - Interspeech,
pages 237{240.
