Multi-Class Composite N-gram Language Model for Spoken Language
Processing Using Multiple Word Clusters
Hirofumi Yamamoto
ATR SLT
2-2-2 Hikaridai Seika-cho
Soraku-gun, Kyoto-fu, Japan
yama@slt.atr.co.jp
Shuntaro Isogai
Waseda University
3-4-1 Okubo, Shinjuku-ku
Tokyo-to, Japan
isogai@shirai.info.waseda.ac.jp
Yoshinori Sagisaka
GITI / ATR SLT
1-3-10 Nishi-Waseda
Shinjuku-ku, Tokyo-to, Japan
sagisaka@slt.atr.co.jp
Abstract
In this paper, a new language model, the
Multi-Class Composite N-gram, is pro-
posed to avoid a data sparseness prob-
lem for spoken language in that it is
difficult to collect training data. The
Multi-Class Composite N-gram main-
tains an accurate word prediction ca-
pability and reliability for sparse data
with a compact model size based on
multiple word clusters, called Multi-
Classes. In the Multi-Class, the statisti-
cal connectivity at each position of the
N-grams is regarded as word attributes,
and one word cluster each is created to
represent the positional attributes. Fur-
thermore, by introducing higher order
word N-grams through the grouping of
frequent word successions, Multi-Class
N-grams are extended to Multi-Class
Composite N-grams. In experiments,
the Multi-Class Composite N-grams re-
sult in 9.5% lower perplexity and a 16%
lower word error rate in speech recogni-
tion with a 40% smaller parameter size
than conventional word 3-grams.
1 Introduction
Word N-grams have been widely used as a sta-
tistical language model for language processing.
Word N-grams are models that give the transition
probability of the next word from the previous
C6 A0BD word sequence based on a statistical analy-
sis of the huge text corpus. Though word N-grams
are more effective and flexible than rule-based
grammatical constraints in many cases, their per-
formance strongly depends on the size of training
data, since they are statistical models.
In word N-grams, the accuracy of the word
prediction capability will increase according to
the number of the order N, but also the num-
ber of word transition combinations will exponen-
tially increase. Moreover, the size of training data
for reliable transition probability values will also
dramatically increase. This is a critical problem
for spoken language in that it is difficult to col-
lect training data sufficient enough for a reliable
model. As a solution to this problem, class N-
grams are proposed.
In class N-grams, multiple words are mapped
to one word class, and the transition probabilities
from word to word are approximated to the proba-
bilities from word class to word class. The perfor-
mance and model size of class N-grams strongly
depend on the definition of word classes. In fact,
the performance of class N-grams based on the
part-of-speech (POS) word class is usually quite
a bit lower than that of word N-grams. Based on
this fact, effective word class definitions are re-
quired for high performance in class N-grams.
In this paper, the Multi-Class assignment is
proposed for effective word class definitions. The
word class is used to represent word connectiv-
ity, i.e. which words will appear in a neigh-
boring position with what probability. In Multi-
Class assignment, the word connectivity in each
position of the N-grams is regarded as a differ-
ent attribute, and multiple classes corresponding
to each attribute are assigned to each word. For
the word clustering of each Multi-Class for each
word, a method is used in which word classes are
formed automatically and statistically from a cor-
pus, not using a priori knowledge as POS infor-
mation. Furthermore, by introducing higher order
word N-grams through the grouping of frequent
word successions, Multi-Class N-grams are ex-
tended to Multi-Class Composite N-grams.
2 N-gram Language Models Based on
Multiple Word Classes
2.1 Class N-grams
Word N-grams are models that statistically give
the transition probability of the next word from
the previousC6A0BD word sequence. This transition
probability is given in the next formula.
D4B4DB
CX
CYDB
CXA0C6B7BD
BNBMBMBMBNDB
CXA0BE
BNDB
CXA0BD
B5 (1)
In word N-grams, accurate word prediction can be
expected, since a word dependent, unique connec-
tivity from word to word can be represented. On
the other hand, the number of estimated param-
eters, i.e., the number of combinations of word
transitions, is CE
C6
in vocabulary CE .AsCE
C6
will
exponentially increase according to C6, reliable
estimations of each word transition probability
are difficult under a large C6.
Class N-grams are proposed to resolve the
problem that a huge number of parameters is re-
quired in word N-grams. In class N-grams, the
transition probability of the next word from the
previous C6A0BD word sequence is given in the next
formula.
D4B4CR
CX
CYCR
CXA0C6B7BD
BNBMBMBMBNCR
CXA0BE
BNCR
CXA0BD
B5D4B4DB
CX
CYCR
CX
B5 (2)
Where, CR
CX
represents the word class to which the
word DB
CX
belongs.
In class N-grams with BV classes, the number
of estimated parameters is decreased from CE
C6
to BV
C6
. However, accuracy of the word predic-
tion capability will be lower than that of word N-
grams with a sufficient size of training data, since
the representation capability of the word depen-
dent, unique connectivity attribute will be lost for
the approximation base word class.
2.2 Problems in the Definition of Word
Classes
In class N-grams, word classes are used to repre-
sent the connectivity between words. In the con-
ventional word class definition, word connectiv-
ity for which words follow and that for which
word precedes are treated as the same neighbor-
ing characteristics without distinction. Therefore,
only the words that have the same word connec-
tivity for the following words and the preceding
word belong to the same word class, and this word
class definition cannot represent the word connec-
tivity attribute efficiently. Take ”a” and ”an” as an
example. Both are classified by POS as an Indef-
inite Article, and are assigned to the same word
class. In this case, information about the differ-
ence with the following word connectivity will be
lost. On the other hand, a different class assign-
ment for both words will cause the information
about the community in the preceding word con-
nectivity to be lost. This directional distinction is
quite crucial for languages with reflection such as
French and Japanese.
2.3 Multi-Class and Multi-Class N-grams
As in the previous example of ”a” and ”an”, fol-
lowing and preceding word connectivity are not
always the same. Let’s consider the case of dif-
ferent connectivity for the words that precede and
follow. Multiple word classes are assigned to
each word to represent the following and preced-
ing word connectivity. As the connectivity of the
word preceding ”a” and ”an” is the same, it is ef-
ficient to assign them to the same word class to
represent the preceding word connectivity, if as-
signing different word classes to represent the fol-
lowing word connectivity at the same time. To
apply these word class definitions to formula (2),
the next formula is given.
D4B4CR
D8
CX
CYCR
CUC6A0BD
CXA0C6B7BD
BNBMBMBMBNCR
CUBE
CXA0BE
BNCR
CUBD
CXA0BD
B5D4B4DB
CX
CYCR
D8
CX
B5 (3)
In the above formula, CR
D8
CX
represents the word class
in the target position to which the word DB
CX
be-
longs, and CR
CUC6
CX
represents the word class in the
N-th position in a conditional word sequence.
We call this multiple word class definition, a
Multi-Class. Similarly, we call class N-grams
based on the Multi-Class, Multi-Class N-grams
(Yamamoto and Sagisaka, 1999).
3 Automatic Extraction of Word
Clusters
3.1 Word Clustering for Multi-Class
2-grams
For word clustering in class N-grams, POS in-
formation is sometimes used. Though POS in-
formation can be used for words that do not ap-
pear in the corpus, this is not always an optimal
word classification for N-grams. The POS in-
formation does not accurately represent the sta-
tistical word connectivity characteristics. Better
word-clustering is to be considered based on word
connectivity by the reflection neighboring charac-
teristics in the corpus. In this paper, vectors are
used to represent word neighboring characteris-
tics. The elements of the vectors are forward or
backward word 2-gram probabilities to the clus-
tering target word after being smoothed. And we
consider that word pairs that have a small distance
between vectors also have similar word neighbor-
ing characteristics (Brown et al., 1992) (Bai et
al., 1998). In this method, the same vector is
assigned to words that do not appear in the cor-
pus, and the same word cluster will be assigned to
these words. To avoid excessively rough cluster-
ing over different POS, we cluster the words un-
der the condition that only words with the same
POS can belong to the same cluster. Parts-of-
speech that have the same connectivity in each
Multi-Class are merged. For example, if differ-
ent parts-of-speeche are assigned to ”a” and ”an”,
these parts-of-speeche are regarded as the same
for the preceding word cluster. Word clustering is
thus performed in the following manner.
1. Assign one unique class per word.s.
2. Assign a vector to each class or to each word
CG. This represents the word connectivity at-
tribute.
DA
D8
B4DCB5BPCJD4
D8
B4DB
BD
CYDCB5BND4
D8
B4DB
BE
CYDCB5BNBMBMBMBND4
D8
B4DB
C6
CYDCB5CL
(4)
DA
CU
B4DCB5BPCJD4
CU
B4DB
BD
CYDCB5BND4
CU
B4DB
BE
CYDCB5BNBMBMBMBND4
CU
B4DB
C6
CYDCB5CL
(5)
Where, DA
D8
B4DCB5 represents the preceding word
connectivity, DA
CU
B4DCB5 represents the following
word connectivity, and D4
D8
is the value of the
probability of the succeeding class-word 2-
gram or word 2-gram, while D4
CU
is the same
for the preceding one.
3. Merge the two classes. We choose classes
whose dispersion weighted with the 1-gram
probability results in the lowest rise, and
merge these two classes:
CD
D2CTDB
BP
CG
DB
B4D4B4DBB5BWB4DAB4CR
D2CTDB
B4DBB5B5BNDAB4DBB5B5B5
(6)
CD
D3D0CS
BP
CG
DB
B4D4B4DBB5BWB4DAB4CR
D3D0CS
B4DBB5B5BNDAB4DBB5B5B5
(7)
where we merge the classes whose merge
cost CD
D2CTDB
A0 CD
D3D0CS
is the lowest. BWB4DA
CR
BNDA
DB
B5
represents the square of the Euclidean dis-
tance between vector DA
CR
and DA
DB
, CR
D3D0CS
repre-
sents the classes before merging, and CR
D2CTDB
represents the classes after merging.
4. Repeat step 2 until the number of classes is
reduced to the desired number.
3.2 Word Clustering for Multi-Class
3-grams
To apply the multiple clustering for 2-grams to
3-grams, the clustering target in the conditional
part is extended to a word pair from the single
word in 2-grams. Number of clustering targets in
the preceding class increases to CE
BE
from CE in 2-
grams, and the length of the vector in the succeed-
ing class also increase to CE
BE
. Therefore, efficient
word clustering is needed to keep the reliability
of 3-grams after the clustering and a reasonable
calculation cost.
To avoid losing the reliability caused by the
data sparseness of the word pair in the history
of 3-grams, approximation is employed using
distance-2 2-grams. The authority of this ap-
proximation is based on a report that the asso-
ciation of word 2-grams and distance-2 2-grams
based on the maximum entropy method gives a
good approximation of word 3-grams (Zhang et
al., 1999). The vector for clustering is given in
the next equation.
DA
CUBE
B4DCB5BPCJD4
CUBE
B4DB
BD
CYDCB5BND4
CUBE
B4DB
BE
CYDCB5BNBMBMBMBND4
CUBE
B4DB
C6
CYDCB5CL
(8)
Where, D4
CUBE
B4DDCYDCB5 represents the distance-2 2-gram
value from word DC to word DD. And the POS con-
straints for clustering are the same as in the clus-
tering for preceding words.
4 Multi-Class Composite N-grams
4.1 Multi-Class Composite 2-grams
Introducing Variable Length Word
Sequences
Let’s consider the condition such that only word
sequence B4BTBNBUBNBVB5 has sufficient frequency in
sequence B4CGBNBTBNBUBNBVBNBWB5. In this case, the value
of word 2-gram D4B4BUCYBTB5 can be used as a reli-
able value for the estimation of word BU, as the
frequency of sequence B4BTBNBUB5 is sufficient. The
value of word 3-gram D4B4BVCYBTBNBUB5 can be used
for the estimation of word BV for the same rea-
son. For the estimation of words BT and BW,itis
reasonable to use the value of the class 2-gram,
since the value of the word N-gram is unreli-
able (note that the frequency of word sequences
B4CGBNBTB5 and B4BVBNBWB5 is insufficient). Based on this
idea, the transition probability of word sequence
B4BTBNBUBNBVBNBWB5 from word CG is given in the next
equation in the Multi-Class 2-gram.
C8 BP D4B4CR
D8
B4BTB5CYCR
CU
B4CGB5B5D4B4BTCYCR
D8
B4BTB5B5B5
A2 D4B4BUCYBTB5
A2 D4B4BVCYBTBNBUB5
A2 D4B4CR
D8
B4BWB5CYCR
CU
B4BVB5B5D4B4BWCYCR
D8
B4BWB5B5 (9)
When word successionBTB7BUB7BV is introduced as
a variable length word sequence B4BTBNBUBNBVB5, equa-
tion (9) can be changed exactly to the next equa-
tion (Deligne and Bimbot, 1995) (Masataki et al.,
1996).
C8 BP D4B4CR
D8
B4BTB5CYCR
CU
B4CGB5B5D4B4BTB7 BU B7 BVCYCR
D8
B4BTB5B5
A2 D4B4CR
D8
B4BWB5CYCR
CU
B4BVB5B5D4B4BWCYCR
D8
B4BWB5B5 (10)
Here, we find the following properties. The pre-
ceding word connectivity of word succession BTB7
BU B7BV is the same as the connectivity of word BT,
the first word of BT B7 BU B7 BV. The following con-
nectivity is the same as the last word BV. In these
assignments, no new cluster is required. But con-
ventional class N-grams require a new cluster for
the new word succession.
CR
D8
B4BT B7 BU B7 BVB5BPCR
D8
B4BTB5 (11)
CR
CU
B4BT B7 BU B7 BVB5BPCR
CU
B4BVB5 (12)
Applying these relations to equation (10), the next
equation is obtained.
C8 BP D4B4CR
D8
B4BT B7 BU B7 BVB5CYCR
CU
B4CGB5B5
A2 D4B4BTB7 BU B7 BVCYCR
D8
B4BT B7 BU B7 BVB5B5
A2 D4B4CR
D8
B4BWB5CYCR
CU
B4BT B7 BU B7 BVB5B5
A2 D4B4BWCYCR
D8
B4BWB5B5 (13)
Equation(13) means that if the frequency of the
C6 word sequence is sufficient, we can partially
introduce higher order word N-grams using C6
length word succession, thus maintaining the re-
liability of the estimated probability and forma-
tion of the Multi-Class 2-grams. We call Multi-
Class Composite 2-grams that are created by par-
tially introducing higher order word N-grams by
word succession, Multi-Class 2-grams. In addi-
tion, equation (13) shows that number of param-
eters will not be increased so match when fre-
quent word successions are added to the word en-
try. Only a 1-gram of word successionBTB7BUB7BV
should be added to the conventional N-gram pa-
rameters. Multi-Class Composite 2-grams are
created in the following manner.
1. Assign a Multi-Class 2-gram, for state ini-
tialization.
2. Find a word pair whose frequency is above
the threshold.
3. Create a new word succession entry for the
frequent word pair and add it to a lexicon.
The following connectivity class of the word
succession is the same as the following class
of the first word in the pair, and its preceding
class is the same as the preceding class of the
last word in it.
4. Replace the frequent word pair in training
data to word succession, and recalculate the
frequency of the word or word succession
pair. Therefore, the summation of probabil-
ity is always kept to 1.
5. Repeat step 2 with the newly added word
succession, until no more word pairs are
found.
4.2 Extension to Multi-Class Composite
3-grams
Next, we put the word succession into the for-
mulation of Multi-Class 3-grams. The transition
probability to word sequence B4BTBNBUBNBVBNBWBNBXBNBYB5
from word pair B4CGBNCHB5 is given in the next equa-
tion.
C8 BP D4B4CR
D8
B4BT B7 BU B7 BV B7 BWB5CYCR
CUBE
B4CGB5BNCR
CUBD
B4CH B5B5
A2 D4B4BT B7 BU B7 BV B7 BWCYCR
D8
B4BT B7 BU B7 BV B7 BWB5B5
A2 D4B4CR
D8
B4BXB5CYCR
CUBE
B4CH B5BNCR
CUBD
B4BTB7 BU B7 BV B7 BWB5B5
A2 D4B4BXCYCR
D8
B4BXB5B5
A2 D4B4CR
D8
B4BYB5CYCR
CUBE
B4BT B7 BU B7 BV B7 BWB5BNCR
CUBD
B4BXB5B5
A2 D4B4BYCYCR
D8
B4BYB5B5 (14)
Where, the Multi-Classes for word succession
BT B7 BU B7 BV B7 BW are given by the next equations.
CR
D8
B4BT B7 BU B7 BV B7 BWB5BPCR
D8
B4BTB5 (15)
CR
CUBE
B4BT B7 BU B7 BV B7 BWB5BPCR
CUBE
B4BWB5 (16)
CR
CUBD
B4BTB7 BU B7 BV B7 BWB5BPCR
CUBE
B4BVB5BNCR
CUBD
B4BWB5 (17)
In equation (17), please notice that the class se-
quence (not single class) is assigned to the pre-
ceding class of the word successions. the class
sequence is the preceding class of the last word of
the word succession and the pre-preceding class
of the second from the last word. Applying these
class assignments to equation (14) gives the next
equation.
C8 BP D4B4CR
D8
B4BTB5CYCR
CUBE
B4CGB5BNCR
CUBD
B4CH B5B5
A2 D4B4BTB7 BU B7 BV B7 BWCYCR
D8
B4BTB5B5
A2 D4B4CR
D8
B4BXB5CYCR
CUBE
B4BVB5BNCR
CUBD
B4BWB5B5
A2 D4B4BXCYCR
D8
B4BXB5B5
A2 D4B4CR
D8
B4BYB5CYCR
CUBE
B4BWB5BNCR
CUBD
B4BXB5B5
A2 D4B4BYCYCR
D8
B4BYB5B5 (18)
In the above formation, the parameter increase
from the Multi-class 3-gram is D4B4BT B7 BU B7 BV B7
BWCYCR
D8
B4BTB5B5. After expanding this term, the next
equation is given.
C8 BP D4B4CR
D8
B4BTB5CYCR
CUBE
B4CGB5BNCR
CUBD
B4CH B5B5
A2 D4B4BTCYCR
D8
B4BTB5B5
A2 D4B4BUCYBTB5
A2 D4B4BVCYBTBNBUB5
A2 D4B4BWCYBTBNBUBNBVB5
A2 D4B4CR
D8
B4BXB5CYCR
CUBE
B4BVB5BNCR
CUBD
B4BWB5B5
A2 D4B4BXCYCR
D8
B4BXB5B5
A2 D4B4CR
D8
B4BYB5CYCR
CUBE
B4BWB5BNCR
CUBD
B4BXB5B5
A2 D4B4BYCYCR
D8
B4BYB5B5 (19)
In equation (19), the words without BU are es-
timated by the same or more accurate models
than Multi-Class 3-grams (Multi-Class 3-grams
for wordsBT, BX and BY, and word 3-gram and word
4-gram for words BV and BW ). However, for word
BU, a word 2-gram is used instead of the Multi-
Class 3-grams though its accuracy is lower than
the Multi-Class 3-grams. To prevent this decrease
in the accuracy of estimation, the next process is
introduced.
First, the 3-gram entry D4B4CR
D8
B4BXB5CYCR
CUBE
B4CH B5BNBTB7
BUB7BVB7BWB5 is removed. After this deletion, back-
off smoothing is applied to this entry as follows.
D4B4CR
D8
B4BXB5CYCR
CUBE
B4CH B5BNCR
CUBD
B4BT B7 BU B7 BV B7 BWB5B5
BP CQB4CR
CUBE
B4CH B5BNCR
CUBD
B4BT B7 BU B7 BV B7 BWB5B5
A2 D4B4CR
D8
B4BXB5CYCR
CUBD
B4BT B7 BU B7 BV B7 BWB5B5 (20)
Next, we assign the following value to the
back-off parameter in equation (20). And this
value is used to correct the decrease in the accu-
racy of the estimation of word BU.
CQB4CR
CUBE
B4CH B5BNCR
CUBD
B4BT B7 BU B7 BV B7 BWB5B5
BP D4B4CR
D8
B4BUB5CYCR
CUBE
B4CH B5BNCR
CUBD
B4BTB5B5
A2 D4B4BUCYCR
D8
B4BUB5B5BPD4B4BUCYBTB5 (21)
After this assignment, the probabilities of words
BU and BX are locally incorrect. However, the total
probability is correct, since the back-off parame-
ter is used to correct the decrease in the accuracy
of the estimation of word BU. In fact, applying
equations (20) and (21) to equation (14) accord-
ing to the above definition gives the next equa-
tion. In this equation, the probability for word BU
is changed from a word 2-gram to a class 3-gram.
C8 BP D4B4CR
D8
B4BTB5CYCR
CUBE
B4CGB5BNCR
CUBD
B4CH B5B5
A2 D4B4BTCYCR
D8
B4BTB5B5
A2 D4B4CR
D8
B4BUB5CYCR
CUBE
B4CH B5BNCR
CUBD
B4BTB5B5
A2 D4B4BUCYCR
D8
B4BUB5B5
A2 D4B4BVCYBTBNBUB5
A2 D4B4BWCYBTBNBUBNBVB5
A2 D4B4CR
D8
B4BXB5CYCR
CUBE
B4BVB5BNCR
CUBD
B4BWB5B5
A2 D4B4BXCYCR
D8
B4BXB5B5
A2 D4B4CR
D8
B4BYB5CYCR
CUBE
B4BWB5BNCR
CUBD
B4BXB5B5
A2 D4B4BYCYCR
D8
B4BYB5B5 (22)
In the above process, only 2 parameters are ad-
ditionally used. One is word 1-grams of word
successions as D4B4BT B7 BU B7 BV B7 BWB5. And the
other is word 2-grams of the first two words of
the word successions. The number of combina-
tions for the first two words of the word succes-
sions is at most the number of word successions.
Therefore, the number of increased parameters in
the Multi-Class Composite 3-gram is at most the
number of introduced word successions times 2.
5 Evaluation Experiments
5.1 Evaluation of Multi-Class N-grams
We have evaluated Multi-Class N-grams in per-
plexity as the next equations.
BXD2D8D6D3D4DD BP
BD
C6
CG
CX
D0D3CV
BE
B4D4B4DB
CX
B5B5 (23)
C8CTD6D4D0CTDCCXD8DD BPBE
BXD2D8D6D3D4DD
(24)
The Good-Turing discount is used for smooth-
ing. The perplexity is compared with those of
word 2-grams and word 3-grams. The evaluation
data set is the ATR Spoken Language Database
(Takezawa et al., 1998). The total number of
words in the training set is 1,387,300, the vocab-
ulary size is 16,531, and 5,880 words in 42 con-
versations which are not included in the training
set are used for the evaluation.
Figure1 shows the perplexity of Multi-Class 2-
grams for each number of classes. In the Multi-
Class, the numbers of following and preceding
classes are fixed to the same value just for com-
parison. As shown in the figure, the Multi-Class
2-gram with 1,200 classes gives the lowest per-
plexity of 22.70, and it is smaller than the 23.93
in the conventional word 2-gram.
Figure 2 shows the perplexity of Multi-Class
3-grams for each number of classes. The num-
ber of following and preceding classes is 1,200
(which gives the lowest perplexity in Multi-Class
2-grams). The number of pre-preceding classes is
Table 1: Evaluation of Multi-Class Composite N-
grams in Perplexity
Kind of model Perplexity Number of
parameters
Word 2-gram 23.93 181,555
Multi-Class 2-gram 22.70 81,556
Multi-Class 19.81 92,761
Composite 2-gram
Word 3-gram 17.88 713,154
Multi-Class 3-gram 17.38 438,130
Multi-Class 16.20 455,431
Composite 3-gram
Word 4-gram 17.45 1,703,207
changed from 100 to 1,500. As shown in this fig-
ure, Multi-Class 3-grams result in lower perplex-
ity than the conventional word 3-gram, indicating
the reasonability of word clustering based on the
distance-2 2-gram.
5.2 Evaluation of Multi-Class Composite
N-grams
We have also evaluated Multi-Class Composite
N-grams in perplexity under the same conditions
as the Multi-Class N-grams stated in the previ-
ous section. The Multi-Class 2-gram is used for
the initial condition of the Multi-Class Compos-
ite 2-gram. The threshold of frequency for in-
troducing word successions is set to 10 based on
a preliminary experiment. The same word suc-
cession set as that of the Multi-Class Composite
2-gram is used for the Multi-Class Composite 3-
gram. The evaluation results are shown in Table
1. Table 1 shows that the Multi-Class Compos-
ite 3-gram results in 9.5% lower perplexity with a
40% smaller parameter size than the conventional
word 3-gram, and that it is in fact a compact and
high-performance model.
5.3 Evaluation in Continuous Speech
Recognition
Though perplexity is a good measure for the per-
formance of language models, it does not al-
ways have a direct bearing on performance in lan-
guage processing. We have evaluated the pro-
posed model in continuous speech recognition.
The experimental conditions are as follows:
AF Evaluation set
22.5
23
23.5
24
24.5
25
400
600
800
1000
1200
1400
1600
Number of Classes
Perplexity
Multi-Class 2-gramword 2-gram
Figure
1:
Perple
xity
of
Multi-Class
2-grams
17
17.5
18
18.5
19
19.5
20
100
300
500
700
900
1100
1300
1500
Number of Classes
Perplexity
Multi-Class 3-gramword 3-gram
Figure
2:
Perple
xity
of
Multi-Class
3-grams
-
The
same
42
con
v
ersations
as
used
in
the
e
v
aluation
of
perple
xity
AF
Acoustic
features
-
Sampling
rate
16kHz
-
Frame
shift
10msec
-
Mel-cepstrum
12
+
p
o
wer
and
their
delta,
total
26
AF
Acoustic
models
-
800-state
5-mixture
HMnet
model
based
on
ML-SSS
(Ostendorf
and
Singer
,
1997)
-
Automatic
selection
of
gender
depen-
dent
models
AF
Decoder
(Shimizu
et
al.,
1996)
-
1st
pass:
frame-synchronized
viterbi
search
-
2nd
pass:
full
search
after
changing
the
language
model
and
LM
scale
The
Multi-Class
Composite
2-gram
and
3-
gram
are
compared
with
those
of
the
word
2-
gram,
Multi-Class
2-gram,
word
3-gram
and
Multi-Class
3-gram.
The
number
of
classes
is
1,200
through
all
class-based
models.
F
o
r
the
e
v
aluation
of
each
2-gram,
a
2-gram
is
used
at
both
the
1st
and
the
2nd
pass
in
decoder
.
F
or
the
3-gram,
each
2-gram
is
changed
to
the
cor
-
responding
3-gram
in
the
2nd
pass.
The
e
v
alu-
ation
measures
are
con
v
entional
word
accurac
y
and
%correct
calculated
as
follo
ws.
CF
D3
D6
CSBTCRCRD9D6
CPCRDD
BP
CF
A0
BW
A0
C1
A0
CB
CF
A2
BDBCBC
B1
BVD3
D6
D6
CT
CR
D8
BP
CF
A0
BW
A0
CB
CF
A2
BDBCBC
(
CF
:
Number
of
correct
words,
BW
:
Deletion
error
,
C1
:
Insertion
error
,
CB
:
Substitution
error)
Table 2: Evaluation of Multi-Class Composite N-
grams in Continuous Speech Recognition
Kind of Model Word Acc. %Correct
Word 2-gram 84.15 88.42
Multi-Class 2-gram 85.45 88.80
Multi-Class 88.00 90.84
Composite 2-gram
Word 3-gram 86.07 89.76
Multi-Class 3-gram 87.11 90.50
Multi-Class 88.30 91.48
Composite 3-gram
Table 2 shows the evaluation results. As in the
perplexity results, the Multi-Class Composite 3-
gram shows the highest performance of all mod-
els, and its error reduction from the conventional
word 3-gram is 16%.
6 Conclusion
This paper proposes an effective word clustering
method called Multi-Class. In the Multi-Class
method, multiple classes are assigned to each
word by clustering the following and preceding
word characteristics separately. This word clus-
tering is performed based on the word connec-
tivity in the corpus. Therefore, the Multi-Class
N-grams based on Multi-Class can improve reli-
ability with a compact model size without losing
accuracy.
Furthermore, Multi-Class N-grams are ex-
tended to Multi-Class Composite N-grams. In
the Multi-Class Composite N-grams, higher or-
der word N-grams are introduced through the
grouping of frequent word successions. There-
fore, these have accuracy in higher order word
N-grams added to reliability in the Multi-Class
N-grams. And the number of increased param-
eters with the introduction of word successions
is at most the number of word successions times
2. Therefore, Multi-Class Composite 3-grams can
maintain a compact model size in the Multi-Class
N-grams. Nevertheless, Multi-Class Composite
3-grams are represented by the usual formation
of 3-grams. This formation is easily handled by a
language processor, especially that requires huge
calculation cost as speech recognitions.
In experiments, the Multi-Class Composite 3-
gram resulted in 9.5% lower perplexity and 16%
lower word error rate in continuous speech recog-
nition with a 40% smaller model size than the
conventional word 3-gram. And it is confirmed
that high performance with a small model size can
be created for Multi-Class Composite 3-grams.
Acknowledgments
We would like to thank Michael Paul and Rainer
Gruhn for their assistance in writing some of the
explanations in this paper.

References
Shuanghu Bai, Haizhou Li, and Baosheng Yuan.
1998. Building class-based language models with
contextual statistics. In Proc. ICASSP, pages 173-176.

P.F. Brown, V.J.D. Pietra, P.V. de Souza, J.C. Lai, and
R.L. Mercer. 1992. Class-based n-gram models
of natural language. Computational Linguistics,
18(4):467-479.

Sabine Deligne and Frederic Bimbot. 1995. Language
modeling by variable length sequences. Proc.
ICASSP, pages 169-172.

Hirokazu Masataki, Shoichi Matsunaga, and Yosinori
Sagusaka. 1996. Variable-order n-gram generation by word-class splitting and consecutive word
grouping. Proc. ICASSP, pages 188-191.

M. Ostendorf and H. Singer. 1997. HMM topol-
ogy design using maximum likelihood successive
state splitting. Computer Speech and Language,
11(1):17-41.

Tohru Shimizu, Hirofumi Yamamoto, Hirokazu Masataki, Shoichi Matsunaga, and Yoshinori Sagusaka.
1996. Spontaneous dialogue speech recognition
using cross-word context constrained word graphs.
Proc. ICASSP, pages 145-148.

Toshiyuki Takezawa, Tsuyoshi Morimoto, and Yoshinori Sagisaka. 1998. Speech and language
databases for speech translation research in ATR.
In Proc. of the 1st International Workshop on East-Asian Language Resource and Evaluation, pages 148-155.

Hirofumi Yamamoto and Yoshinori Sagisaka. 1999.
Multi-class composite n-gram based on connection
direction. Proc. ICASSP, pages 533-536.

S. Zhang, H. Singer, D. Wu, and Y. Sagisaka. 1999.
Improving n-gram modeling using distance-related
unit association maximum entropy language modeling. In Proc. EuroSpeech, pages 1611-1614.
