Proceedings of the 43rd Annual Meeting of the ACL, pages 515–522,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
A Phonotactic Language Model for Spoken Language Identification 
 Haizhou Li and Bin Ma 
Institute for Infocomm Research 
Singapore 119613 
{hli,mabin}@i2r.a-star.edu.sg 
 
Abstract 
We have established a phonotactic lan-
guage model as the solution to spoken 
language identification (LID). In this 
framework, we define a single set of 
acoustic tokens to represent the acoustic 
activities in the world’s spoken languages. 
A voice tokenizer converts a spoken 
document into a text-like document of 
acoustic tokens. Thus a spoken document 
can be represented by a count vector of 
acoustic tokens and token n-grams in the 
vector space. We apply latent semantic 
analysis to the vectors, in the same way 
that it is applied in information retrieval, 
in order to capture salient phonotactics 
present in spoken documents. The vector 
space modeling of spoken utterances con-
stitutes a paradigm shift in LID technol-
ogy and has proven to be very successful. 
It presents a 12.4% error rate reduction 
over one of the best reported results on 
the 1996 NIST Language Recognition 
Evaluation database. 
1 Introduction 
Spoken language and written language are similar 
in many ways. Therefore, much of the research in 
spoken language identification, LID, has been in-
spired by text-categorization methodology. Both 
text and voice are generated from language de-
pendent vocabulary. For example, both can be seen 
as stochastic time-sequences corrupted by a chan-
nel noise. The n-gram language model has 
achieved equal amounts of success in both tasks, 
e.g. n-character slice for text categorization by lan-
guage (Cavnar and Trenkle, 1994) and Phone Rec-
ognition followed by n-gram Language Modeling, 
or PRLM (Zissman, 1996) .  
Orthographic forms of language, ranging from 
Latin alphabet to Cyrillic script to Chinese charac-
ters, are far more unique to the language than their 
phonetic counterparts. From the speech production 
point of view, thousands of spoken languages from 
all over the world are phonetically articulated us-
ing only a few hundred distinctive sounds or pho-
nemes (Hieronymus, 1994). In other words, 
common sounds are shared considerably across 
different spoken languages. In addition, spoken 
documents
1
, in the form of digitized wave files, are 
far less structured than written documents and need 
to be treated with techniques that go beyond the 
bounds of written language. All of this makes the 
identification of spoken language based on pho-
netic units much more challenging than the identi-
fication of written language. In fact, the challenge 
of LID is inter-disciplinary, involving digital signal 
processing, speech recognition and natural lan-
guage processing.  
In general, a LID system usually has three fun-
damental components as follows:  
1) A voice tokenizer which segments incoming 
voice feature frames and associates the seg-
ments with acoustic or phonetic labels, called 
tokens; 
2) A statistical language model which captures 
language dependent phonetic and phonotactic 
information from the sequences of tokens; 
3) A language classifier which identifies the lan-
guage based on discriminatory characteristics 
of acoustic score from the voice tokenizer and 
phonotactic score from the language model.  
In this paper, we present a novel solution to the 
three problems, focusing on the second and third 
problems from a computational linguistic perspec-
tive. The paper is organized as follows: In Section 
2, we summarize relevant existing approaches to 
the LID task. We highlight the shortcomings of 
existing approaches and our attempts to address the 
                                                           
1
 A spoken utterance is regarded as a spoken document in this 
paper. 
515
issues. In Section 3 we propose the bag-of-sounds 
paradigm to turn the LID task into a typical text 
categorization problem. In Section 4, we study the 
effects of different settings in experiments on the 
1996 NIST Language Recognition Evaluation 
(LRE) database
2
. In Section 5, we conclude our 
study and discuss future work. 
2 Related Work 
Formal evaluations conducted by the National In-
stitute of Science and Technology (NIST) in recent 
years demonstrated that the most successful ap-
proach to LID used the phonotactic content of the 
voice signal to discriminate between a set of lan-
guages (Singer et al., 2003). We briefly discuss 
previous work cast in the formalism mentioned 
above: tokenization, statistical language modeling, 
and language identification. A typical LID system 
is illustrated in Figure 1 (Zissman, 1996), where 
language dependent voice tokenizers (VT) and lan-
guage models (LM) are deployed in the Parallel 
PRLM architecture, or P-PRLM. 
 
 
Figure 1.  L monolingual phoneme recognition 
front-ends are used in parallel to tokenize the input 
utterance, which is analyzed by LMs to predict the 
spoken language 
2.1 Voice Tokenization 
A voice tokenizer is a speech recognizer that 
converts a spoken document into a sequence of 
tokens. As illustrated in Figure 2, a token can be of 
different sizes, ranging from a speech feature 
frame, to a phoneme, to a lexical word. A token is 
defined to describe a distinct acoustic/phonetic 
activity. In early research, low level spectral 
                                                           
2
 http://www.nist.gov/speech/tests/ 
frames, which are assumed to be independent of 
each other, were used as a set of prototypical spec-
tra for each language (Sugiyama, 1991). By adopt-
ing hidden Markov models, people moved beyond 
low-level spectral analysis towards modeling a 
frame sequence into a larger unit such as a pho-
neme and even a lexical word.  
Since the lexical word is language specific, the 
phoneme becomes the natural choice when build-
ing a language-independent voice tokenization 
front-end. Previous studies show that parallel lan-
guage-dependent phoneme tokenizers effectively 
serve as the tokenization front-ends with P-PRLM 
being the typical example. However, a language-
independent phoneme set has not been explored 
yet experimentally. In this paper, we would like to 
explore the potential of voice tokenization using a 
unified phoneme set. 
 
 
Figure 2 Tokenization at different resolutions 
2.2 n-gram Language Model 
With the sequence of tokens, we are able to es-
timate an n-gram language model (LM) from the 
statistics. It is generally agreed that phonotactics, 
i.e. the rules governing the phone/phonemes se-
quences admissible in a language, carry more lan-
guage discriminative information than the 
phonemes themselves. An n-gram LM over the 
tokens describes well n-local phonotactics among 
neighboring tokens. While some systems model 
the phonotactics at the frame level (Torres-
Carrasquillo et al., 2002), others have proposed P-
PRLM. The latter has become one of the most 
promising solutions so far (Zissman, 1996).  
  A variety of cues can be used by humans and 
machines to distinguish one language from another. 
These cues include phonology, prosody, morphol-
ogy, and syntax in the context of an utterance. 
VT-1: Chinese 
VT-2: English 
VT-L: French 
 
 
 
LM-L: French 
LM-1 … LM-L 
 
 
 
LM-L: French 
LM-1 … LM-L 
 
 
 
LM-L: French 
LM-1 … LM-L 
l
a
n
g
u
a
g
e
 
c
l
a
s
s
i
f
i
e
r
 
s
p
o
k
e
n
 
u
t
t
e
r
a
n
c
e
 
 
h
y
p
o
t
h
e
s
i
z
e
d
l
a
n
g
u
a
g
e
word 
phoneme 
frame
516
However, global phonotactic cues at the level of 
utterance or spoken document remains unexplored 
in previous work. In this paper, we pay special at-
tention to it. A spoken language always contains a 
set of high frequency function words, prefixes, and 
suffixes, which are realized as phonetic token sub-
strings in the spoken document. Individually, those 
substrings may be shared across languages. How-
ever, the pattern of their co-occurrences discrimi-
nates one language from another.  
Perceptual experiments have shown (Mut-
husamy, 1994) that with adequate training, human 
listeners’ language identification ability increases 
when given longer excerpts of speech.  Experi-
ments have also shown that increased exposure to 
each language and longer training sessions im-
prove listeners’ language identification perform-
ance. Although it is not entirely clear how human 
listeners make use of the high-order phonotac-
tic/prosodic cues present in longer spans of a spo-
ken document, strong evidence shows that 
phonotactics over larger context provides valuable 
LID cues beyond n-gram, which will be further 
attested by our experiments in Section 4. 
2.3 Language Classifier 
The task of a language classifier is to make 
good use of the LID cues that are encoded in the 
model 
l
λ  to hypothesize from among L lan-
guages, Λ , as the one that is actually spoken in a 
spoken document O. The LID model 
ˆ
l
l
λ  in P-
PRLM refers to extracted information from acous-
tic model and n-gram LM for language l.  We have 
and {,
AM
} L
LM
lll
λλλ=  ( 1,..., )
l
lλ ∈Λ = . A maxi-
mum-likelihood classifier can be formulated as 
follows: 
()(
ˆ
arg max ( / )
arg max / , /
l
l
AM LM
ll
l
T
lPO
POT PT
λ
λλ
∈Λ
∈Λ
∈Γ
=
≈
∑
)
)
      (1) 
The exact computation in Eq.(1) involves sum-
ming over all possible decoding of token se-
quences T given O. In many implementations, 
it is approximated by the maximum over all se-
quences in the sum by finding the most likely to-
ken sequence, , for each language l, using the 
Viterbi algorithm: 
∈Γ
ˆ
l
T
()(
ˆ ˆˆ
arg max[ / , / ]
AM LM
ll l l
l
lPOTPTλλ
∈Λ
≈         (2) 
Intuitively, individual sounds are heavily shared 
among different spoken languages due to the com-
mon speech production mechanism of humans. 
Thus, the acoustic score has little language dis-
criminative ability. Many experiments (Yan and 
Barnard, 1995; Zissman, 1996) have further at-
tested that the n-gram LM score provides more 
language discriminative information than their 
acoustic counterparts. In Figure 1, the decoding of 
voice tokenization is governed by the acoustic 
model
AM
l
λ to arrive at an acoustic score 
( )
ˆ
/,
AM
ll
POTλ  and a token sequence . The n-
gram LM derives the n-local phonotactic score 
ˆ
l
T
( )
ˆ
/
LM
ll
PT λ from the language model 
LM
l
λ .  
Clearly, the n-gram LM suffers the major short-
coming of having not exploited the global phono-
tactics in the larger context of a spoken utterance. 
Speech recognition researchers have so far chosen 
to only use n-gram local statistics for primarily 
pragmatic reasons, as this n-gram is easier to attain. 
In this work, a language independent voice tokeni-
zation front-end is proposed, that uses a unified 
acoustic model  
AM
λ  instead of multiple language 
dependent acoustic models
AM
l
λ .  The n-gram 
LM
LM
l
λ is generalized to model both local and 
global phonotactics. 
3 Bag-of-Sounds Paradigm 
The bag-of-sounds concept is analogous to the 
bag-of-words paradigm originally formulated in 
the context of information retrieval (IR) and text 
categorization (TC) (Salton 1971; Berry et al., 
1995; Chu-Caroll and Carpenter, 1999). One focus 
of IR is to extract informative features for docu-
ment representation. The bag-of-words paradigm 
represents a document as a vector of counts. It is 
believed that it is not just the words, but also the 
co-occurrence of words that distinguish semantic 
domains of text documents.   
Similarly, it is generally believed in LID that, al-
though the sounds of different spoken languages 
overlap considerably, the phonotactics differenti-
ates one language from another. Therefore, one can 
easily draw the analogy between an acoustic token 
in bag-of-sounds and a word in bag-of-words. 
Unlike words in a text document, the phonotactic 
information that distinguishes spoken languages is 
517
concealed in the sound waves of spoken languages. 
After transcribing a spoken document into a text 
like document of tokens, many IR or TC tech-
niques can then be readily applied. 
It is beyond the scope of this paper to discuss 
what would be a good voice tokenizer. We adopt 
phoneme size language-independent acoustic to-
kens to form a unified acoustic vocabulary in our 
voice tokenizer. Readers are referred to (Ma et al., 
2005) for details of acoustic modeling. 
3.1 Vector Space Modeling 
In human languages, some words invariably occur 
more frequently than others. One of the most 
common ways of expressing this idea is known as 
Zipf’s Law (Zipf, 1949). This law states that there 
is always a set of words which dominates most of 
the other words of the language in terms of their 
frequency of use. This is true both of written words 
and of spoken words. The short-term, or local pho-
notactics, is devised to describe Zipf’s Law.  
The local phonotactic constraints can be typi-
cally described by the token n-grams, or phoneme 
n-grams as in (Ng et al., 2000), which represents 
short-term statistics such as lexical constraints. 
Suppose that we have a token sequence, t1 t2 t3 t4. 
We derive the unigram statistics from the token 
sequence itself. We derive the bigram statistics 
from t1(t2) t2(t3) t3(t4) t4(#) where the token vo-
cabulary is expanded over the token’s right context. 
Similarly, we derive the trigram statistics from the 
t1(#,t2) t2(t1,t3) t3(t2,t4) t4(t3,#) to account for left 
and right contexts. The # sign is a place holder for 
free context. In the interest of manageability, we 
propose to use up to token trigram. In this way, for 
an acoustic system of Y  tokens, we have poten-
tially bigram and Y trigram in the vocabulary.  
2
Y
3
Meanwhile, motivated by the ideas of having 
both short-term and long-term phonotactic statis-
tics, we propose to derive global phonotactics in-
formation to account for long-term phonotactics: 
The global phonotactic constraint is the high-
order statistics of n-grams. It represents document 
level long-term phonotactics such as co-
occurrences of n-grams. By representing a spoken 
document as a count vector of n-grams, also called 
bag-of-sounds vector, it is possible to explore the 
relations and higher-order statistics among the di-
verse n-grams through latent semantic analysis 
(LSA).  
It is often advantageous to weight the raw 
counts to refine the contribution of each n-gram to 
LID. We begin by normalizing the vectors repre-
senting the spoken document by making each vec-
tor of unit length. Our second weighting is based 
on the notion that an n-gram that only occurs in a 
few languages is more discriminative than an n-
gram that occurs in nearly every document. We use 
the inverse-document frequency (idf) weighting 
scheme (Spark Jones, 1972), in which a word is 
weighted inversely to the number of documents in 
which it occurs, by means of 
() log /()idf w D d w=  , where w is a word in the 
vocabulary of W token n-grams. D is the total num-
ber of documents in the training corpus from L lan-
guages. Since each language has at least one 
document in the training corpus, we have D L≥ . 
is the number of documents containing the 
word w. Letting be the count of word w in 
document d, we have the weighted count as 
()dw
,wd
c
21/2
,, ,
1
()/( )
wd wd w d
wW
ccidfw c
′
′≤≤
′ =×
∑
 (3) 
and a vector to represent 
document d. A corpus is then represented by a 
term-document matrix
1, 2, ,
{ , ,..., }
T
dddWd
ccc c′′ ′=
12
{ , ,..., }
D
H cc c= of WD× .  
3.2 Latent Semantic Analysis 
The fundamental idea in LSA is to reduce the 
dimension of a document vector, W to Q, where 
QW<< and QD<<  , by projecting the problem 
into the space spanned by the rows of the closest 
rank-Q matrix to H in the Frobenius norm (Deer-
wester et al, 1990).  Through singular value de-
composition (SVD) of H, we construct a modified 
matrix H
Q
 from the Q-largest singular values: 
T
QQQQ
H USV=                         (4) 
Q
U is a WQ× left singular matrix with rows 
,1
w
uwW≤ ≤
Q
S; is a QQ×  diagonal matrix of Q-
largest singular values of H; is 
Q
V D Q× right sin-
gular matrix with rows , 1 . 
d
v dD≤≤
With the SVD, we project the D document vec-
tors in H into a reduced space  , referred to as 
Q-space in the rest of this paper. A test document 
of unknown language ID is mapped to a 
pseudo-document in the Q-space by matrix  
Q
V
p
c
p
v
Q
U
518
1T
pppQ
cvcUS
−
→=
Q
  (5) 
After SVD, it is straightforward to arrive at a 
natural metric for the closeness between two spo-
ken documents  and in Q-space instead of  
their original W-dimensional space  and . 
i
v
j
v
i
c
j
c
(, ) cos(, )
|| || || ||
T
ij
ij ij
ij
vv
gc c v v
vv
⋅
≈=
⋅
   (6) 
(, )
ij
g cc  indicates the similarity between two vec-
tors, which can be transformed to a distance meas-
ure . 
1
(, ) cos (, )
ij ij
kc c gc c
−
=
In the forced-choice classification, a test docu-
ment, supposedly monolingual, is classified into 
one of the L languages. Note that the test document 
is unknown to the H matrix. We assume consis-
tency between the test document’s intrinsic phono-
tactic pattern and one of the D patterns, that is 
extracted from the training data and is presented in 
the H matrix, so that the SVD matrices still apply 
to the test document, and Eq.(5) still holds for di-
mension reduction. 
3.3 Bag-of-Sounds Language Classifier 
The bag-of-sounds phonotactic LM benefits from 
several properties of vector space modeling and 
LSA.  
1) It allows for representing a spoken document 
as a vector of n-gram features, such as unigram, 
bigram, trigram, and the mixture of them; 
2) It provides a well-defined distance metric for 
measurement of phonotactic distance between 
spoken documents;  
3) It processes spoken documents in a lower di-
mensional Q-space, that makes the bag-of-
sounds phonotactic language modeling, 
LM
l
λ , 
and classification computationally manageable. 
Suppose we have only one prototypical vector 
and its projection in the Q-space to represent 
language l. Applying LSA to the term-document 
matrix
l
c
l
v
:H WL× , a minimum distance classifier is 
formulated: 
ˆ
arg min ( , )
pl
l
lkv
∈Λ
=    (7) 
In Eq.(7), is the Q-space projection of , a test 
document. 
p
v
p
c
Apparently, it is very restrictive for each lan-
guage to have just one prototypical vector, also 
referred to as a centroid. The pattern of language 
distribution is inherently multi-modal, so it is 
unlikely well fitted by a single vector. One solution 
to this problem is to span the language space with 
multiple vectors. Applying LSA to a term-
document matrix :HW L′× , where LL as-
suming each language l is represented by a set of 
M vectors, 
M′=×
l
Φ , a new classifier, using k-nearest 
neighboring rule (Duda and Hart, 1973) , is formu-
lated, named k-nearest classifier (KNC): 
ˆ
arg min ( , )
l
pl
l
l
lk
φ
′
∈Λ
′∈
= v
∑
              (8) 
where 
l
φ is the set of k-nearest-neighbor to  and  
p
v
ll
φ ⊂Φ . 
Among many ways to derive the M centroid vec-
tors, here is one option. Suppose that we have a set 
of training documents D
l
 for language l , as subset 
of corpus Ω ,  and . To derive 
the M vectors, we choose to carry out vector quan-
tization (VQ) to partition D
l
D ⊂Ω
1
L
ll
D
=
∪=Ω
l
l 
 into M cells D
l,m
 in the 
Q-space such that 
1,
M
mlm
D D
=
∪= using similarity 
metric Eq.(6). All the documents in each cell 
,lm
D can then be merged to form a super-document, 
which is further projected into a Q-space vector 
. This results in M prototypical centroids 
. Using KNC, a test vector is 
compared with M vectors to arrive at the k-nearest 
neighbors for each language, which can be compu-
tationally expensive when M is large. 
,lm
v
,
(1,.
lm l
)∈Φvm=
Alternatively, one can account for multi-modal 
distribution through finite mixture model. A mix-
ture model is to represent the M discrete compo-
nents with soft combination. To extend the KNC 
into a statistical framework, it is necessary to map 
our distance metric Eq.(6) into a probability meas-
ure. One way is for the distance measure to induce 
a family of exponential distributions with pertinent 
marginality constraints. In practice, what we need 
is a reasonable probability distribution, which 
sums to one, to act as a lookup table for the dis-
tance measure. We here choose to use the empiri-
cal multivariate distribution constructed by 
allocating the total probability mass in proportion 
to the distances observed with the training data. In 
short, this reduces the task to a histogram normali-
zation. In this way, we map the distance  
to a conditional probability distribution 
(, )
ij
kc c
(| )
ij
p vv  
519
subject to . Now that we are in the 
probability domain, techniques such as mixture 
smoothing can be readily applied to model a lan-
guage class with finer fitting. 
||
1
(| )1
ij
i
pv v
Ω
=
=
∑
Let’s re-visit the task of L language forced-
choice classification. Similar to KNC, suppose we 
have M centroids  in the Q-
space for each language l. Each centroid represents 
a class.  The class conditional probability can be 
described as a linear combination of
,
 ( 1,... )
lm l
vm∈Φ = M
,
(| )
ilm
p vv : 
,
1
(| ) ( )(| )
M
LM
il lm ilm
m
,
p vpvpvλ
=
=
∑
)
           (9) 
the probability 
,
(
lm
p v , functionally serves as a 
mixture weight of
,
(| )
ilm
p vv . Together with a set 
of centroids , 
,
 (1,.
lm l
vm)∈Φ =
,
(| )
ilm
Mp vv
)
and 
,
(
lm
p v  define a mixture model 
LM
l
λ .  
,
(| )
ilm
p vv  
is estimated by histogram normalization and 
,
(
lm
)p v is estimated under the maximum likelihood 
criteria, 
,,
() /
lm ml l
p vC= C , where C  is total 
number of documents in D
l
l
, of which C docu-
ments fall into the cell m.  
,ml
An Expectation-Maximization iterative process 
can be devised for training of
LM
l
λ  to maximize the 
likelihood Eq.(9) over the entire training corpus: 
||
11
(|) ( | )
l
DL
LM
dl
ld
ppvλ
==
ΩΛ=
∏∏
           (10) 
Using the phonotactic LM score 
( )
ˆ
/
LM
ll
PT for 
classification, with T  being represented by the 
bag-of-sounds vector v ,  Eq.(2) can be reformu-
lated as Eq.(11),  named mixture-model classifier 
(MMC): 
λ
ˆ
l
p
,,
1
ˆ
arg max ( | )
 arg max ( ) ( | )
LM
pl
l
M
lm p lm
l
m
lpv
p vpvv
λ
∈Λ
∈Λ
=
=
=
∑
 (11) 
To establish fair comparison with P-PRLM, as 
shown in Figure 3, we devise our bag-of-sounds 
classifier to solely use the LM score 
( )
ˆ
/
LM
ll
PT λ for classification decision whereas the 
acoustic score 
( )
ˆ
/,
AM
ll
PO may potentially help 
as reported in (Singer et al., 2003).  
Tλ
                                                          
 
 
Figure 3.  A bag-of-sounds classifier. A unified 
front-end followed by L parallel bag-of-sounds 
phonotactic LMs. 
4 Experiments 
This section will experimentally analyze the per-
formance of the proposed bag-of-sounds frame-
work using the 1996 NIST Language Recognition 
Evaluation (LRE) data. The database was intended 
to establish a baseline of performance capability 
for language recognition of conversational tele-
phone speech. The database contains recorded 
speech of 12 languages: Arabic, English, Farsi, 
French, German, Hindi, Japanese, Korean, Manda-
rin, Spanish, Tamil and Vietnamese. We use the 
training set and development set from LDC Call-
Friend corpus
3
 as the training data. Each conversa-
tion is segmented into overlapping sessions of 
about 30 seconds each, resulting in about 12,000 
sessions for each language. The evaluation set con-
sists of 1,492 30-sec sessions, each distributed 
among the various languages of interest. We treat a 
30-sec session as a spoken document in both train-
ing and testing. We report error rates (ER) of the 
1,492 test trials. 
4.1 Effect of Acoustic Vocabulary 
The choice of n-gram affects the performance of 
LID systems. Here we would like to see how a bet-
ter choice of acoustic vocabulary can help convert 
a spoken document into a phonotactically dis-
criminative space. There are two parameters that 
determine the acoustic vocabulary: the choice of 
acoustic token, and the choice of n-grams. In this 
paper, the former concerns the size of an acoustic 
system Y in the unified front-end. It is studied in 
more details in (Ma et al., 2005). We set Y to 32 in 
 
3
 See http://www.ldc.upenn.edu/. The overlap between 1996 
NIST evaluation data and CallFriend database has been re-
moved from training data as suggested in the 2003 NIST LRE 
website http://www.nist.gov/speech/tests/index.htm 
LM
l
λ
LM-L:  French 
Unified VT
1
LM
λ
LM-1: Chinese 
2
LM
λ
 LM-2: English 
L
a
n
g
u
a
g
e
 
C
l
a
s
s
i
f
i
e
r
s
p
o
k
e
n
 
u
t
t
e
r
a
n
c
e
H
y
p
o
t
h
e
s
i
z
e
d
 
l
a
n
g
u
a
g
e
 
AM
λ
520
this experiment; the latter decides what features to 
be included in the vector space. The vector space 
modeling allows for multiple heterogeneous fea-
tures in one vector. We introduce three types of 
acoustic vocabulary (AV) with mixture of token 
unigram, bigram, and trigram:   
a) AV1: 32 broad class phonemes as unigram, 
selected from 12 languages, also referred to as 
P-ASM as detailed in (Ma et al., 2005) 
b) AV2: AV1 augmented by 32  bigrams of 
AV1, amounting to 1,056 tokens 
32×
c) AV3: AV2 augmented by 32  tri-
grams of AV1, amounting to 33,824 tokens 
32 32××
 
AV1 AV2 AV3 
ER % 46.1 32.8 28.3
Table 1.  Effect of acoustic vocabulary (KNC) 
 
We carry out experiments with KNC classifier 
of 4,800 centroids. Applying k-nearest-neighboring 
rule, k is empirically set to 3. The error rates are 
reported in Table 1 for the experiments over the 
three AV types. It is found that high-order token n-
grams improve LID performance.   This reaffirms 
many previous findings that n-gram phonotactics 
serves as a valuable cue in LID. 
4.2 Effect of Model Size 
As discussed in KNC, one would expect to im-
prove the phonotactic model by using more cen-
troids. Let’s examine how the number of centroid 
vectors M affects the performance of KNC. We set 
the acoustic system size Y to 128, k-nearest to 3, 
and only use token bigrams in the bag-of-sounds 
vector. In Table 2, it is not surprising to find that 
the performance improves as M increases. How-
ever, it is not practical to have large M be-
cause comparisons need to take place in 
each test trial.  
LLM′=×
 
#M 1,200 2,400 4,800 12,000 
ER % 17.0 15.7 15.4 14.8 
Table 2. Effect of number of centroids (KNC) 
 
To reduce computation, MMC attempts to use 
less number of mixtures M to represent the phono-
tactic space. With the smoothing effect of the mix-
ture model, we expect to use less computation to 
achieve similar performance as KNC. In the ex-
periment reported in Table 3, we find that MMC 
(M=1,024) achieves 14.9% error rate, which al-
most equalizes the best result in the KNC experi-
ment (M=12,000) with much less computation.  
 
#M 4 16 64 256 1,024 
ER % 29.6 26.4 19.7 16.0 14.9 
Table 3. Effect of number of mixtures (MMC) 
4.3 Discussion 
The bag-of-sounds approach has achieved equal 
success in both 1996 and 2003 NIST LRE data-
bases. As more results are published on the 1996 
NIST LRE database, we choose it as the platform 
of comparison. In Table 4, we report the perform-
ance across different approaches in terms of error 
rate for a quick comparison. MMC presents a 
12.4% ER reduction over the best reported result
4
 
(Torres-Carrasquillo et al., 2002). 
It is interesting to note that the bag-of-sounds 
classifier outperforms its P-PRLM counterpart by a 
wide margin (14.9% vs 22.0%). This is attributed 
to the global phonotactic features in 
LM
l
λ .  The 
performance gain in (Torres-Carrasquillo et al., 
2002; Singer et al., 2003) was obtained mainly by 
fusing scores from several classifiers, namely 
GMM, P-PRLM and SVM, to benefit from both 
acoustic and language model scores. Noting that 
the bag-of-sounds classifier in this work solely re-
lies on the LM score, it is believed that fusing with 
scores from other classifiers will further boost the 
LID performance.  
 
 
ER % 
P-PRLM
5
22.0 
P-PRLM + GMM acoustic
5
19.5 
P-PRLM + GMM acoustic +  
GMM tokenizer
5
17.0 
Bag-of-sounds classifier (MMC) 14.9 
Table 4. Benchmark of different approaches 
 
Besides the error rate reduction, the bag-of-
sounds approach also simplifies the on-line com-
puting procedure over its P-PRLM counterpart. It 
would be interesting to estimate the on-line com-
putational need of MMC. The cost incurred has 
two main components: 1) the construction of the 
                                                           
4
 Previous results are also reported in DCF, DET, and equal 
error rate (EER). Comprehensive benchmarking for bag-of-
sounds phonotactic LM will be reported soon. 
5
 Results extracted from (Torres-Carrasquillo et al., 2002) 
521
pseudo document vector, as done via Eq.(5); 2) 
vector comparisons. The computing 
cost is estimated to be  per test trial 
(Bellegarda, 2000). For typical values of Q, this 
amounts to less than 0.05 Mflops. While this is 
more expensive than the usual table look-up in 
conventional n-gram LM, the performance im-
provement is able to justify the relatively modest 
computing overhead. 
LLM′=×
2
()QO
5 Conclusion 
We have proposed a phonotactic LM approach to 
LID problem. The concept of bag-of-sounds is in-
troduced, for the first time, to model phonotactics 
present in a spoken language over a larger context. 
With bag-of-sounds phonotactic LM, a spoken 
document can be treated as a text-like document of 
acoustic tokens. This way, the well-established 
LSA technique can be readily applied. This novel 
approach not only suggests a paradigm shift in LID, 
but also brings 12.4% error rate reduction over one 
of the best reported results on the 1996 NIST LRE 
data. It has proven to be very successful.  
We would like to extend this approach to other 
spoken document categorization tasks. In monolin-
gual spoken document categorization, we suggest 
that the semantic domain can be characterized by 
latent phonotactic features. Thus it is straightfor-
ward to extend the proposed bag-of-sounds frame-
work to spoken document categorization. 
Acknowledgement 
The authors are grateful to Dr. Alvin F. Martin of 
the NIST Speech Group for his advice when pre-
paring the 1996 NIST LRE experiments, to Dr G. 
M. White and Ms Y. Chen of Institute for Info-
comm Research for insightful discussions.  
References  
Jerome R. Bellegarda. 2000. Exploiting latent semantic 
information in statistical language modeling, In Proc. 
of the IEEE, 88(8):1279-1296. 
M. W. Berry, S.T. Dumais and G.W. O’Brien. 1995. 
Using Linear Algebra for intelligent information re-
trieval, SIAM Review, 37(4):573-595. 
William B. Cavnar, and John M. Trenkle. 1994. N-
Gram-Based Text Categorization, In Proc. of 3rd 
Annual Symposium on Document Analysis and In-
formation Retrieval, pp. 161-169. 
Jennifer Chu-Carroll, and Bob Carpenter. 1999. Vector-
based Natural Language Call Routing, Computa-
tional Linguistics, 25(3):361-388. 
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and 
R. Harshman, 1990, Indexing by latent semantic 
analysis, Journal of the American Society for Infor-
matin Science, 41(6):391-407 
Richard O. Duda and Peter E. Hart. 1973. Pattern Clas-
sification and scene analysis. John Wiley & Sons 
James L. Hieronymus. 1994. ASCII Phonetic Symbols 
for the World’s Languages: Worldbet. Technical Re-
port AT&T Bell Labs. 
Spark Jones, K. 1972. A statistical interpretation of 
term specificity and its application in retrieval, Jour-
nal of Documentation, 28:11-20 
Bin Ma, Haizhou Li and Chin-Hui Lee, 2005. An Acous-
tic Segment Modeling Approach to Automatic Lan-
guage Identification, submitted to Interspeech 2005 
Yeshwant K. Muthusamy, Neena Jain, and Ronald A.  
Cole. 1994. Perceptual benchmarks for automatic 
language identification, In Proc. of ICASSP 
Corinna Ng , Ross Wilkinson , Justin Zobel, 2000. 
, Speech Communication, 32(1-2):61-
77 
Ex-
periments in spoken document retrieval using pho-
neme n-grams
G. Salton, 1971. The SMART Retrieval System, Pren-
tice-Hall, Englewood Cliffs, NJ, 1971 
E. Singer, P.A. Torres-Carrasquillo, T.P. Gleason, W.M. 
Campbell and D.A. Reynolds. 2003. Acoustic, Pho-
netic and Discriminative Approaches to Automatic 
language recognition, In Proc. of Eurospeech 
Masahide Sugiyama. 1991. Automatic language recog-
nition using acoustic features, In Proc. of ICASSP. 
Pedro A. Torres-Carrasquillo, Douglas A. Reynolds, 
and J.R. Deller. Jr. 2002. Language identification us-
ing Gaussian Mixture model tokenization, in Proc. of 
ICASSP. 
Yonghong Yan, and Etienne Barnard. 1995. An ap-
proach to automatic language identification based on 
language dependent phone recognition, In Proc. of 
ICASSP. 
George K. Zipf. 1949. Human Behavior and the Princi-
pal of Least effort, an introduction to human ecology. 
Addison-Wesley, Reading, Mass. 
Marc A. Zissman. 1996. Comparison of four ap-
proaches to automatic language identification of 
telephone speech, IEEE Trans. on Speech and Audio 
Processing, 4(1):31-44. 
522
