Cross-Lingual Lexical Triggers in Statistical Language Modelinga0
Woosung Kim
The Johns Hopkins University
3400 N. Charles St., Baltimore, MD
woosung@cs.jhu.edu
Sanjeev Khudanpur
The Johns Hopkins University
3400 N. Charles St., Baltimore, MD
khudanpur@jhu.edu
Abstract
We propose new methods to take advan-
tage of text in resource-rich languages
to sharpen statistical language models in
resource-deficient languages. We achieve
this through an extension of the method
of lexical triggers to the cross-language
problem, and by developing a likelihood-
based adaptation scheme for combining
a trigger model with an a1 -gram model.
We describe the application of such lan-
guage models for automatic speech recog-
nition. By exploiting a side-corpus of con-
temporaneous English news articles for
adapting a static Chinese language model
to transcribe Mandarin news stories, we
demonstrate significant reductions in both
perplexity and recognition errors. We
also compare our cross-lingual adaptation
scheme to monolingual language model
adaptation, and to an alternate method for
exploiting cross-lingual cues, via cross-
lingual information retrieval and machine
translation, proposed elsewhere.
1 Data Sparseness in Language Modeling
Statistical techniques have been remarkably suc-
cessful in automatic speech recognition (ASR) and
natural language processing (NLP) over the last two
decades. This success, however, depends crucially
a2This research was supported by the National Science Foun-
dation (via Grant No¯ ITR-0225656 and IIS-9982329) and the
Office of Naval Research (via Contract No¯ N00014-01-1-0685).
on the availability of accurate and large amounts
of suitably annotated training data and it is difficult
to build a usable statistical model in their absence.
Most of the success, therefore, has been witnessed
in the so called resource-rich languages. More re-
cently, there has been an increasing interest in lan-
guages such as Mandarin and Arabic for ASR and
NLP, and data resources are being created for them
at considerable cost. The data-resource bottleneck,
however, is likely to remain for a majority of the
world’s languages in the foreseeable future.
Methods have been proposed to bootstrap acous-
tic models for ASR in resource deficient languages
by reusing acoustic models from resource-rich lan-
guages (Schultz and Waibel, 1998; Byrne et al.,
2000). Morphological analyzers, noun-phrase chun-
kers, POS taggers, etc., have also been developed
for resource deficient languages by exploiting trans-
lated or parallel text (Yarowsky et al., 2001). Khu-
danpur and Kim (2002) recently proposed using
cross-lingual information retrieval (CLIR) and ma-
chine translation (MT) to improve a statistical lan-
guage model (LM) in a resource-deficient language
by exploiting copious amounts of text available in
resource-rich languages. When transcribing a news
story in a resource-deficient language, their core
idea is to use the first pass output of a rudimentary
ASR system as a query for CLIR, identify a contem-
poraneous English document on that news topic, fol-
lowed by MT to provide a rough translation which,
even if not fluent, is adequate to update estimates of
word frequencies and the LM vocabulary. They re-
port up to a 28% reduction in perplexity on Chinese
text from the Hong Kong News corpus.
In spite of their considerable success, some short-
comings remain in the method used by Khudanpur
and Kim (2002). Specifically, stochastic translation
lexicons estimated using the IBM method (Brown
et al., 1993) from a fairly large sentence-aligned
Chinese-English parallel corpus are used in their ap-
proach — a considerable demand for a resource-
deficient language. It is suggested that an easier-
to-obtain document-aligned comparable corpus may
suffice, but no results are reported. Furthermore, for
each Mandarin news story, the single best match-
ing English article obtained via CLIR is translated
and used for priming the Chinese LM, no matter
how good the CLIR similarity, nor are other well-
matching English articles considered. This issue
clearly deserves further attention. Finally, ASR re-
sults are not reported in their work, though their pro-
posed solution is clearly motivated by an ASR task.
We address these three issues in this paper.
Section 2 begins, for the sake of completeness,
with a review of the cross-lingual story-specific LM
proposed by Khudanpur and Kim (2002). A notion
of cross-lingual lexical triggers is proposed in Sec-
tion 3, which overcomes the need for a sentence-
aligned parallel corpus for obtaining translation lex-
icons. After a brief detour to describe topic-
dependent LMs in Section 4, a description of the
ASR task is provided in Section 5, and ASR results
on Mandarin Broadcast News are presented in Sec-
tion 6. The issue of how many English articles to
retrieve and translate into Chinese is resolved by a
likelihood-based scheme proposed in Section 6.1.
2 Cross-Lingual Story-Specific LMs
For the sake of illustration, consider the task of
sharpening a Chinese language model for transcrib-
ing Mandarin news stories by using a large corpus
of contemporaneous English newswire text. Man-
darin Chinese is, of course, not resource-deficient
for language modeling — 100s of millions of words
are available on-line. However, we choose it for our
experiments partly because it is sufficiently different
from English to pose a real challenge, and because
the availability of large text corpora in fact permits
us to simulate controlled resource deficiency.
Let a3a5a4a6a8a7a10a9a10a9a10a9a11a7a12a3a5a4a13 denote the text of a1 test sto-
ries to be transcribed by an ASR system, and let
a3a15a14a6a8a7a10a9a10a9a10a9a16a7a12a3a15a14
a13 denote their corresponding or aligned
English newswire articles. Correspondence here
does not imply that the English document a3a5a14a17 needs
to be an exact translation of the Mandarin story a3 a4
a17 .
It is quite adequate, for instance, if the two stories re-
port the same news event. This approach is expected
to be helpful even when the English document is
merely on the same general topic as the Mandarin
story, although the closer the content of a pair of ar-
ticles the better the proposed methods are likely to
work. Assume for the time being that a sufficiently
good Chinese-English story alignment is given.
Assume further that we have at our disposal a
stochastic translation dictionary — a probabilistic
model of the form a18a20a19a22a21a24a23a26a25a27a29a28 — which provides the
Chinese translation a23a31a30a33a32 of each English word
a27a34a30a36a35 , where a32 and a35 respectively denote our Chi-
nese and English vocabularies.
2.1 Computing a Cross-Lingual Unigram LM
Let a37a18a38a21a39a27a5a25a3a15a14a17 a28 denote the relative frequency of a word
a27 in the document a3 a14
a17 ,
a27a40a30a41a35 , a42a40a43a45a44a46a43
a1 . It seems
plausible that, a47a48a23a48a30a49a32 ,
a18a51a50a53a52a5a54a53a55a57a56a57a58a60a59a62a61a64a63a62a65a66a21a24a23a26a25a3 a14
a17
a28a68a67a70a69
a71a73a72a16a74
a18a75a19a22a21a24a23a26a25a27a29a28 a37a18a76a21a39a27a5a25a3 a14
a17
a28a77a7 (1)
would be a good unigram model for the a44 -th Man-
darin story a3 a4a17 . We use this cross-lingual unigram
statistic to sharpen a statistical Chinese LM used for
processing the test story a3 a4a17 . One way to do this is
via linear interpolation
a18 a50a78a52a79a54a53a58a56a81a80a83a82a39a61a85a84a16a86a88a87a63a88a80a83a82a39a89 a21a24a23a57a90a77a25a23a57a90 a54
a6
a7a12a23a57a90 a54a92a91 a7a12a3 a14
a17
a28a68a67 (2)
a93
a18 a50a78a52a79a54a53a55a57a56a94a58a60a59a62a61a83a63a62a65 a21a24a23a94a90a77a25a3 a14
a17
a28a96a95a97a21a62a42a99a98
a93
a28a100a18a40a21a24a23a57a90a77a25a23a94a90 a54
a6
a7a12a23a57a90 a54a92a91 a28
of the cross-lingual unigram model (1) with a static
trigram model for Chinese, where the interpolation
weight a93 may be chosen off-line to maximize the
likelihood of some held-out Mandarin stories. The
improvement in (2) is expected from the fact that
unlike the static text from which the Chinese trigram
LM is estimated, a3 a14a17 is semantically close to a3 a4a17 and
even the adjustment of unigram statistics, based on
a stochastic translation model, may help.
Figure 1 shows the data flow in this cross-lingual
LM adaptation approach, where the output of the
first pass of an ASR system is used by a CLIR sys-
tem to find an English document a3 a14
a17 , an MT system
 Cross−Language Information Retrieval
Cross−Language 
Unigram Model
Contemporaneous 
English Articles
Baseline Chinese
Acoustic Model
Baseline Chinese
Language Model
Chinese 
DictionaryASR
Automatic Transcription
English Article Aligned with 
Mandarin Story
Machine TranslationStatistical Translation 
lexicon
Mandarin Story
Figure 1: Story-Specific Cross-Lingual Adaptation
of a Chinese Language Model using English Text.
computes the statistic of (1), and the ASR system
uses the LM of (2) in a second pass.
2.2 Obtaining Matching English Documents
To illustrate how one may obtain the English doc-
ument a3 a14a17 to match a Mandarin story a3 a4a17 , let us
assume that we also have a stochastic reverse-
translation lexicon a18a20a19a8a21a39a27a5a25a23a16a28 . One obtains from the
first pass ASR output, cf. Figure 1, the relative fre-
quency estimate a37a18a40a21a24a23a26a25a3a5a4a17 a28 of Chinese words a23 in a3a79a4a17 ,
a23a101a30a45a32 , and uses the translation lexicon a18a20a19a22a21a39a27a5a25a23a16a28 to
compute, a47a102a27a66a30a103a35 ,
a18a68a50a78a52a79a54a53a55a57a56a94a58a60a59a62a61a83a63a62a65a66a21a39a27a5a25a3 a4
a17
a28a104a67 a69
a105a62a72a16a106
a18a75a19a22a21a39a27a79a25a23a16a28 a37a18a107a21a24a23a26a25a3 a4
a17
a28a77a7 (3)
an English bag-of-words representation of the Man-
darin story a3 a4
a17 as used in standard vector-based in-
formation retrieval. The document with the highest
TF-IDF weighted cosine-similarity to a3 a4a17 is selected:
a3 a14
a17
a67a97a108a29a109a12a110a8a111a40a108a113a112
a114a73a115
a116
sima21a24a18a68a50a78a52a79a54a53a55a94a56a57a58a60a59a62a61a83a63a62a65a117a21a39a27a5a25a3 a4a17 a28a12a7 a37a18a107a21a39a27a5a25a3 a14a118 a28a62a28a77a9
Readers familiar with information retrieval litera-
ture will recognize this to be the standard query-
translation approach to CLIR.
2.3 Obtaining Stochastic Translation Lexicons
The translation lexicons a18a75a19a8a21a24a23a26a25a27a119a28 and a18a20a19a8a21a39a27a5a25a23a16a28 may
be created out of an available electronic translation
lexicon, with multiple translations of a word being
treated as equally likely. Stemming and other mor-
phological analyses may be applied to increase the
vocabulary-coverage of the translation lexicons.
Alternately, they may also be obtained auto-
matically from a parallel corpus of translated and
sentence-aligned Chinese-English text using statisti-
cal machine translation techniques, such as the pub-
licly available GIZA++ tools (Och and Ney, 2000),
as done by Khudanpur and Kim (2002). Unlike stan-
dard MT systems, however, we apply the translation
models to entire articles, one word at a time, to get a
bag of translated words — cf. (1) and (3).
Finally, for truly resource deficient languages, one
may obtain a translation lexicon via optical character
recognition from a printed bilingual dictionary (cf.
Doerman et al (2002)). This task is arguably easier
than obtaining a large LM training corpus.
3 Cross-Lingual Lexical Triggers
It seems plausible that most of the information one
gets from the cross-lingual unigram LM of (1) is
in the form of the altered statistics of topic-specific
Chinese words conveyed by the statistics of content-
bearing English words in the matching story. The
translation lexicon used for obtaining the informa-
tion, however, is an expensive resource. Yet, if one
were only interested in the conditional distribution
of Chinese words given some English words, there
is no reason to require translation as an intermedi-
ate step. In a monolingual setting, the mutual infor-
mation between lexical pairs co-occurring anywhere
within a long “window” of each-other has been used
to capture statistical dependencies not covered by
a1 -gram LMs (Rosenfeld, 1996; Tillmann and Ney,
1997). We use this inspiration to propose the follow-
ing notion of cross-lingual lexical triggers.
In a monolingual setting, a pair of words a21a24a120a121a7a81a122a11a28 is
considered a trigger-pair if, given a word-position in
a sentence, the occurrence of a120 in any of the pre-
ceding word-positions significantly alters the (con-
ditional) probability that the following word in the
sentence is a122 : a120 is said to trigger a122 . E.g. the occur-
rence of either significantly increases the proba-
bility of or subsequently in the sentence. The set of
preceding word-positions is variably defined to in-
clude all words from the beginning of the sentence,
paragraph or document, or is limited to a fixed num-
ber of preceding words, limited of course by the be-
ginning of the sentence, paragraph or document.
In the cross-lingual setting, we consider a pair of
words a21a39a27a119a7a12a23a16a28 , a27a103a30a123a35 and a23a49a30a123a32 , to be a trigger-pair
if, given an English-Chinese pair of aligned docu-
ments, the occurrence of a27 in the English document
significantly alters the (conditional) probability that
the word a23 appears in the Chinese document: a27 is
said to trigger a23 . It is plausible that translation-pairs
will be natural candidates for trigger-pairs. It is,
however, not necessary for a trigger-pair to also be a
translation-pair. E.g., the occurrence of Belgrade
in the English document may trigger the Chinese
transliterations of Serbia and Kosovo, and pos-
sibly the translations of China, embassy and
bomb! By infering trigger-pairs from a document-
aligned corpus of Chinese-English articles, we ex-
pect to be able to discover semantically- or topically-
related pairs in addition to translation equivalences.
3.1 Identification of Cross-Lingual Triggers
Average mutual information, which measures how
much knowing the value of one random variable
reduces the uncertainty of about another, has been
used to identify trigger-pairs. We compute the av-
erage mutual information for every English-Chinese
word pair a21a39a27a119a7a12a23a16a28 as follows.
Let a124a11a3 a14a17 a7a12a3 a4a17a126a125 , a44a127a67a128a42a29a7a10a9a10a9a10a9a16a7 a1 , now be a document-
aligned training corpus of English-Chinese article
pairs. Let a129a130a3a121a21a39a27a119a7a12a23a16a28 denote the document frequency,
i.e., the number of aligned article-pairs, in which a27
occurs in the English article and a23 in the Chinese.
Let a129a130a3a92a21a39a27a26a7a11a131a23a113a28 denote the number of aligned article-
pairs in which a27 occurs in the English articles but a23
does not occur in the Chinese article. Let
a18a40a21a39a27a26a7a12a23a11a28a132a67
a129a130a3a121a21a39a27a119a7a12a23a16a28
a1 a108a119a133a92a134a135a18a40a21a39a27a119a7a16a131a23a113a28a104a67
a129a130a3a92a21a39a27a26a7a11a131a23a113a28
a1 a9
The quantities a18a40a21a12a131a27a26a7a12a23a16a28 and a18a40a21a81a131a27a15a7a136a131a23a113a28 are similarly de-
fined. Next let a129a130a3a121a21a39a27a29a28 denote the number of English
articles in which a27 occurs, and define
a18a40a21a39a27a119a28a104a67
a129a137a3a92a21a39a27a29a28
a1 and a18a40a21a24a23a26a25a27a29a28a132a67
a18a40a21a39a27a26a7a12a23a11a28
a18a40a21a39a27a29a28
a9
Similarly define a18a38a21a12a131a27a119a28 , a18a38a21a24a23a119a25a138a131a27a26a28 via the document fre-
quency a129a130a3a121a21a12a131a27a119a28a139a67 a1 a98a97a129a130a3a121a21a39a27a29a28 ; define a18a38a21a24a23a11a28 via the
document frequency a129a130a3a121a21a24a23a16a28 , etc. Finally, let
a140
a21a39a27a26a141a12a23a16a28a104a67 a18a40a21a39a27a119a7a12a23a16a28a15a142a138a143a119a110a137a144a51a145
a105a81a146a71a88a147
a144a51a145
a105a148a147 a95a36a18a40a21a39a27a119a7a16a131a23a113a28a15a142a149a143a119a110a139a144a51a145a64a150
a105a73a146a71a88a147
a144a51a145a64a150
a105a100a147
a95 a18a40a21a81a131a27a15a7a12a23a11a28a15a142a149a143a119a110 a144a51a145
a105a73a146
a150
a71a151a147
a144a51a145
a105a148a147 a95a36a18a40a21a81a131a27a119a7a16a131a23a113a28a15a142a149a143a119a110 a144a51a145a64a150
a105a73a146
a150
a71a151a147
a144a51a145a64a150
a105a148a147 a9
We propose to select word pairs with high mutual
information as cross-lingual lexical triggers.
There are a25a35a152a25a83a153a154a25a32a22a25 possible English-Chinese word
pairs which may be prohibitively large to search
for the pairs with the highest mutual information.
We filter out infrequent words in each language,
say, words appearing less than 5 times, then mea-
sure a140 a21a39a27a119a141a12a23a16a28 for all possible pairs from the remaining
words, sort them by a140 a21a39a27a119a141a12a23a16a28 , and select, say, the top
1 million pairs.
3.2 Estimating Trigger LM Probabilities
Once we have chosen a set of trigger-pairs, the next
step is to estimate a probability a18a127a155a119a61a85a58a60a59a15a21a24a23a26a25a27a119a28 in lieu
of the translation probability a18a75a19a22a21a24a23a26a25a27a29a28 in (1), and a
probability a18 a155a119a61a64a58a59 a21a39a27a5a25a23a16a28 in (3).
Following the maximum likelihood approach pro-
posed by Tillman and Ney (1997), one could choose
the trigger probability a18a20a155a119a61a64a58a60a59a15a21a24a23a26a25a27a29a28 to be based on the
unigram frequency of a23 among Chinese word tokens
in that subset of aligned documents a3 a4a17 which have
a27 in a3 a14
a17 , namely
a18a127a155a119a61a85a58a60a59a15a21a24a23a26a25a27a29a28a132a67
a156
a17a77a157 a114 a115a158a160a159
a71
a1
a114a94a161a158
a21a24a23a16a28
a156
a105a100a162a149a72a11a106
a156
a17a77a157 a114 a115a158a160a159
a71
a1
a114a73a161a158
a21a24a23a94a163a83a28
a9 (4)
As an ad hoc alternative to (4), we also use
a18 a155a119a61a85a58a60a59 a21a24a23a26a25a27a119a28a104a67
a140
a21a39a27a119a141a12a23a16a28
a156
a105a39a162a149a72a16a106
a140
a21a39a27a26a141a12a23a73a163a164a28
a7 (5)
where we set a140 a21a39a27a119a141a12a23a16a28a76a67a166a165 whenever a21a39a27a119a7a12a23a16a28 is not a
trigger-pair, and find it to be somewhat more effec-
tive (cf. Section 6.2). Thus (5) is used henceforth in
this paper. Analogous to (1), we set
a18a127a155a119a61a85a58a60a59a10a54a53a55a57a56a57a58a60a59a62a61a64a63a62a65a117a21a24a23a26a25a3 a14
a17
a28a51a67 a69
a71a81a72a113a74
a18a127a155a119a61a64a58a59a5a21a24a23a119a25a27a119a28 a37a18a107a21a39a27a79a25a3 a14
a17
a28a77a7 (6)
and, again, we build the interpolated model
a18 a155a119a61a85a58a60a59a10a54a53a58a56a81a80a83a82a39a61a64a84a11a86a88a87a60a63a88a80a64a82a39a89 a21a24a23a94a90a77a25a23a57a90 a54
a6
a7a12a23a94a90 a54a92a91 a7a12a3 a14
a17
a28a68a67 (7)
a93
a18 a155a119a61a85a58a60a59a10a54a53a55a57a56a57a58a60a59a62a61a64a63a62a65 a21a24a23a57a90a77a25a3 a14
a17
a28a167a95a168a21a62a42a99a98
a93
a28a100a18a40a21a24a23a57a90a77a25a23a57a90 a54
a6
a7a12a23a57a90 a54a92a91 a28a12a9
4 Topic-Dependent Language Models
The linear interpolation of the story-dependent un-
igram models (1) and (6) with a story-independent
trigram model, as described above, is very reminis-
cent of monolingual topic-dependent language mod-
els (cf. e.g. (Iyer and Ostendorf, 1999)). This moti-
vates us to construct topic-dependent LMs and con-
trast their performance with these models.
To this end, we represent each Chinese article in
the training corpus by a bag-of-words vector, and
cluster the vectors using a standard K-means algo-
rithm. We use random initialization to seed the al-
gorithm, and a standard TF-IDF weighted cosine-
similarity as the “metric” for clustering. We per-
form a few iterations of the K-means algorithm, and
deem the resulting clusters as representing differ-
ent topics. We then use a bag-of-words centroid
created from all the articles in a cluster to repre-
sent each topic. Topic-dependent trigram LMs, de-
noted a18 a118 a21a24a23a57a90a53a25a23a57a90 a54 a6 a7a12a23a57a90 a54a92a91 a28 , are also computed for each
topic exclusively from the articles in the a169 -th cluster,
a42a66a43a170a169a38a43a123a171 .
Each Mandarin test story is represented by a bag-
of-words vector a37a18a137a21a24a23a26a25a3 a4a17 a28 generated from the first-
pass ASR output, and the topic-centroid a172 a17 having
the highest TF-IDF weighted cosine-similarity to it
is chosen as the topic of a3 a4a17 . Topic-dependent LMs
are then constructed for each story a3a79a4a17 as
a18a20a155a119a86a88a84a57a58a60a173a62a54a53a80a83a61a85a58a60a59a62a61a83a63a62a65a117a21a24a23a94a90a77a25a23a57a90 a54
a6
a7a12a23a94a90 a54a92a91 a7a62a172
a17
a28a104a67 (8)
a93
a18a20a174
a158
a21a24a23a94a90a53a25a23a57a90 a54
a6
a7a12a23a94a90 a54a92a91 a28a96a95a168a21a62a42a46a98
a93
a28a100a18a38a21a24a23a94a90a77a25a23a57a90 a54
a6
a7a12a23a94a90 a54a92a91 a28
and used in a second pass of recognition.
Alternatives to topic-dependent LMs for exploit-
ing long-range dependencies include cache LMs and
monolingual lexical triggers; both unlikely to be as
effective in the presence of significant ASR errors.
5 ASR Training and Test Corpora
We investigate the use of the techniques described
above for improving ASR performance on Man-
darin news broadcasts using English newswire texts.
We have chosen the experimental ASR setup cre-
ated in the 2000 Johns Hopkins Summer Workshop
to study Mandarin pronunciation modeling, exten-
sive details about which are available in Fung et
al (2000). The acoustic training data (a175 10 hours)
for their ASR system was obtained from the 1997
Mandarin Broadcast News distribution, and context-
dependent state-clustered models were estimated us-
ing initials and finals as subword units. Two Chinese
text corpora and an English corpus are used to esti-
mate LMs in our experiments. A vocabulary a32 of
51K Chinese words, used in the ASR system, is also
used to segment the training text. This vocabulary
gives an OOV rate of 5% on the test data.
XINHUA: We use the Xinhua News corpus of
about 13 million words to represent the scenario
when the amount of available LM training text bor-
ders on adequate, and estimate a baseline trigram
LM for one set of experiments.
HUB-4NE: We also estimate a trigram model
from only the 96K words in the transcriptions used
for training acoustic models in our ASR system.
This corpus represents the scenario when little or no
additional text is available to train LMs.
NAB-TDT: English text contemporaneous with
the test data is often easily available. For our test set,
described below, we select (from the North Ameri-
can News Text corpus) articles published in 1997 in
The Los Angeles Times and The Washington Post,
and articles from 1998 in the New York Times and
the Associated Press news service (from TDT-2 cor-
pus). This amounts to a collection of roughly 45,000
articles containing about 30-million words of En-
glish text; a modest collection by CLIR standards.
Our ASR test set is a subset (Fung et al (2000))
of the NIST 1997 and 1998 HUB-4NE bench-
mark tests, containing Mandarin news broadcasts
from three sources for a total of about 9800 words.
We generate two sets of lattices using the baseline
acoustic models and bigram LMs estimated from
XINHUA and HUB-4NE. All our LMs are evaluated
by rescoring a176a29a165a119a165 -best lists extracted from these two
sets of lattices. The a176a29a165a119a165 -best lists from the XINHUA
bigram LM are used in all XINHUA experiments,
and those from the HUB-4NE bigram LM in all
HUB-4NE experiments. We report both word error
rates (WER) and character error rates (CER), the lat-
ter being independent of any difference in segmenta-
tion of the ASR output and reference transcriptions.
6 ASR Performance of Cross-Lingual LMs
We begin by rescoring the a176a29a165a119a165 -best lists from the
bigram lattices with trigram models. For each test
story a3 a4a17 , we perform CLIR using the first pass ASR
output to choose the most similar English docu-
ment a3 a14
a17 from NAB-TDT. Then we create the cross-
lingual unigram model of (1). We also find the inter-
polation weight a93 which maximizes the likelihood
of the 1-best hypotheses of all test utterances from
the first ASR pass. Table 1 shows the perplexity and
WER for XINHUA and HUB-4NE.
Language model Perp WER a177 -value
XINHUA trigram 426 49.9% –
CL-interpolated 375 49.5% 0.208
HUB-4NE trigram 1195 60.1% –
CL-interpolated 750 59.3% a178 0.001
Table 1: Word-Perplexity and ASR WER of LMs
based on single English document and global a93 .
All a177 -values reported in this paper are based on
the standard NIST MAPSSWE test (Pallett et al.,
1990), and indicate the statistical significance of a
WER improvement over the corresponding trigram
baseline, unless otherwise specified.
Evidently, the improvement brought by CL-
interpolated LM is not statistically significant on
XINHUA. On HUB-4NE however, where Chinese
text is scarce, the CL-interpolated LM delivers con-
siderable benefits via the large English corpus.
6.1 Likelihood-Based Story-Specific Selection
of Interpolation Weights and the Number
of English Documents per Mandarin Story
The experiments above na¨ıvely used the one most
similar English document for each Mandarin story,
and a global a93 in (2), no matter how similar the best
matching English document is to a given Mandarin
news story. Rather than choosing one most simi-
lar English document from NAB-TDT, it stands to
reason that choosing more than one English docu-
ment may be helpful if many have a high similarity
score, and perhaps not using even the best matching
document may be fruitful if the match is sufficiently
poor. It may also help to have a greater interpola-
tion weight a93 for stories with good matches, and a
smaller a93 for others. For experiments in this sub-
section, we select a different a93 for each test story,
again based on maximizing the likelihood of the a42 -
best output given a CL-Unigram model. The other
issue then is the choice and the number of English
documents to translate.
a179 -best documents: One could choose a predeter-
mined number a1 of the best matching English doc-
uments for each Mandarin story. We experimented
with values of a42 , a42a10a165 , a176a29a165 , a180a29a165 , a181a29a165 and a42a10a165a119a165 , and found
that a1 a67a182a176a29a165 gave us the best LM performance,
but only marginally better than a1 a67a183a42 as described
above. Details are omitted, as they are uninteresting.
All documents above a similarity threshold:
The argument against always taking a predetermined
number of the best matching documents may be that
it ignores the goodness of the match. An alternative
is to take all English documents whose similarity to
a Mandarin story exceeds a certain predetermined
threshold. As this threshold is lowered, starting from
a high value, the order in which English documents
are selected for a particular Mandarin story is the
same as the order when choosing the a1 -best docu-
ments, but the number of documents selected now
varies from story to story. It is possible that for
some stories, even the best matching English doc-
ument falls below the threshold at which other sto-
ries have found more than one good match. We ex-
perimented with various thresholds, and found that
while a threshold of a165a78a9a149a42a136a184 gives us the lowest per-
plexity on the test set, the reduction is insignificant.
This points to the need for a story-specific strategy
for choosing the number of English documents, in-
stead of a global threshold.
Likelihood-based selection of the number of
English documents: Figure 2 shows the perplex-
ity of the reference transcriptions of one typical test
story under the LM (2) as a function of the number
of English documents chosen for creating (1). For
each choice of the number of English documents,
the interpolation weight a93 in (2) is chosen to max-
imize the likelihood (also shown) of the first pass
output. This suggests that choosing the number of
English documents to maximize the likelihood of the
first pass ASR output is a good strategy.
For each Mandarin test story, we choose the
1000-best-matching English documents and divide
the dynamic range of their similarity scores evenly
into 10 intervals. Next, we choose the documents
in the top a6
a6a151a185
-th of the range of similarity scores,
not necessarily the top-a42a10a165a119a165 documents, compute
a18 a50a78a52a79a54a53a55a94a56a57a58a60a59a62a61a83a63a62a65 a21a24a23a26a25a3 a14
a17
a28 , determine the
a93 in (2) that max-
imizes the likelihood of the first pass output of only
the utterances in that story, and record this likeli-
hood. We repeat this with documents in the top a91
a6a151a185
-th
of the range of similarity scores, the top a186a6a151a185 -th, etc.,
0 50 100 150300
400
500
600
# En Doc (dEi )
Perplexity of Reference
Reference
550
560
570
580
− Log Likelihood of 1−Best List
1−Best List
Figure 2: Perplexity of the Reference Transcription
and the Likelihood of the ASR Output v/s Number
of a3a15a14a17 for a Typical Test Story.
and obtain the likelihood as a function of the simi-
larity threshold. We choose the threshold that max-
imizes the likelihood of the first pass output. Thus
the number of English documents a3 a14a17 in (1), as well
as the interpolation weight a93 in (2), are chosen dy-
namically for each Mandarin story to maximize the
likelihood of the ASR output. Table 2 shows ASR
results for this likelihood-based story-specific adap-
tation scheme.
Note that significant WER improvements are
obtained from the CL-interpolated LM using
likelihood-based story-specific adaptation even for
the case of the XINHUA LM. Furthermore, the per-
formance of the CL-interpolated LM is even better
than the topic-dependent LM. This is remarkable,
since the CL-interpolated LM is based on unigram
statistics from English documents, while the topic-
trigram LM is based on trigram statistics. We be-
lieve that the contemporaneous and story-specific
nature of the English document leads to its rela-
tively higher effectiveness. Our conjecture, that the
contemporaneous cross-lingual statistics and static
topic-trigram statistics are complementary, is sup-
ported by the significant further improvement in
WER obtained by the interpolation of the two LMs,
as shown on the last line for XINHUA.
The significant gain in ASR performance in the
resource deficient HUB-4NE case are obvious. The
small size of the HUB-4NE corpus makes topic-
models ineffective.
6.2 Comparison of Cross-Lingual Triggers
with Stochastic Translation Dictionaries
Once we select cross-lingual trigger-pairs as de-
scribed in Section 3, a18 a19 a21a24a23a26a25a27a119a28 in (1) is replaced by
a18a127a155a119a61a85a58a60a59a15a21a24a23a26a25a27a29a28 of (5), and a18a75a19a22a21a39a27a79a25a23a11a28 in (3) by a18a20a155a119a61a64a58a60a59a15a21a39a27a5a25a23a16a28 .
Therefore, given a set of cross-lingual trigger-pairs,
the trigger-based models are free from requiring
a translation lexicon. Furthermore, a document-
aligned comparable corpus is all that is required to
construct the set of trigger-pairs. We otherwise fol-
low the same experimental procedure as above.
As Table 2 shows, the trigger-based model (Trig-
interpolated) performs only slightly worse than the
CL-interpolated model. One explanation for this
degradation is that the CL-interpolated model is
trained from the sentence-aligned corpus while the
trigger-based model is from the document-aligned
corpus. There are two steps which could be affected
by this difference, one being CLIR and the other be-
ing the translation of the a3 a14a17 ’s into Chinese. Some
errors in CLIR may however be masked by our
likelihood-based story-specific adaptation scheme,
since it finds optimal retrieval settings, dynamically
adjusting the number of English documents as well
as the interpolation weight, even if CLIR performs
somewhat suboptimally. Furthermore, a document-
aligned corpus is much easier to build. Thus a much
bigger and more reliable comparable corpus may be
used, and eventually more accurate trigger-pairs will
be acquired.
We note with some satisfaction that even simple
trigger-pairs selected on the basis of mutual infor-
mation are able to achieve perplexity and WER re-
ductions comparable to a stochastic translation lex-
icon: the smallest a177 -value at which the difference
between the WERs of the CL-interpolated LM and
the Trig-interpolated LM in Table 2 would be signif-
icant is a165a78a9a60a187 for XINHUA and a165a78a9a189a188 for HUB-4NE.
Triggers (4) vs (5): We compare the alternative
a18a127a155a119a61a85a58a60a59a15a21a88a190a64a25a189a190a191a28 definitions (4) and (5) for replacing a18a20a19a8a21a88a190a64a25a189a190a191a28
in (1). The resulting CL-interpolated LM (2) yields a
perplexity of 370 on the XINHUA test set using (4),
compared to 367 using (5). Similarly, on the HUB-
4NE test set, using (4) yields 736, while (5) yields
727. Therefore, (5) has been used throughout.
XINHUA HUB-4NE
Perp WER CER a177 -value Language model Perp WER CER a177 -value
426 49.9% 28.8% – Baseline Trigram 1195 60.1% 44.1% –
381 49.1% 28.4% 0.003 Topic-trigram 1122 60.0% 44.1% 0.660
367 49.1% 28.6% 0.004 Trig-interpolated 727 58.8% 43.3% a178 0.001
346 48.8% 28.4% a178 0.001 CL-interpolated 630 58.8% 43.1% a178 0.001
340 48.7% 28.4% a178 0.001 Topic + Trig-interpolated 730 59.2% 43.5% 0.002
326 48.5% 28.2% a178 0.001 Topic + CL-interpolated 631 59.0% 43.3% a178 0.001
320 48.3% 28.1% a178 0.001 Topic + Trig- + CL-interp. 627 59.0% 43.3% a178 0.001
Table 2: Perplexity and ASR Performance with a Likelihood-Based Story-Specific Selection of the Number
of English Documents a3 a14
a17 ’s and Interpolation Weight
a93 for Each Mandarin Story.
7 Conclusions and Future Work
We have demonstrated a statistically significant im-
provement in ASR WER (1.4% absolute) and in
perplexity (23%) by exploiting cross-lingual side-
information even when nontrivial amount of train-
ing data is available, as seen on the XINHUA cor-
pus. Our methods are even more effective when LM
training text is hard to come by in the language of
interest: 47% reduction in perplexity and 1.3% ab-
solute in WER as seen on the HUB-4NE corpus.
Most of these gains come from the optimal choice of
adaptation parameters. The ASR test data we used
in our experiments is derived from a different source
than the corpus on which the translation and trigger
models are trained, and the techniques work even
when the bilingual corpus is only document-aligned,
which is a realistic reflection of the situation in a
resource-deficient language.
We are developing maximum entropy models to
more effectively combine the multiple information
sources we have used in our experiments, and expect
to report the results in the near future.

References
P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer.
1993. The mathematics of statistical machine trans-
lation: Parameter estimation. Computational Linguis-
tics, 19(2):269 – 311.
W. Byrne, P. Beyerlein, J. Huerta, S. Khudanpur,
B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Ver-
gyri, and W. Wang. 2000. Towards language indepen-
dent acoustic modeling. In Proc. ICASSP, volume 2,
pages 1029 – 1032.
P. Fung et al. 2000. Pronunciation modeling of mandarin
casual speech. 2000 Johns Hopkins Summer Work-
shop.
D. Doermann et al. 2002. Lexicon acquisition from
bilingual dictionaries. In Proc. SPIE Photonic West
Article Imaging Conference, pages 37–48, San Jose,
CA.
R. Iyer and M. Ostendorf. 1999. Modeling long-distance
dependence in language: topic-mixtures vs dynamic
cache models. IEEE Transactions on Speech and Au-
dio Processing, 7:30–39.
S. Khudanpur and W. Kim. 2002. Using cross-language
cues for story-specific language modeling. In Proc.
ICSLP, volume 1, pages 513–516, Denver, CO.
F. J. Och and H. Ney. 2000. Improved statistical align-
ment models. In ACL00, pages 440–447, Hongkong,
China, October.
D. Pallett, W. Fisher, and J. Fiscus. 1990. Tools for
the analysis of benchmark speech recognition tests.
In Proc. ICASSP, volume 1, pages 97–100, Albur-
querque, NM.
R. Rosenfeld. 1996. A maximum entropy approach
to adaptive statistical language modeling. Computer,
Speech and Language, 10:187–228.
T. Schultz and A. Waibel. 1998. Language independent
and language adaptive large vocabulary speech recog-
nition. In Proc. ICSLP, volume 5, pages 1819–1822,
Sydney, Australia.
C. Tillmann and H. Ney. 1997. Word trigger and the em
algorithm. In Proceedings of the Workshop Computa-
tional Natural Language Learning (CoNLL 97), pages
117–124, Madrid, Spain.
D. Yarowsky, G. Ngai, and R. Wicentowski. 2001. In-
ducing multilingual text analysis tools via robust pro-
jection across aligned corpora. In Proc. HLT 2001,
pages 109 – 116, San Francisco CA, USA.
