Proceedings of the ACL Student Research Workshop, pages 121–126,
Ann Arbor, Michigan, June 2005. c©2005 Association for Computational Linguistics
Speech Recognition of Czech - Inclusion of Rare Words Helps
Petr Podvesk·y and Pavel Machek
Institute of Formal and Applied Linguistics
Charles University
Prague, Czech Republic
a0 podvesky,machek
a1 @ufal.mff.cuni.cz
Abstract
Large vocabulary continuous speech
recognition of in ective languages, such
as Czech, Russian or Serbo-Croatian, is
heavily deteriorated by excessive out of
vocabulary rate. In this paper, we tackle
the problem of vocabulary selection, lan-
guage modeling and pruning for in ective
languages. We show that by explicit
reduction of out of vocabulary rate we
can achieve signi cant improvements
in recognition accuracy while almost
preserving the model size. Reported
results are on Czech speech corpora.
1 Introduction
Large vocabulary continuous speech recognition of
in ective languages is a challenging task for mainly
two reasons. Rich morphology generates huge num-
ber of forms which are not captured by limited-size
dictionaries, and therefore leads to worse recogni-
tion results. Relatively free word order admits enor-
mous number of word sequences and thus impover-
ishes a2 -gram language models. In this paper we are
concerned with the former issue.
Previous work which deals with excessive vocab-
ulary growth goes mainly in two lines. Authors have
either decided to break words into sub-word units or
to adapt dictionaries in a multi-pass scenario. On
Czech data, (Byrne et al., 2001) suggest to use lin-
guistically motivated recognition units. Words are
broken down to stems and endings and used as the
recognition units in the  rst recognition phase. In
the second phase, stems and endings are concate-
nated. On Serbo-Croatian, (Geutner et al., 1998)
also tested morphemes as the recognition units. Both
groups of authors agreed that this approach is not
bene cial for speech recognition of in ective lan-
guages. Vocabulary adaptation, however, brought
considerable improvement. Both (Icring and Psutka,
2001) on Czech and (Geutner et al., 1998) on Serbo-
Croatian reported substantial reduction of word er-
ror rate. Both authors followed the same procedure.
In the  rst pass, they used a dictionary composed
of the most frequent words. Generated lattices were
then processed to get a list of all words which ap-
peared in them. This list served as a basis for a new
adapted dictionary into which morphological vari-
ants were added.
It can be concluded that large corpora contain a
host of words which are ignored during estimation
of language models used in  rst pass, despite the fact
that these rare words can bring substantial improve-
ment. Therefore, it is desirable to explore how to in-
corporate rare or even unseen words into a language
model which can be used in a  rst pass.
2 Language Model
Language models used in a  rst pass of current
speech recognition systems are usually built in the
following way. First, a text corpus is acquired.
In case of broadcast news, a newspaper collection
or news transcriptions are a good source. Second,
most frequent words are picked out to form a dictio-
nary. Dictionary size is typically in tens of thousand
words. For English, for example, dictionaries of size
121
of 60k words suf ciently cover common domains.
(Of course, for recognition of entries listed in the
Yellow pages, such limited dictionaries are clearly
inappropriate.) Third, an a2 -gram language model is
estimated. In case of Katz back-off model, the con-
ditional bigram word probability is estimated as
a3a5a4a7a6a9a8a11a10a13a12a8a11a10a9a14a15a4a17a16a19a18a21a20a23a22
a3a24a6a9a8 a10 a12a8 a10a9a14a15a4 a16 if
a25
a6a9a8 a10a9a14a15a4a27a26 a8 a10 a16a29a28a31a30
a32a34a33a34a6a9a8 a10a9a14a15a4 a16a36a35
a22
a3a24a6a9a8 a10 a16 otherwise
(1)
where a22a3 represents a smoothed probability distribu-
tion, a32a34a33a34a6a37a16 stands for the back-off weight, and a25 a6a39a38a40a16
denotes the count of its argument. Back-off model
can be also nicely viewed as a  nite state automaton
as depicted in Figure 1.
a8a41a4 a8a43a42
a44a46a45
a8a43a42a48a47
a22
a3a49a6a9a8a43a42a50a12a8a41a4a51a16
a52
a47a53a32a34a33a34a6a9a8 a4 a16 a8a43a42a54a47
a22
a3a24a6a9a8a43a42a54a16
Figure 1: A fragment of a bigram back-off model
represented as a  nite-state automaton.
To alleviate the problem of a high OOV, we sug-
gest to gather supplementary words and add them
into the model in the following way.
a3a24a6a9a8a11a10a55a12a8a11a10a9a14a15a4a17a16a56a18a21a20
a3 a4 a6a9a8 a10 a12a8 a10a9a14a15a4 a16 a8 a10a58a57a60a59
a32a61a33a24a6a9a8a11a10a9a14a15a4a51a16a58a35a7a62a34a6a9a8a11a10a37a16a63a8a11a10 a57a65a64
(2)
a3a5a4a7a6a37a16 refers to the regular back-off model, a59 de-
notes the regular dictionary from which the back-off
model was estimated, a64 is the supplementary dictio-
nary which does not overlap with a59 .
Several sources can be exploited to obtain sup-
plementary dictionaries. Morphology tools can de-
rive words which are close to those observed in cor-
pus. In such a case, a62a24a6a9a8a43a10a66a16 can be set as a constant
function and estimated on held-out data to maximize
recognition accuracy.
a62a24a6a9a8 a10 a16a56a18a68a67 a45
a2a70a69a54a71 for
a8 a10 generated by morphology
(3)
Having prior domain knowledge, new words which
are expected to appear in audio recordings might be
collected and added into a64 . Consider an example
of transcribing an ice-hockey tournament. Names
of new players are desirably in the vocabulary. An-
other source of a64 are the words which fell below
the selection threshold of a59 . In large corpora, there
are hundreds of thousands words which are omitted
from the estimated language model. We suggest to
put them into a64 . As it turned out, unigram proba-
bility of these words is very low, so it is suitable to
increase their score to make them competitive with
other words in a59 during recognition. a62a24a6a9a8 a10 a16 is then
computed as
a62a24a6a9a8a11a10a66a16a56a18 shift a35a54a72a56a6a9a8a43a10a73a16 (4)
where a72a56a6a9a8a43a10a73a16 refers to the relative frequency of a8a74a10 in
a given corpus, shift denotes a shifting factor which
should be tuned on some held-out data.
a8a41a4 a8a43a42
a44a46a45
a8a43a42a48a47
a22
a3a49a6a9a8a43a42a50a12a8a41a4a51a16
a52
a47a53a32a34a33a34a6a9a8a41a4a17a16 a8a43a42a48a47
a22
a3a24a6a9a8a43a42a54a16
a75
a4a17a47a76a62a34a6
a75
a4a17a16
a75
a42 a47a76a62a34a6
a75
a42 a16
Figure 2: A fragment of a bigram back-off model
injected by a supplementary dictionary
Note that the probability of a word given its his-
tory is no longer proper probability. It does not adds
up to one. We decided not to normalize the model
for two reasons. First, we used a decoder which
searches for the best path using Viterbi criterion, so
there’s no need for normalization. Second, normal-
ization would have involved recomputing all back-
off model weights and could also enforce re-tuning
of the language model scaling factor. To rule out
any variation which the re-tuning of the scaling fac-
tor could bring, we decided not to normalize the new
model.
In  nite-state representation, injection of a new
dictionary was implemented as depicted in Figure
2. Supplementary words form a loop in the back-off
state.
122
3 Experiments
We have evaluated our approach on two corpora,
Czech Broadcast News and the Czech portion of
MALACH data.
3.1 Czech Broadcast News Data
The Czech Broadcast News (Radov·a et al., 2004) is
a collection of both radio and TV news in Czech.
Weather forecast, traf c announcements and sport
news were excluded from this corpus. Our train-
ing portion comprises 22 hours of speech. To tune
the language model scaling factor and additional LM
parameters, we set aside 100 sentences. The test set
consists of 2500 sentences.
We used the HTK toolkit (Young et al., 1999) to
extract acoustic features from sampled signal and to
estimate acoustic models. As acoustic features we
used 12 Mel-Frequency Cepstral Coef cients plus
energy and delta and delta-delta features. We trained
a triphone acoustic model with tied mixtures of con-
tinuous density Gaussians.
As a LM training corpus we exploited a collection
of newspaper articles from the Lidov·e Noviny (LN)
newspaper. This collection was published as a part
of the Prague Dependency Treebank by LDC (Haji c
et al., 2001). This corpus contains 33 million tokens.
Its vocabulary contains more than 650k word forms.
OOV rates are displayed in Table 1.
Dict. size OOV
60k 8.27%
80k 6.92%
124k 5.20%
371k 2.23%
658k 1.63%
Table 1: OOV rate of transcriptions of the test data.
Dictionaries contain the most frequent words.
As can be readily observed, moderate-size vocab-
ularies don’t suf ciently cover the test data tran-
scriptions. Therefore they are one of the major
sources of poor recognition performance.
The baseline language model was estimated from
60k most frequent words. It was a bigram Katz
back-off model with Knesser-Ney smoothing pruned
by the entropy-based method (Stolcke, 1998).
As the supplementary dictionary we took the rest
of words from the LN corpus. To learn the impact
of injection of infrequent words, we carried out two
experiments.
First, we built a uniform loop which was injected
into the back-off model. The uniform distribution
was tuned on the held-out data. Tuning of this con-
stant is displayed in Table 2.
Uniform scale WER
12 18.89%
11 18.68%
10 18.40%
9 21.00%
Table 2: Tuning of uniform distribution on the held-
out set. WER denotes the word error rate.
Second, we took relative frequencies multiplied
by a shift coef cient as the injected model scores.
This shift coef cient was again tuned on held-out
data as shown in Table 3.
Unigram shift WER
no shift 19.52%
a77a54a78 18.54%
a77a48a79 17.91%
a77a54a80 18.75%
Table 3: Tuning of the shift coef cient of unigram
model on the held-out set.
Then, we took the best parameters and used them
for recognition of the test data. Recognition re-
sults are depicted in Figure 4. The injection of sup-
plementary words helped decrease both recognition
word error rate and oracle word error rate. By oracle
WER is meant WER of the path, stored in the gener-
ated lattice, which best matches the utterance regard-
less the scores. In other words, oracle WER gives us
a bound on how well can we get by tuning scores in
a given lattice. Injection of shifted unigram model
brought relative improvement of 13.6% in terms of
WER over the 60k baseline model. Uniform injec-
tion brought also signi cant improvement despite its
simplicity. Indeed, we observed more than 10% rel-
ative improvement in terms of WER. In terms of ora-
cle WER, unigram injection brought more than 30%
relative improvement.
123
Model WER OWER
Baseline 60k 29.17% 15.90%
Baseline 80k 27.44% 14.31%
60k + Uniform injection 26.12% 11.10%
60k + Unigram injection 25.21% 11.03%
Table 4: Evaluation on 2500 test sentences. OWER
stands for the oracle error rate.
It’s worthwhile to mention the model size, since it
could be argued that the improvement was achieved
by an enormous increase of the model. We de-
cided to measure the model size using two factors.
The disk space occupied by the language model and
the disk space taken up by the so-called CLG. By
CLG we mean a transducer which maps triphones
to words augmented with the model scores. This
transducer represents the search space investigated
during recognition. More details on transducers in
speech recognition can be found in (Mohri et al.,
2002). Table 5 summarizes the sizes of the evalu-
ated models.
Model CLG size G size
Baseline 60k 399MB 106MB
60k + Uniform 405MB 115MB
60k + Unigram 405MB 115MB
Baseline 80k 441MB 116MB
Table 5: Model size comparison measured in disk
space. G denotes a language model compiled as
a  nite-state automaton. CLG denotes transducer
mapping triphones to words augmented with model
scores.
Injection of supplementary words increased the
model size only slightly. To see the difference in the
size of injected models and traditionally built ones,
we constructed a model of 80k most frequent words
and pruned with the same threshold as the 60k LM.
Not only did this 80k model give worse recognition
results, but it also proved to be bigger.
3.2 MALACH Data
The next data we tested our approach on was
the Czech portion of the MALACH corpus
(http://www.clsp.jhu.edu/research/malach).
MALACH is a multilingual audio-visual corpus.
It contains recordings of survivors of World War
II talking about war events. 600 people spoke in
Czech, but only 350 recordings had been digitized
till end of 2003. The interviewer and the interviewee
had separate microphones, and were recorded on
separate stereo channels. Recordings were stored in
the MPEG-1 format. Average length of a testimony
is 1.9 hours.
30 minutes from each testimony were transcribed
and used as training data. 10 testimonies were tran-
scribed completely and used for testing. The acous-
tic model used 15-dimensional PLP cepstral fea-
tures, sampled at 10 msec. Modeling was done using
the HTK Toolkit.
The baseline language model was estimated from
transcriptions of the survivors’ testimonies. We
worked with the standardized version of the tran-
scriptions. More details regarding the Czech portion
of the MALACH data can be found in (Psutka et al.,
2004). Transcriptions are 610k words long and the
entire vocabulary comprises 41k words. We refer to
this corpus as TR 41k.
To obtain a supplementary vocabulary, we used
Czech morphology tools (Haji c and Vidov·a-Hladk·a,
1998). Out of 41k words we generated 416k words
which were the in ected forms of the observed
words in the corpus. Note that we posed restrictions
on the generation procedure to avoid obsolete, ar-
chaic and uncommon expressions. To do so, we ran
a Czech tagger on the transcriptions and thus ob-
tained a list of all morphological tags of observed
forms. The morphological generation was then con-
 ned to this set of tags.
Since there is no corpus to train unigram scores
of generated words on, we set the LM score of the
generated forms to a constant.
The transcriptions are not the only source of text
data in the MALACH project. (Psutka et al., 2004)
searched the Czech National Corpus (CNC) for sen-
tences which are similar to the transcriptions. This
additional corpus contains almost 16 million words,
330k types. CNC vocabulary overlaps to a large ex-
tent with TR vocabulary. This fact is not surprising
since the selection criterion was based on a lemma
unigram probability. Table 6 summarizes OOV rates
of several dictionaries.
We estimated several language models. The base-
line models are pruned bigram back-off models with
Knesser-Ney smoothing. The baseline word error
124
Dictionary OOV
Name Size
TR41k 41k 5.07 %
TR41k + Morph416k 416k 2.74 %
TR41k + CNC60k 79k 3.04 %
TR41k + CNC100k 114k 2.62 %
TR41k + CNC160k 171k 2.25%
TR41k + CNC329k 337k 1.76 %
All together 630k 1.46 %
Table 6: OOV for several dictionaries. TR, CNC de-
note the transcriptions, the Czech National Corpus,
respectively. Morph refers to the dictionary gener-
ated by the morphology tools from from TR. Num-
bers in the dictionary names represent the dictionary
size.
rate of the model built solely from transcriptions was
37.35%. We injected constant loop of morphologi-
cal variants into this model. In terms of text cover-
age, this action reduced OOV from 5.07% to 2.74%.
In terms of recognition word error rate, we observed
a relative improvement of 3.5%.
In the next experiment we took as the baseline LM
a linear interpolation of the LM built from transcrip-
tions and a model estimated from the CNC corpus.
Into this model, we injected a unigram loop of all
the available words. That is the rest of words from
the CNC corpus with unigram scores and words pro-
vided by morphology which were not already in the
model. Table 7 summarizes the achieved WER and
oracle WER. Given the fact that the injection only
slightly reduced the OOV rate, a small relative re-
duction of 2.3% matched our expectations.
Model Acc OAcc
TR41k 37.35% 14.40%
TR41k + Uniform Morph 36.06% 12.48%
TR41k + CNC 100k 34.47% 11.95%
TR41k + CNC 100k + Inj 33.67% 10.79%
TR41k + CNC 160k 34.19% 11.65%
Table 7: Word error rate and oracle WER for base-
line and injected models. Uniform Morph refers
to the constant uniform loop of the morphology-
generated words. Inj denotes the loop of the rest
of words of the CNC corpus and the morphology-
generated words.
To learn how the injection affected model size, we
measured size of the language model automaton and
the optimized triphone-to-word transducer. As in the
case of the LN corpus, injection increased the model
size only moderately. Sizes of the models are shown
in Table 8.
model CLG G
TR41k 38MB 5.6MB
TR41k + Morph 54MB 11MB
TR41k + CNC 100k 283MB 53MB
TR41k + CNC 100k + Inj 307MB 61MB
TR41k + CNC 160k 312MB 59MB
Table 8: Disk usage of tested models. G refers
to a language model compiled into an automaton,
CLG denotes triphone-to-word transducer. CNC and
Morph refer to a LM estimated from transcriptions
and the Czech National Corpus, respectively. Morph
represents the loop of words generated by morphol-
ogy. Inj is the loop of all words from CNC which
were not included in CNC language model, more-
over, Inj also contains words generated by the mor-
phology.
4 Conclusion
In this paper, we have suggested to inject a loop
of supplementary words into the back-off state of a
 rst-pass language model. As it turned out, addition
of rare or morphology-generated words into a lan-
guage model can considerably decrease both recog-
nition word error rate and oracle WER in single
recognition pass. In the recognition of Czech Broad-
cast News, we achieved 13.6% relative improvement
in terms of word error rate. In terms of oracle er-
ror rate, we observed more than 30% relative im-
provement. On the MALACH data, we attained only
marginal word error rate reduction. Since the text
corpora already covered the transcribed speech rela-
tively well, a smaller OOV reduction translated into
a smaller word error rate reduction. In the near fu-
ture, we would like to test our approach on agglu-
tinative languages, where the problems with high
OOV are even more challenging. We would also like
to experiment with more complex language models.
125
5 Acknowledgements
We would like to thank our colleagues from the Uni-
versity of Western Bohemia for providing us with
acoustic models. This work has been done under the
support of the project of the Ministry of Education of
the Czech Republic No. MSM0021620838 and the
grant of the Grant Agency of the Charles University
(GAUK) No. 375/2005.
References
W. Byrne, J. Haji c, P. Ircing, F. Jelinek, S. Khudanpur,
P. Krbec, and J. Psutka. 2001. On large vocabulary
continuous speech recognition of highly in ectional
language - Czech. In Eurospeech 2001.
P. Geutner, M. Finke, and P. Scheytt. 1998. Adaptive
Vocabulariesfor Transcribing Multilingual Broadcast
News. In ICASSP, Seattle, Washington.
Jan Haji c and Barbora Vidov·a-Hladk·a. 1998. Tagging
in ective languages: Prediction of morphological cat-
egories for a rich, structured tagset. In Proceedings
of the Conference COLING ACL ‘98, pages 483-490,
Mountreal, Canada.
Jan Haji c, Eva Haji cov·a, Petr Pajas, Jarmila Panevov·a,
Petr Sgall, and Barbora Vidov·a-Hladk·a. 2001. Prague
dependency treebank 1.0. Linguistic Data Consortium
(LDC), catalog number LDC2001T10.
P. Icring and J. Psutka. 2001. Two-Pass Recognition of
Czech Speech Using Adaptive Vocabulary. In TSD,
 Zelezna·a Ruda, Czech Republic.
M. Mohri, F. Pereira, and M. Riley. 2002. Weighted
 nite-state transducers in speech recognition. Com-
puter Speech and Language, 16:69-88.
J. Psutka, P. Ircing, V. Radova, and J. V. Psutka. 2004.
Issues in annotation of the Czech spontaneous speech
corpus in the MALACH project. In Proceedings of the
4th International Conference on Language Resources
and Evaluation, Lisbon, Portugal.
Vlasta Radov·a, Josef Psutka, Lud ek Mcurrency1uller, William
Byrne, J.V. Psutka, Pavel Ircing, and Jind rich Ma-
tou sek. 2004. Czech broadcast news speech.
Linguistic Data Consortium (LDC), catalog number
LDC2004S01.
A. Stolcke. 1998. Entropy-based pruning of backoff lan-
guage models. In In Proceedings of the ARPA Work-
shop on Human Language Technology.
S. Young et al. 1999. The HTK Book. Entropic Inc.
126
