SPEECH RECOGNITION USING A STOCHASTIC LANGUAGE 
MODEL INTEGRATING LOCAL AND GLOBAL CONSTRAINTS 
Ryosuke Isotani, Shoichi Matsunaga 
ATR Interpreting Telecommunications Research Laboratories 
Seika-cho, Soraku-gun, Kyoto 619-02, Japan 
ABSTRACT 
In this paper, we propose a new stochastic language model that 
integrates local and global constraints effectively and describe a 
speechrecognition system basedon it. Theproposedlanguagemodel 
uses the dependencies within adjacent words as local constraints in 
the same way as conventional word N-gram models. To capture 
the global constraints between non-contiguous words, we take into 
account the sequence of the function words and that of the content 
words which are expected to represent, respectively, the syntactic and 
semantic relationships between words. Furthermore, we show that 
assuming an independence between local- and global constraints, the 
number of parameters to be estimated and stored is greatly reduced. 
The proposed language model is incorporated into a speech recog- 
nizer based on the time-synchronous Viterbi decoding algorithm, and 
compared with the word bigram model and trigram model. The pro- 
posed model gives a better recognition rate than the bigram model, 
though slightly worse than the trigram model, with only twice as 
many parameters as the bigram model. 
1. INTRODUCTION 
At present, word N-gram models \[1\], especially bigram (N = 2).or 
trigram (N = 3) models, are recognized as effective and are widely 
used as language models for speech recognition. Such models, how- 
ever, represent only local constraints within a few successive words 
and lack the ability to capture global or long distance dependencies 
between words. They might represent global constraints if N were 
set at a larger value, but it is not only computationally impractical 
but also inefficient because dependencies between non-contiguous 
words are often independent of the contents and length of the word 
string between them. In addition, estimating so many parameters 
from a finite number of text corpora would result in sparseness of 
data. 
Recently some papers treat long distance factors. In the long dis- 
tance bigrams by Huang et al. \[2\] a linear combination of distance-d 
bigrams is used. All the preceding words in a window of fixed 
length are considered, and bigram probabilities are estimated for 
each distance d between words respectively. The extended bigram 
model by Wright et al. \[3\] uses a single word selected for each word 
according to a statistical measure as its "parent." The extended 
bigrams are insensitive to the distance between the word and its 
parent, but this model does not utilize multiple information. The 
trigger pairs described in \[4, 5\] also represent relationships between 
non-contiguous words. They are also extracted automatically and 
insensitive to the distance. The way of combining the evidence from 
trigger pairs with local constraints ("the static model" in their term) 
is also given. But this approach has the disadvantage that it is com- 
putationaUy expensive. Another approach is a tree-based model \[6\], 
which automatically generates a binary decision tree from training 
data. Although it could also extract similar dependencies by setting 
binary questions appropriately, it has the same disadvantage as the 
trigger-based model. 
We therefore proposed a new language model based on function 
word N-grams 1 and content word N-grams \[7\]. Global constraints 
are captured effectively without significantly increasing computa- 
tional cost nor number of parameters by utilizing simple linguistic 
tags. Function wordN-grams are mainly intended for syntactic con- 
straints, while content word N-grams are for semantic ones. We 
already showed their effectiveness for Japanese speech recognition 
by applying them to sentence candidate selection from phrase lat- 
tices obtained by a phrase speech recognizer. We also gave a method 
to combine these global constraints with local constraints similar 
to conventional bigrams, and demonstrated that it improves perfor- 
mance. 
In this paper, we extend and modify this model so that it can be 
incorporated directly into the search process in continuous speech 
recognition based on the time-synchronous Viterbi decoding algo- 
rithm. The new model uses the conventional word N-grams for 
local constraints with N being a small value, and uses function- and 
content word N-grams as global constraints, where N can again be 
small. These constraints are treated statistically in a unified manner. 
A similar approach is found in \[8\], where, to compute a word proba- 
bility, the headwords of the two phrases immediately preceding the 
word are used as well as the last two words. Our model is differ- 
ent from this method in that the former also takes function words 
into consideration, and treats function words and content words sepa- 
rately in computing the probability to extract more effective syntactic 
and semantic information, respectively. 
In the following sections, we first explain the proposed language 
model, where we also show that the number of parameters can be 
reduced by assuming an independence between local- and global 
constraints. Then we describe how it is incorporated into the time- 
synchronous Viterbi decoding algorithm. Finally, results of speaker- 
dependent sentence recognition experiments are presented, where 
our model is compared with the word bigram and trigram models in 
the viewpoints of number of parameters, perplexity, and recognition 
rate. 
2. LANGUAGE MODELING 
Linguistic constraints between words in a sentence include syntactic 
ones and semantic ones. The syntactic constraints are often specified 
by the relationships between the cases of the words or phrases. Con- 
1 Previously referred to as "particle N-grams." 
88 
(a) 
(b) 
Kaigi -wa I futsuka -kara I itsuka -made I Kyoto -de I 
the conference (CM) the 2nd from the 5th to Kyoto in 
(The conference will be held in Kyoto from the 2nd to the 5th.) 
Soredewa / tourokuyoushi -o / ookuri -itashi -masu. 
then the registration form (CM) send (aux. v.) (aux. v.) 
(Then I will send you the registration form.) 
Figure 1: Examples of Japanese sentences 
(CM: case marker, aux. v.: auxiliary verb) 
kaisaisare -masu. 
be held (aux. v.) 
sequently, they are expected to be reflected in the sequence of the 
cases of the words or phrases. Taking notice that case information is 
mainly conveyed by function words in Japanese, we consider func- 
tion word sequences to capture syntactic constraints while ignoring 
content words in the sentences. On the contrary, semantic infor- 
mation is mostly contained in the content words. Accordingly the 
idea of content word sequences is also introduced to extract semantic 
constraints. 
After briefly explaining the roles of the function words and content 
words in Japanese sentences, we will propose a new model, model 
I, as an extension of the conventionalN-gram model. In this model, 
the relationships between function words and between content words 
are taken into consideration only implicitly. Then by making some 
assumptions, model II will be derived as an approximation of model I. 
Model II uses the probabilities of function word N-grams and content 
word N-grams directly and may be easier to grasp intuitively. 
2.1. Function Words and Content Words in 
Japanese 
A common Japanese sentence consists of phrases ("bunsetsu"), each 
of which typically has one content word and optional function words. 
Figure 1 shows examples of Japanese sentences. In the figure, "f' 
represents a phrase separator. Words after "-" in a phrase are func- 
tion words and all others are content words 2. The corresponding 
English words are given in the figure. Content words include nouns, 
verbs, adjectives, adverbs, etc. Function words are particles and 
auxiliary verbs. Japanese particles include case markers such as 
"ga" (subjective case marker), "o" (objective case marker) as well as 
words such as "kara (from)" or "de (in)." Every word in a sentence 
is classified either as a content word or as a function word. 
Paying attention only to function words and ignoring content words 
in sentences, "kara (from)" often comes before "made (to)" while 
"ga"s (subjective case markers) rarely appear in succession in a sen- 
tence. Thus, a sequence of function words is expected to reflect the 
syntactic constraints of a sentence. If we consider the content word 
sequence instead, such words as "sanka (participate)" or "happyou 
(give a presentation)" appear more frequently than words such as 
"okuru (send)" after "kaigi (conference)." On the other hand, after 
"youshi (form)," "okuru (send)" comes more frequently. Like these 
examples, a sequence of content words in a sentence is expected to 
be constrained by semantic relationships between words. 
These kinds of constraints can be described statistically. To acquire 
these global constraints, the proposed language model makes use of 
2 These marks are for explanation only and never appearin actual Japanese 
text. 
the N-gram probabilities of both function words and content words. 
2.2. Proposed Language Model I 
Suppose a sentence S consists of a word siring wl, w2,..., wn, and 
denote a substring Wl, w2,..., wi as w\[. Then the probability of the 
sentence S is written as 
P(S) = P(w1,w2,...,wn) (1) 
= IIe(wilw{-b. (2) 
i=l 
In conventional word N-gram models, each term of the right hand 
side of expression (2) is approximated as the probability given for 
a single word based on the final N - 1 words preceding it. In the 
bigram model, for example, the foll6wing approximation is adopted: 
e(wi \[w{ -t) ~ e(wi lwi-1). (3) 
The proposed model is an extension of the N-gram model and utilizes 
the global constraints represented by function- and content word N- 
grams as well. For simplicity, only a single preceding word is 
taken into account, both for global and local relationships. Letfi 
and ci denote the last function word and the last content word in 
the substring w{, respectively. The probability of a word wi given 
w{ -1 is, takingfi_l and ci-i into consideration as well as wi-l, 
represented approximately as follows: 
P(wi \[ w1-1) -~ P(wi \[ wi-1, ci-i ,f i-l). (4) 
As wi-1 is identical to ci-i orfi_l, it is rewritten as 
e(wi \[ Wi-1, Ci-1 ,f i-1 ) 
P(wi \]wi-l,f i-1), wi-l: contentword 
= P(wi \[ wi-1, ci-1), wi-l: function word. (5) 
We refer to the model based on equation (5) as "proposed model I." 
Figure 2 shows how the word dependencies are taken into account in 
"7"',-. c f c f f c c f c: content word 
. .~J r~ J / 7 f: function word 
Figure 2: Word dependency in model I 
89 
this model. The probability of each word in a sentence is determined 
by the pleceding content- and function-word pair. If content words 
and function words appear alternately, this model reduces to the 
trigram model. But when, for example, a function word is preceded 
by more than one content word, the most recent function word is 
used to predict it instead of the last word but two (wi-2). 
2.3. Proposed Model H 
-- R, eduction of the Number of Parameters 
The following two assumptions are introduced as an approximation 
to reduce the number of parameters: 
1. Mutual information between wi and wi-1 is independent of 
fi-1 if wi-1 is a content word, and independent of ci-i ff wi-1 
is a function word, i.e., the following approximations hold; 
l(Wi, Wi-1 If i-1 ) = l(Wi, Wi-1 ) (6) 
if Wi-1 is a content word, and 
I(wi, wi-1 \[ Ci--l) = l(Wi, Wi-1 ) (7) 
if Wi-1 is a function word. 
2. The appearance of a content word and that of a function 
word are mutually independent when they are located non- 
contiguously in a sentence, i.e., 
P(wi If i-l) = e(wi) (8) 
if wi-1 and wi are content words, and 
P(wi \[ Ci-l) = P(wi) (9) 
if wi-~ and Wi are function words. 
From these approximations, expression (5) is rewritten as 
P(wi \[ wi-1, ci-i ,f i-1) 
PL(Wi \[Wi-l)" PG(fi \[f i-O Pc(\]'i) 
wi-1: content word, wi: function word (=fi) 
= PG(Ci I Ci-1 ) (10) eL(wi \[ Wi-i)" PG(Ci) 
wi-l: function word, wi: content word (= ci) 
eL(wi \[ Wi-1) otherwise, 
where PL and PC represent the probabilities of local and global 
constraints between words. To be more exact, Pc(f i) is the prob- 
ability that the i-th word isfl knowing that it is a function word, 
and PG(fi \[f~-l) is the probability that the i-th word isfi given that 
the most recent function word isfi-1 and also knowing that the i-th 
word is a function word. Pc(ci) and PG(CilCi-l) are explained in 
the same way. In other words, Pc(') denotes a probability in the 
function (or content) word sequences obtained by extracting only 
function (or content) words from sentences. Notice should be taken 
that PG(') is used only when two function (or content) words appear 
non-contiguously. We refer to the model based on equation (10) as 
"proposed model II." 
This approximate equation shows that the probabilities of words in 
a sentence are expressed as the product of word bigram probabilities 
and function word (or content word) bigram probabilities, which de- 
scribe local and global linguistic constraints, respectively. The term 
word bigram probabilities (local constraints) 
cf cff c cf 
function word bigram / content word bigram 
probabilities (global constraints) 
c: content word 
f: function word 
Figure 3: Word dependency in model II 
PG~i) and Pc(ci) in the denominators can be intuitively interpreted 
as the compensation for the probability of word wi being multiplied 
twice. 
Figure 3 shows how the word dependencies are taken into account 
in this model. The probability of each word is determined from the 
word immediately before it, and also from the preceding word of the 
same category (function word or content word) ff the category of 
the word immediately before it is different from that of the current 
word. The first corresponds to the word bigram probability and the 
latter corresponds to the function word (or content word) bigram 
probability, which are computed independently. It is easy to extend 
this model so as to use a word trigram model or a function word 
(content word) trigram model. 
The decomposition of probabilities greatly reduces the number of pa- 
rameters to be estimated. The number of parameters in each model 
is summarized in Table 1, where V, Vc, Vf is the vocabulary size, 
the number of content words, and the number of function words, re- 
spectively (V = Vc + Vf ). The word trigram model and the proposed 
model I has O(V 3) parameters, while the proposed model II has only 
O(I/2) parameters, which is comparable to the word bigram model. 
3. APPLICATION TO SPEECH 
RECOGNITION 
Since, like N-gram models, the proposed language models are 
Markov models, they can easily be incorporated into a speech recog- 
nition system based on the time-synchronous Viterbi decoding algo- 
rithm. They could also be used in rescoring for N-best hypotheses, 
but it would bring some loss of information. 
Figure 4 shows the network representation of the language model. 
Symbols c i, c i, c k represent content words, andf t, f m, f n repre- 
sent function words. Each node of the network is a Markov state 
Language Model 
Bigram 
Trigram 
Proposed (I) 
Proposed (II) 
Number of Parameters 
V2 
V 3 
2v~vlv 
++ . 
V: vocabulary size (= Vc + Vf ) 
Vc: number of content words 
Vf : number of function words 
Table 1: Number of parameters of each model 
90 
Figure 4: Network representation of the proposed language model 
Task 
Vocabulary Size 
Speaker 
Test Data 
International conference registration 
1,500 words 
1 male speaker 
261 sentences 
(7.0 words/sentence, on average) 
Table 2: Experimental conditions for speech recognition 
The proposed model was compared with the word bigram and tri- 
gram models in their perplexities for test sentences and in sentence 
recognition rates. As for the proposed model I, only perplexity 
was calculated. The ratios of the numbers of parameters were also 
calculated based on Table 1. 
corresponding to a word pair of either (Ci-l,fi-1) or (fi-1, Ci-1), 
and each arc is a transition corresponding to a word wl. In the case 
of the trigram model, each node would correspond to a word pair 
(wi-2, wi-1 ). Each arc is assigned with a probability value accord- 
ing to equation (5) (model I) or (10) (model H). The number of nodes 
is 2VcVf and the total number of arcs is 2VcVf V for both model I 
and model H. In the case of the trigam model, they would be V z and 
V 3, respectively. 
Ordinary time-synchronous Viterbi decoding controlled by this net- 
work is possible. As the numbers of nodes and arcs are still huge 
although reduced compared with the trigram model, a beam search 
is necessary in the decoding process. 
4. EXPERIMENTS 
4.1. Estimation of Language Model 
Parameters 
A 11,000-sentence text database of Japanese conversations concern- 
ing conference registration was used to train the language models. 
This database is manually labeled with part of speech tags. Each 
word was classified as a function word or a content word according 
to its part of speech. The size of the vocabulary is 5,389 words 
(5,041 content words and 348 function words), where words having 
the same spelling but different pronunciation, or were different parts 
of speech, were counted as different words. 
The probability values in the language models were estimated by the 
maximum likelihood method. These values were then smoothed us- 
ing the deleted interpolation method \[9\]. To cope with the unknown 
word problem, 'zero-gram' probabilities (uniform distribution) were 
also used in the interpolation. In the model II, this interpolation was 
applied to probabilities of local constraints (PL) and those of global 
constraints (Pc), respectively. 
In the calculation of perplexity for model II, use of the values obtained 
by equation (10) does not give the correct perplexity because 
~ P(wi I wi-l, ci-i ,f i-1) = 1 (11) 
wl 
does not hold due to the approximation. Therefore the values of 
P(wi \] Wi-l, ci-1 ,fi-i ) were normalized in order to satisfy this equa- 
tion. This normalization was done by simply multiplying a constant 
value found for each combination of (wi-l, ci-1 ,fi-1 ). It was omit- 
ted in the recognition experiment for computational reasons. 
Beam width for recognition was fixed at 6,000 in all cases. Weighting 
values for the acoustic score and linguistic score were determined by 
preliminary experiments. Common weighting values were used for 
all models. 
4.3. Results 
The results are shown in Table 4. The proposed models give lower 
perplexities than the bigram model, although not so low as the trigram 
model, which is reflected in the speech recognition accuracy. The 
perplexity of model II is higher than that of model I, which we think 
is caused by the approximation used to derive model II, but the 
smallness of the increase supports the validity of the assumptions 
described in 2.3. 
Although the perplexity and recognition rate are improved compared 
with the bigram model, the gain is modest. This may be due to a lack 
of training data or to a mismatch between the training and test data, 
especially since the difference in performance is also small between 
the bigram model and the trigram model. 
However, the fact that the performance obtained by the proposed 
model II lies almost halfway between the bigram and trigram, shows 
that the proposed model has the capability to capture linguistic con- 
4.2. Experimental Conditions 
Speaker-dependent continuous speech recognition experiments were 
carried out under the conditions shown in Table 2. The domain of 
the recognition task is the same as that of the training data, but 
the text of the test speech data was not included in the training 
data. Context-independent continuous mixture HMMs were used as 
acoustic models. The details of the acoustic models are shown in 
Table 3. 
Number of Phonemes 
Topology 
Output Probabilities 
Number of Mixtures 
Training Data 
38 
4-state 3-1oop, left-to-right model 
Gaussian mixtures 
max 14 (variable) 
2620 word utterance 
Table 3: HMM used as the acoustic models 
91 
Language Model 
Bigram 
Trigram 
Proposed (I) 
Proposed (II) 
Sentence Ratio of 
Perplexity Recognition Rate Number of Parameters 
41.2 51.3% 1.0 
36.3 54.0% 5.4 x 103 
36.9 -- 6.5 x 102 
38.1 52.5% 1.9 
Table 4: Test set perplexity and sentence recognition rate 
straints effectively with a comparatively small number of parameters. 
Its perfolanance could be improved by extending it to use trigram 
probabilities for local or global constraints, 
5. DISCUSSIONS 
In an attempt to capture the global constraints, we took note of 
the role of function words as case markers and used their N-gram 
probabilities to extract the syntactic constraints. We also used the 
N-gram probabilities of the content words to extract the semantic 
constraints. 
One of its advantages is that it does not need expensive computa- 
tional cost compared with previous works \[3, 4, 5, 6\]. Furthermore, 
as the syntactic constraints are considered to be less dependent on 
the domain than the semantic ones, function word N-grams could 
be trained with a task-independent large database and combined 
with content word N-grams trained with a task-dependent smaller 
database. 
One of the disadvantages of our approach is that the labels indicating 
whether a word is a function word or a content word are necessary 
in the training data. We think it would not be so difficult to auto- 
matically label if we only have to classify the words into these two 
categories, because the category of function words can be regarded 
as a closed class. 
Another problem is its generality, especially its applicability to other 
languages. English, for example, has different structure of sentences 
and different way of specifying the cases, although relationships 
between the content words are expected to exist. We think sim- 
ilar approach could be also useful for other languages, but some 
modification may be needed. 
6. CONCLUSIONS 
In this paper, a speech recognition system using a new stochastic 
language model that integrates local and global linguistic constraints 
was proposed. Function word bigrams and content word bigrams 
were introduced to capture global syntactic and semantic constraints, 
and combined with a conventional word bigram model. The num- 
ber of parameters was reduced by decomposing local and global 
dependency. 
Continuous speech recognition based on the time-synchronous 
Viterbi decoding algorithm with the proposed language model incor- 
porated into it was presented, and speaker-dependent speech recog- 
nition experiments were conducted. Although the improvements in 
performance over the conventional bigram model are rather modest, 
results show that the proposed model has the capability to capture 
Linguistic constraints effectively. 
g2 
The assumptions made to reduce parameters do not degrade perplex- 
ity, but their validity needs to be verified from the linguistic point of 
view. The number of parameters is reduced in the proposed model, 
but the size of database we used is still not large enough to estimate 
the statistics in the model. More data would be necessary to evaluate 
the effectiveness of the proposed model. The use of part of speech 
or word equivalence classes generated automatically (for example, 
\[10\]) could help to increase the robustness of the estimates obtained 
from the limited size of the corpora. 
In the future, we plan to further investigate the effective utilization 
of linguistic knowledge as well as statistical approaches to ex/ract 
more useful global constraints. 
Acknowledgments 
We would like to thank Mr. Sagayama, NTT Interface Laboratories, 
for his valuable advice. We are also grateful to Dr. Sagisaka and the 
members of Department 1 for their useful comments and help. 
References 
1. Bahl, L. R., Jelinek, F., and Mercer, R. L., "A maximum 
likelihood approach to continuous speech recognition," IEEE 
Transaction on Pattern Analysis and Machine Intelligence, 
vol. PAMI-5, 1983, pp. 179-190. 
2. Huang, X., Alleva, F., Hon, H-W., Hwang, M-Y., Lee, K:F., and 
Rosenfeld, R., "The SPHINX-II speech recognition system: 
an overview," Computer Speech and Language, vol. 7, 1993, 
pp. 137-148. 
3. Wright, J. H., Jones, G. J. E, and Lloyd-Thomas, H., "A con- 
solidated language model for speech recognition," Proc. Eu- 
rospeech 93, 1993, pp. 977-980. 
4. Lau, R., Rosenfeld, R., and Roukos, S., "Adaptive language 
modeling using the maximum entropy principle," Proc. ARPA 
Human Language Technology Workshop, 1993. 
5. Lau, R., Rosenfeld, R., and Roukos, S., "Trigger-based lan- 
guage models: a maximum entropy approach," Proc. ICASSP 
93, 1993, pp. I/-45-II-48. 
6. Bahl, L. R., Brown, P. F., de Souza, P. V., and Mercer, R. L., 
"A tree-based statistical language model for natural language 
speech recognition," IEEE Transaction on Acoustics, Speech, 
and Signal Processing, vol. 37, 1989, pp. 1001-1008. 
7. Isotani, R. and Sagayama, S., "Speech recognition using par- 
ticle N-grams and content-word N-grams," Proc. Eurospeech 
93, 1993, pp. 1955-1958. 
8. Jelinek, F., "Self-organized language modeling for speech 
recognition," IBM research report, 1985. Also available in 
Readings in Speech Recognition, Waibel, A. and Lee, K-F., 
eds., 1990. 
9. Jelinek, F. and Mercer, R. L., "Interpolated estimation of 
Markov source parameters from sparse data," in Pattern Recog- 
nition in Practice, Gelsema, E. S. and Kanal, L. N., eds., North- 
Holland Publishing Company, 1980. 
10. Kneser, R. and Ney, H., "Improved clustering techniques for 
class-based statistical language modelling," Proc. Eurospeech 
93, 1993, pp. 973-976. 
93 
