Nonlocal Language Modeling 
based on Context Co-occurrence Vectors 
Sadao Kurohashi and Manabu Ori 
Graduate School of Informatics 
Kyoto University 
Yoshida-honmachi, Sakyo; Kyoto, 606-8501 Japan 
kuro@±, ky0to-u, ac. j p, oriOpine, kuee. kyoto-u, ac. j p 
Abstract 
This paper presents a novel nonlocal lmlguage 
model which utilizes contextual information. 
A reduced vector space model calculated from 
co-occurrences of word pairs provides word 
co-occurrence vectors. The sum of word co- 
occurrence vectors represents tile context of a 
document, and the cosine similarity between 
the context vector and the word co-occurrence 
vectors represents the \]ong-distmlce lexical de- 
pendencies. Experiments on the Mainichi 
Newspaper corpus show significant improve- 
ment in perplexity (5.070 overall and 27.2% 
on target vocabulary) 
1 Introduction 
Human pattern recognition rarely handles iso- 
lated or independent objects. We recog- 
nize objects in various spatiotemporal circum- 
stances such as an object in a scene, a word 
in an uttermlce. These circumstances work 
as conditions, eliminating ambiguities and en- 
abling robust recognition. The most challeng- 
ing topics in machine pattern recognition are 
in what representation and to what extent 
those circumstances are utilized. 
In laalguage processing, a context--that is; 
a portion of the utterance or the text before 
the object--is ml important circumstmlce. 
One way of representing a context is statis- 
tical  nmdels which provide a word 
sequence probability, P(w~), where w~ de- 
notes the sequence wi...wj. In other words, 
they provide the conditional probability of a 
word given with the previous word sequence, 
P( wilw~-l ), which shows the prediction of a 
word in a given context. 
The most conmmn laalguage models used 
nowadays are N-granl models based on a 
(N- 1)-th order Markov process: event pre- 
dictions depend on at most (N- 1) previous 
events. Therefore, they offer the following ap- 
proximation: 
P(w.ilw  -1)   wiJwi_N+l) (I) 
A common value for N is 2 (bigram  
model) or 3 (trigram  model); only 
a short local context of one or two words is 
considered. 
Even such a local context is effective in 
some cases. For example, in Japanese, after 
the word kokumu 'state affairs', words such as 
daijin 'minister' mad shou 'department' likely 
follow; kaijin 'monster' and shou 'priZe' do 
not. After dake de 'only at', you cml often 
find wa (topic-marker), but you hardly find 
ga (nominative-marker) or wo (accusative- 
marker). These examples show behaviors of 
compound nouns and function word sequences 
are well handled by bigram mad trigraan mod- 
els. These models are exploited in several ap- 
plications such as speech recognition, optical 
character recognition and nmrphological anal- 
ysis. 
Local  models, however, cannot 
predict nmch in some cases. For instance, the 
word probability distribution after de wa 'at 
(topic-marker)' is very flat. However, even if 
the probability distribution is flat in local lan- 
guage models, the probability of daijin 'min- 
ister' and kaijin 'monster' must be very differ- 
ent in documents concenfing politics. Bigram 
and trigram models are obviously powerless 
to such kind of nonlocal, long-distmlce lexical 
dependencies. 
This paper presents a nonlocal  
model. The important information concern- 
ing long-distance lexical dependencies is the 
word co-occurrence information. For example, 
words such as politics, govermnent, admin- 
istration, department, tend to co-occur with 
daijin 'minister'. It is easy to measure co- 
occurrences of word pairs from a training cor- 
pus, but utilizing them as a representation of 
context is the problem. We present a vector 
80 
Wl 
W2 
w3 
w4 
w5 
w6 
D1 D2 D3 D4 D~ D6 D7 Ds 
1 0 1 0 1 0 1 0 
1 O 1 1 0 0 0 0 
0 1 0 0 1 1 0 1 
1 1 1 0 0 0 0 0 
0 0 0 O 1 0 1 0 
0 0 0 0 1 0 0 1 
Wl 
w2 
w3 
w4 
w5 
w6 
Wl W2 w3 W4 w5 w 6 
4 2 1 2 2 1 
3 0 2 0 0 
4 1 1 2 
3 0 0 
2 1 
2 
Figure 1: V~rord-document co-occurrence ma- 
trix. 
representation of word co-occurrence informa- 
tion; and show that the context can be repre- 
sented as a sum of word co-occurrence vectors 
in a docmnent and it is incorporated in a non- 
local  model. 
2 Word Co-occurrence Vector 
2.1 Word-Document Co-occurrence 
Matrix 
Word co-occurrences are directly represented 
in a matrix whose rows correspond to words 
and whose columns correspond to documents 
(e.g. a newspaper article). The element of 
the matrix is 1 if the word of the row ap- 
pears in the document of the colunm (Figure 
1). Wre call such a matrix a word-document 
co-occurrence matrix. 
The row-vectors of a word-document co- 
occurrence matrix represent the co-occurrence 
information of words. If two words tend to ap- 
pear in the same documents, that is: tend to 
co-occur, their row-vectors are similar, that is, 
they point in sinfilar directions. 
The more document is considered, the more 
reliable and realistic the co-occurrence infor- 
mation will be. Then, the row size of a word- 
document co-occurrence matrix may become 
very large. Since enormous amounts of online 
text are available these days, row size can be- 
come more than a million documents. Then, 
it is not practical to use a word-docmnent co- 
occurrence matrix as it is. It is necessary to 
reduce row size and to simulate the tendency 
in the original matrix by a reduced matrix. 
2.2 Reduction of Word-Document 
Co-occurrence Matrix 
The aim of a word-document co-occurrence 
matrix is to measure co-occurrence of two 
words by the angle of the two row-vectors. 
In the reduction of a matrix, angles of two 
row-vectors in the original matrLx should be 
maintained in the reduced matrLx. 
Figure 2: ~Vord-word co-occurrence matrix. 
As such a matrix reduction, we utilized a 
learning method developed by HNC Software 
(Ilgen and Rushall, 1996). 1 
1. Not the word-docmnent co-occurrence 
matrix is constructed from tile learning 
corpus, but a word-word co-occurrence 
matrix. In this matrix: the rows and 
colunms correspond to words and the i- 
th diagonal element denotes the number 
of documents in which the word wl ap- 
pears, F(wi). The i:j-th element denotes 
the number of documents in which both 
words w,: and wj appear, F(wi, wj) (Fig- 
ure 2). 
The importmlt information in a word- 
document co-occurrence matrix is the co- 
sine of the angle of the row-vector of wi 
and that of wj, which can be calculated 
by the word-word co-occurrence matrix 
as follows: 
F(w,:, wj) (2) 
This is because x/F(wi) corresponds to 
the magnitude of the row-vector of wl, 
and F(wl, wi) corresponds to the dot 
product of the row-vector of wl and 
that of wj in the word-docmnent co- 
occurrence matrix. 
2. Given a reduced row size, a matrix is ini- 
tialized as follows: matrix elements are 
chosen from a normal distribution ran- 
domly, then each row-vector is normal- 
ized to magnitude 1.0. The random refit 
row-vector of the word wl is denoted as 
,WCi Rand. 
Random unit row-vectors in high di- 
mensional floating point spaces have a 
1The goal of HNC was the enhancement of text 
retrieval. The reduced word vectors were regarded as 
semantic representation of words and used to represent 
documents and queries. 
81 
sori wa kakugi de' kankyo mondai 
(Prime Minister) (Cabinet meeting) (environment) (issue) 
I wc I \] wc II wc 
\ 
ni tuite 
w (cc • wc) 2 Pc 
kaigi (conference) 0.237962 0.002702 
senkyo (election) 0.150773 0.001712 
yosan (budget) 0.128907 0.001463 
daijin (minister) 0.018549 0.000211 
yakyu (baseball) 0.004556 0.000052 
kaijin (monster) 0.000002 0.000000 
sugaku (mathematics) 0.000001 0.000000 
TOTAL 88.079230 1.000000 
Figure 3: An example of context co-occurrence probabilities. 
property that is referred to a "qnasi- 
orthogonality'. That is; the expected 
~¢alue of the dot product between an3" 
pair of random row-vectors, wci Rand and 
wet and, is approximately equal to zero 
(i.e. all vectors are approximately or- 
thogonal). 
3. The trained row-vector, wai is calculated 
as follows: 
WCi -~ ~13C~ and + "q ~ O'ij'T.ll4 and 
J (3) 
wc - (4) 
The procedure iterates the following calcu- 
lation: 
OJ 
wen e~' = wcl - q Owe/ 
= + rl (a j - we~. wcj)wc  
(6) 
new -- W C7 e~: 
ilwcF wl I (7) 
The learning method by HNC is a rather 
simple approximation of the procedure, doing 
just one step of it. Note that wci.wcj is 
approximately zero for the initialized random 
vectors. 
aij corresponds to the degree of the co- 
occurrence of two words. By adding 
wc~ and to wet a'd depending on aij, th.e 
learning formula (3) achieves that two 
words that, tend to co-occur will have 
trained vectors that point in shnilar di- 
rections, r/is a design parameter chosen 
to optimize performance. The formula 
(4) is to normalize vectors to magnitude 
1.0. 
We call the trained row-vector we/of the 
word wi a word co-occurrence vector. 
The background of the above method is a 
stochastic gradient descent procedure for min- 
imizing the cost function: 
1 J = ~ .~(aij -- we/" wcj) 2 (5) 
%3 
subject to the constraints \[\[we/I\[ = 1. 
3 Context Co-occurrence Vector 
The next question is how to represent the 
context of a document based on word co- 
occurrence vectors. We propose a simple 
model which represents the context as the sum 
of the word co-occurrence vectors associated 
with content words ill a document so far. It 
should be noted that the vector is normalized 
to unit length. V~re call the resulting vector a 
context co-occurrence vector. 
W'ord co-occurrence vectors have the prop- 
erty that words which tend to co-occur have 
vectors that. point in similar directions. Con- 
text co-occurrence vectors are expected to 
have the sinfilar property. That is, if a word 
tends to appear in a given context, the word 
co-occurrence vector of the word and the con- 
text co-occurrence vector of the context will 
point in similar directions ...... 
Such a context co-occurrence vector can be 
seen to predict the occurrence of words in a 
82 
where 
p(.wdwi_,) = ( P(C~lwi-' ) x P(wdw~-'Cc) P(Cflwj-') x P(wdw~-lc/) 
( 
if wl E C~ 
if wi E C/ 
P(C~Iw~ -1) 
P(wilw~-:C~) 
P(wi\[w -lc/) 
= A:P(Cc) + A2P(C~lwi_l ) + A3P(C~\[wi-2wi-1) 
= AclP(wiICc) + A~2P(wi\[wi-lC~) + A~3P(wi\[wi-2Wi-lCc) 
= 1- P(C~lwj-:) 
= a/:P(wdc/) + a/2P(wd,  -:ci) + 
with 
Figure 4: Context  model. 
given context, mad is utilized as a component 
of statistical  modeling, as shown in 
the next section. 
4 Language Modeling using 
Context Co-occurrence Vector 
4.1 Context Co-occurrence 
Probability 
The dot product of a context co-occurrence 
vector and a word co-occurrence vector shows 
the degree of affinity of the context m:d the 
word. The probability of a content word based 
on such dot products, called a context co- 
occurrence probability, can be calculated as 
follows: 
Pc(wilw~_lcc) = f(cc~ -1 "~cl) 
~wjEcc f(cc~ -1" ~vcj) 
(S) 
where cc~ -1 denotes the context co-occurrence 
vector of the left context, Wl ... wi-1, and Cc 
denotes a content word class. Pc(wilw~-lcc) 
metals the conditional probability of wi given 
that a content word follows wj-:. 
One choice for the function .f(x) is the iden- 
tity. However, a linear contribution of dot 
products to the probability results in poorer 
estimates, since the differences of dot prod- 
ucts of related words (tend to co-occur) and 
unrelated words are not so large. Experiments 
showed that x 2 or x 3 is a better estimate. 
An example of context co-occurrence prob- 
abilities is shown in Figure 3. 
4.2 Language Modeling using Context 
Co-occurrence Probability 
Context co-occurrence probabilities can ham 
dle long-distance lexical dependencies while a 
standard trigram model can handle local con- 
texts more clearly: in this way they comple- 
ment each other. Therefore,  model- 
ing of their linear interpolation is employed. 
Note that tile linear interpolation of unigram, 
bigram and trigram models is simply referred 
to 'trigxan: model' in this paper. 
The proposed  model, called a con- 
text  model, computes probabilities 
as shown in Figure 4. Since context co- 
occurrence probabilities are considered only 
for content words (Cc), probabilities are cal- 
culated separately for content words (Co) and 
function words (C/). 
P(Cc\[w~ -1) denotes the probability that a 
content word follows w~-:, which is approx- 
imated by a trigrmn nmdel. P(.wi\[w~-lcc) 
denotes the probability that wi follows w~-: 
given that a content word follows w~-:, which 
is a linear interpolation of a standard trigram 
model and the context co-occurrence proba- 
bilities. 
In the case of a function word, since the 
context co-occurrence probability is not con- 
sidered, P(wdw~-lCi) is just a standard tri- 
granl model. 
X's adapt using an EM re-estimation proce- 
dure on the held-out data. 
83 
Table 1: Perplexity results for the stmldard trigrazn model and the context  nmdel. 
Perplexity on Perplexity on 
Language Model the entire the target 
vocabulary vocabulary 
Standard Trigram Model 107.7 1930.2 
Context Language Model 
Vector size 0 f(x) 
500 0.5 x ~ 
1000 0.3 x ~ 
1000 0.5 x 
* 1000 0.5 x 2 
1000 0.5 x 3 
1000 1.0 x 2 
2000 0.5 x 2 
106.3 (-1.3%) ~o 
102.7 (-4., %) 
103.6 (-3.9%) 
102.4 (-5.0%) 
102.4 (-5.0%) 
102.5 (-4.8%) 
102.4 (-5.0%) 
1663.8 (-13.8%) 
1495.9 (-22.5%) 
1496.1 (-22.5%) 
1406.2 (-27.2%) 
1416.8 (-26.9%) 
1430.3 (-25.9%) 
1408.1 (-27.1%) 
Standard Bigram Model 130.28 2719.67 
Context Language Model 
125.06 (-4.0%) 
122.85 (-5.7%) 
1000 0.5 x 
1000 0.5 x 2 
2075.10 (-23.7%) 
1933.68 (-28.9%) 
shijyo no ~ wo ~ ni Wall-gai ga kakkyou wo teishi, bei kabushiki 
'US' 'stock' 'market' 'sudden rise' 'background' %Vall Street' 'activity' 'show' 
wagayonoharu wo ~a~ shire iru. \[shoukenl kaisha, ~h~ ginkou wa 1996 nen ni 
'prosperity' 'enjoy' 'do' 'stock' 'company' 'investment' 'bank' 'year' 
halite ka o saiko  l ko shi  \] '96 ne,  I k b shiki l so.ha '95 
'enter' 'past' 'maximum' 'profit' 'renew' 'year' 
ni I .tsuzuki\] kyushin . mata \] kab.uka\] kyushin wo 
'continue' 'rapid increase' 'stock price' 'rapidly increase' 
I shinkabul hakkou ga ~ saikou to natta. 
'new stock' 'issue' 'past' 'maximum' 'become' 
'stock' 'market' 'year' 
ni ~u~ no 
'background' 'corporation' 
Figure 5: Comparison of probabilities of content words by the trigraan model and the context 
model. (Note that wa, ga, wo, ni; to and no are Japanese postpositions.) 
4.3 Test Set Perplexity 
By using the Mainichi Newspaper corpus 
(from 1991 to 1997, 440,000 articles), test 
set perplexities of a standard trigrmn/bigram 
model and the proposed context  
model are compared. The articles of six 
years were used for the leanfing of word co- 
occurrence vectors, unigrams, bigrmns and 
trigrams; the articles of half a year were used 
as a held-out data for EM re-estimation of A's; 
the remaining articles (half a year) for com- 
puting test set perplexities. 
Word co-occurrence vectors were computed 
for the top 50,000 frequent content words (ex- 
cluding pronouns, numerals, temporal nouns, 
mad light verbs) in the corpus, and unigrmn: 
bigrmn and trigrmn were computed for the top 
60,000 frequent words. 
The upper part of Table 1 shows thecom- 
parison results of the stmldard trigram model 
and the context  model. For the best 
parameters (marked by *), the overall per- 
plexity decreased 5.0% and the perplexity on 
target vocabulary (50,000 content words) de- 
creased 27.270 relative to the standard trigram 
model. For the best parameters, A's were 
adapted as follows: 
A1 = 0.08, A2 = 0.50, A3 = 0.42 
Acl = 0.03, ~c2 = 0.50, Xc3 = 0.30, Xcc = 0.17 
Afl = 0.06, ~f2 = 0.57, A f3 = 0.37 
As for parazneter settings, note that per- 
formance is decreased by using shorter word 
co-occurrence vector size. The vaxiation of 
~/does not change the performance so much. 
84 
f(x) = x 2 and f(x) = x 3 are alnmst the same; 
better thaaa f(x) = x. 
The lower part of Table 1 shows the compar- 
ison results of the standard bigram model and 
the context  model. Here, the context 
 model is based on the bigrana model, 
that is, the terms concerning trigrmn in Fig- 
ure 4 were eliminated. The result was similar, 
but the perplexity decreased a bit more; 5.7% 
overall and 28.9% on target vocabulary. 
Figure 5 shows a test article in which the 
probabilities of content words by the trigram 
lnodel aald the context model are compared. If 
that by the context model is bigger (i.e. the 
context model predicts better), the word is 
boxed; if not, the word is underlined. 
The figure shows that the context model 
usually performs better after a function word, 
where the trigram model usually has little pre- 
diction. On the other hand, the trigram model 
performs better after a content word (i.e. in 
a compound noun) because a clear prediction 
by the trigram model is reduced by paying 
attention to the relatively vague context co- 
occurrence probability (Acc is 0.17). 
The proposed model is a constant interpo- 
lation of a trigram model and the context co- 
.0ccurrence probabilities. More adaptive inter- 
polation depending on the N-gram probabil- 
ity distribution may improve the performance. 
5 Related Work 
Cache  models (Kuhn mad de Mori, 
1990) boost the probability of the words al- 
ready seen in the history. 
Trigger models (Lau et al., 1993), even more 
general, try to capture the co-occurrences be- 
tween words. While the basic idea of our 
model is similar to trigger models, they handle 
co-occurrences of word pairs independently 
and do not use a representation of the whole 
context. This omission is also done in ap- 
plications such as word sense dismnbiguation 
(Yarowsky: 1994; FUNG et al., 1999). 
Our model is the most related to Coccaro 
mad Jurafsky (1998), in that a reduced vec- 
tor space approach was taken and context is 
represented by the accumulation of word co- 
occurrence vectors. Their model was reported 
to decrease the test set perplexity by 12%, 
compared to the bigram nmdel. The major 
differences are: 
1. SVD (Singular Value Decomposition) 
was used to reduce the matrix which is 
common in the Latent Semaaltic Analysis 
(Deerwester et ai.; 1990), and 
2. context co-occurrence probabilities were 
computed for all words, and the degree 
of combination of context co-occurrence 
probabilities and N-gram probabilities 
was computed for each word, depending 
on its distribution over the set of docu- 
lnents. 
As for the first point, we utilized the 
computationally-light, iteration-based proce- 
dure. One reason for this is that the com- 
putational cost of SVD is very high when 
millions or more documents are processed. 
Furthermore, considering an extension of our 
nmdel with a cognitive viewpoint, we believe 
an iteration-based model seems more reason- 
able than an algebraic model such as SVD. 
As for the second point, we doubt the ap- 
propriateness to use the word's distribution 
as a measure of combination of two models. 
What we need to do is to distinguish words 
to which semantics should be considered and 
other words. We judged the distinction of con- 
tent words and function words is good enough 
for that purpose, and developed their trigram- 
based distinction as shown in Figure 4. 
Several topic-based models have been pro- 
posed based on the observation that certain 
words tend to have different probability dis- 
tributions in different topics. For example, 
Florian and Yarowsky (1999) proposed the fol- 
lowing model: 
t (9) 
where t denotes a topic id. Topics are 
obtained by hierarchical clustering from a 
training corpus, and a topic-specific  
model, Pt, is learned from the clustered docu- 
ments. Reductions in perplexity relative to a 
bigrmn model were 10.5% for the entire text 
and 33.5% for the target vocabulary. 
Topic-based models capture long-distance 
lexical dependencies via intermediate topics. 
In other words, the estimated distribution of 
topics, P(t\]w~), is the representation of a con- 
text. Our model does not use such interme- 
diate topics, but accesses word cg-occurrence 
information directly aald represents a context 
as the accumulation of this information. 
85 
6 Conclusion 
In this paper we described a novel  
model of incorporating long-distance lexical 
dependencies based on context co-occurrence 
vectors. Reduced vector representation of 
word co-occurrences enables rather simple but 
effective representation of the context. Sig- 
nificant reductions in perplexity are obtained 
relative to a staaldard trigram model: both on 
the entire text. (5.0~) and on the target vo- 
cabulary (27.2%). 
Acknowledgments 
The research described in this paper was sup- 
ported in part. by JSPS-RFTF96P00502 (The 
Japan Society for the Promotion of Science, 
Research for the Future Program). 

References 
Noah Coccaxo and Daniel Jurafsky. 1998. To- 
wards better integration of semantic predictors 
in statistical  modeling. In Proceedings 
of ICSLP-98, volume 6, pages 2403-2406. 
Scott Deem, ester, Susan T. Dumais, George W. 
Furnas, Thomas K. Landauer, and Richard 
Harshmaa~. 1990. Indexing by latent semantic 
analysis. Journal of the American Society for 
Information Science, 41(6):391-407. 
Radu Florian and David Yarowsky. 1999. Dy- 
namic nonlocal  modefing via hierar- 
chical topic-based adaptation. In Proceedings off 
the 37rd Annual Meeting of ACL, pages 167- 
174. 
Pascale FUNG, LIU Xiaohu, mad CHEUNG Chi 
Shun. 1999. Mixed  query disambigua- 
tion. In Proceedings of the 37rd Annual Meeting 
of A CL, pages 333-340. 
Maa'd R. Ilgen and David A. Rushall. 1996. Re- 
cent advances in HNC's context vector informa- 
tion retrieval technology. In TIPSTER PRO- 
GRAM PHASE II, pages 149--158. 
R. Kuhn and IL de Mori. 1990. A cache-based 
natural  model for speech recognition. 
IEEE Transactions on Pattern Analysis and 
Machine Intelligence, 12(6):570-583. 
R. Lau, Ronald Rosenfeld, and Safim Roukos. 
1993. Trigger based  models: a max- 
imum entropy approach. In Proceedings of 
ICASSP, pages 45-48. 
David Yarowsky. 1994. Decision fists for lexical 
ambiguity resolution : Application to accent 
restoration in Spanish and French. In Proceed- 
ings o/the 32nd Annual Meeting of A CL, pages 
88-995. 
