HYPOTHESIZING UNTAGGED TEXT WORD ASSOCIATION FROM 
Tomoyoshi Matsukawa 
BBN Systems and Technologies 
70 Fawcett St. 
Cambridge, MA 02138 
ABSTRACT 
This paper reports a new method for suggesting word 
associations, based on a greedy algorithm that employs Chi- 
square statistics on joint frequencies of pairs of word groups 
compared against chance co-occurrence. The benefits of this 
new approach are: 1) we can consider even low frequency 
words and word pairs, and 2) word groups and word 
associations can be automatically generated. The method 
provided 87% accuracy in hypothesizing word associations for 
unobserved combinations of words in Japanese text. 
1. INTRODUCTION 
Using mutual information for measuring word association 
has become popular since \[Church and Hanks, 1990\] 
defined word association ratio as mutual information 
between two words. Word association ratios are a 
promising tool for lexicography, but there seem to be at 
least two limitations to the method: 1) much data with low 
frequency words or word pairs cannot be used and 2) 
generalization of word usage still depends totally on 
lexicographers. 
In this paper, we propose an alternative (or extended) 
method for suggesting word associations using Chi-square 
statistics, which can be viewed as an approximation to 
mutual information. Rather than considering significance 
of joint frequencies of word pairs as \[Church and Hanks, 
1990\] did, our algorithm uses joint frequencies of pairs of 
word groups instead. The algorithm employs a hill- 
climbing search for a pair of word groups that occur 
significantly frequently. 
The benefits of this new approach are: 
1) that we can consider even low frequency 
words and word pairs, and 
2) that word groups or word associations can be 
automatically generated, .namely automatic 
hypothesis of word associations, which can 
later be reviewed by a lexicographer. 
3) word associations can be used in parsing and 
understanding natural language, as well as in 
natural language generation \[Smadja and 
McKeown, 1990\]. 
Our method proved to be 87% accurate in hypothesizing 
word associations for unobserved combinations of words in 
Japanese text, where accuracy was tested by human 
verification of a random sample of hypothesized word pairs. 
We extracted 14,407 observations of word co-occurrences, 
involving 3,195 nouns and 4,365 verb/argument pairs. Out 
of this we hypothesized 7,050 word associations. The 
corpus size was 280,000 words. We would like to apply 
the same approach to English. 
2. RELATED WORK 
Some previous work (e.g., \[Weischedel, et al., 1990\]) 
found verb-argument associations from bracketed text, such 
as that in TREEBANK; however, this paper, and related 
work has hypothesized word associations from untagged 
text. 
\[Hindle 1990\] confirmed that word association ratios can 
be used for measuring similarity between nouns. For 
example, "ship", "plane", "bus", etc., were automatically 
ranked as similar to "boat". \[Resnik 1992\] reported a word 
association ratio for identifying noun classes from a pre- 
existing hierarchy as selectional constraints on the object 
of a verb. 
\[Brown et.al. 1992\] proves that, under the assumption of a 
bi-gram class model, the perplexity of a corpus is 
minimized when the average mutual information between 
word classes is maximized. Based on that fact, they cluster 
words via a greedy search algorithm which finds a local 
maximum in average mutual information. 
Our algorithm considers joint frequencies of pairs of word 
groups (as \[Brown et. al. 1992\] does) in contrast to joint 
frequencies of word pairs as in \[Church and Hanks, 1990\] 
and \[Hindle 1990\]. Here a word group means any subset of 
the whole set of words. For example, "ship," "plane," 
"boat" and "car" may be a word group. The algorithm will 
find pairs of such word groups. Another similarity to 
\[Brown et. al. 1992\]'s clustering algorithm is the use of 
greedy search for a pair of word groups that occur 
significantly frequently, using an evaluation function based 
on mutual information between classes. 
On the other hand, unlike \[Brown et. al. 1992\], we assume 
some automatic syntactic analysis of the corpus, namely 
part-of-speech analysis and at least finite-state 
approximations to syntactic dependencies. Moreover, the 
clustering is done depth first, not breadth first as \[Brown et. 
248 
al. 1992\], i.e., clusters are hypothesized one by one, not in 
parallel. 
3. OVERVIEW OF THE METHOD 
The method consists of three phases: 
1) Automatic part of speech tagging 
of text. First, texts are labeled by our 
probabilistic part of speech tagger (POST) 
which has been extended for Japanese 
morphological processing \[Matsukawa et. al. 
1993\]. This is fully automatic; human 
review is not necessary under the assumption 
that the tagger has previously been trained 
on appropriate text \[Meteer et. al. 1991\] 1 
2) Finite state pattern matching. 
Second, a finite-state pattern matcher with 
patterns representing possible grammatical 
relations, such as verb/argument pairs, 
nominal compounds, etc. is run over the 
sample text to suggest word pairs which will 
be considered candidates for word 
associations. As a result, we get a word co- 
occurrence matrix. Again, no human review 
of the pattern matching is assumed. 
3) Filtering/Generalization of word 
associations via Chi-square. Third, 
given the word co-occurrence matrix, the 
program starts from an initial pair of word 
groups (or a submatrix in the matrix), 
incrementally adding into the submatrix a 
word which locally gives the highest Chi- 
square score to the submatrix. Finally, 
words are removed which give a higher Chi- 
square score by their removal. By adding and 
removing words until reaching an 
appropriate significance level, we get a 
submatrix as a hypothesis of word 
associations between the cluster of words 
represented as rows in the submatrix and the 
cluster of words represented as columns in 
the submatrix. 
4 WORD SEGMENTATION AND PART 
OF SPEECH LABELING 
1 In our experience thus far in three domains and in both 
Japanese and English, while retraining POST on domain- 
specific data would reduce the error rate, the effect on overall 
performance of the system in data extraction from text has 
been small enough to make retraining unnecessary. The effect 
of domain-specific lexical entries (e.g., DRAM is a noun in 
microelectronics) often mitigates the need to retrain. 
Since in Japanese word separators such as spaces are not 
present, words must be segmented before we assign part of 
speech to words. To do this, we use JUMAN from Kyoto 
University to segment Japanese text into words, AMED, 
an example-based segmentation corrector, and a Hidden 
Markov Model (POST) \[Matsukawa, et. al. 1993\]. For 
example, POST processes an input text such as the 
following: 
and produces tagged text such as: 2 
--~j/CONJ, /TT ~/CN ~/CN ~,~/rM 
~_:~)~:~.,/PN O/NCM ~ ~/SN ::\[~,~/CN "~/CM 
L./ADV. /IT ~/CN IJCN ~/NCM ~'~#2/ADJ 
~..jCN Q)/NCM .Jzff/./CN ~\[Iii~/CN I/~'T/CN ~/NCM 
~_ ~/FN "~/PT, \[FT ~ 2\[s;/PN ~TPT ~/NCM 
~/CM ~\]~t-~ ~/VB ~,~$/VSUF ° /KT 
5. FINITE STATE PATTERN 
MATCHING 
We use the following finite state patterns for extracting 
possible Japanese verb/argument word co-occurrences from 
automatically segmented and tagged Japanese text. 
Completely different patterns would be used for English. 
PN PT "'" SN SN 
where CN = common noun 
PN = proper name 
SN = Sa-inflection noun (nominal verb) 
CM = case marker (-nom/-acc argument) 
PT = particle (other arguments) 
VB = verb 
Here, the first part (CN, PN or SN) represents a noun. 
Since in Japanese the head noun of a noun phrase is always 
at the right end of the phrase, this part should always 
match a head noun. The second part (CM or PT) represents 
a postposition which identifies an argument of a verb. The 
final pattern element (VB or SN) represents a verb. Sa- 
inflection nouns (SN) are nominalized verbs which form a 
verb phrase with the morpheme "suru." 
2 CONJ = conjunction; Tr = Japanese comma; 
CN = common noun; TM = Topic marker; 
PN - proper noun; etc. 
249 
Distance 
0 
1 
2 
4 
Matched Text 
g~|/CN \]j-- t~/CN ~:'~_/CN k/PT ~J~/'SN xJ-&/VB ~ ~/FN "~TPT .... 
7 9 MX/PN 0)/NCM ~_~/CN ~/PT ~a~/CN ~i~}~t~/SN ~/CM L~NB ... 
~I\]Z~/ADJ ~:~t/CN ~/'PT "~'/~--b/CN \]J¢/CM ~i~l~/SN L~NB ... 
~-/CN -~/CN ~/CN ~/PT ~/CN ~/NNSU~--F/CN ~/NCM 
~j #/ON 
~ig~/SN ~/NCM 
Figure 1: Examples of Pattern Matches with Skipping over Words. 
Since argument structure in Japanese is marked by 
postpositions, i.e., case markers (i.e., "o," "ga") and 
partic?,es (e.g., "ni," "kara," . . .), word combinations 
matched with the patterns will represent associations 
between a noun filling a particular argument type (e.g., 
"o") and a verb. Note that topic markers (TM; i.e., "wa") 
and toritate markers (TTM; e.g."mo", "sae", ...) are not 
included in the pattern since these do not uniquely identify 
the case of the argument. 
Just as in English, the arguments of a verb in Japanese 
may be quite distant from the verb; adverbial phrases and 
scrambling are two cases that may separate a verb from its 
argument(s). We approximate this in a finite state machine 
by allowing words to be skipped. In our experiment, up to 
four words could be skipped. As shown in Figure 1, 
matching an argument structure varies from distance 0 to 4. 
By limiting the algorithm to a maximum of four word 
gaps, and by not considering the ambiguous cases of topic 
markers and taritate markers, we have chosen to limit the 
cases considered in favor of high accuracy in automatically 
hypothesizing word associations. \[Brent, 1991\] similarly 
limited what his algorithm could learn in favor of high 
accuracy. 
6. FILTERING AND 
GENERALIZATION VIA CHI-SQUARE 
Word combinations found via the finite state patterns 
include a noun, postposition, and a verb. A two 
dimensional matrix (a word co-occurrence matrix) is 
formed, where the columns are nouns, and the rows are 
pairs of a verb plus postposifion. The cells of the matrix 
are the frequency of the noun (column element) co- 
occurring in the given case with that verb (row element). 
Starting from a submatrix, the algorithm successively adds 
to the submatfix the word with the largest Chi-square score 
among all words outside the submatrix. Words are added 
until a local maximum is reached. Finally, the 
appropriateness of the submatrix as a hypothesis of word 
associations is checked with heuristic criteria based on the 
sizes of the row and the column of the submatrix. 
Currently, we use the following criteria for appropriateness 
of a submatrix: 
LET 1 : size of row of submatfix 
m : size of column of submatrix 
C1, C2, C3 : parameters 
IF 1 > C1, and 
m > C1, and 
1 > C2 or m/l < C3, and 
m > C2 or l/m < C3 
THEN the submatrix is appropriate. 
For any submatrix found, the co-occurrence observations 
for the clustered words are removed from the word co- 
occurrence matrix and treated as a single column of 
clustered nouns and a single row of clustered verb plus case 
pairs. Currently, we use the following values for the 
parameters: C1=2, C2=10, and C3=10. 
Table 1. shows an example of clustering starting from the 
initial submatrix shown in Figure 2. The words in Figure 
2 were manually selected as words meaning "organization." 
In Table 1, the first (leftmost) column indicates the word 
which was added to the submatfix at each step. The second 
column gives an English gloss of the word. The third 
column reports fix,Y), the frequency of the co-occurrences 
between the word and the words that co-occur with it. For 
example, the first line of the table shows that the word 
"~/~L" (establish/-acc) co-occurred with the 
"organization" words 26 times. The rightmost column 
specifies I(X,Y), the scaled mutual information between the 
rows and columns of the submatrix. As the clustering 
proceeds, I(X,Y) gets larger. 
~_~(company), ;~k:~l\](head quarter), mS(organization), 
~(coorporation), iitij:~.J:(both companies), ~(school), 
~:~t\](the company), zj~:~i(child company), ~l~(bank), 
~/~(department store), ~t~t~\]~(agency), ~n0(coop.), 
~j:~IXbusiness company), ~.~(city bank), ~)~(stand), 
~-~ ~\[~l~(trust bank), 3~/~(branch), ~-~ ~(credit association), 
:~k)~(head store), ~--~--(university), :~-:~(each company), 
--~\] ~-- ~ (department store), JAR(agriculture cooperative), 
--~ --(maker), :~:)~(book store), if" L," I~')-~j(TV station), 
7°~ :Y" ~ ~ ~ M(agency), X --)'¢~(superrnarket), 
¢~\[~tXjoint-stock corporation), ~(doctoFs office), 
)~(all stores) 
Figure 2: The initial word group (submatrix) for the 
clustering shown in Table 1. 
250 
Word added 
~/~ 
~/~ 
~/~ 
~/~ 
~/~I~< ~ 
~/~ 
~/~ 
~/~ 
~¢/~ 
~/~A 
~/~ 
fi¢/~ 
~z/ ~ z~ 
~z/~ 
~/~ 
~/~ 
~/~ 
~z/~ ~/~\]~ 
~A 
Z~Z 
ATT 
Gloss 
establish/-acc 
tie-up/with 
tie-up/-nom 
unite/with 
cooperate/-nom 
possess/-nom 
unite/-nom 
advance/-nom 
in succession 
proceed/-nom 
purchase/-acc 
entrust/-acc 
produce/-nom 
develop/-nom 
invest/-nom 
expand/with 
develop/-nom 
publish/-nom 
agree/-nom 
demand/from 
invest/in 
sell/-nom 
purchase/-nom 
open/-acc 
introduce/from 
create/-nom 
utilize/at 
limit/to 
treat/-nom 
connect/-nom 
do/-nom 
exclude/-acc 
oppose/to 
sign/-copula 
sell/to 
participate/in 
corporation 
major 
Japan 
Nisho-Iwai 
three parties 
Drug Company 
Sony 
dealer 
Institution 
Honda 
Mitsubishi 
AT&T 
Air Line 
respectively 
Honda 
Freq 
26 
25 
18 
11 
7 
8 
7 
6 
5 
4 
5 
6 
6 
7 
6 
3 
3 
4 
3 
3 
5 
7 
3 
4 
3 
3 
3 
3 
3 
3 
5 
3 
3 
3 
4 
4 
9 
5 
5 
4 
3 
4 
5 
5 
5 
4 
3 
3 
4 
3 
3 
I 
0.11 
0.19 
0.25 
0.29 
0.32 
0.35 
0.38 
0.40 
0.43 
0.44 
0.46 
0.47 
0.49 
0.51 
0.52 
0.54 
0.55 
0.56 
0.58 
0.59 
0.60 
0.61 
0.63 
0.64 
0.65 
0.66 
0.67 
0.68 
0.69 
0.69 
0.70 
0.71 
0.71 
0.72 
0.72 
0.72 
0.74 
0.75 
0.77 
0.78 
0.79 
0.80 
0.81 
0.81 
0.82 
0.83 
0.83 
0.84 
0.84 
0.85 
0.85 
~t~ Bank 7 0.85 
~j~:~ Air Line 6 0.85 
~-~.~ Trust Company 4 0.85 
~I~ Steel Company 4 0.85 
Table 1: Example of Clustering 
7. EVALUATION 
Using 280,000 words of Japanese source text from the 
TIPSTER joint ventures domain, we tried several 
variations of the initial submatrices (word groups) from 
which the search in step three of the method starts: 
a) complete bipartite subgraphs, 
b) pre-classified noun groups and 
c) significantly frequent word pairs. 
Based on the results of the experiments, we concluded that 
alternative (b) gives both the most accurate word 
associations and the highest coverage of word associations. 
This technique is practical because classification of nouns 
is generally much simpler than that of verbs. We don't 
propose any automatic algorithm to accomplish noun 
classification, but instead note that we were able to 
manually classify nouns in less than ten categories at about 
500 words/hour. That productivity was achieved using our 
new tool for manual word classification, which is partially 
inspired by EDR's way of classifying their semantic lexical 
data \[Matsukawa and Yokota, 1991 \]. 
Based on a corpus of 280,000 words in the TIPSTER joint 
ventures domain, the most frequently occurring Japanese 
nouns, proper nouns, and verbs were automatically 
identified. Then, a student classified the frequently 
occurring nouns into one of the twelve categories in (1) 
below, and each frequently occurring proper noun into one 
of the four categories in (2) below, using a menu-based 
tool, we were able to categorize 3,195 lexical entries in 12 
person-hours. 3 These categories were then used as input to 
the word co-occurrence algorithm. 
1. Common noun categories 
1 a. Organization 
CORPORATION 
GOVERNMENT 
UNDETERMINED-CORPORATION 
OTHER-ORGANIZATION 
1 b. Location 
CITY 
COUNTRY 
PROVINCE 
3 We divided the process of classifying common nouns into 
two phases; classification into the four categories la, lb, lc 
and ld, and further classification into the twelve categories. As 
a result, each word was checked twice. We found that using two 
phases generally improves both overall productivity and 
consistency. 
251 
OTHER-LOCATION 
1 c. Person 
ENTITY-OFFICER 
TrlLE 
OTHER-PERSON 
1 d. Other 
2. Proper noun categories 
ORGANIZATION 
LOCATION 
PERSON 
OTHER 
Using the 280,000 word joint venture corpus, we collected 
14,407 word co-occurrences, involving 3,195 nouns and 
4,365 verb/argument pairs, by the finite state pattern given 
in Section 5. 16 submatrices were clustered, grouping 810 
observed word co-occurrences and 6,240 unobserved (or 
hypothesized) word co-occurrences. We evaluated the 
accuracy of the system by manual review of a random 
sample of 500 hypothesized word co-occurrences. Of these, 
435, or 87% were judged reasonable. This ratio is fine 
compared with a random sample of 500 arbitrary word co- 
occurrences between the 3,195 nouns and the 4,365 
verb/argument pairs, of which only 153 (44%) were judged 
reasonable. Table 2 below shows some examples judged 
reasonable; questionable examples are marked by "?"; 
unreasonable hypotheses are marked with an asterisk. 
With a small corpus (280,000 words) such as ours, 
considering small frequency co-occurrences is critical. 
Looking at Table 3 below, if we had to ignore co- 
occurrences with frequency less than five (as \[Church and 
Hanks 1990\] did), there would be very little data. With our 
method, as long as the frequency of co-occurrence of the 
word being considered with the set is greater than two, the 
statistic is stable. 
Frequency Number of 
Word Pairs 
0 6240 
1 631 
2 113 
3 36 
4 18 
5 4 
6 2 
7 3 
9 1 
10 1 
16 1 
Table 3: Pair Frequencies 
8. CONCLUSION 
Our method achieved fully automatic hypothesis of word 
associations, starting from untagged text and generalizing 
to unobserved word associations. As a result of human 
review 87% of the hypotheses were judged to be 
reasonable. Because the technique considers low frequency 
cases, most of the data was used in making generalizations. 
It remains to be determined how well this method will 
work for English, but with appropriate finite state patterns, 
similar results may be achieved. 
(owner) (take office/as) 
A T T ~' 6/~X 
(AT&T) (introduceA~rom) ~$\[\] ~:/~ 
(melropolitan) (build/at) 
(personnel) (dispatch/-acc) 
(Commitee) (unite/with) 
(library) (sell/-nom) 
(Company) (organize/-acc) 
~r~ ~/~ 
(agency) (publish/-nom) 
(post office) (tie-up/with) 
~t ~:/~ 
(State) (developAo) 
(Cannon) (enter/-acc) ~ ~:/~ 
(doctor's office) (limit/to) 
(nations) (haveAn) ~ ~/~ 
(Nomura) (prroduce/-nom) ~ ~/~ 
(station employee) (take office/-nom) 
D R A M ~/~ 
(DRAM) (unite/-nom) 
(Switzerland) (see/-nom) ~ ~:/~ 
(director) (announce/to) 
Table 2: Examples of reasonable hypothesized co- 
occurrences 
ACKNOWLEDGMENTS 
The author wishes to thank Madeleine Bates, Ralph 
Weischedel and Sean Boisen for significant contributions to 
this paper. 
252 
1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
REFERENCES 
Brent, M.R., (1991) "Automatic Acquisition of 
Subcategorization Frames from Untagged Text," 
Proceedings of the 29th annual Meeting of the ACL, 
pp. 209-214. 
Brown, P.F., et. al., (1992) "Class-based N-gram 
Models of Natural Language," Computational 
Linguistics Vol. 18 (4), pp. 467-479. 
Church, K. and Hanks, P., (1990) "Word Association 
Norms, Mutual Information, and Lexicography," 
Computational Linguistics Vol. 16 (1), pp.22-29. 
Hindle, D., (1990) "Noun Classification from 
Predicate-Argument Structures," Proceedings of the 
28th Annual Meeting of the ACL, pp. 268-275. 
Hoel P. G., (1971): Introduction to Mathematical 
Statistics, Chapter 9. 2. 
Resnik, P., (1992) "A Class-based Approach to Lexical 
Discovery," Proceedings of the 30th Annual Meeting 
of the ACL, pp. 327-329. 
Smadja F.A. and McKeown, K.R., (1990) 
"Automatically Extracting and Representing 
Collocations for Language Generation," Proceedings 
of the 28th Annual Meeting of the ACL, pp. 252-259. 
Matsukawa T., Miller S. and Weischedel R. (1993) 
"Example-based Correction of Word Segmentation and 
Part of Speech Labelling," Proceedings of DARPA 
Human Language Technologies Workshop. 
Matsukawa, T. and Yokota, E. (1991) "Development 
of the Concept Dictionary - Implementation of Lexical 
Knowledge," Proc. of pre-conference workshop 
sponsored by the special Interest Group on the Lexicon 
(SIGLEX) of the Association for Computational 
Linguistics, 1991. 
Weischedel, R. et al. (1991) "Partial Parsing: A 
Report on Work in Progress," Proceedings of the 
Workshop on Speech and Natural Language, pp. 204- 
210. 
APPENDIX: JUSTIFICATION OF CHI 
SQUARE 
Chi-square score is given by the following formula : 
I(X, Y)= ~ I(X, Y) 
p(X, Y) E p(X, Y" Io = / gp-~-p~) (0) 
where .,~ Y= columns and rows of a word co-occurrence 
matrix 
X, Y = subsets of X~ Y, respectively 
(i.e. word classes at the columns and the rows) 
This can be justified as follows. 
According to \[Hoel 1971\], the likelihood ratio LAMBDA 
for a test of the hypothesis: p(i) = po(i) (i = 1, 2 ..... k), 
where p(i) is the probability of case i and po(i) is a 
hypothesized probability of it, when observations are 
independent of each other, is given as: 
k , n(i) -2 log LAMBDA 2 ~ n(i) (1) = log~ 
i=l 
where n(i) is the number of observations of case i, and e(i) 
is its expectation, i.e., e(i) = n p(i), where n is the total 
number of observations. 
The distribution is chi-square when n is large. If we 
assume two word classes, ci and cj, occur independently, 
then the expected value of the probability of their co- 
occurrence will be, 
e(ci, cj)= n p(ci) p(cj) (2) 
where p(ci) and p(cj) are estimations of the probability of 
occurrence of ci and cj. The maximum likelihood estimate 
of p(ci) and p(cj) is f(ci)/n and f(cj)/n, where f(cj) and f(cj) 
are the number of observations of words classified in ci and 
cj. The maximum likelihood estimate of p(ci, cj), the 
probability of the co-occurrences of words in ci and cj, is 
f(ci, cj)/n, where f(ci, cj) is the number of observations of 
the co-occurrences. Then the number of the co-occurrences 
n(ci, cj) (which is the same as f(ci, cj) ) can be represented 
as, 
n(ci, cj)= n p(ci, cj) (3) 
Therefore, given k classes, cl, c2 ..... ck, substituting (2) 
and (3) into (1). 
k i p(ci, c j) 2 ~ ~ np(ci, cj)log 
i=0j=0 p(~i) }~(-~j ) (4) 
If n is large, this will have a chi-square distribution; 
therefore, we can estimate how unlikely our assumption of 
independence among word classes is. Since formula (4) 
gives a scaled average mutual information among the word 
classes, searching for a partition of words that provides 
maximum average mutual information among word classes 
is equivalent to seeking classes where independence among 
word classes is minimally likely. The algorithm reported 
in this paper searches for pairs of word classes which 
provide a local maximum I(X, Y), a term in the 
summation of formula (0). 
253 
