CATCHING THE CHESHIRE CAT 
Christer Johansson 
Dept. of Linguistics, Lund University, Sweden 
email: Christer.Johansson @ling.lu.se 
ABSTRACT 
Finding useful phrases is important in applica- 
tions like information retriewd, and text-to- 
speech systems. One of the currently most 
used statistics is the mutual information ratio. 
This paper compares the mutual information 
ratio and a measure that takes temporal order- 
ing into account. Using this lnodified measure, 
some local syntactic constraints as well as 
phrases am captured. 
INTRODUCTION 
In Alice's Adventures in Wonderland by Lewis 
Carrel many of Alice's friends have names that 
consists of two words, for example: the March 
Hare, the Mock Turtle, and the Cheshire Cat. 
'\['he individual words in these combinations, if 
we ignore capitalisation, might be quite com- 
mon. 
Individual words usually mean different 
things when they am free. l:or example, in "The 
March against Apartheid", and "The March 
I tare", "march" means totally different things. 
There is obviously a strong link between "the" 
and "march", but the link between "march" and 
"hare" is definitely stronger, at least in Lt;wis 
Carrol's text. 
The goal of this paper is to propose a statistic 
that measures the strength ol7 such glue between 
words in a sampled text. Finding tile names {)17 
Alice's friends can be done by searching for two 
adjacent words with initial capit~d letters. 
()no use of statistical associations could he to 
find translatable concepts and phrases, that 
might be expressed with a different number of 
words in another language. Another possibly 
interesting use of statistical associations is to 
predict whether words constitute new or given 
information in speech. It has been proposed 
(e.g. Horne& Johansson, 1993) that the stress 
of words in speech is highly dependent on the 
informational content of the word. Also, statisti- 
cal associations are not incompatible with the 
first stages of the "hypothesis space" proposed 
by Processability Theory (personal communica- 
tion with Manfred Pienemann of Sydney 
University, see also Meisel & al., 1981). 
There are different methods of calculating 
statistical associations. Yang & Chute (1992) 
showed that a linear least square mapping of 
natural language to canonical terms is both fea- 
sible, and a way of detecting synonyms. Their 
method does not seem to detect dependencies in 
the order of words however. To do this we need 
a measure that is sensitive to the order between 
words. In this paper we will use a variant of 
mutual infi)rmation that derives from Shannon's 
theory of information. (as discussed in e.g., 
Salton & McGill, 1983) 
Definitions and assumptions 
The definition of a word in a meaninglul way is 
\[:ar from easy, but a working definition, for 
technical purposes, is to assume that a word 
equals a string of letters. These 'words' are sep- 
arated by non-letters. The case of letters is ig- 
nored, i.e. converted into lower case. For ex- 
ample: "there's" are two 'words': "there" and 
~IS". 
A collocation consists of a word and the 
word that immediate@ follows. Index I will re- 
fer to the first word and 2 to the second word. 
Index 12 will refer to word 1 followed by 
word2, and similarly for 2 I. 
Another assumption is that natural language 
is morn predictive in the (left-to-right) temporal 
order, than in tile reversed order. This is moti- 
wtted by the simple obserwttion that speech 
comes into the system through the ears serially. 
For example: consider the French phrase "un 
ben viu hlanc" (Lit. "a good wine white"). 
"Ben" can (relatively often) be followed by 
"vin", but usually not "vin" by "ben". The 
same kind of link exists between "vin" and 
"bhmc", but not between "blanc" and "vin". 
This linking affects the intonation of French 
phrases, and also that intonation supports these 
kinds of links. Note, that this is not an explana-. 
tion of either intonation or syntax: we mosl 
likely have to consider massive interaction be-. 
tween different modalities of language. 
1021 
Deriving the measure 
The mutual information ratio, g, provides a 
rough estimation on the glue between words. It 
measures, roughly, how much more common a 
collocation is in a text than can be accounted for 
by chance. This measure does not assume any 
ordering between the words making up a collo- 
cation, in the sense that the g-measure of 
\[wl...w2\] and \[w2...wl\] are calculated as if 
they were unrelated collocations. 
The mutual information ratio (in Steier & 
Belew, 1991) is expressed: 
, 
Formula 1: The mutual information ratio 
where 'p' defines the probability function, 
p(\[wl...w2\]) is read as "the probability of 
finding word w2 after word wl". 
Adjusting for order between words 
We have experimented with the difference in 
mutual information, ag, between the two differ- 
ent orderings of two words making up a collo- 
cation. The results indicate that zxg captures 
some of the local constraints in a sampled text. 
6g can be expressed: 
A~t= 
I (P(\[Wl...W2\])I , (P(\[W2...W~I 
=> 
, / , 
= tog2/-- --/ 
Formula 2: The diffcmnce in mutual 
inlblrnation 
where F(\[wx...Wy\]) denotes the frequency of 
which Wx and Wy co-occur in the sample. 
F(wx) is the frequency of word Wx. Note that 
the size of the sample cancels in this equation. 
Note also that this measure is not sensitive to the 
individual probabilities of the words. 
A problem is when them is no F(\[w2...wl\]). 
In these cases, we have chosen to arbitrarily set 
F(\[w2...Wl\]) to 0.1, with the justification that if 
the sample was ten times larger we might have 
found at least one such pair. 
MATERIAL 
The material is Alice's Adventures in 
Wonderhmd by Lewis Carrol, available in elec- 
tronic format via email from the Gutenberg 
Project. The text contains 27332 words of 
which 2576 are unique, making up a total of 
14509 unique word pairs. Alice in Wonderland 
was chosen because it is a well-known text, it 
contains some phrases that we know are in there 
(e.g. March Hare), and it contains a sufficient 
number of words, and variations of words, to 
be interesting for the experiment. Studies could 
be done for other collections of texts, e.g. 
medical abstracts. As morn documents ate avail- 
able, comparisons between documents can be 
done (Steier & Belew, 1991). This experiment 
only contains within comparisons of phrases for 
one specific text. 
METHOD 
For each of the unique words in the text the fie- 
quencies of all immediately following words 
were collected. In this text, no filtering of the 
text was performed. Some initial experiments 
were performed, with a stoplist, to remove 
function words and some other common words 
(see Fox, 1992, for details). Some simple 
stemming was also tried, e.g. removing 's' and 
'ed' from the end of words. Stemming may lead 
to difficulties in distinguishing compounds from 
noun-verb complexes. It is not clear if the pros 
of using stemming outweighs the cons, conse- 
quently we decided to work with the raw text. 
Stoplists and stemming might be more important 
when the ordinary g-measure is used. 
RESULTS 
The collocations were ordered differently by the 
two measures. The g was sensitive to individual 
frequencies, and favoured very low fi'equency 
collocations. The Ag was sensitive to the order- 
ing of the words, and favoured high frequency 
collocations that only occmred in one order. The 
quality of the diffemnt measures can be seen by 
comparing the top and last ten collocations 
between the measures. Table 1.1 and 2.1 refer 
to Ag, and Table 1.2 and 2.2 refer to g. The N 
column tells the rank-number of the collocation. 
Note that the frequencies of the individual 
words, F1 and F2, are not used to compute Ag, 
they are only provided for compa~%on with the 
g-measure. 
Note that the numerical values of the g-mea- 
sure and the Ag-measure cannot be directly 
compared since they measure slightly different 
phenomena. 
1022 
Table 1.1: The top 
N-I wo'  _._.!r 
said-> (he -1 \]~(}~ 
2 ioT> (h~ 
3 ~ in->a .{).92 } 
4 an{f> the 9_.7(} ! 
_5 in-> tile _9.641 
_ 6 to-> Ihe 9.431 
7 don->l 9.25 l 
8 as-> she 9.25_\] 
_ {) a-> little 9.201 
10 she-> had 9.2()\[ 
ten collocations by Ag 
1,'1 y~_l, lZll,)lA 
_ 462 .f~2~ _2!01 {J __\[ 
3696321 97\]_0 \] 
_ 3691 .6421 801_0 ~ 
6322__ 1281, "A', t2 A 
_55~_178L 2~1o I 
als gives a measure of local links between 
words. As can be seen from Table 1.1, Abt 
captures local constraints: that prepositions am 
usually followed by a noun phrase, that 'and' 
usually is used as a noun co-ordinator 
(indicated by the high value for 'and->the'). 
Mitjushin (1992) has proposed similar links on a 
higher syntactic level, using a rule-based ap- 
proach. We have deliberately tried to awfid 
talking about word-classes since it is misleading 
at this level of analysis. However, we get many 
examples of good representatives for word- 
classes that form collocations. 
Table 1.2: Thc top tell collocations by ~t 
N word pair b t 1," ! F2 _ le 12 
___l w{×~den->spadcs .... 14.7 / ~ 1 _ 1_ 
2 vari{~u~r_>Drctextss .__14.7 & I ._ l 
uncommonly->lat .___ 
4 _t3~->t{}il~e 
• .5 littere{l->audibl~__ _ 
_.6 link li~>sl~_ecd~ 
7 tide-> rises 
_ 8 lalt->cuslard 
_9 slcam-~c __£ 
10 ~d~->{h'essed 
14.7 1__ I _ I 
14.7 I . I 1 
14.7 1 1 1 
144.Z C 1 _ r 
14.7 1 l 1 
_14.j_ & t L_ 
t4.j Z J _ ! 
The flavour of the collocations that bt rate 
highly is different. As can be seen from Table 
1.2, low individual frequencies result in a high 
g-value, even if the collocation is unique. This 
gives an illusion of a semantic relation, which is 
due to the fact that low frequency words arc 
usually high in content. The g-measure is useful 
when we are interested in the correlation be- 
tween words within and between documents 
(Steier & Belew, 1991). This notion could be 
expanded up{}n to incorporate correlation be- 
tween any two words in general, and it seems 
to work well for the g-measure (Wettler and 
Rapp, 1989). 
The last ten collocations. Ag is sensitive to 
deviation from an expected ordering in tile 
sample. The negative valued link between these 
words makes a phrase boundary between the 
two words probable. 
'Fable 2.1: The last ten collocations by A~ 
14500 catcq~illar-> -4.70 28 \[ 1642 I / 26 I 
__ Ihe 
_14501 mouse->the 
14502 s->it 
14503 s->that 
14504 dormouse-> 
the 
14505 q ucen->lhc 
14L506 she->alld 
14507 was->shc 
14508 m->i 
14509! was->it 
-4.~1~ 201/~.~ 56 \[ 
-5.09 20t 3t51 x4 L 
-5.13 40 16421 1 \] 35 I 
__-5.177 75 16421 2 72 I 
_578 552 
-5.86 63 545 I 1 I 5N \[ 
The g-measure, in contrast, gives some col- 
locations that are intuitively unlikely phrases 
consisting of high frequency words. In the case 
of "the-> the" there exists 1641 pairs that speak 
against that pairing, but it is hard to explain this 
in terms of local syntactic constraints. The 
negative scores seems to capture possible typo- 
graphic errors. 
Table 2.2: The last ten collocations by bt 
N word pair g___ 1," 1 __.1~2 1¢12 
14500 she-> of -3.37 552 513 1 
14501 to-> and -3.54 729 872 2 
14502 a->i -3.66 632 545 1 
14503 and->of -4.03 872 513 1 
14504 i->and -4.12 545 872 1 
14505 she->mld -4.14 552 872 1 
145(16 to->to -4.28 729 729 1 
14507 mid-> ~ld -4.80 872 872 1 
14508 i->lhe -5.03 545 r642 1 
14509 Ihc->lhe -6.62 '642 !642 1 
Particle verbs 
Particle verbs are hard to rank high for the 
~t-measure, because the individual fl'cquencies 
of the particles are usually devastatingly high, 
~md the fl'equency of the main verb in pm'ticle 
verb constructions are usually higher than av- 
crage. The Abt are, in gencral, good at finding 
such combinations if the order between the 
two words is fixed ('Fable 3.1). 
1023 
Table 3.1: Some verb + particle (or negation) 
word pair Np. NAg rt Art 
1 did->not 3961 33 6.39 8.13~ 
/ scemcd->to 6818 54 4.80 7.64 
must->be 4038 58 6.32 7.57 
looked->at 5211 72 5.61 7.41 
Finding thematic phrases 
But what about finding Alice's friends' ? Does 
the art find the phrases that the text is about (~ 
thematic phrases)? To test this we chose some of 
the names of Alice's friends (Table 3.2). 
We found that the rank number that Ag deliv- 
ers is higher than the rank number for the rt- 
measure for all the checked friends. This is due 
to the frequency effects discussed above. 
Table 3.2: Alice's Friends 
word pair ~_ NAg g 
mock->turtle 1517 12 8.86~ 
march->hare 1003 28 9.65 
white->rabbit 1637 47 8.62 
chesl~ke ->cat 1360 473 9.04 
~ueen 8519 831 4,00 
the->donnouse 8841 832 3.86 
8463 2954 4.03 
What is lost 
There am obviously good phrases that g 
rates higher than zXg. These usually consists of 
two words that are uncommon in the sample. 
Some idioms are of this kind. The at* needs to 
find more examples of collocations with the 
exact ordering between the consti-tuents to 
rate the collocation high ( Table 3.3). 
Table 3.3: Some collocations 
with Ng < Nag 
word pair Nbt NAbt rt Art 
ycr->honour 172 705 12.7 5.32 
young->lady 230 1073 12.4 4.91 
guinea->pigs 398 645 11.6 5.32 
rose->tree 459 1114 11.3 4.91 
fast->asleep 460 1115 11.3 4.91 
note->book 462 2501 11 3 4.32 / 
raving->mad 597 2500 ~ 4.32 
cheshire->cats 1925 4468 \[ 8.23 \[ 3.32 
Adding memory 
We have also done some experiments with 
adding memory to the method. A 'memory' 
could, for example, extend 10 words after each 
word. All words following within a distance 
equal to the size of the memory were collected. 
Adding a memory allowed the model to detect 
shared information of words that was further 
apart (for example "pack of card~" or "boots and 
shoes". 
The memory introduced false collocations: 
e.g., "grammar-> mouse". The context was: 
"Alice thought thi,~" lnust be the right way of 
speaking to a mouse: she had never done such a 
thing before, but she remembered having seen in 
her brother~ Latin Grammar, ',4 mouse--era 
mouse--to a mouse--a mouse--O mouse\]'" 
This context gave up to 5 collocations for 
"grammar" followed by "mouse", and therefore 
rated "grammar-> mouse" very high. 
Otherwise, words that happened to be near a 
word without being statistically related to the 
word were usually rated low. The g gave clearly 
better results on finding related phrases than the 
zXg, with the model with the 'memory'. 
With the memory, the Abt ordered the pairs 
closer to the original raw-frequency ordering the 
more 'memory' was present. The experiment 
with the memory was useful because it showed 
that this was not worth doing for aj.t, but likely 
worth doing for g. 
CONCLUSIONS 
Possible usefulness 
The higher sensitivity to local constraints in 
the temporal ordering could be used in a parser 
for finding local phrases. This might also have 
its implications for language acquisition. It could 
be tested if language learners make mistakes that 
could be explained by the statistical connectivity 
between words. Further research is needed on 
how the measure of connectivity behaves on 
phrase boundaries. 
Areas where phrase finding could be useful 
include: text-to-speech (phrase intonation), ma- 
chine translation (translation of compounds), 
and in information retrieval: phrase transfo~xna- 
tion of high frequency terms into medium fie- 
quency telxns with a better discrimination value 
(Salton & McGill, 1983). 
Characteristics 
The rt-measure is good at estimating global 
correlations in a document or collection of doc- 
uments (Wettler & Rapp, 1989). This could be 
used for capturing contextual and pragmatic 
constraints in a text. Other methods exist that are 
good, perhaps even better, at capturing for 
example synonymy. 
1024 
Linear least square mapping (Yang & Chute 
1992) is one method that has shown to bc 
promising on capturing very good mappings 
between, in their case, symptoms and diagnosis. 
The same technique could be used for mapping a 
text to its abstract. The draw-back of these 
methods is their inherent parallel structure which 
makes it hard to account for the ordering that 
natm'al language requires. 
The Ag-measure, on the other hand, is a local 
measure, that seems to capture dependencies in 
the temporal ordering of the language. It is hard 
to draw any definite conclusions from the 
analysis of only one text, but we have seen how 
the two proposed measures react 1o the fre- 
quencies of individual words, as well as the 
frequencies of word pairs. Taking into account 
the ability o1' Abt to find dependencies in the 
temporal ordering, we think it is a more relevaut 
measure than I-t for several aspects of natural 
language processing, but not all. 
Acknowledgements 
Thanks to the people at my department: es- 
pecially Barbara Gawronska. 
Shannon, C. E., 1951, Prediction mid l';ntropy of 
Printed English, Bell Systems Technical 
Journal, Vol. 30, No. 1, Janumy 1951, pp. 50~ 
65. (quoted in Salton & MeGilI) 
Sitter, A. M. & Belcw, R. K., 1991, l:';xporting 
phrases: A statistical analysis of topical lan- 
guage. In Casey, R. & Crofl, B., (l';ds.), 2nd 
Symposium on Document Analysis and 
lnJormation RetriewtL 
Wettler, M. & Rapp, R., 1989, A connectionist 
System to Simulate I,exical Decisions in 
\[nlbrmation Retrieval, In: Pl'cilcr & al. (F, ds,), 
Connectionism in Perspective, North-I lolland 
Yang, Y. & Chute (2. G., 1992, A Linem' I,east 
Squm'cs l;it Method for Inlormation Retrieval 
flom Natural Language Texts, Proceedings of 
the ./iJteenth International Con/erence on 
Conqmtational Linguistics, pp. 447-453 
Yarowsky, I)., 1992, Word-Sense l)ismnbiguation 
Using Statistical Models of Roger's Categories 
Trained on I.arge Coq)ora, Proceedings of the 
Ji/teenth International CoR/?erence on 
Computational Linguistics, pp. 454-460 
Project Gutenberg, Illinois Benedictine College, 
send the message: "send gutenberg catalog" to 
'almanac @ oes. otwl. edu ' lor more inlbrmation. 
Canol, I,. Alice's Adventures in Wonderland, The 
Millennium l:ulerum l'~(lition 2.9 
REFERENCES 
Below, R. K., 1989, Adaptive inlormation re- 
trieval: Using a connectionist reprcsentalion to 
reUicve and learn about doctlments, hi: l'roc. 
SIGIR 1989, pp. I 1-20, C~unbridge, MA. 
Fox, C., 1992, l.exical Analysis and Stoplists, Ill: 
l:rakes, W. B., & Baeza-Yates, R., hfformation 
Retrieval, Prentice Ilall, NJ. 
llorne, M. & Johansson, C., 1991, l,cxical 
Structure and accenting in English and Swedish 
rcslrictcd texts. Working l'apers (Dept. of 
Ling., U. of Lurid, Sweden)38: 97-114. 
Ilornc, M. & Johansson, C., 1993, 
Computational tracking of 'new' vs. 'given' in- 
formation: implications lor synthesis of into- 
nation, In: (}ranstr(hn, B. & Nard, 1.. (l';ds.) 
Nordic Prosody VI - papers l?om a sympo- 
sium, Ahnquist & Wikscll International, 
S|ockhohn, Sweden. 
Meiscl, J., Clahsen, 11., alld Pienemann, M., 
1981, On determining developmental stages in 
second language acquisition. Studies in Second 
Language Acquisilion 3, 2, pp. 109-135. 
Mitjushin, L. 1992, lligh Probabilily Syntactic 
Links, Proceedings o~ tile fifteenth 
International Conference on Computational 
Linguistics, pp. 930-934. 
Salt(m, (\]., & McGill M. J., 1983, Introduction 
Io Modern h!formation Retriewtl, McGraw-I Iill 
Computer Science Series 
1025 

Information Retrieval 
& Extraction 

