Using Suffix Arrays to Compute Term Frequency 
and Document Frequency for All Substrings in a Corpus 
Mikio Yamamoto 
University of Tsukuba 
1-1-1 Tennodai, 
Tsukuba 305-8573, JAPAN 
myama@is.tsukuba.ac.jp 
Kenneth W. Church 
AT&T Labs - Research 
180 Park Avenue 
Florham Park, NJ 07932, U.S.A 
kwc @research.att.com 
Abstract 
Mutual Information (MI) and similar 
measures are often used in corpus-based 
linguistics to find interesting ngrams. MI 
looks for bigrams whose term frequency (~ is 
larger than chance. Residual Inverse 
Document Frequency (RIDF) is similar, but it 
looks for ngrams whose document frequency 
(df) is larger than chance. Previous studies 
have tended to focus on relatively short 
ngrams, typically bigrams and trigrams. In 
this paper, we will show that this approach 
can be extended to arbitrarily long ngrams. 
Using suffix arrays, we were able to compute 
tf, df and RIDF for all ngrams in two large 
corpora, an English corpus of 50 million 
words of Wall Street Journal news articles and 
a Japanese corpus of 216 million characters of 
Mainichi Shimbun news articles. 
1 MI and RIDF 
Mutual Information (MI), l(x;y), compares the 
probability of observing word x and word y 
together (the joint probability) with the 
probabilities of observing x and y independently 
(chance). 
l(x;y) = log (P(x,y) / e(x)e(y) ) 
MI has been used to identify a variety of 
interesting linguistic phenomena, ranging from 
semantic relations of the doctor/nurse type to 
lexico-syntactic co-occurrence preferences of the 
save/from type (Church and Hanks, 1990). 
Church and Gale (1995) proposed Residual 
28 
Inverse Document Frequency (RIDF), the 
difference between the observed IDF and what 
would be expected under a Poisson model for a 
random word or phrase with comparable 
frequency. RIDF is a variant of IDF, a standard 
method for weighting keywords in Information 
Retrieval (IR). Let D be the number of documents, 
tf be the term frequency (what we call '~frequency" 
in our field) and dfbe the document frequency (the 
number of documents which contain the word or 
phrase at least once). RIDF is defined as: 
Residual IDF ~ observed IDF - predicted IDF 
= -log(df/D) +log(1-exp(- 8)) 
= -log(df/D) +log(1-exp(-tf/D)). 
RIDF is, in certain sense, like MI; both are the log 
of the ratio between an empirical observation and 
a chance-based estimate. Words or phrases with 
high RIDF or MI have distributions that cannot be 
attributed to chance. However, the two measures 
look for different kinds of deviations from chance. 
MI tends to pick out general vocabulary, the kind 
of words one would expect to find in a dictionary, 
whereas RIDF tends to pick out good keywords, 
the kind of words one would not expect to find in 
a dictionary. This distinction is not surprising 
given the history of the two measures; MI, as it is 
currently used in our field, came from 
lexicography whereas RIDF came from 
Information Retrieval. 
In addition, it is natural to compute RIDF 
for all substrings. This is generally not done for 
MI, though there are many ways that MI could be 
generalized to apply to longer ngrams. In the next 
section, we will show an algorithm based on suffix 
arrays for computing tf, df and RIDF for all 
substrings in a corpus in O(NlogN) time. 
In section 3, we will compute RIDF's for all 
substrings in a corpus and compare and contrast 
MI and RIDF experimentally for phrases in a 
English corpus and words/phrases in a Japanese 
corpus. We won't try to argue that one measure is 
better than the other; rather we prefer to view the 
two measures as mutually complementary. 
2 Computing tf and df for all substrings 
2.1 Suffix arrays 
A suffix array is a data structure designed to make 
it convenient to compute term frequencies for all 
substrings in a corpus. Figure 1 shows an example 
of a suffix array for a corpus of N=6 words. A 
suffix array, s, is an array of all N suffixes, 
pointers to substrings that start at position i and 
continue to the end of the corpus, sorted 
alphabetically. The following very simple C 
function, suffixarray, takes a corpus as input and 
returns a suffix array. 
int suffix_compare(char **a, char **b){ 
return strcmp(*a, *b); } 
/* The input is a string, terminated with a null */ 
char **suffix_array(char *corpus){ 
int i, N = strten(corpus); 
char **result=(char **)rnalloc(N*sizeof(char *)); 
/* initialize result\[i\] with the ith suffix */ 
for(i=0; i < N; i++) result\[il = corpus + i; 
ClSOr't(result, N, sizeof(char *), suffix_compare); 
return result; } 
Nagao and Mori (1994) describe this 
procedure, and report that it works well on their 
corpus, and that it requires O(NlogN)time, 
assuming that the sort step requires O(NlogN) 
comparisons, and that each comparison requires 
0(1) time. We tried this procedure on our two 
corpora, and it worked well for the Japanese one, 
but unfortunately, it can go quadratic for a corpus 
with long repeated substfings, where strcmp takes 
O(N) time rather than 0(1) time. For our English 
corpus, after 50 hours of cpu time, we gave up and 
turned to Doug Mcllroy's implementation 
( http : //cm. bell-labs, corrJcm/cs/ who/doug/ssort, c) 
of Manber and Myers' (1993) algorithm, which 
took only 2 hours. For a corpus that would 
29 
otherwise go quadratic, the Manber and Myers' 
algorithm is well worth the effort, but otherwise, 
the procedure described above is simpler, and 
often a bit faster. 
As mentioned above, suffix arrays were 
designed to make it easy to compute term 
frequencies (~. If you want the term frequency of 
"to be," you can do a binary search to find the first 
and last position in the suffix array that start with 
this phrase, i and j, and then tfl"to be") =j-i+l. In 
this case, i=5 and j=6, and consequently, tfl"to 
be")=6-5+1=2. Similarly, tfl"be")= 2-1+1 = 2, and 
~"to")=6-5+1=2. This straightforward method of 
computing tf requires O(logN) string comparisons, 
though as before, each string comparison could 
take O(N) time. There are more sophisticated 
algorithms that take O(logN) time, even for 
corpora with long repeated substrings. 
A closely related concept is lcp (longest 
common prefix). Lcp is a vector of length N, 
where lcp\[i\] indicates the length of the common 
prefix between the ith suffix and the/+/st suffix in 
the suffix array. Manber and Myers (1993) showed 
how to compute the lcp vector in O(NlogN) time, 
even for corpora with long repeated substrings, 
though for many corpora, the complications 
required to avoid quadratic behavior are 
unnecessary. 
Corpus: "to be or not to be" 
s\[i\] 
s\[z\] 
s\[3\] 
s\[4\] 
s\[s\] 
s\[s\] 
1 2 3 4 
Alphabet: \[to, be, or, not} 
... lcp 
 -Ior not to be I 
not to be I 
or not to 
to be 
to be or 
be\] 
not to be\] 
0 
0 
0 
2 
0 
Lcp's are denoted by bold vertical lines as well as the Icp table. 
Figure 1: An example of a Suffix Array with lcp's 
2.2 Classes of substrings 
Thus far we have seen how to compute tf for a 
single ngram, but how do we compute tfand dffor 
all ngrams? There are N(N+I)/2 substrings in a 
text of size N. If every substring has a different tf 
and df, the counting algorithm would require at 
least quadratic time and space. Fortunately many 
substrings have the same tf and the same df. We 
will cluster the N(N+I)/2 substrings into at most 
2N-1 classes and compute tf and df over the 
classes. There will be at most N distinct values of 
RIDF. 
Let <i,j> be an interval on the suffix array: 
{s\[i\], s\[i+l\] ..... s\[j\]}. We call the interval LCP- 
delimited if the lcp's are larger inside the interval 
than at its boundary: 
min(lcp\[i\], lcp\[i+ l\] ..... lcp\[j-1\]) 
> max(lcp\[i-1\], Icp\[l\]) (1) 
In Figure 1, for example, the interval <5,6> is 
LCP-delimited, and as a result, 0fCto") = tf("to be") 
= 2, and dfCto")=dfCto be"). 
The interval <5,6> is associated with a class 
of substrings: "to" and "to be." Classes will turn 
out to be important because all of the substrings in 
a class have the same tf(property l) and the same 
df (property 2). In addition, we will show that 
classes partition the set of substrings (property 3) 
so that we can compute tf and df on the classes, 
rather than substrings. Doing so is much more 
efficient because there many fewer classes than 
substfings (property 4). 
Classes of substrings are defined to be the 
(not necessarily least) common prefixes in an 
interval. In Figure 1, for example, both "to" and 
"to be" are common prefixes throughout the 
interval <5,6>. That is, every suffix in the interval 
<5,6> starts with "to," and every suffix also starts 
with "to be". More formally, we define 
class(<ij>) as: {s\[i\]ml LBL<rn_<SIL}, where s\[i\]rn 
is a substring (the first m characters of s\[i\]), 
LBL(longest boundary lcp) is the fight hand of (1) 
and SIL (shortest interior Icp) is the left hand side" 
of (1). In Figure 1, for example, SIL(<5,6>) = 
min(lcp\[5\]) = 2, LBL(<5,6>) = max(lcp\[4\], lcp\[6\]) 
=0, and class(<5,6>) = {s\[5\]m I 0<m_<2} = {"to", 
"to be"}. 
Figure 2 shows six LCP-delimited intervals 
and the LBL and SIL of <2,4>. For <2,4>, the 
bounding lcp's are lcp\[1 \] = 2 and lcp\[4\]=3 
(LBL=3), and the interior lcp's are lcp\[2\]=4 and 
lcp\[3\]=6 (SIL=4). The interval <2,4> is LCP- 
delimited, because L B L<SIL. Class(<2,4>)= 
{s\[2\]m13<m~<4} = {aacc}. The interval <3,3> is 
*) SIL(<i,i>) is defined to be infinity, and consequently, 
all intervals <i,i> are LCP-delimited, forall i. 
30 
Doc-id 
(..,. 382 s\[1\] 
~84987 stZ\] 
\6892 s\[3\] 
--382 s\[4\] 
2566 s\[5\] 
s\[6\] 
1 2 3 4 5 6 7... 
a alb b c c d... 
a aLc old d e... 
4, 
:: ..... 
Bounding Icps, LBL, SIL, Intedor Icps of <2, 4> 
Vertical lines denote lcps. Gray area denotes endpoints 
of substrings in class(<2,4>). 
LCP-delimited Class interval 
<2,4> {aacc} 
<3,4> {aacce, aaccee} <1,1> {aab, aabb, aabbc, ...} 
<2,2> {aaccd, aaccdd, ...} 
<3,3> {aacceef, ...} <4,4> {aacceeg, ...} 
LBL SIL tf 
2 4 3 
3 6 2 2 infinity 1 
4 infinity 1 6 infinity \] 
6 infinity 1 
Figure 2: Examples of intervals and classes 
LCP-delimited because SIL is infinite and LBL=6. 
The interval <2,3> is not LCP-delimited because 
SIL is 4 and LBL is 6 (LBL>SIL). 
By construction, the suffixes within the 
interval <i,j> all start with the substrings in 
class( <i,j> ), and no suffixes outside this interval 
start with these substfings. As a result, if sl and s2 
are two substfings in class(<ij>) then 
Property 1: tflsJ) = tfls2) =j-i+l 
• and 
Property 2: dr(s1) = df(s2). 
The calculation of dfis more complicated than tf, 
and will be discussed in section 2.4. 
It is not uncommon for an LCP-delimited 
interval to be nested within another. In Figure 2, 
for example, the in~rval <3,4> is nested within 
<2,4>. The computation of df in section 2.4 will 
take advantage of a very convenient nesting 
property. Given two LCP-delimited intervals, 
either one is nested within the other (e.g., <2,4> 
and <3,4>), or one precedes the other (e.g., <2,2> 
and <3,4>), but they cannot overlap. Thus, for 
example, the intervals <1,3> and <2,4> cannot 
both be LCP-delimited because they overlap. 
Because of this nesting property, it is possible to 
express the dfof an interval recursively in terms of 
its constituents or subintervals. 
As mentioned aboye, we will use the 
following partitioning property so that we can 
compute tfand dfon the classes rather than on the 
substrings. 
Property 3: the classes partition the set of 
all substrings in a text. 
There are two parts to this argument: every 
substfing belongs to at most one class (property 
3a), and every substring belongs to at least one 
class (property 3b). 
Demonstration of property 3a (proof by 
contradiction): Suppose there is a substfing, s, that 
is a member of two classes: class(<ij>) and 
class(<u,v>). There are three possibilities: one 
interval precedes the other, they are property 
nested or they overlap. The only interesting case is 
the nesting case. Suppose without loss of 
generality that <u,v> is nested within <ij> as in 
Figure 3. Because <u,v> is LCP-delimited, there 
must be a bounding lcp of <u,v> that is smaller 
than any lcp within <u,v>. This bounding Icp must 
be within <i j>, and as a result, class(<ij>) and 
class(<u,v>) must be disjoint. Therefore, s cannot 
be in both classes. 
t Suffix Array / SIL of <i,j> 
$\[i\]1 I 1 I s\[.\]/ ^ I , II 
/ I I s\[v\]/ 
I ,fl I sb\] ,.<..j"l 
?h~s is an interior lcp of <i,j> 
and the LBL of <u, v>. 
Figure 3: An example of nested intervals 
Demonstration of property 3b (constructive 
argument): Let s be an arbitrary substring in the 
corpus. There will be at least one suffix in the 
suffix array that starts with s. Let i be the first 
such suffix and let j be the last such suffix. By 
construction, the interval <i j> is LCP-delimited 
(LBL(<ij>) < Isl and S1L(<ij>) >_ Isl), and s is an 
element of class(<ij>). 
Finally, as mentioned above, computing 
over classes is much more efficient than 
computing over the substfings themselves because 
there are many fewer classes (at most 2N-l) than 
substrings (N(N+I)/2). 
31 
Property 4: There are N classes with tf=l 
and at most N-1 classes with ~'> 1. 
The first clause is relatively straightforward. 
There are N intervals <i,i>. These are all and only 
the intervals with tf=l. By construction, these 
intervals are LCP-delimited. 
To argue the second clause, we will make 
use of a uniqueness property: an LCP-delimited 
interval <ij> can be uniquely determined by its 
S1L and a representative element k (i.~.k<j). 
Suppose there were two distinct intervals, <id> 
and <u,v>, with the same SIL, SIL(<ij>)= 
SIL(<u,v>), and the same representative, i.~.k<j and 
u_<k<v. Since they share a common representative, 
k, the two intervals must overlap. But since they 
are distinct, there must be a distinguishing 
element, d, that is in one but not the other. One of 
these distinguishing elements, d, would have to be 
a bounding lcp in one and an interior lcp in the 
other. But then the two intervals couldn't both be 
LCP-delimited. 
Given this uniqueness property, we can 
determine the N-1 upper bound on the number of 
LCP-delimited intervals by considering the N-1 
elements in the Icp vector. Each of these elements, 
lcp\[k\], has the opportunity to become the SIL of an 
LCP-delimited interval <ij> with a representative 
k. Thus there could be as many as N-1 LCP- 
delimited intervals (though there could be fewer if 
some of the opportunities don't work out). 
Moreover, there couldn't be any more intervals 
with 0f>l, because if there were one, its SIL should 
have been in the lcp vector. (Note that this lcp 
counting argument excludes intervals with t~-I 
discussed above, because their SILs need not be in 
the lcp vector.) 
From property 4, it follows that there are at 
most N distinct values of RIDF. The N intervals 
<i,i> have just one RIDF value since 0~-'-df=l for 
these intervals. The other N-1 intervals could have 
another N-1 RIDF values. 
In summary, the four properties taken 
collectively make it practical to compute tf, df and 
RIDF over a relatively small number of classes; it 
would have been prohibitively expensive to 
compute these quantities directly over the 
N(N+ 1)/2 substrings. 
2.3 Calculating classes using Suffix Array 
This section will describe a single pass procedure 
for Computing classes. Since LCP-delimited 
intervals obey a convenient nesting property, the 
procedure is based on a push-down stack. The 
procedure outputs 4-tuples, <s\[i\],LBL,SIL,~>, one 
for each LCP-delimited interval. The stack 
elements are pairs (x,y), where x is an index, 
typically the left edge of a candidate LCP- 
delimited interval, and y is the SIL of this 
candidate interval. Typically, y=lcp\[x\], though not 
always, as we will see in Figure 5. 
The algorithm sweeps over the suffixes in 
suffix array s\[1..N\] and their lcp\[1..N\] (lcp\[N\]=O) 
successively. While Icp's of suffixes are 
monotonically increasing, indexes and lcp's of the 
suffixes are pushed into a stack. When it finds the 
i-th suffix whose lcp\[i\] is less than the lcp on the 
top of the stack, the index and Icp on the top are 
popped off the stack. Popping is repeated until the 
lcp on the top becomes less than the lcp\[i\]. 
A stack element popped out generates a 
class. Suppose that a stack element composed of 
an index i and lcp\[i\] is popped out by lcp\[1\]. Lcp\[i\] 
is used as the SIL. The LBL is the Icp on the next 
top element in the stack or lcp\[j\]. If the next top 
Icp will be popped out by lcp\[j\], then the algorithm 
uses the next top lop as the LBL, else it uses the 
lcp\[j\]. Tf is the offset between the indexes i and j, 
that is, j-i+1. 
Figure 4 shows the detailed algorithm for 
Create and clear stack. 
Push (-1, -1) (dummy). 
Repeat i = 1 ..... N do 
top (index1, Icpl). 
if Icp\[i\] > Icpl then 
push (i, Icp\[i\]). 
else 
while Icp\[i\] _< Icpl do 
pop(index1, Icpl) 
top (index2, Icp2) 
if Icp\[i\] _< Icp2 then 
output <s\[index 1\], Icp2, Icpl, i-index1 +1 > 
else 
output <s\[indexl\], Icp\[i\], Icpl, i-index1+1> 
push (indext, Icp\[i\]) 
Icpl = Icp2. 
Figure 4: An algorithm for computing all classes 
32 
computing all classes with tf > 1. If classes with tf 
= 1 are needed, we can easily add the line to 
output those into the algorithm. The expressions, 
push(x,y) and pop(x,y), operate on the stack in the 
obvious way, but note that x and y are inputs for 
push and outputs for pop. The expression, top(x,y), 
is equivalent to pop(x,y) followed by push(x,y); it 
reads the top of the stack without changing the 
stack pointer. 
As mentioned above, the stack elements are 
typically pairs (x,y) where y=lcp\[x\], but not 
always. Pairs are typically pushed onto the stack 
by line 6, push(i, Icp\[i\]), and consequently, 
y=lcp\[x\], in many cases, but some pairs are pushed 
on by line 15. Figure 5 (a) shows the typical case 
with the suffix array in Figure 2. At this point, 
i=3 and the stack contains 4 pairs, a dummy 
element (-1, -1), followed by three pairs generated 
by line 6: (1, Icp\[l\]), (2, lcp\[2\]), (3, lcp\[3\]). In 
contrast, Figure 5 (b) shows an atypical case. In 
between snapshot (a) and snapshot (b), two LCP- 
delimited intervals were generated, <s\[3\], 4, 6, 2> 
and <s\[2\], 3, 4, 3>, and then the pair (2, 3) was 
pushed onto the stack by line 15, push(indexl, 
lcp\[i\]), to capture the fact that there is a candidate 
LCP-delimited interval starting at indexl=2, 
spanning past the representative element i=4, with 
an SIL of lcp\[i=4\]. 
index lcp Note! 
(3, 6)\]\]Popped ilout ('2, 3) 
(2, 4)\[- s\[4\]. , 2) (1, 2) (1 
(-1,-1) I dummy (-1,-1) 
I I \] ushod 
(a) end of processing s\[3\] (b) end of processing s\[4\] 
Figure 5: Snapshots of the stack 
2.4 Computing df for all classes 
This section will extend the algorithm in Figure 4 
to include the calculation of dr. Straightforwardly 
computing dfindependently for each class would 
require at least quadratic time, because the 
program must check document id's for all 
substfings (N at most) in all classes (N-I at most). 
Instead of this, we will take advantage of the 
nesting property of intervals. The df for one 
interval can be computed recursively in terms of 
its constituents (nested subintervals), avoiding 
unnecessary recomputation. 
The stack elements in Figure 5 is augmented 
with two additional counters: (1) a df counter for 
summing the dfs over the nested subintervals and 
(2) a duplication counter for adjusting for 
overcounting documents that are referenced in 
multiple subintervals. The df for an interval is 
simply the difference of these two counters, that is, 
the sum of the dfs of the subintervals, minus the 
duplication. A C code implementation can be 
found at 
http://www.milab.is.tsukuba.ac.jp/-myama/oedf/tfdf c. 
The df counters are relatively 
straightforward to implement. The crux of the 
problem is the adjustment for duplication. The 
adjustment makes use of a document link table, as 
illustrated in Figure 6. The left two columns 
indicate that suffixes s\[101\], s\[104\] and s\[107\] are 
Suffix Document Document id link (index) 
s\[101\] 382 - ~ 66 j 
s\[102\] 84987 ~172 ~ silO31 -- 6892 21 
s\[104l 382 - 01 
s\[105\] 2566 / 112~) s\[106\] -- 6892 03 
s\[107\] 382 - ~,-I04 ',/ 
stl08\] l- 84987 .. \[102 "~-.,,~., 
Figure 6: An example of document link table 
s\[i\] 
sbq 
s\[k ~. 
s\[t 
Suffix Array 
characters (suffix) 
......... ....................... 
, Idf- 
df-counter \] h 
h h ,,tdf-counter I 
Adf-counterl I6 
-"----... _ ;,11;,I I ;, 4 df-cou.te,T----IL_ II 
.............. I,o 
dup-counter I-~ In 
document links ~.~ ~Interval 
endpoints of substrings in the class of the interval 
Figure 7: Dfrelations among an interval 
and its constituents 
33 
all in document 382, and that several other suffixes 
are also in the same documents. The third column 
links together suffixes that are in the same 
document. Note, for example, that there is a 
pointer from suffix 104 to 101, indicating that 
s\[104\] and s\[101\] are in the same document. The 
suffixes in one of these linked lists are kept sorted 
by their order in the suffix array. When the 
algorithm is processing s\[t\], the algorithm searches 
the stack to find the suffix, s\[k\], with the largest k 
such k_<i and s\[i\] and s\[k\] are in the same 
document. This search can be performed in 
O(logN) time. 
Figure 7 shows the LCP-delimited intervals 
in a suffix array and four suffixes included in the 
same document. I1 has four immediate constituents 
of intervals. S\[j\] is included in the same document 
of s\[i\]. Count for the document of s\[j\] will be 
duplicated at computing df of 11. At the point of 
processing sO'\], the algorithm will increment 
duplication-counter of I! to cancel dfcount of sO'\]. 
As the same way, df count of s\[k\] has to canceled 
at computing df of 11. 
Figure 8 shows a snapshot of the stack after 
processing s\[4\] in Figure 2. Each stack element is 
a 4-tuple of the index of suffix array, lcp, df- 
counter and duplication-counter, (i, lcp, df dc). 
Figure 2 shows s\[1\] and s\[4\] are in the same 
document. Looking up the document link table, 
the algorithm knows s\[1\] is the nearest suffix 
which is in the same document of s\[4\]. The 
duplication-counter of the element of s\[1\] is 
incremented. The duplication of counting s\[1\] and 
s\[4\] for the class generated by s\[1\] will be avoided 
using this duplication-counter. 
At some processing point, the algorithm 
uses only a part of the document link table. It 
duplication lcp counter 
(2, 3, 3,0) \[(1, 
2, 1,1) 
I(-1,-1,-,-) 
Figure 8: A snapshot of 
the stack in dfcomputing 
Nearest Doc-id 
index 
..° 
382 4 
oH 
6892 3 
°,° 
84987 2 
Figure 9: Nearest indexes 
of documents 
needs only the nearest index on the link, but not 
the whole of the link. So we can compress the link 
table to dynamic one in which an entry of each 
document holds the nearest index. Figure 9 shows 
the nearest index+ table of document after 
processing s\[4\]. 
The final algorithm to calculate all classes 
with tfand dftakes O(NlogN) time and O(N) space 
in the worst case. 
3 Experimental results 
3.1 RIDF and MI for English and Japanese 
We computed all RIDF's for all substrings of two 
corpora, Wall Street Journal of ACL/DCI in 
English (about 50M words and 113k articles) and 
Mainichi News Paper 1991-1995 (CD-Mainichi 
Shimbun 91-95) in Japanese (about 216M 
characters and 436k articles), using the algorithm 
in the previous section. In English, we tokenized 
the text into words, delimited by white .space, 
whereas in Japanese we tokenized the text into 
characters (usually 2-bytes) because Japanese text 
has no word delimiter such as white space. 
It took a few hours to compute all RIDF's 
using the suffix array. It takes much longer to 
compute the suffix array than to compute tfand df. 
We ignored substrings with tf< 10 to avoid noise, 
resulting in about 1.6M English phrases (#classes 
= 1.4M) and about 15M substrings of Japanese 
words/phrases (#classes = 10M). 
MI of the longest substring of each class was 
also computed by the following formula. 
, p(xyz) MI(xyz) 
= xog p(xy)p(z I y) 
Where xyz is a phrase or string, x and Z are a 
word or a character and y is a sub-phrase or sub- 
string. 
3.2 Little correlation between RIDF and MI 
We are interested in comparing and contrasting 
RIDF and MI. Figure 10 (a) plots RIDF vs MI for 
phrases in WSJ (length > 1), showing little, if any, 
correlation between RIDF and MI. Figure 10 (b) 
also plots RIDF vs MI but this time the corpus is in 
Japanese and the words were manually selected by 
the newspaper to be keywords. Both Figures I0 
(a) and 10 (b) suggest that RIDF and MI are 
34 
largely independent. There are many substrings 
with a large RIDF value and a small MI, and vice 
versa. 
MI is very different from RIDF. Both pick 
out interesting phrases, but phrases with large MI 
are interesting in different ways from phrases with 
large RIDF. Consider the phrases in Table 1, 
which all contain the word "having." These 
phrases have large MI values and small RIDF 
values. A lexicographer such as Patrick Hanks, 
who works on dictionaries for learners, might be 
interested in these phrases because these kinds of 
collocations tend to be difficult for non-native 
speakers of the language. On the other hand, these 
kinds of collocations are not very good keywords. 
Table 2 is a random sample of phrases 
containing the substring/Mr/, sorted by RIDF. The 
ones at the top of the list tend to be better 
keywords than the ones further down. 
Table 3.A and 3.B show a few phrases 
starting with/the/, sorted by MI (Table 3.A) and 
sorted by RIDF (Table 3.B). Most of the phrases 
are interesting in one way or another, but those at 
the top of Table 3.A tend to be somewhat 
.+ +• 
!$ •o •+ . • : .'.~- : .- 
• , 
.:'~Mi~-.t....'... • ,.~llt~.~z:L= : 
o MI lo 20 
(a) English phrases 
o"l~ 
-lo o MI lo 
(b) Japanese strings 
Figure 10: Scatter plot of RIDF and MI 
i,-Table5 
20 
Table 1: phrases with 'having' 
ff df RIDF MI Phrase 
18 18 -0.0001 10.4564 admits to having 
14 14 -0.0001 9.7154 admit to having 
25 23 0.1201 8.8551 diagnosed as having 
20 20 -0.0001 7.4444 suspected of having 
301 293 0.0369 7.2870 without having 
15 13 0.2064 6.9419 denies having 
59 59 -0.0004 6.7612 avoid having 
18 18 -0.0001 5.9760 without ever having 
12 12 -0.0001 5.9157 Besides having 
26 26 -0.0002 5.7678 denied having 
Table 2: phrases with 'Mr' 
tf df RIDF MT Phrase 
ii 3 1.8744 0.6486 . Mr. Hinz 
18 5 1.8479 6.5583 Mr. Bradbury 
51 16 1.6721 6.6880 Mr. Roemer 
67 25 1.4218 6.7856 Mr. Melamed 
54 27 0.9997 5.7704 Mr. Burnett 
16 9 0.8300 5.8364 Mrs. Brown 
Ii 8 0.4594 1.0931 Mr. Eiszner said 
53 40 0.4057 0.2855 Mr. Johnson . 
21 16 0.3922 0.1997 Mr. Nichols said . 
13 i0 0.3784 0.4197 . Mr. Shulman 
176 138 0.3498 0.4580 Mr. Bush has 
13 ii 0.2409 1.5295 to Mr. Trump's 
13 Ii 0.2409 -0.9301 Mr. Bowman , 
35 32 0.1291 1.1673 wrote Mr. 
12 ii 0.1255 1.7330 M r. Lee to 
22 21 0.0670 1.4293 facing Mr. 
ii ii -0.0001 0.7004 Mr. Poehl also 
13 13 -0.0001 1.4061 inadequate . " Mr. 
16 16 -0.0001 1.5771 The 41-year-old Mr. 
19 19 -0.0001 0.4738 14 . Mr. 
26 26 -0.0002 0.0126 in November . Mr. 
27 27 -0.0002 -0.0112 " For his part , Mr. 
38 38 -0.0002 1.3589 . AMR , 
39 39 -0.0002 -0.3260 for instance , Mr. 
tf df 
Table 3.A: Worse Keywords 
RIDF MI Phrase 
ii ii -0.0001 11.0968 the up side 
73 66 0.1450 9.3222 the will of 
16 16 -0.0001 8.5967 the sell side 
17 16 0.0874 8.5250 the Stock Exchange of 
16 15 0.0930 8.4617 the buy side 
20 20 -0.0001 8.4322 the down side 
55 54 0.0261 8.3287 the will to 
14 14 -0.0001 8.1208 the saying goes 
15 15 -0.0001 7.5643 the going gets 
tf df 
Table 3.B: Better Keywords 
RIDF MI Phrase 
37 3 3.6243 2.2561 the joint commission 
66 8 3.0440 3.5640 the SSC 
55 7 2.9737 2.0317 the Delaware & 
37 5 2.8873 3.6492 the NHS 
22 3 2.8743 3.3670 the kibbutz 
22 3 2.8743 4.1142 the NSA's 
29 4 2.8578 4.1502 the DeBartolos 
36 5 2.8478 2.3061 the Basic Law 
21 3 2.8072 2.2983 the national output 
Table 3.C: Concordance of the phrase "the Basic Law" 
The first col. is the token id and the last col. is the doc id (position of the start word in the corpus) 
2229521: line in the drafting of 
2229902: s policy as expressed in 
9746758: he U.S. Constitution and 
11824764: any changes must follow 
33007637: sts a tentative draft of 
33007720: the relationship between 
33007729: onstitution . Originally 
33007945: wer of interpretation of 
33007975: tation of a provision of 
33008031: interpret provisions of 
33008045: ration of a provision of 
33008115: etation of an article of 
33008205: nland representatives of 
33008398: e : Mainland drafters of 
33008488: pret all the articles of 
33008506: y and power to interpret 
33008521: pret those provisions of 
33008545: r the tentative draft of 
33008690: d of being guaranteed by 
33008712: uncilor , is a member of 
39020313: sts a tentative draft of 
39020396: the relationship between 
39020405: 
39020621: 
39020651: 
39020707: 
39020721: 
39020791: 
39020881: 
39021074: 
39021164: 
39021182: 
39021197: 
39021221: 
39021366: 
39021388: 
onstitution . Originally 
wet of interpretation of 
tation of a provision of 
interpret provisions of 
tation of a provision of 
etation of an article of 
nland representatives of 
e : Mainland drafters of 
pret all the articles of 
y and power to interpret 
pret those provisions of 
r the tentative draft of 
d of being guaranteed by 
uncilor , is a member of 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
the Basic 
Law that will determine how Hon 2228648 
Law -- as Gov. Wilson's debut s 2228648 
Law of the Federal Republic of 9746014 
Law , Hong Kong's miniconstitut 11824269 
Law , and although this may be 33007425 
Law and the Chinese Constitutio 33007425 
Law was to deal with this topic 33007425 
Law shall be vested in the NPC 33007425 
Law , the courts of the HKSAR { 33007425 
Law . If a case involves the in 33007425 
Law concerning defense , foreig 33007425 
Law regarding " defense , forei 33007425 
Law Drafting Committee fear tha 33007425 
Law simply do not appreciate th 33007425 
Law . While recognizing that th 33007425 
Law , it should irrevocably del 33007425 
Law within the scope of Hong Ko 33007425 
Law , I cannot help but conclud 33007425 
Law , are being redefined out o 33007425 
Law Drafting Committee . <EOA> 33007425 
Law , and although this may be 39020101 
Law and the Chinese Constitutio 39020101 
Law was to deal with this topic 39020101 
Law shall be vested in the NPC 39020101 
Law , the courts of the HKSAR { 39020101 
Law . If a case involves the in 39020101 
Law concerning defense , foreig 39020101 
Law regarding " defense , forei 39020101 
Law Drafting Committee fear tha 39020101 
Law simply do not appreciate th 39020101 
Law . While recognizing that th 39020101 
Law , it should irrevocably del 39020101 
Law within the scope of Hong Ko 39020101 
Law , I cannot help but conclud 39020101 
Law , are being redefined out o 39020101 
Law Drafting Committee . <EOA> 39020101 
35 
tf df RIDF MI 
Table 4: Phrases with prepositions 
Phrase with 'for' tf df RIDF 
14 14 
15 15 
12 12 
i0 5 
12 4 
13 13 
23 21 
I0 2 
i0 9 
19 16 
-0.0001 14.5587 
-0.0001 14.4294 
-0.0001 14.1123 
0.9999 13.7514 
1.5849 13.7514 
-0.0001 13.6803 
0.1311 13.6676 
2,3219 13.4009 
0.1519 13.3591 
0.2478 12.9440 
tf df RIDF MI 
feedlots for fattening ii 5 1.1374 
error for subgroups II i0 0.1374 
Voice for Food 13 12 0.1154 
Quest for Value 16 16 -0.0001 
Friends for Education 12 12 -0.0001 
Commissioner for Refugee~ 12 12 -0.0001 
meteorologist for Weathe\] 22 18 0.2894 
Just for Men ii ii -0.0001 
Witness for Peace 17 12 0.5024 
priced for reoffering 22 20 0.1374 
Phrase with'by' tf df RIDF 
Ii ii 
13 13 
13 13 
15 15 
16 16 
61 59 
17 17 
12 12 
ii ii 
20 20 
-0,0001 12.8665 
-0.0001 12.5731 
-0,0001 12.4577 
-0,0001 12.4349 
-0.0001 11.8276 
0,0477 11.5281 
-0,0001 11.4577 
-0.0001 11.3059 
-0.0001 10.8176 
-0.0001 10.6641 
piece by piece ii i0 0.1374 
guilt by association 12 5 1.2630 
step by step 16 16 -0.0001 
bit by bit 14 13 0.1068 
engineer by training 10 9 0.1519 
side by side ii II -0.0001 
each by Korea's i0 9 0.1519 
hermaed in by I0 8 0.3219 
dictated by formula 12 12 -0.0001 
70%-owned by Exxon 16 4 1.9999 
Table 5: Examples of keywords 
with interesting RIDF and MI 
RIDF MI Substrings Features 
~E(native last name) 
SUN (company name) High Low z,J-~'(foreign name) 
10% 10% ~Z~ b(brush) 
V 7 7 -- (sofa) 
~< \]'fl,~ (huge) 
Low High '~l~J (passive) 
I,~ 19 (determination) 
10% 10% /~jJ(native full name) 
~ii~l~'(native full name) 
Kanji character 
English character 
Katakana character 
Hiragana character 
Loan word, Katakana 
General vocabulary 
General vocab., Kanji 
General vocabulary 
Kanji character 
Kanji character 
idiomatic (in the WSJ domain) whereas those at 
the top of Table 3.B tend to pick out specific 
stories or events in the news. For example, the 
phrase, "the Basic Law," selects for stories about 
the British handover of Hong Kong to China, as 
illustrated in Table 3.C. 
Table 4 shows a number of phrases with 
high M! containing common prepositions. The 
high MI indicates an interesting association, but 
again most of them are not good keywords, though 
there are a few exceptions such as "Just for Men," 
a well-known brand name. 
RIDF and MI for Japanese substrings tend to 
be similar. Substrings with both high RIDF and MI 
tend to be good keywords such as ~ (merger), 
(stock certificate), ~,~ (dictionary), J~l~ (wireless) 
36 
MI Phrase with 'on' 
14.3393 Terrorist on Trial 
13.1068 War on Poverty 
12.6849 Institute on Drug 
12.5599 dead on arrival 
11.5885 from on high 
11.5694 knocking on doors 
11.3317 warnings on cigarette 
11.2137 Subcon~ittee on Oversight 
11.1847 Group on Health 
11.1421 free on bail 
MI Phrase with 'of 
16.7880 Joan of Arc 
16.2177 Ports of Call 
16.0725 Articles of Confederation 
16.0604 writ of mandamus 
15.8551 Oil of Olay 
15.8365 shortness of breath 
15.6210 Archbishop of Canterbur 
15.3454 Secret of My 
15.2030 Lukman of Nigeria 
15.1600 Days of Rage 
and so on. Substrings with both low RIDF and MI 
tend to be poor keywords such as "~" ~q~ 
(current regular-season game) and meaningless 
fragments such as *& ,_.~" (??). Table 5 shows 
examples where MI and RIDF point in opposite 
directions (rectangles in Figure 10 (b)). Words 
with low RIDF and high MI tend to be general 
vocabulary (often written in Kanji characters). In 
contrast, words with high RIDF and low MI tend 
• to be domain specific words such as loan words 
(often written in Katakana characters). MI is high 
for words in general vocabulary (words found in 
dictionary) and RIDF is high for good keywords 
for IR. 
3.3 Word extraction 
Sproat and Shih (1990) found MI to be useful for 
word extraction in Chinese. We performed the 
following experiment to see if both MI and RIDF 
are useful for word extraction in Japanese. 
We extracted four random samples of 100 
substrings each. The four samples cover all four 
combinations of high and low RIDF and high and 
low MI, where high is defined to be in the top 
decile and low is defined to be in the bottom 
decile. Then we manually scored each sample 
substring using our own judgment as a good (the 
substring is a word) or bad the substring is not a 
word) or gray (the judge is not sure). The results 
are presented in Table 6, which shows that 
Table 6: RIDF and MI are complementary 
I MI MI All MI ! (high 10%) (low 10%) 
All RIDF --- 20-44% 2-11% 
RIDF (high 10%) 29-51% 38-55% 11-35% 
RIDF (low 10%) 3-18% 4-13% 0-8% 
Each cell is computed over a sample of 100 
examples. The smaller values are counts of 'good' 
words and the larger values, 'not bad' words ('good' 
and 'gray' words). Good or 'not bad' word ratio of 
pairs of characters with high MI is 51-76%. 
substrings with high scores in both dimensions are 
more likely to be words than substrings that score 
high in just one dimension. Conversely, substrings 
with low scores in both dimensions are very 
unlikely to be words. 
3.4 Case study: Names 
We also compared RIDF and MI for people's 
names. We made a list of people's names from 
corpora using simple heuristics. A phrase or 
substring is accepted as a person's name if English 
phrase starts with the title 'Mr.' 'Ms.' or 'Dr.' and is 
followed by a series of capitalized words. For 
Japanese, we selected phrases in the keyword list 
ending with 'L~:' (-shi), which is roughly the 
equivalent of the English titles 'Mr.' and 'Ms.' 
Figure 11 plots RIDF and MI for names in 
English (a) and Japanese (b) with tf _> 10, 
respectively• Figure 11 (a) shows that MI has a 
more limited range than RIDF, suggesting that 
RIDF may be more effective with names than MI. 
The English name 'Mr. From' is a particularly 
L 
°% • 
-, .:. 
r" lr r.qp ° ° ° 
• "'" "?:",:i':."" 1 
"" " " " .1 5 • : .'~~,: 
I 1 
: " "" "" ~.-~1~,.'~':. I 
0 4 MI 8 12 -8 -4 MI4 8 12 
(a) English names (b) Japanese names 
Figure 11: MI and RIDF of people's names 
37 
interesting case, since both 'Mr.' and 'From' is a 
stop word. In this case, the RIDF was large and the 
MI was not. 
The Japanese names in Figure 11 (b) split 
naturally at RIDF = 0.5. Japanese names with 
RIDF below 0.5 are different from names after 0.5. 
The group whose RIDF is under 0.5 included first 
name and full name (first and last name) at rate of 
90% and another group whose RIDF is up to 0.5 
included only lastname at rate of 90%. The reason 
of this separation is that full name (and first name 
as a substring of full name) appears once in the 
beginning of the document, but last name is 
repeated as a reference in the article. Recall that 
RIDF tends to give higher value to substrings 
which appear many times in a few documents. In 
summary, RIDF can discriminate difference of 
some words which cannot be done by MI. 
5 Conclusion 
We showed that RIDF is efficiently and naturally 
scalable to long phrases or substrings. RIDF for all 
substrings in a corpus can be computed using the 
algorithm which computes tfs and dfs for all 
substrings based on Suffix Array. It remains an 
open question how to do this for MI. We found 
that RIDF is useful for finding good keywords, 
word extraction and so on. The combination of MI 
and RIDF is better than either by itself. RIDF is 
like MI, but different• 

References 
Church, K. and P. Hanks (1990)Word association 
norms, mutual information, and lexicography• 
Computational Linguistics, 16:1, pp. 22 - 29. 
Church, K. and W. Gale (1995) Poisson mixtures. 
Natural Language Engineering, 1:2, pp. 163 - 190. 
Manber, U. and G. Myers (1993) Suffix array: A new 
method for on-line string searches. SIAM Journal 
on Computing, 22:5, pp. 935 - 948. 
http://glimpse.cs.arizona, edu/udi.html 
Nagao, M. and S. Mori (1994) A new method of n-gram 
statistics for large number of n and automatic 
extraction of words and phrases from large text data 
of Japanese, Coling-94, pp.611-615. 
Sproat, R and C. Shih (1990) A statistical method for 
finding word boundaries in Chinese text. Computer 
Processing of Chinese and Oriental Languages, 
Vol.4, pp. 336 - 351. 
