Selecting the Most Highly Correlated Pairs within a Large Vocabulary
Kyoji Umemura
Department of Computer Science
Toyoahshi University of Technology
umemura@tutics.tut.ac.jp
Abstract
Occurence patterns of words in documents
can be expressed as binary vectors. When
two vectors are similar, the two words cor-
responding to the vectors may have some
implicit relationship with each other. We
call these two words a correlated pair.
This report describes a method for obtain-
ing the most highly correlated pairs of a
given size. In practice, the method re-
quires a0a2a1a4a3a6a5a8a7a10a9a12a11a13a1a4a3a15a14a16a14 computation time,
and a0a2a1a4a3a17a14 memory space, where a3 is the
number of documents or records. Since
this does not depend on the size of the
vocabulary under analysis, it is possible
to compute correlations between all the
words in a corpus.
1 Introduction
In order to find relationships between words in a
large corpus or between labels in a large database,
we may use a distance measure between the binary
vectors of a3 dimensions, where a3 is the number of
documents or records, and the a18 th element is 1 if the
a18 th document/record contains the word or the label,
or 0 otherwise.
There are several distance measures suitable
for this purpose, such as the mutual informa-
tion(Church and Hanks, 1990), the dice coeffi-
cient(Manning and Schueutze 8.5, 1999), the phi
coefficient(Manning and Schuetze 5.3.3, 1999), the
cosine measure(Manning and Schueutze 8.5, 1999)
and the confidence(Arrawal and Srikant, 1995).
There are also special functions for certain applica-
tions, such as then complimentary similarity mea-
sure (CSM)(Hagita and Sawaki, 1995) which is
known as to be suitable for cases with a noisy pat-
tern.
All of these five measures can be obtained from a
simple contingency table. This table has four num-
bers for each word/label a19 and word/label a20 . The
first number is the number of documents/records
that have both a19 and a20 . We define this number as
a21a23a22a25a24 a1
a19a27a26a16a20
a14 . The second number is the number of doc-
uments/records that have a19 but not a20 . We define this
number as a21a23a22a12a28 a1 a19a27a26a16a20 a14 . The third number is the num-
ber of documents/records that do not have a19 but do
have a20 . We define this number as a21a29a22a31a30 a1 a19a27a26a16a20 a14 . The
fourth and the last number is the number of docu-
ments/records that have neither a19 nor a20 . We define
this number as a21a29a22a12a32 a1 a19a27a26a16a20
a14 .
An obvious method to obtain the most highly re-
lated pairs is to calculate a21a23a22a33a24 , a21a29a22a25a28 , a21a23a22a25a30 , a21a23a22a12a32 for all
pairs of words/labels, compute the similarity for all
pairs and then select pairs of the highest values. Let
a34 be the number of possible words/labels, and a3
be the total number of documents/records in a cor-
pus/database. This method requires a0a2a1 a34a23a35 a14 memory
space and a0a8a1 a34 a35a36a5a17a3a17a14 computation time. However,
its use is only feasible if a34 is smaller than a37a39a38a33a40 . When
a34 is larger than ten thousand, execution of this pro-
cedure becomes difficult.
The method described here is based on the obser-
vation that there is an upper boundary to the number
of different words in one document. The assumption
of such a boundary could even made of a large scale
corpus. For example, a collective corpus of a news-
paper may become larger and larger, but the length
of each article is stable. It is not likely that one arti-
cle would contain thousands of different words.
In view of this observation and the assumption,
this method is effective for obtaining the most highly
correlated pairs in a large corpus, and uses a0a2a1a4a3a17a14
memory space, and a0a2a1a4a3 a5 a7a10a9a12a11 a1a4a3a17a14a16a14 computation
time.
2 Notations
Several notations are introduced in this section to de-
scribe the method. Assuming a corpus C, which is a
set of sets of words, values are assigned as follows.
a1
a21 : documents (elements of the corpus).
a21a3a2a5a4
a1
a19 , a20 , a6 : label (elements of a document).
a19
a2 a21
a26a16a20
a2 a21
a26a7a6
a2 a21
a1
a19a9a8 a20 : a20 is placed after a19 in the alphabetical
order.
a1
a3 : the total number of documents.
a3a11a10a13a12 a4 a12
a1
a21a29a22 a1
a19
a14 : the number of documents that contain
a19 .
a21a29a22 a1
a19
a14a14a10a13a12a16a15 a21 a12
a19
a2 a21a18a17 a12
a1
a21a29a22a25a24 a1
a19a27a26a16a20
a14 : the number of documents that con-
tain a19 and contain a20 .
a21a29a22a25a24 a1
a19 a26a16a20
a14a19a10a20a12a21a15 a21 a12
a19
a2 a21a23a22
a20
a2 a21a24a17 a12
a1
a21a29a22a25a28 a1
a19a27a26a16a20
a14 : the number of documents that con-
tain a19 but not a20 .
a21a29a22a25a28 a1
a19 a26a16a20
a14a25a10a13a12a21a15 a21 a12
a19
a2 a21a26a22
a20a28a27
a2 a21a18a17 a12
a1
a21a29a22a25a30 a1
a19a27a26a16a20
a14 : the number of documents that con-
tains a20 but not a19 .
a21a29a22a25a30 a1
a19 a26a16a20
a14a25a10a13a12a21a15 a21 a12
a19a29a27
a2 a21a23a22
a20
a2 a21a18a17 a12
a1
a21a29a22a25a32 a1
a19a27a26a16a20
a14 : the number of documents that cone-
tain neither a19 nor a20 .
a21a23a22a25a32 a1
a19a27a26a16a20
a14a19a10a13a12a21a15 a21 a12
a19a29a27
a2 a21a23a22
a20a30a27
a2 a21a24a17 a12
3 Problem Definition
When the corpus of a set of sets of labels is provided,
and the function a31a33a32
a34 a1
a19a27a26a16a20
a14 of a pair of labels to the
number in the following form is also provided, we
will obtain a34 : the set of pairs of a given size that
satisfies the following condition.
a35 a1
a19
a35
a26a16a20
a35
a14
a27
a2
a34a37a36
a35 a1
a19a39a38 a26a16a20a40a38
a14 a2
a34a37a36a41a31
a1
a19a39a38a39a26a16a20a40a38
a14a43a42
a31
a1
a19
a35
a26a16a20
a35
a14
where
a31
a1
a19a27a26a16a20
a14
a10 a22 a1 a21a23a22 a24 a1
a19a27a26a16a20
a14
a26
a21a23a22a25a28 a1
a19a27a26a16a20
a14
a26
a21a23a22 a30 a1
a19a27a26a16a20
a14
a26
a21a23a22a25a32 a1
a19 a26a16a20
a14a16a14
The following are examples of a22 a1a45a44 a26a47a46 a26a7a48 a26 a21 a14 .
a1 cosine function
a44
a49
a1a45a44a23a50
a46
a14 a5 a1a45a44a23a50
a48
a14
a1 dice coefficient
a51
a5a52a44
a1a45a44a53a50
a46
a14a54a50 a1a45a44a53a50
a48
a14
a1 confidence
a44
a44a26a50
a48
a1 pairwise mutual information
a44
a3
a5 a7 a9a12a11
a44 a5 a3
a1a45a44a26a50
a46
a14 a5 a1a45a44a26a50
a48
a14
a1 phi coefficient
a44 a5 a21a56a55
a46
a5
a48
a49
a1a45a44a26a50
a46
a14 a5 a1a45a44a26a50
a48
a14 a5 a1
a46
a50 a21 a14 a5 a1
a48
a50 a21 a14
a1 complementary similarity measure
a44 a5 a21a56a55
a46
a5
a48
a49
a1a45a44a26a50
a48
a14 a5 a1
a46
a50 a21 a14a16a14
Implementation of a program that requires a0 a5 a34a23a35
memory space and a0 a5 a34 a35 a5 a3 computation time
is easy. A program of this type could be used to
calculate a21a23a22a12a24 , a21a23a22a25a28 , a21a23a22a25a30 , and a21a23a22a25a32 for all pairs of a19 and
a20 , and could then provide the most highly correlated
pairs. However, compuation with this method is not
feasible when a34 is large.
For example, in order to calculate the most highly
correlated words within a newspaper over several
years of publication, a34 becomes roughly a37a39a38a2a1 , and a3
becomes a37a39a38a4a3 . The amount of computation time is
then increased to a37a39a38 a38 a3 .
4 Approach
In terms of the actual data, the number of correlated
pairs is usually much smaller than the number of un-
correlated pairs. Moreover, most of the uncorrelated
pairs usually satisfy the condition: a21a23a22a33a24 a1 a19a27a26a16a20 a14 a10 a38 ,
and are not of interest. This method takes this fact
into account. Moreover, it also uses the relationship
between a15 a3 a26 a21a29a22a25a24a21a17 and a15 a21a23a22 a26 a21a23a22a12a28 a26 a21a23a22a25a30 a26 a21a23a22a25a32 a17 to make the
computation feasible.
5 Relationship between a5a7a6a9a8a11a10a13a12 a24a15a14 and
a5a7a10a13a12a16a8a17a10a13a12
a28
a8a11a10a13a12
a30
a8a11a10a13a12
a32 a14
Proofs of the following equations are provided be-
low.
a21a23a22 a1
a19
a14 a10 a21a23a22a25a24 a1
a19a27a26a16a19
a14
a21a29a22a25a28 a1
a19 a26a16a20
a14 a10 a21a23a22a25a24 a1
a19a27a26a16a19
a14 a55 a21a23a22a25a24 a1
a19a27a26a16a20
a14
a21a29a22 a30 a1
a19 a26a16a20
a14 a10 a21a23a22 a24 a1
a20 a26a16a20
a14 a55 a21a29a22 a24a33a1
a19a27a26a16a20
a14
a21a23a22a25a32 a1
a19 a26a16a20
a14 a10 a3 a55 a21a23a22a25a24 a1
a19 a26a16a19
a14 a55 a21a23a22a25a24 a1
a20 a26a16a20
a14a39a50 a21a23a22a25a24 a1
a19a27a26a16a20
a14
Proof:
1. a34 a22 a34 is equivalent to a34 .
a21a29a22a25a24 a1
a19a27a26a16a19
a14
a10 a12a15 a21 a12
a19
a2 a21 a22
a19
a2 a21a18a17 a12
a10 a12a15 a21 a12
a19
a2 a21a18a17 a12
a10 a21a29a22 a1
a19
a14
2. By definition, the sum of a21a29a22a31a24 , a21a23a22a25a28 ,
a21a23a22a25a30 , and a21a29a22a25a32 always represents the to-
tal number of documents.
a21a23a22a25a24 a1
a19 a26a16a20
a14 a50 a21a23a22a25a28 a1
a19a27a26a16a20
a14
a50 a21a23a22a25a30 a1
a19a27a26a16a20
a14a39a50 a21a29a22a25a32 a1
a19a27a26a16a20
a14
a10 a3
3. Similarly, the sum of a21a23a22a31a24 a1 a19a27a26a16a20 a14 and
a21a23a22a25a28 a1
a19a27a26a16a20
a14 is the number of documents
that contain a19 . This equals a21a23a22 a1 a19 a14 .
a21a23a22a25a24 a1
a19 a26a16a20
a14 a50 a21a23a22a25a28 a1
a19a27a26a16a20
a14
a10 a12a15 a21 a12
a19
a2 a21a23a22
a20
a2 a21a24a17 a12
a50a13a12a15 a21 a12
a19
a2 a21a26a22
a20a30a27
a2 a21a24a17 a12
a10 a12a15 a21 a12
a19
a2 a21a18a17 a12
a10 a21a23a22 a1
a19
a14
4. Similarly, the sum of a21a23a22a31a24 a1 a19a27a26a16a20 a14 and
a21a23a22a25a30 a1
a19a27a26a16a20
a14 is the number of documents
that contain a20 . This equals a21a23a22 a1 a20 a14 .
a22a25a24 a1
a19a27a26a16a20
a14a39a50 a21a29a22a25a30 a1
a19a27a26a16a20
a14
a10 a12a15 a21 a12
a19
a2 a21a23a22
a20
a2 a21a24a17 a12
a50a13a12a15 a21 a12
a19 a27
a2 a21a26a22
a20
a2 a21a24a17 a12
a10 a12a15 a21 a12
a20
a2 a21a18a17 a12
a10 a21a23a22 a1
a20
a14
5. These four equations make it possi-
ble to express a21a23a22 , a21a23a22a12a28 , a21a23a22a25a30 and a21a23a22a12a32 by
a21a23a22 a24 and a3 .
These formulas indicate that the number of re-
quired two-dimensional tables is not four, but just
one. In other words, if we create a table of a21a29a22 a24 a1 a19a27a26a16a20 a14
and one variable for a3 , we can obtain a21a29a22 a1 a19 a14 ,
a21a23a22a25a28 a1
a19 a26a16a20
a14 , a21a29a22a25a30 a1
a19a27a26a16a20
a14 , and a21a29a22a25a32 a1
a19 a26a16a20
a14 .
6 The memory requirement for a10a13a12 a24
Let a18 be the maximum number of different
words/labels in one document. The following prop-
erty exists in a21a23a22a31a24 a1 a19a27a26a16a20 a14 .
a19a21a20a22a19a24a23
a21a23a22a25a24 a1
a19 a26a16a20
a14a26a25
a18
a35 a5 a3
The left side of the formula equals the
total number of all pairs of words/labels.
This cannot exceed a18 a35 a5 a3 .
This relationship indicates that if the table
is stored using tuples of a1 a19a27a26a16a20 a26
a21a23a22a31a24 a1
a19a27a26a16a20
a14a16a14 where
a21a23a22 a24 a1
a19a27a26a16a20
a14a1a0
a38 , the required memory space is
a0a2a1a4a3a17a14 .
Tuples where a21a23a22a31a24 a1 a19a27a26a16a20 a14 a10 a38 are not necessary be-
cause we know that a21a23a22a31a24 a1 a19 a26a16a20 a14 a10 a38 when the tuple
for a1 a19 a26a16a20 a26 a21a29a22a25a24 a1 a19a27a26a16a20 a14a16a14 does not exist in memory.
This estimation is pessimistic. The actual size of
the tuples will be smaller than a18 a35 a5 a3 , since not all
documents will have a18 different words/labels.
7 Obtaining a10a13a12 a24 , and a6
The algorithm to obtain a21a23a22a33a24 a1 a19a27a26a16a20 a14 , and a3 is straight-
forward. First, the corpus must be trasformed into a
set of sets of words/labels. Since this is a set form,
there are no duplications of the words/labels of one
document. In the following program, the hashtable
returns 0 for a non-existent item.
(01) Let DFA be empty hashtable.
(02) Let DF be empty hashtable
(03) Let N be 0
(04) For each document, assign it to D
(05) | N = N + 1
(06) | For each word in D
(07) | assign the word to X
(08) | | For each word in D
(09) | | assign the word to Y
(10) | | | DFA(X, Y)=DFA(X,Y)+1
(11) | | end of loop
(12) | end of loop
(13) end of loop
The computation time for this program is less than
a18
a35 a5 a3 . Since
a18 is independent from
a3 , the compu-
tation time is a0a2a1a4a3a17a14 . Again, a18 a35 a5 a3 is a pessimistic
estimation, since not all documents will have a18 dif-
ferent words/labels.
8 Selecting Pairs
Even though a21a23a22 a24 , a21a23a22a12a28 , a21a23a22 a30 , and a21a29a22 a30 can be obtained
in constant time after a0a2a1a4a3a17a14 preprocessing, there are
a34 a35 values to consider to obtain the best N correlated
pairs. Fortunately, many of the functions that are
usable as indicators of correlation and, at least, all
five functions, return a lower value than the known
threshold if a21a23a22a12a24 a1 a19a27a26a16a20
a14a19a10
a38 .
The cosine measure, the dice coefficient, and pair-
wise mutual information have property 1 and prop-
erty 2 as defined below. This implies that the value
for a1 a19a27a26a16a20 a14 where a21a23a22a12a24 a1 a19 a26a16a20 a14 a10 a38 is actually the mini-
mum value of all a1 a19a27a26a16a20 a14 . Therefore, the first part of
the total ordered sequence of a1 a19a27a26a16a20 a14 is the sorted list
of a1 a19a27a26a16a20 a14 where a21a23a22 a1 a19 a26a16a20 a14a2a0 a38 . The rest is an arbitary
order of pairs where a21a23a22 a1 a19a27a26a16a20 a14a19a10 a38 .
Property 1: the value is not negative.
Property 2: when a21a23a22a31a24 a1 a19a27a26a16a20 a14a19a10 a38 , the value is a38 .
The phi coefficient and the complementary simi-
larity measure have the following properties 1, 2 and
3. Therefore, the first part of the total ordered se-
quence where the value is positive, is equal to first
part of the sorted list where a21a23a22 a1 a19a27a26a16a20 a14a3a0 a38 and the
value is positive. Moreover, this list contains all
pairs that have a positive correlation. This list is long
enough for the actual application.
Property 1: when a21a23a22 a24a33a1 a19a27a26a16a20 a14a43a10 a38 , the value is nega-
tive.
Property 2: when a19 and a20 are not correlated, the
estimated value is a38 .
Property 3: when a19 and a20 tend to appear at the
same time, the estimated value is positive.
It should be recalled that the number of pairs
where a21a29a22 a24a33a1 a19 a26a16a20 a14a4a0 a38 is less than a18 a35 a5a2a3 . The sorted
list is obtained in a0a2a1 a18 a35 a5 a3 a5 a7 a9a12a11 a1 a18 a35 a5 a3a15a14a16a14 com-
putation time, where a18 is the maximum number of
different words/labels in one document. Since a18 is
constant, it becomes a0a8a1a4a3 a5 a7a10a9a12a11 a1a4a3a17a14a16a14 , even if the
size ofvocabulary is very large.
It is true that for the given some fixed vocabu-
lary of size a34 , a18 a35 a5 a3 might be larger than a34 a35 as we
increase the size of corpus. Fortunately, the actual
memory consumption of this procedure also have
the upper bound of a0a8a1 a34 a35 a14 , and we will not loose
any memory space. When a34 is not fixed and a34 may
become very large compare to a3 as is the case for
proper nouns, a18 a35 a5 a3 is smaller than a34 a35 .
9 Case study of a Newspaper Corpus
The computation time of the baseline system is
a34 a35 a5 a3 where a34 is the distinct number of labels in the
a3 time(sec.) speed(sec./doc)
1000 2.4
a51
a0
a0
a5
a37a39a38
a1
a2
3000 7.8
a51
a0a4a3
a5
a37a39a38
a1
a2
10000 21.1
a51
a0
a37
a5
a37a39a38
a1
a2
30000 60.9
a51
a0
a38
a5
a37a39a38
a1
a2
Table 1: The actual execution time shows a linear
relationship to the size of input data.
corpus. When we analyzed labels of names of places
in a newspaper over the course of one year, this cor-
pus consisted of about 60,000 documents. The place
names totalled 1902 after morphological analysis.
The maximum number of names in one document
was 142, and the average in one document was 4.02.
In this case, the method described here, was much
more efficient than the baseline system.
Table 1 shows the actual execution time of the
program in the appendix, changing the length of the
corpus. This program computes similarity values for
all pairs of words where a21a23a22a31a24 a0 a38 . It indicates that
the execution time is linear.
Our observation shows that even if the corpus
were extended year by year, a18 which is the maxi-
mum number of different words in one document is
stable, even though the total number of words would
increase with the ongoing addition of proper nouns
and new concepts.
10 For a large corpus
Although the program in the appendix cannot be ap-
plied to a corpus larger than memory size, we can
obtain a table of a21a29a22a12a24 using sequential access to file.
The program in the appendix stores every pair in
memory. The space requirement of a18 a35 a5 a3 may
seem too great to hold in memory. However, se-
quential file can be used to obtain the a21a23a22 a24 table, as
follows. Although the computation time for a21a23a22 a24 is
a0a2a1a4a3 a5 a7a10a9a12a11 a1a4a3a17a14a16a14 rather than a0a2a1a4a3a17a14 , the total compu-
tation time remains the same because computation
of a0a8a1a4a3 a5 a7a10a9a12a11 a1a4a3a17a14a16a14 is required to select pairs in both
cases.
Consider the following data. Each line corre-
sponds to one document.
a b
a c
x y
x
x y z
a b c
When the pairs of words in each document are
recorded, the following file is obtained. Note that
since a21a23a22 a24a33a1 a19a27a26a16a20 a14a5a10 a21a23a22 a24 a1 a20 a26a16a19 a14 , it is not necessary to
record pairs where a19 a0 a20 . This reduces the memory
requirement.
a a
a b
b b
a a
a c
c c
x x
x y
y y
x x
x x
x y
x z
y y
y z
z z
a a
a b
a c
b b
b c
c c
Using the merge sort algorithm which can sort a
large file using sequential access only, the file can
be sorted in a0a8a1a4a3 a5 a7a10a9a12a11a13a1a4a3a15a14a16a14 computation time. Af-
ter sorting in alphabetical order, same pairs come
together. Then, the pairs can be counted with se-
quential access, thereby providing the a21a29a22a33a24 table. An
example of this table fllows:
a a 3
a b 2
a c 2
b b 2
b c 1
c c 2
x x 3
x y 2
x z 1
y y 2
y z 1
z z 1
It should be noted that the a21a23a22 table can be obtained
easily by extracting lines in which letter of the first
column and that of the second column are the same,
since a21a29a22 a1 a19 a14a19a10 a21a23a22a12a24 a1 a19a27a26a16a19 a14 . The a21a29a22 table can usually be
stored in memory since it is a one dimensional array.
After storing a21a29a22 in memory, similarity can be com-
puted line by line. The following example uses the
phi coefficient. The first column is the coefficient,
followed by a21a23a22a12a24 , a21a29a22a25a28 , a21a23a22a25a30 , a21a23a22a25a32 , a19 and a20 . Since the phi
coefficient is reflective, the a1 a19 a26a16a20 a14 value where a19 a0 a20
is not required. When the function is not symmetric,
a1
a19a27a26a16a20
a14 and a1 a21a23a22a25a28
a26
a21a29a22 a30 a14 can be exchanged at the same
time.
0.544705 3 0 0 3 a a
0.384900 2 1 0 3 a b
0.384900 2 1 0 3 a c
0.624695 2 0 0 4 b b
0.156174 1 1 1 3 b c
0.624695 2 0 0 4 c c
0.544705 3 0 0 3 x x
0.384900 2 1 0 3 x y
0.242536 1 2 0 3 x z
0.624695 2 0 0 4 y y
0.392232 1 1 0 4 y z
0.674200 1 0 0 5 z z
The ordered list can be obtained by sorting this
table with the first column. This example shows that
pairs where a21a23a22a12a24 a1 a19a27a26a16a20 a14 a10 a38 , such as a1a1a0 a26a3a2 a14 or a1a1a0 a26a5a4 a14 ,
do not add any overhead to either memory or com-
putation time.
11 Comparison with Apriori
There is a well known algorithm for forming a list of
related items termed Apriori(Arrawal and Srikant,
1995). Apriori lists all relationship using confi-
dence, where a21a23a22a12a24 a1 a19a27a26a16a20 a14 is larger than a specified
value. Using Apriori, the a21a23a22a33a24 threshold can be spec-
ified in order to reduce computation, whereas with
the proposed method, there is no way to adjust this
threshold. This implies that Apriori may be faster
than our algorithm in terms of confidence. However,
since Apriori uses the property of confidence to re-
duce computation, it cannot be used for other func-
tions, unlike the proposed method which can employ
many standard functions, at least the five measures
used here including confidence.
12 Correlation of All Substrings
When computing correlations of all substrings in a
corpus, a34 can be as large as a3 a5 a1a4a3 a55 a37 a14a5a6
a51
. Since the
memory space requirement and computation time
does not depend on a34 , this method can be used to
generate a list of the most hightly correlated sub-
strings of any length. In fact, in some cases, a18 may
be too large to compute.
The Yamamoto-Church method(Yamamoto and
Church, 2001) allows for the creation of a a21a29a22 a1 a19
a14 ta-
ble using a0a8a1a4a3a15a14 memory space and a0a2a1a4a3 a5 a7 a9a12a11 a1a4a3a15a14a16a14
computation time, where a19 represents all substrings
in a given corpus. Yamamoto’s method shows that
although there may be a3 a5 a1a4a3 a55 a37 a14a5a6
a51
kinds of
substrings in a corpus, there is
a51
a5 a3 occurence
patterns (or sets of substrings which have same oc-
curence pattern) at most. The computational cost
is greatly reduced if we deal with each pattern in-
stead of each substring. Although the order of com-
putional complexity does not depend on a34 , a18 differs
whether the pattern is used or not. We have also de-
veloped a system using the pattern which actually re-
duces the cost of computation. Although the number
of a18 is still problematic even using the Yamamoto-
Church method, and although the computation cost
is much larger than using words, the program runs
much faster than the simple method.
13 Conclusion
This paper describes a method for selecting cor-
related pairs in a0a8a1a4a3a15a14 memory space and a0a2a1a4a3 a5
a7a10a9a12a11 a1a4a3a17a14a16a14 computation time, where a3 is the num-
ber of documents in a corpus, provided that there
is an upper boundary in the number of different
words/labels in one document/record. We have ob-
served that a corpus usually has this kind of upper
boundary, and have shown that we can uses a se-
quential file for most of our memory requirements.
This method is useful not only for confidence but
also for other functions whose values are decided by
a21a23a22 a24 , a21a29a22a25a28 , a21a23a22 a30 , a21a23a22a12a32 . Examples of these functions are
mutual information, the dice coefficient, the confi-
dence measure, the phi coefficient and the compli-
mentary similarity measure.

References
K. W. Church and P. Hanks 1990 Word association
norms, mutual information and lexicography Compu-
tational Linguistics, 16(1):22–29
R. Agrawal and R. Srikant 1995 Mining of association
rules between sets of items in large databases. In Pro-
ceedings of the ACM SIGMOD Conference on Man-
agement of Data:94–105
N. Hagita and M. Sawaki: 1995 Robust recognition
of degraded machine-printedcharacters using com-
plimentary similarity measure and error-correction
learning Proceedings of the SPIE - The International
Society for Optical Engineering 2442:234–244
U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth The
KDD Process for Extracting Useful Knowledge from
Volumes of Data Communications of the ACM,
39(11):27–34,
Christopher D. Manning and Hinrich Schuetze, 1999
Chapter 8.5, Semantic Similarity Foundations of
statistical natural language processing:294–303, The
MIT Press
Christopher D. Manning and Hinrich Schuetze, 1999
Chapter 5.3.3, Pearson’s chi-square test Founda-
tions of statistical natural language processing:169–
172, The MIT Press
Mikio Yamamoto and Kenneth W. Church 2001 Using
Suffix Arrrays to Compute Term Frequency and Docu-
ment Frequency for All Substring in a Corpus Com-
putational Linguistics, 27(1):1–30, The MIT Press.
