A Method of Measuring Term Representativeness 
- Baseline Method Using Co-occurrence Distribution - 
Toru Hisamitsu,* Yoshiki Niwa,* and Jun-ichi Tsujii * 
$ Central Research Laboratory, Hitachi, Ltd. 
Akanuma 2520, Hatoyama, Saitama 350-0395, Japan 
{hisamitu, yniwa} @harl.hitachi.co.jp 
Abstract 
This paper introduces a scheme, which we call 
the baseline method, to define a measure of term 
representativeness and measures defined by using 
the scheme. The representativeness of a term is 
measured by a normalized characteristic value 
defined for a set of all documents that contain the 
term. Normalization is done by comparing the 
original characteristic value with the 
characteristic value defined for a randomly 
chosen document set of the same size. The latter 
value is estimated by a baseline function obtained 
by random sampling and logarithmic linear 
approximation. We found that the distance 
between the word distribution in a document set 
and the word distribution in a whole corpus is an 
effective characteristic value to use for the 
baseline method. Measures defined by the 
baseline method have several advantages 
including that they can be used to compare the 
representativeness of two terms with very 
different frequencies, and that they have 
well-defined threshold values of being 
representative. In addition, the baseline function 
for a corpus is robust against differences in 
corpora; that is, it can be used for normalization 
in a different corpus that has a different size or is 
in a different domain. 
1 Introduction 
Measuring the representativeness (i.e., the 
informativeness or domain specificity) of a term ~ is 
essential to various tasks in natural  
processing (NLP) and information retrieval (IR). It 
is particularly crucial when applied to an IR 
interface to help a user find informative terms. For 
instance, when the number of retrieved documents is 
intractably large, an overview of representative 
words in the documents is needed to understand the 
contents. To enable this, an IR system, called 
DualNAVI, that has two navigation windows where 
one displays a graph of representative words in the 
retrieved documents, was developed (Nishioka et al. 
1997). This window helps users grasp the contents 
of retrieved documents, but it also exposes problems 
concerning existing representativeness measures. 
Figure l shows an example of a graph for the 
query '~Y-'~'~(-~ (electronic money), with Nihon 
A term is a word or a word sequence. 
{ Graduate School of Science, the University of Tokyo 
7-3-I Hongo, Bunkyo-ku, Tokyo 113-8654, Japan 
tsujii@is.s.u-tokyo.ac.j p 
Keizai Sl#mbun (a financial newspaper) 1996 as the 
corpus. Frequently appearing words are displayed in 
the upper part of the window, and words are selected 
by a tf-idf-like measure (Niwa et al. 1997). Typical 
non-representative words are filtered out by using a 
stop-word list. 
me n e y .~v~~~\]~------~ ~ 
electronic ---- /tmu,.~ \] ~ .//'f-~year 
~/\ ~L~I2~-- month 
read c pher 
Figure 1 
A topic word graph when the query is 
N~-e ~---(etectronic money). 
One problem is the difficulty of suppressing 
uninformative words such as ~V- (year), -- (one), 
and )\] (month) because classical measures, such as 
tf-idf are too sensitive to word frequency and no 
established method to automatically construct a 
stop-word list has bcen developed. 
Another problem is that the difference in the 
representativeness of words is not sufficiently 
\[In ~j indicated. In the exarnple above, highlighting " .... 
(cipher) over less representative words such as ~'U~ 
k_ 5 (read) would be useful. Most classical 
measures based on only term frequency and 
document frequency cannot overcome this problem. 
To define a more elaborate measure, atternpts 
to incorporate more precise co-occurrence 
information have been made. Caraballo et al. (1999) 
tried to define a measure for "specificity" of a noun 
by using co-occurrence intbrmation of a noun, but it 
was not very successful in the sense that the 
measure did not particularly outperformed the term 
frequency. 
Hisamitsu et al. (1999) developed a measure 
of the representativeness of a term by using 
co-occurrence information and a normalization 
320 
technique. Fhe measure is based on the distance 
between tile word distribution in the documenls 
containing a term and the word distribution ill the 
whole corpus. Their measure overcomes previously 
mentioned problems and preliminary experiments 
showed that this measure worked better than 
existing measures in picking out 
representative/non-representative terms. Since tile 
normalizatio11 technique plays a crucial part of 
constructing tile nleasure, issl_lcs related to the 
normalization need more study. 
In this paper we review Hisamitsu's measure 
and introduce a generic scheme - which we call the 
baseline method for convenience - that can be used 
to define wu'ious measures including the above. A 
characteristic value of all documents containing a 
term Y is normalized by using a baseline fimction 
that estimates the characteristic value of a randomly 
chosen document set of the slune size. Tile 
normalized value is then used to measure tile 
representativeness of the term 77. A measure defined 
by the baseline.-method has several advantages 
compared to classical measures. 
We compare four measures (two classical 
ones and two newly defined ones) from w.trious 
viewpoints, and show the superiority of the measure 
based on the normalized distance between two word 
distributions. Another important finding is that the 
baseline function is substantially portable, that is, 
one defined for a corpus can be used for a different 
corpus even it" the two corpora Mvc considerably 
different sizes or arc in different domains. 
2. E,isting measures of representative~kess 
2.1 Overview 
Various methods for mea:.;uring the inforlnativcness 
or domain specificity of a word have been proposed 
in the donmins of IR and term extraction in NLP 
(see the survey paper by Kageura 1996). Ill 
characterizing a term, Kagcura introduced the 
concepts of "unithood" and "termhood": unithood is 
"the degree of strength or stability of syntagnmtic 
combinations or collocations," and termhood is "tile 
degree to which a linguistic unit is related to (or 
more straightR)rwardly, represents) domain-specific 
concepts." Kageura's termhood is therefore what we 
call representativeness here. 
Representativeness lneasurcs were first 
introduced in till IR domain for determining 
indexing words. The simplest measure is calculated 
fi'om only word frequency within a document, For 
example, tile weight 1 o of word w~ in document d/is 
defined by 
./~./ 
/r.. ____ , 
~ >Z/< .I t-i 
wherc./ii is tilt: frequency of word wi in document (\]i 
(Sparck-Jolms 1973, Noreauh ct al. 1977). More 
elaborate measures for tcrmhood combine word 
frequency within a document and word occurrence 
over a whole corpus. For instance, (/:/4/; the most 
comlnonly used measure, was originally defined its 
N IoIcl\[ \[,i = ./;, x log(--), 
where iV, and N,,,,~ are, respectively, tile number of 
documents containing word wg and the total number 
of documents (Salton et al. 1973). There are a 
wlriety or" definitions of ¢idJl but its basic feature is 
that a word appearing more flequently in fewer 
documents is assigned a higher value. If documents 
are categorized beforehand, we can use a more 
sophisticated measure based on the X-' test of the 
hypothesis that an occurrence of" the target word is 
independent of categories (Nagao et al. 1976). 
Research on automatic term extraction in 
NLP domains has led to several measures for 
weighting terms mainly by considering the unithood 
of a word sequence. For instance, mutual 
information (Church ct al. 1990) and the 
log-likelihood (Dunning 1993) methods for 
extracting word bigrams have been widely used. 
Other measures for calculating the unithood of 
n-grains have also been proposed (Frantzi et al. 
1996, Nakagawa et al. 1998, Kita et al. 1994). 
2.2 Problems 
Existing measures suffer from at least one of the 
following problems: 
(1) Classical measures sucll as t/-idjare so sensitive 
to term frequencies that they fail to avoid very 
frequent non-informative words. 
(2) Methods using cross-category word distributions 
(such as the Z-' method) can be applied only if 
documents in a corpus are categorized. 
(3) Most lneasures in NLP domains cannot treat 
single word terms because they use the unithood 
strength of multiple words. 
The threshold wdue lbr being representative is 
dcfincd in all ad hoc manner. 
constructs 
(4) 
The scheme that we describe here 
measures that are free of these problems. 
3. Baseline method for defining 
representativeness measures 
3.1 Basic idea 
This subsection describes the method we developed 
for defining a measure of term representativeness. 
Our basic idea is smmnarized by tile lhmous quote 
(Firth 1957) : 
"You shall k~ow a wo~zt l~y the coml)alT); ir 
Iwup.v." 
We interpreted this as the following working 
hypolhesis: 
321 
For any term T, if the term is 
representative, D(T), the setofall 
documents containing T, should have 
some characteristic property 
compared to the "average" 
To apply this hypothesis, we need to specify a 
measure to obtain some "property" of a document 
set and the concept of "average". Thus, we 
converted this hypothesis into the following 
procedure: 
Choose a measure M characterizing 
adocumentset. FortermT, calculate 
M(D(T)), the value of the measure 
for D(T). Then compare M(D(T)) with 
B~(#D(T)), where #D(T)is the number 
of words contained in #D(T), and B,~ 
estimates the value of M(D) when D 
is a randomly chosen document set 
of size #D(T). 
Here, M measures the property and BM estinmtes the 
average. The size of a document set is defined as the 
number of words it contains. 
We tried two measures as M. One was the 
number of different words (referred to here as 
DIFFNUM) appearing in a document set. Teramoto 
conducted an experiment with a snmll corpus and 
reported that DIFFNUM was useful for flicking out 
important words (Teramoto et al. 1999) under the 
hypothesis that the number of different words 
co-occurring with a topical (representative) word is 
snmllcr than that with a generic word. The other 
measure was the distance between the word 
distribution in D(T) and the word distribution in the 
whole corpus Do. The distance between the two 
distributions can be measured in various ways, and 
we used the log-likelihood ratio as in Hisalnitsu et al. 
1999, and denote this rneasure as LLR. Figure 2 
plots (#D, M(D))s when M is DIFFNUM or LLR, 
where D varies over sets of randomly selected 
documents of various sizes from the articles in 
Nikkei-Shinbun 1996. 
For measure M, we define Rep(T, M), the 
representativeness of T, by normalizing M(D(T)) by 
BM(#D(T)). The next subsection describes the 
construction of By and the normalization. 
3.2 Baseline function and normalization 
Using the case of LLR as an example, this 
subsection explains why nornmlization is necessary 
and describes the construction of a baseline 
function. 
Figure 3 superimposes coordinates {(#D(7), 
LLR(D(T))} s onto the graph of LLR where T varies 
2 With Teramoto's method, eight paranaeters must be ttmed to 
normalize D1FFNUM( D( T) ), but the details of how this was 
done were not disclosed. 
I000000 
100000 
10000 
I000 
100 
10 
I 
100 100000 100000000 
#D: Size of randomly chosen documents 
Figure 2 
Values of DIFFNUM and LLR for 
randomly chosen document set. 
II ~,~'' i " " ~ .... over ~-ytc pner), qi(year), )J (month), i~cJ~-ll~7~ 
(read), -- (one), j- ~ (do), and ~}: i>~/(economy). 
Figure 3 shows that, for example, LLR(D(J-~)) is 
smaller than LLR(D( ~,~ }J5 )), which reflects our 
linguistic intuition that words co-occurring with 
"economy" are more biased than those with "do". 
However, LLR(DOI~-',3-)) is smaller than LLR(D(.£J/- 
I~6)) and smaller even than LLR(D@O-~)). This 
contradicts our linguistic intuition, and is why 
values of LLR are not dircctly used to compare the 
representativeness of terms. This phenomenon arises 
because LLR(D(~) generally increases as #\])(7) 
increases. We therefore need to use some form of 
normalization to offset this underlying tendency. 
We used a baseline function to normalize the 
values. In this case, Bu,(o) was designed so that it 
approximates the curve in Fig. 3. From the 
definition of the distance, it is obvious that Bu.t~(0) = 
Bu.R(#Do) = 0. At the limit when #1)(~--+ o% Bu.R(') 
becomes a monotonously increasing function. 
The curve could be approxinmted precisely 
through logarithmic linear approximation near (0, 0). 
~lb make an approximation, up to 300 documents are 
randomly sampled at a time. (Let each randomly 
chosen document set be denoted by D. The number 
of sampled documents are increased from one to 300, 
repeating each number up to five times.) Each (#D, 
LLR(D)) is converted to (log(#D), Iog(LLR(D))). 
The curve formulated by the (log(#D), log(LLR(D))) 
values, which is very close to a straight line, is 
further divided into nmltiple parts and is part-wise 
approximated by a linear function. For instance, in 
the interval I = {x \[ 10000 _<x < 15,000}, 
Iog(LLR(D)) could be approximated by 1.103 + 
1.023 x log(#D) with R e = 0.996. 
For LLR, we define Rep(T, LLR), the 
representativeness of T by normalizing LLR(D(7)) 
by Bu.R(#D(7)) as follows: 
Rep(r, LLR) = 100 x (Iog(LLR(D(T))) _ 1). 
"log(Bu, (# D(T))) 
322 
For instance, when we used Nihon Keizai 
Shimbun 1996, The average of I OOx(log(LLR(D)) 
~log(BLue (#D)) - 1), Avr, was -0.00423 and the 
standard deviation, cs, was about 0.465 when D 
varies over randomly selected doctuncnt sets. l';very 
observed wflue fell within Avs'4-4er and 99% ot' 
observed values fell within Avl±3cs. This hapfmlled 
in all corpora (7 orpora) we tested. Theretbrc, we 
can de:fine the threshold of being representative as, 
say, Aw" + 40. 
umoooo ~:}f'i(economy) .__ _ h... JJ~nionlh) 
! ;~i'~;i/.Jl).~) (read) i 
i 
., (cipher) \ !! ~,j~ ! & (do) 
)~ 10000 
1000 
1 O0 1000 10000 100000 1000000 10000000 I \[ = {}S 
#1) and lid (T) 
Figure 3 
Baseline and sample word distribution 
3.3 Treatment of very frequent terms 
So \['ar we have been unable to treat extremely 
frequent terms, such as -~-~ (do). We therefore 
used random sampling to calculalc tile 1@1)(77 LLR) 
of a very li'cquent lerm T. II' the munbcr ot' 
documents in D(7) is larger than a threshold wdue N, 
which was calculated froln the average number of 
words contained in a document, N docnmcnts arc 
randomly chosen from D(2) (we used N = 150). This 
subset is denoted D(T) and Re/)(7; LLR) is delined 
by 100 x (log(LLR(D(7))) /log(BL~,Se (#1)(7))) -- 1). 
This is effcctivc because wc can use a 
well-approximated part of the baseline curve; it also 
reduccs thc amount of calctflation required. 
By using Rel)(77 LLR) detSned above, wc 
obtained Rel)(-'F g), LLR) = -0.573, Rel)(a')&TJ, llk 7~), 
LLR) = 4.08, and , * .... Re\])(llil-o, LLR) = 6.80, which 
reflect our linguistic intuition. 
3.4 Features of Rep(T, M) 
Rep(T, M) has the t bllowing advantages by virtue of 
its definition: 
(1) Its definition is mathematically clear. 
(2) It can compare high-frequency terms with low- 
ficqucncy terms. 
(3) The threshold value of being representative can 
be defined systematically. 
(4) It can be applied to n-gram terms for any n. 
4. Experiments 
4.1 Ewfluation of monograms 
Taldng topic-word selection for a navigation 
window for IR (see Fig. 1) into account, we 
cxamined the relation bctwecn the value of Rel)(7, 
M) and a manual classification of words 
(monograms) extracted from 158,000 articles 
(excluding special-styled non-sentential articles such 
as company-personnel-aflhir articles) in the 1996 
issties of the Nildcei Shinbun. 
4.1.1 Preparation 
We randolnly chose 20,000 words from 86,000 
words having doculnent ficquencies larger than 2, 
thcn randomly chose 2,000 of them and classified 
these into thrce groups: class a (acceptable) words 
uscfill for the navigation window, class d (delete) 
words not usethl for the navigation window, ,and 
class u (uncertain) words whose usefulness in the 
navigation window was either neulral or difficult to 
judge. In the classification process, a judge used the 
DualNA VI system and examined the informativeness 
of each word as guidance. Classification into class d 
words was done conservatively because the 
consequences of removing informative words from 
lhc window are more serious than those of allowing 
useless words to appear. 
3hblc I shows part of the chtssification of thc 
2,000 words. Words marked "p" arc proper nouns. 
The difference between propcr nouns in class a and 
proper nouns in other classes is that the former arc 
wcllknown. Most words classified as "d" are very 
common verbs (such as-,J-~(do) and {J~s-~(have)), 
adverbs, demonstrative pronouns, conjunctions, and 
numbers. It is thereti)rc impossible to define a 
stop-word list by only using parts-of-spccch bccausc 
ahnost all parts-of speech appear in class d words. 
4.1.2 Measures used in tile experiments 
To evaluate the effectiveness of several lneasures, 
we compared the ability of each measure to gather 
(avoid) representative (non-representative) terms. 
We randomly sorted thc 20,000 words and then 
compared the results with the restllts of sorting by 
other criteria: Rep(., LLR), Rep(., DIFFNUM), (f 
(tern~ liequency), and tfid.fi The comparison was 
done by nsing the accunmlated number of words 
marked by a specified class that appeared in the first 
N (1 _< N_< 2,000) words. The definition we used for 
tj- idf was 
Nlota\[ .t/- ira= 4771775 × log N(r ' 
where T is a term, TF(7) is the term frequency of 7, 
Nt,,,<,l is the number of total documents, and N(7) is 
the number of documents that contain 7: 
4.1.3 Results 
Figure 4 compares, for all the sorting criteria, tile 
323 
accumulated number of words marked "a". The total 
number of class a words was 911. Rep( o, LLR) 
clearly outperformed the other measures. Although 
Rep(., DIFFNUM) outperformed .tf and tf-idf up to 
about the first 9,000 monograms, it otherwise 
under-performed them. If we use the threslaold value 
of Rep(., LLR), from the first word to the 1,511th 
word is considered representative. In this case, the 
recall and precision of the 1,511 words against all 
class a words were 85% and 50%, respectively. 
When using tf-idf the recall and precision of the 
first 1,511 words against all class a words were 79% 
and 47%, respectively (note that tJ'-idfdoes not have 
a clear threshold value, though). 
Although the degree of out-performance by 
Rep(., LLR) is not seemingly large, this is a 
promising result because it has been pointed out that, 
in the related domains of term extraction, existing 
measures hardly outperform even the use of 
frequency (for example, Daille et al. 1994, Caraballo 
et al. 1999) when we use this type of comparison 
based on the accumulated numbers. 
Figure 5 compares, for all the sorting criteria, 
the accumulated number of words marked by d (454 
in total), in this case, fewer the number of words is 
better. The difference is far clearer in this case: 
Rep(., LLR) obviously outperformed the other 
measures. In contrast, tfidJ and frequency barely 
outperformed random sorting. Rep(., DIFFNUM) 
outperformed tfand (f-idfuntil about the first 3,000 
monograms, but under-performed otherwise. 
Figure 6 compares, for all the sorting criteria, 
the accumulated number of words marked ap 
(acceptable proper nouns, 216 in total ). Comparing 
this figure with Fig. 4, we see that the 
out-performance ofRep(., LLR) is more pronounced. 
Also, Rep(., DIFFNUM) globally outperformed tf 
and tf-idf while the performance of(land tf-idfwcre 
nearly the same or even worse than with random 
sorting. 
IOOO 
900 
~00 
700 
600 
500 
400 
300 
10 
0 5000 10000 15000 20000 
Order 
• random • Rep(., LLR) a Rep(., DIFFNUM) ~ tfidf * tf 
Figure 4 
Sorting results on class a words 
350 
300 Z 
250 
200 
.< 
150 
100 
~g / 
L 
0 5000 10000 15000 20000 
Order 
• random ~ Rep(., LLR) a Rcp(., DIFFNUM) ~ tt:idf • tf 
Figure 5 
Sorting results on class d words 
p) 
a~ 150 Z 
100 
.< 
o j~,,-- 
o 5(1{)0 I0000 15000 20000 
Order 
• random ~ Rep(., LLR) z~ Rep(., I)IFFNUM) ~ tl=id\[" • tf 
Figure 6 
Sorting results on class ap words 
qhble 1 
Examples of the classified words 
chtss a class u class d 
~" 2 :L ~Y-2"~ 5/ 1..'<-- ~ O'/~s("g) (chilly) )kT'-I'i)J (83,000,000) 
(amusement park) ~'\['J?J2 (depressed) ~)<?2 (greatly) 
g)3~)~ (threlerfingletter) ;~'~'1 t (lshigami) p T-l'flJqM-/-: (1, t46) 
/'7"4)'OM--JP (fircwall) ~}5',;: (Shigeyuki) p ~J-~<~ (all) 
"\[~l'~t~', (antique) li~¢;i,'2:t, '¢£(misdirected) ~" L L (not...in the least) 
7" \]- ~ ;/// (Atlanta) p ~}J(~A~ (agility) 
In the experiments, proper nouns generally 
have a high Rep-value, and some have particularly 
high scores. Proper nouns having particularly high 
scores are, for instance, the names ofsumo wrestlers 
or horses. This is because they appear in articles 
with special formats such as sports reports. 
We attribute the difference of the performance 
between Rep(., LLR) and RED(., DIFFNUM) to the 
quantity of information used. Obviously information 
on the distribution of words in a document is more 
comprehensive than that on the number of different 
words. This encourages us to try other measures of 
document properties that incorporate even more 
precise information. 
324 
4.2 Picking out fl'equeni non-representative 
monograms 
When we concentrate on the nlost fi-equent terms, 
Re/)(., DIFFNUM) outperfomlcd Rep(., LLR) in the 
following sense. We marked "clearly 
non-representative terms" in the 2,000 most frequent 
monograms, then counted the number of marked 
terms that were assigned Rt7)-values smaller than 
the threshold value of a specified representativeness 
ulcasurc. 
The total number of checked terms was 563, 
and 409 of them are identified as non-representative 
by Rep(', LER). On the other hand, Rep( °, 
DIFFNUM) identified 453 terms as 
non--representative. 
4.3 Rank correlation between measures 
We investigated the rank-correlation of the sorting 
results for the 20,000 terms used in the experiments 
described in subsection 4.1. Rank correlation was 
measured by Spearman's method and Kendall's 
method (see Appendix) using 2,000 terms randomly 
selected from the 20,000 terms. Table 2 shows the 
correlation between Rep(,, LLR) and other measures. 
It is interesting that the ranking by Rep(., LLR) and 
that by Rep(., DIFFNUM) had a very low 
correlation, even lower than with (f or (fidf This 
indicates that a combination of Rep(., LLR) and 
Rep(,, DIFFNUM) should provide a strong 
discriminative ability in term classification; this 
possibility deserves further investigation. 
Table 2 
Two types of Rank correlation between 
term-rankings by Rep(., LLR) and other measures. 
Rep(., DIFFNUM) t/=ic(f tf 
Spearman -0.00792 0.202 0.198 
Kendall -0.0646 0.161 0.153 
4.4 Portability of baseline functions 
We examined the robustness of thc baseline 
fimctions; that is, whether a baseline function 
defined from a corpus can be used for normalization 
in a different corpus. This was investigated by using 
Re/)(., LLR) with seven different corpora. Seven 
baseline functions were defined from seven corpora, 
then were used for normalization for defining Rep(., 
LLR) in the corpus used in the experiments 
described in subesction 4.1. The per%rmance of the 
Re/)(,, LLR)s defined using the difl'erent baseline 
flmctions was compared in the same way as in the 
snbsection 4. l. The seven corpora used to construct 
baseline fhnctions were as follows: 
NK96-ORG: 15,8000 articles used in the experiments in 4.1 
NK96-50000:50,000 randomly selected articles from Ihe whole 
corpus N K96 (206,803 articles of Nikkei-shinhun 1996) 
N K96-100000: I 0(},000 randomly selected articles fn}m N K96 
NK96-200000: 2{}0,00(} randomly selcctcd articles fiom NK96 
NK98-1580{}0:158,0{}(} randomly selecled articles from articles in 
Nikkei-xhinhun 1998 
N('- 158000:158,{}00 randomly selected abstracts of academic papers 
I\]'Olll NACSIS corptl:.; (Kando ct al. 1999) 
NC-:\LI.: all abstracts (333,003 abstracts) in the NACSIS coq)us. 
Statistics on their content words are shown in Table 3. 
Table 3 
Corpora and statistics on their content words 
~~. NK96-OP, G NK96-soooo NKq6-1ooooo NK96-2ooooo 
fi o|'Iotal words 42,555,095 13,49S,244 26,934,068 53.816,407 
;: ofdillbrent words 210,572 127,852 172.914 233,668 
~~ NK98-158000 NC-158000 NC-AI.I. 
# ,af total v,'ords 39,762, 127 30,770,682 64,806,627 
# of difliarent words 196,261 231,769 350.991 
Figure 7 compares, for all the baseline functions, the 
accumulated number of words marked "a" (see 
subsection 4.1). The pertbrmancc decreased only 
slightly when the baseline defned from NC-ALL 
was used. In other cases, the difl'erences was so 
small that they were almost invisible ill Fig. 7. The 
same results were obtained when using class d 
words and class ap words. 
tuoo 
9OO 
700 
-j 
5OO 
0 2000 40011 (dRRI XOOH I UO(lO 12tRR} 14000 160110 I ROOt} 21111{11/ 
Order 
* random ~ NK96-OR(i A NK96-5t}000 - NK96-100000 
c\] NK96-20{}000 * NK98-158{}(1{} + NC-158000 x NC-ALL 
Figure 7 
Sorting results on class a words 
We also examined the rank correlations 
between the ranking that resulted from each 
representativeness measure in the same way as 
described in subsection 4.2 (see Table 4). They were 
close to 100% except when combining the Kendall's 
method and NACSIS corpus baselines. 
Table 4 
Rank correlation between the measure defined by an 
NK96-ORG baseline and ones defined by other baselines (%) 
NK96- NK96- NK96- NK9g- "~C- 1 5800C NC-AI.I. 
500{}0 I.OOOO 2000{}0 158000 
Spcarmann 0.997 0.997 0.996 0.999 0.912 0.900 
Kendall 0.970 0.956 0.951 0.979 0.789 0.780 
These resnhs suggest that a baseline function 
constructed from a corpus can be used to rank terms 
in considerably different corpora. This is particularly 
useful when we are dealing with a corpus silnilar to 
a known corpus but do not know the precise word 
distributions in the corpus. The same tdnd of 
robustness was observed when we used Re/)(", 
325 
DIFFNUM). This baseline thnction robustness is an 
important tbature of measures defined using the 
baseline based. 
5. Conclusion and future works 
We have developed a better method -- the baseline 
method -- for defining the representativeness of a 
term. A characteristic value of all docmnents 
containing a term T, D(T), is normalized by using a 
baseline function that estimates the characteristic 
value of a randomly chosen doculnent set of the 
same size as D(?). The normalized value is used to 
measure the representativeness of the term T, and a 
measure defined by the baseline method offers 
several advantages compared to classical measures: 
(1) its definition is mathematically simple and clean 
(2) it can compare high-frequency terms with 
low-frequency terms, (3) the threshold value for 
being representative can be defined systcmatically, 
and (4) it can be applied to n-gram terms for any n. 
We developed two measures: one based on 
the normalized distance between two word 
distributions (Rep(., LLR)) and another based on 
the number of different words in a document set 
(Rep( o, DIFFNUM)). We compared these measures 
with two classical measures from various viewpoints, 
and confirmed that Rep(,, LLR) was superior. 
Experiments showed that the newly developed 
measures were particularly eflizctive for discarding 
frequent but uninformative terms. We can expect 
that these measures can be used for automated 
construction of a stop-word list and improvement of 
similarity calculation of documents. 
An important finding was that the baseline 
function is portable; that is, one defined on a corpus 
can be used for laormalization in a diflbrent corpus 
even if the two corpora have considerably diftbrent 
sizes or are in different domains. Wc can therefore 
apply the measures in a practical application when 
dealing with multiple similar corpora whose word 
distribution information is not fully known but we 
have the inforlnation on one particular corpus. 
We plan to apply Rep(., LLR) and Rep(., 
DIFFNUM) to several tasks in IR domain, such as 
the construction of a stop-word list for indexing and 
term weighting in document-similarity calculation. 
It will also be interesting to theoretically 
estimate the baseline functions by using 
fundalnental parameters such as the total numbcr of 
words in a corpus or the total different number in the 
corpus. The natures of the baseline functions 
deserve further study. 
Acknowledgements 
This project is supported in part by the Advanced 
Software Technology Project under the auspices of 
Information-technology Promotion Agency, Japan 
(IPA). 
Appendix 
Asusume that items I1 ..... IN are ranked by measures A and B, 
and that the rank of item/: assigncd by A (B) is RiO" ) (R~(j)), 
where RA(i ) eRA(j) (Rl4(i) ¢Ri~(j)) if i ~j. Then, Spearman's rank 
correlation between the two rankings is given as 
t 6x~j(R4(i)-R"(i))2 
N(N ~ - 1) 
and Kendall's rank correlation between the two rankings is 
given as 
I × ({# {(i, j)I c~(&.,(i) - R A (j)) = cr(Rz,(i ) - RB(j))}- 
N C2 
#{(i,./) l cr(R.4(i) - R.I(J)) = -cr(R~(i) - Re(./))}) , 
where c~ (x)=l ifx > 0, clse ifx < 0, c~ (x) = -I. 

References 

Caraballo, S. A. and Charniak, E. (1999). Determining 
the specificity of nouns fronl text. Prec. of EMNLP'99, 
pp. 63-70. 

Church, K. W. and Itanks, P. (1990). Word Association 
Norms, Mutual Information, and Lexicography, 
Computational Linguistics 6(1), pp.22-29. 

Daille, B. and Gaussier, E., and Lange, J. (1994). Towards 
automatic extraction of monolingual and bilingual 
terminology. Proc. of COLING'94, pp. 515-521. 

Dunning, T. (1993). Accurate Method for the Statistics of 
Surprise and Coincidence, Computational Linguistics 
19(1), pp.61-74. 

Firth, J. A synopsis ot' linguistic theory 1930- 1955. (t 957). 
Studies in Linguistic Analysix, Philological Society, Oxford. 

Frantzi, K. T., Ananiadou, S., and Tsujii, J. (1996). 
Extracting Terminological Expressions, IPSJ Technical 
Report of SIG NL, NLl12-12, pp.83-88. 

Hisamitsu, 'I:, Niwa, Y., and "l'sttiii, J. (1999). Measuring 
Representativeness of Terms, Prec. oflRAL'99, pp.83-90. 

Kageura, K. and Umino, B. (1996). Methods of automatic term 
recognition: A review. Terminology 3(2), pp.259-289. 

Kando, N., I:,2uriyanaa, K., and Nozue, T. (1999). NACSIS test 
collection workshop (NTCIR-I), l'roc, of the 22nd Ammal 
hlternational A CM SIGIR Cot!\['. on Research and 
Development in 1R, pp.299-300. 

Kita, Y., Kate, Y., Otomo, 'E, and Yano, Y. (1994). 
Colnparativc Study of Automatic Extraction of Collocations 
fiOln Corpora: Mutual lnlbrmation vs. Cost Criteria, Journal 
of Natural Language Processing, 1 ( 1 ), 21-33. 

Nagao, M., Mizutani, M., and lkeda, H. (1976). An Automated 
Method of the Extraction of hnportant Words fiom Japanese 
Scientific l)ocuments, Trans. oJIPSJ, 17(2), pp. 110-117. 

Nakagawa, H. and Mori, T. (1998). Nested Collocation and 
Compound Noun For Term Extraction, Prec. c( 
Computernt '98, pp.64-70 

Nishioka, S., Niwa, Y., lwayama, M., and Takano, A. (1997). 
DualNA VI: An intbrmation retrieval interface. Prec. o j 
WISS'97, pp.43-48. (in Japanese) 

Niwa, Y., Nishioka, S., Iwayama, M., and Takano, A. (1997). 
'lbpic graph generalion lbr query navigation: UTse of 
fiequency classes lbr topic extraction. Prec. c?fNLPRS'97, 
pp.95-100. 

Norcault, q'., McGill, M., and Koll, M. B. (1977). A 
Pertbrmance Evaluation of Similarity Measure, Document 
Telill Weighting Schemes and Representation in a Boolean 
Environment. In Oddey, R. N. (ed.), Iq\[brmalion Retrieval 
Resemz:h. London: Butterworths, pp.57-76. 

Salton, G. and Yang, C. S. (1973). On the Specification of Term 
Values in Automatic Indexing. Journal of Documentation 
29(4), pp.351-372. 

Sparck-Jones, K. (1973). Index Term Weighting. h(/brmation 
Storage and Retrieval 9(11), pp.616-633. 

Tcramoto, Y., Miyahara, Y., and Matsumoto, S. (1999). 
Word weight calculation for document retrieval by analyzing 
the distribution of co-occurrence words, Prec. of the 59th 
Ammal Meeting of lPS.l, 1P-06. (in Japanese) 
