Empirical Term Weighting and Expansion Frequency 
Kyoji Umemura 
Toyohashi University of Technology 
Toyohashi Aichi 441-8580 Japan 
umemura@tut ics. rut. ac. jp 
Kenneth W. Church 
AT&T Labs-Research 
180 Park Ave., Florham Park, NJ. 
kwc~research, art. com 
Abstract 
We propose an empirical method for estimating 
term weights directly from relevance judgements, 
avoiding various standard but potentially trouble- 
some assumptions. It is common to assume, for ex- 
ample, that weights vary with term frequency (t f) 
and inverse document frequency (idf) in a particu- 
lar way, e.g., tf. idf, but the fact that there are so 
many variants of this formula in the literature sug- 
gests that there remains considerable uncertainty 
about these assumptions. Our method is similar to 
the Berkeley regression method where labeled rel- 
evance judgements are fit as a linear combination 
of (transforms of) t f, idf, etc. Training meth- 
ods not only improve performance, but also ex- 
tend naturally to include additional factors such 
as burstiness and query expansion. The proposed 
histogram-based training method provides a sim- 
ple way to model complicated interactions among 
factors such as t f, idf, burstiness and expansion 
frequency (a generalization of query expansion). 
The correct handling of expanded term is realized 
based on statistical information. Expansion fre- 
quency dramatically improves performance from 
a level comparable to BKJJBIDS, Berkeley's en- 
try in the Japanese NACSIS NTCIR-1 evaluation 
for short queries, to the level of JCB1, the top 
system in the evaluation. JCB1 uses sophisti- 
cated (and proprietary) natural  process- 
ing techniques developed by Just System, a leader 
in the Japanese word-processing industry. We are 
encouraged that the proposed method, which is 
simple to understand and replicate, can reach this 
level of performance. 
1 Introduction 
An empirical method for estimating term weights 
directly from relevance judgements is proposed. 
The method is designed to make as few assump- 
tions as possible. It is similar to Berkeley's use 
of regression (Cooper et al., 1994) (Chen et al., 
1999) where labeled relevance judgements are fit 
as a linear combination of (transforms of) t f, idf, 
etc., but avoids potentially troublesome assump- 
tions by introducing histogram methods. Terms 
are grouped into bins. Weights are computed 
based on the number of relevant and irrelevant 
documents associated with each bin. The result- 
• t: a term 
• d: a document 
• tf(t, d): term freq = # of instances of t in d 
• df(t): doc freq = # of docs d with tf(t, d) > 1 
• N: # of documents in collection 
• idf(t): inverse document freq: -log2 d~t) 
• df(t, tel, t f0): # of relevant documents d with 
tf(t, d) = tfo 
• df(t, rel, tfo): # of irrelevant documents d 
with tf(t, d) = tfo 
• el(t): expansion frequency = # docs d in 
query expansion with tf(t, d) > 1 
• TF(t): standard notion of frequency in 
corpus-based NLP: TF(t) = ~d tf(t, d) 
• B(t): burstiness: B(t) = 1 iff ~ is large. df(t) 
Table 1: Notation 
ing weights usually lie between 0 and idf, which 
is a surprise; standard formulas like tf. idf would 
assign values well outside this range. 
The method extends naturally to include ad- 
ditional factors such as query expansion. Terms 
mentioned explicitly in the query receive much 
larger weights than terms brought in via query 
expansion. In addition, whether or not a term 
t is mentioned explicitly in the query, if t ap- 
pears in documents brought in by query expan- 
sion (el(t) > 1) then t will receive a much larger 
weight than it would have otherwise (ef(t) = 0). 
The interactions among these factors, however, are 
complicated and collection dependent. It is safer 
to use histogram methods than to impose unnec- 
essary and potentially troublesome assumptions 
such as normality and independence. 
Under the vector space model, the score for a 
document d and a query q is computed by sum- 
ming a contribution for each term t over an ap- 
propriate set of terms, T. T is often limited to 
terms shared by both the document and the query 
(minus stop words), though not always (e.g, query 
expansion). 
117 
i# 
12.89 
10.87 
9.79 
8.96 
7.75 
6.82 
5.78 
4.74 
3.85 
2.85 
1.78 
0.88 
t/=O tf=l t/:=2 tf=3 tf>4 
-0.37 9.73 11.69 12.45 13.59 
-0.49 8.00 9.95 11.47 12.06 
-0.86 7.36 9.38 10.63 10.88 
-0.60 6.26 7.99 8.99 9.41 
-0.34 4.62 5.82 6.62 7.98 
-1.26 3.94 6.05 7.59 8.98 
-0.83 3.16 5.17 5.77 7.00 
-0.84 2.46 3.91 4.54 5.58 
-0.60 1.58 2:.76 3.57 4.55 
-1.02 1.00 1.72 2.55 3.96 
-1.33 -0.06 1.05 2.46 4.50 
-0.16 0.17 0.19 -0.10 -0.37 
Table 2: Empirical estimates of A as a function of 
tf and idf. Terms are assi._~ed to bins based on 
idf. The column labeled idf is the mean idf for 
the terms in each bin. A is estimated separately for 
each bin and each tf value, based on the labeled 
relevance judgements. 
score~(d, q) = E t/(t, d) . idf(t) 
tET 
Under the probabilistic retrieval model, docu- 
ments are scored by summing a similar contribu- 
tion for each term t. 
= ~ l P(tJrel) 
In this work, we use A to refer to term weights. 
q) = d, q) 
tET 
This paper will start by showing how to estimate A 
from relevance judgements. Three parameteriza- 
tions will be considered: (1) fit-G, (2) fit-B, which 
introduces burstiness, and (3) fit-E, which intro- 
duces expansion frequency. The evaluation section 
shows that each model improves on the previous 
one. But in addition to performance, we are also 
interested in the interpretations of the parameters. 
2 Supervised Training 
The statistical task is to compute A, our best esti- 
mate of A, based on a training set. This paper will 
use supervised methods where the training mate- 
rials not only include a large number of documents 
but also a few queries labeled with relevance judge- 
ments. 
To make the training task more manageable, it 
is common practice to map the space of all terms 
into a lower dimensional feature space. In other 
words, instead of estimating a different A for each 
term in the vocabulary, we can model A as a func- 
tion of tf and idf and various other features of 
Train 
/ ~4.~ / 1 ~ 
0 2 4 6 8 10 12 
IDF 
Test 
4 ~ s~- 
. 
~2 11 
0 2 4 6 8 10 12 
IDF 
Figure 1: Empirical weights, A. Top panel shows 
values in previous table. Most points fall between 
the dashed lines (lower limit of A = 0 and upper 
limit of A = idf). The plotting character denotes 
tf. Note that the line with tf = 4 is above the 
line with tf = 3, which is above the line with 
tf = 2, and so on. The higher lines have larger 
intercepts and larger slopes than the lower lines. 
That is, when we fit A ,~, a(tf) + b(tf), idf, with 
separate regression coefficients, a(tf) and b(tf), 
for each value of t f, we find that both a(tf) and 
b(tf) increase with t\]. 
terms. In this way, all of the terms in a bin are 
assigned the weight, A. The common practice, 
for example, of assigning tf • idf weights can be 
interpreted as grouping all terms with the same 
idf into a bin and assigning them all the same 
weight, namely tf. idf. Cooper and his colleagues 
at Berkeley (Cooper et al., 1994) (Chen et al., 
1999) have been using regression methods to fit 
as a linear combination of idf , log(t f) and var- 
ious other features. This method is also grouping 
terms into bins based on their features and assign- 
ing similar weights to terms with similar features. 
In general, term weighting methods that are fit 
to data are more flexible than weighting methods 
that are not fit to data. We believe this additional 
flexibility improves precision and recall (table 8). 
Instead of multiple regression, though, we 
choose a more empirical approach. Parametric as- 
118 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
20 
21 
Description (function of term t) 
df(t, rel,O) _-- # tel does d with tf(t,d) = 0 
dr(t, tel, 1) _= # rel does d with tf(t, d) = 1 
dr(t, rel, 2) _= # rel does d with tf(t, d) = 2 
df(t, rel,3) ~ # rel does d with tf(t,d) = 3 
df(t, rel,4+) ~ # tel does d with tf(t,d) _> 
dr(t, tel, O) ~ # tel does d with tf(t, d) = 0 
dr(t, rel, 1) ~_ # tel does d with tf(t, d) = 1 
dr(t, tel, 2) ~ # rel does d with tf(t, d) = 2 
where dr(bin, rel, t f) is 
1 dr(bin, tel, t f) ~ Ib/=l ~ df(t, rel,tf) 
tEbin 
Similarly, the denominator can be approximated 
as: 
dr(bin, tel, t \]) P(bin, tfl~) ~ log2 
df(t,~,3) -= #reml does d with t/(t,d) = 3 
df(t, rel,4+) ~ # tel does d with tf(t,d) _> 
# tel does d 
# tel does d 
freq of term in corpus: TF(t) = ~a tf(t, d) 
# does d in collection = N 
dff = # does d with tf(t, d) _> 1 
where dr(bin, tel, t f) is 
1 dff(bin, tel, t/) ~ Ib/nl ~ dff(t, ~, t f) 
tEbin 
ef = # does d in query exp. with tf(t, d) > 1 ~re t is an estimate of the total number of relevant 
where: D (description), E (query expansion) documents. Since some queries have more rele- 
25 burstiness: B 
Table 3: Training file schema: a record of 25 fields 
is computed for each term (ngram) in each query 
in training set. 
sumptions, when appropriate, can be very pow- 
erful (better estimates from less training data), 
but errors resulting from inappropriate assump- 
tions can outweigh the benefits. In this empirical 
investigation of term weighting we decided to use 
conservative non-parametric histogram methods 
to hedge against the risk of inappropriate para- 
metric assumptions. 
Terms are assigned to bins based on features 
such as idf, as illustrated in table 2. (Later we 
will also use B and/or ef in the binning process.) 
is computed separately for each bin, based on the 
use of terms in relevant and irrelevant documents, 
according to the labeled training material. 
The estimation method starts with a training 
file which indicates, among other things, the num- 
ber of relevant and irrelevant documents for each 
term t in each training query, q. That is, for 
each t and q, we are are given dr(t, rel, tfo) and 
dr(t, tel, tfo), where dr(t, tel, tfo) is the number 
of relevant documents d with tf(t, d) = tfo, and 
df(t, rel, tfo) is the number of irrelevant docu- 
ments d with tf(t, d) = tfo. The schema for the 
training file is described in table 3. From these 
training observations we wish to obtain a mapping 
from bins to As that can be applied to unseen test 
material. We interpret )~ as a log likelihood ratio: 
, P(bin, tflrel) ~(bin, t /) = ~og2-z-::-- 
~'\[bin, t/IN) 
where the numerator can be approximated as: 
,.~ _ dr(bin, rel, t f) P(bin, triter) ~ togs 
Nrel 
vant documents than others, N~t is computed by 
averaging: 
1 
tEbin 
To ensure that Nr~l + ~"~/= N, where N is the 
number of documents in the collection, we define 
This estimation procedure is implemented with 
the simple awk program in figure 2. The awk pro- 
gram reads each line of the training file, which con- 
tains a line for each term in each training query. 
As described in table 3, each training line contains 
25 fields. The first five fields contain dr(t, tel, t f) 
for five values of tf, and the next five fields con- 
tain df(t, rel, tf) for the same five values of tf. 
The next two fields contain N,a and N;-~. As the 
awk program reads each of these lines from the 
training file, it assigns each term in each train- 
ing query to a bin (based on \[log2(df)\], except 
when df < 100), and maintains running sums of 
the first dozen fields which are used for comput- 
ing dr(bin, rel, t f), df(bin, re'---l, tf), l~rret and I~--~ 
for five values of tf. Finally, after reading all the 
training material, the program outputs the table 
of ks shown in table 2. The table contains a col- 
umn for each of the five tf values and a row for 
each of the dozen idf bins. Later, we will consider 
more interesting binning rules that make use of 
additional statistics such as burstiness and query 
expansion. 
2.1 Interpolating Between Bins 
Recall that the task is to apply the ks to new un- 
seen test data. One could simply use the ks in 
table 2 as is. That is, when we see a new term 
in the test material, we find the closest bin in ta- 
ble 2 and report the corresponding ~ value. But 
since the idf of a term in the test set could easily 
fall between two bins, it seems preferable to find 
the two closest bins and interpolate between them. 
119 
awk 'function log2(x) { 
return log(x)/log(2) } 
$21 - /'D/ { N = $14; df=$15; 
# binning rule 
if(df < I00) {bin = O} 
else {bin=int (log2 (dr)) } ; 
docfreq\[bin\] += df; 
Nbin \[bin\] ++; 
# average df(t,rel,tf), df(t,irrel,tf) 
for(i=l;i<=12;i++) n\[i,bin\]+=$i } 
END {for(bin in Nbin) { 
nbin = Nbin\[bin\] 
Nrel = n\[ll,bin\]/nbin 
Nirrel = N-Nrel 
idf = -log2 ( (docfreq \[bin\]/nbin)/N) 
printf("Y.6.2f ", idf) 
for (i=l ; i<=5 ; i++) { 
if(Nrel==O) prel = 0 
else prel = (n\[i,bin\]/nbin)/Nrel 
if(Nirrel == O) pirrel = 0 
else pirrel = (n\[i+5,bin\]/nbin)/Nirrel 
if(prel <= 0 \]} pirrel <= O) { 
printf "Y.6s ", "NA" } 
else { 
printf "Y.6.2f ", log2(prel/pirrel)} } 
print ""}}' 
Figure 2: awk program for computing ks. 
We use linear regression to interpolate along the 
idf dimension, as illustrated in table 4. Table 4 is 
a smoothed version of table 2 where A ~ a + b.idf. 
There are five pairs of coefficients, a and b, one for 
each value of tf. 
Note that interpolation is generally not neces- 
sary on the tf dimension because tf is highly 
quantized. As long as tf < 4, which it usually 
is, the closest bin is an exact match. Even when 
tff > 4, there is very little room for adjustments if 
we accept the upper limit of A < idf. 
Although we interpolate along the idf dimen- 
sion, interpolation is not all that important along 
that dimension either. Figure 1 shows that the 
differences between the test data and the train- 
ing data dominate the issues that interpolation is 
attempting to deal with. The main advantage of 
regression is computational convenience; it is eas- 
ier to compute a + b. idf than to perform a binary 
search to find the closest bin. 
Previous work (Cooper et al., 1994) used mul- 
tiple regression techniques. Although our perfor- 
mance is similar (until we include query expan- 
sion) we believe that it is safer and easier to treat 
each value of tf as a separate regression for rea- 
sons discussed in table 5. In so doing, we are ba- 
sically restricting the regression analysis to such 
an extent that it is unlikely to do much harm (or 
much good). Imposing the limits of 0 < A _< idf 
also serves the purpose of preventing the regres- 
sion from wandering too far astray. 
tf a b 
0 -0.95 0.05 
1 -0.98 0.69 
2 -0.15 0.78 
3 0.53 0.81 
4+ 1.32 0.77 
Table 4: Regression coefficients for method fit-G. 
This table approximates the data in table 1 with 
~ a(tf) + b(tf), idf. Note that both the inter- 
cepts, a(tf), and the slopes, b(tf), increase with 
tf (with a minor exception for b(4+)). 
tf 
0 
1 
2 
3 
4 
5 
a(tf) bit/) 
-0.95 0.05 
-0.98 0.69 
-0.15 0.78 
0.53 0.81 
1.32 0.77 
1.32 0.77 
a2 + c2. log(1 + t f) b2 
-4.1 0.66 
-1.4 0.66 
0.18 0.66 
1.3 0.66 
2.2 0.66 
2.9 0.66 
Table 5: A comparison of the regression coeffi- 
cients for method fit-G with comparable coeffi- 
cients from the multiple regression: A = a2 + b2 • 
idf + c2 • log(1 + t f) where a2 ---- -4.1, b2 = 0.66 
and c2 = 3.9. The differences in the two fits are 
particularly large when tf = 0; note that b(0) is 
negligible (0.05) and b2 is quite large (0.66). Re- 
ducing the number of parameters from 10 to 3 in 
this way increases the sum of square errors, which 
may or may not result in a large degradation in 
precision and recall. Why take the chance? 
3 Burstiness 
Table 6 is like tables 4 but the binning rule not 
only uses idf, but also burstiness (B). Burstiness 
(Church and Gale, 1995)(Katz, 1996)(Church, 
2000) is intended to account for the fact that some 
very good keywords such as "Kennedy" tend to 
be mentioned quite a few times in a document 
or not at all, whereas less good keywords such as 
"except" tend to be mentioned about the same 
number of times no matter what the document 
tf 
0 
1 
2 
3 
4+ 
B=0 
a b 
-0.05 -0.00 -0.61 
-1.23 0.63 -0.80 
-0.76 0.71 -0.05 
0.00 0.69 0.23 
0.68 0.71 0.75 
B=i 
a b 
0.02 
0.79 
0.79 
0.82 
0.83 
Table 6: Regression coefficients for method fit-B. 
Note that the slopes and intercepts are larger when 
B = 1 than when B = 0 (except when tf = 0). 
Even though A usually lies between-0 and idf, we 
restrict A to 0 < A < idf, just to make sure. 
120 
tf ef 
1 0 
2 0 
3 0 
4+ 0 
1 
2 
3 
4+ 
1 
2 
3 
4+ 
2 
2 
2 
2 
1 3 
2 3 
3 3 
4+ 3 
where=D 
a b 
-1.57 0.37 
-3.41 0.82 
-1.30 0.11 
0.40 0.06 
-1.84 0.87 
-2.12 1.10 
-0.66 0.95 
0.84 0.98 
-1.87 0.92 
-1.77 1.12 
-1.72 1.10 
-3.06 1.71" 
-2.52 0.95 
-1.81 1.02 
0.45 0.85 
0.38 1.22 
where=E 
a b 
-2.64 
-2.70 
-2.98 
-3.35 
-3.00 
-2.78 
-3.07 
-3.25 
0.68 
0.71 
0.74 
0.78 
0.86 
0.85 
0.93 
0.79 
-2.71 0.91 
-2.28 0.88 
-2.63 0.97 
-3.66 1.14 
Table 7: Many of the regression coefficients for 
method fit-E. (The coefficients marked with an 
asterisk are worrisome because the bins are too 
small and/or the slopes fall well outside the nor- 
mal range of 0 to 1.) The slopes rarely exceeded .8 
is previous models (fit-G and fit-B), whereas fit-E 
has more slopes closer to 1. The larger slopes are 
associated with robust conditions, e.g., terms ap- 
pearing in the query (where = D), the document 
(tf > 1) and the expansion (el > 1). If a term 
appears in several documents brought in by query 
• expansion (el > 2), then the slope can be large 
even if the term is not explicitly mentioned in the 
query (where = E). The interactions among tf , 
idf, ef and where are complicated and not easily 
captured with a straightforward multiple regres- 
sion. 
is about. Since "Kennedy" and "except" have 
similar idf values, they would normally receive 
similar term weights, which doesn't seem right. 
Kwok (1996) suggested average term frequency, 
avtf = TF(t)/df(t), be used as a tie-breaker for 
cases like this, where TF(t) = ~a if(t, d) is the 
standard notion of frequency in the corpus-based 
NLP. Table 6 shows how Kwok's suggestion can 
be reformulated in our empirical framework. The 
table shows the slopes and intercepts for ten re- 
gressions, one for each combination of tf and B 
(B = 1 iff avtf is large. That is, B = 1 iff 
TF(t)/df(t) > 1.83 - 0.048-idf). 
4 Query Expansion 
We applied query expansion (Buckley et al., 1995) 
to generate an expanded part of the query. The 
original query is referred to as the description (D) 
and the new part is referred to as the expansion 
(E). (Queries also contain a narrative (N) part that 
is not used in the experiments below so that our 
results could be compared to previously published 
results.) 
The expansion is formed by applying a base- 
line query engine (fit-B model) to the description 
part of the query. Terms that appear in the top 
k = 10 retrieved documents are assigned to the E 
portion of the query (where(t) = E), unless they 
were previously assigned to some other portion of 
the query (e.g., where(t) = D). All terms, t, no 
matter where they appear in the query, also re- 
ceive an expansion frequency el, an integer from 
0 to k = 10 indicating how many of the top k 
documents contain t. 
The fit-E model is: A = a(tf, where, ef) + 
b( t f , where, el) • i df , where the regression coeffi- 
cients, a and b, not only depend on tf as in fit-G, 
but also depend on where the term appears in the 
query and expansion frequency el. We consider 5 
values of t f, 2 values of where (D and E) and 6 
values of ef (0, 1, 2, 3, 4 or more). 32 of these 
60 pairs of coefficients are shown in table 7. As 
before, most of the slopes are between 0 and 1. 
is usually between 0 and idf, but we restrict A to 
0 < A < idf, just to make sure. 
In tables 4-7, the slopes usually lie between 0 
and 1. In the previous models, fit-B and fit-G, 
the largest slopes were about 0.8, whereas in fit- 
E, the slope can be much closer to 1. The larger 
slopes are associated with very robust conditions, 
e.g., terms mentioned explicitly in all three areas of 
interest: (1) the query (where = D), (2) the doc- 
ument (tf > 1) and (3) the expansion (el > 1). 
Under such robust conditions, we would expect to 
find very little shrinking (downweighting to com- 
pensate for uncertainty). 
On the other hand, when the term is not men- 
tioned in one of these areas, there can be quite 
a bit of shrinking. Table 7 shows that the slopes 
are generally much smaller when the term is not 
in the query (where = E) or when the term is 
not in the expansion (el = 0). However, there are 
some exceptions. The bottom right corner of ta- 
ble 7 contains some large slopes even though these 
terms are not mentioned explicitly in the query 
(where = E). The mitigating factor in this case 
is the large el. If a term is mentioned in several 
documents in the expansion (el _> 2), then it is 
not as essential that it be mentioned explicitly in 
the query. 
With this model, as with fit-G and fit-B, ~ tends 
to increase monotonically with tf and idf, though 
there are some interesting exceptions. When the 
term appears in the query (where = D) but not 
in the expansion (el = 0), the slopes are quite 
small (e.g., b(3,D,0) = 0.11), and the slopes actu- 
ally decrease as tf increases (b(2, D, 0) = 0.83 > 
b(3,D,0) = 0.11). We normally expect to see 
slopes of .7 or more when t.f > 3, but in this case 
(b(3, D, 0) = 0.11), there is a considerable shrink- 
ing because we very much expected to see the term 
in the expansion and we didn't. .... 
As we have seen, the interactions among t f, idf, 
ef and where are complicated and probably de- 
121 
filter trained on sys. 
NA ? JCB1 
2+, El tf, where,ef fit-E 
2 B,tf fit-B 
2, K tf + ... BKJJBIDS 
2, K B,tf fit-B 
2, K tf fit-G 
2, K none log(1 + t f). idf 
2, K none tf. idf 
11 
.360 
.354 
.283 
.272 
.264 
.257 
.249 
.112 
Table 8: Training helps: methods above the line 
use training (with the possible exception of JCB1); 
methods below the line do not. 
pend on many factors such as , collection, 
typical query patterns and so on. To cope with 
such complications, we believe that it is safer to 
use histogram methods than to try to account for 
all of these interactions at once in a single multiple 
regression. The next section will show that fit-E 
has very encouraging performance. 
5 Experiments 
Two measures of performance are reported: (1) 11 
point average precision and (2) R, precision after 
retrieving Nrd documents, where Nrd is the num- 
ber of relevant documents. We used the "short 
query" condition of the NACSIS NTCIR-1 Test 
Collection (Kando et al., 1999) which consists of 
about 300,000 documents in Japanese, plus about 
30 queries with labeled relevance judgement for 
training and 53 queries with relevance judgements 
for testing. The result of "short query" is shown in 
page 25 of(Kando et al., 1999), which shows that 
"short query" is hard for statistical methods. 
Two previously published systems are included 
in the tables below: JCB1 and BKJJBIDS. JCB1, 
submitted by Just System, a company with a com- 
mercially successful product for Japanese word- 
processing, produced the best results using sophis- 
ticated (and proprietary) natural  pro- 
cessing techniques.(Fujita, 1999) BKJJBIDS used 
Berkeley's logistic regression methods (with about 
half a dozen variables) to fit term weights to the 
labeled training material. 
Table 8 shows that training often helps. The 
methods above the line (with the possible excep- 
tion of JCB1) use training; the methods below the 
line do not. Fit-E has very respectable perfor- 
mance, nearly up to the level of JCB1, not bad for 
a purely statistical method. 
The performance of fit-B is close to that of 
BKJJBIDS. For comparison sake, fit-B is shown 
both with and without the K filter. The K filter 
restricts terms to sequences of Katakana and Kanji 
characters. BKJJBIDS uses a similar heuristic to 
eliminate Japanese function words. Although the 
K filter does not change performance very much, 
the use of this filter changes the relative order of 
fit-B and BKJJBIDS. These results suggest that 
R • 2: restrict terms to bigrams explicitly men- 
.351 tioned in query (where ~- D) 
.363 • 2+: restrict terms to bigrams, but include 
.293 where = E as well as where = D 
.282 .282 * W: restrict terms to words, as identified by 
.267 Chasen (Matsumoto et al., 1997) 
.262 • K: restrict terms to sequences of Katakana 
.138 and/or Kanji characters 
• B: restrict terms to bursty (B -- 1) terms 
• Ek: require terms to appear in more than k 
docs brought in by query expansion (el(t) > k). 
Table 9: Filters: results vary somewhat depending 
on these choices, though not too much, which is 
fortunate, since since we don't understand stop 
lists very well. 
filter trained on sys. 
2+, E1 tf, where,ef fit-E 
2+, E2 tf, where,ef fit-E 
2+, E4 tf, where,ef fit-E 
2+ tf, where,ef fit-E 
NA NA JCB1 
11 R 
.354 .363 
.350 .359 
.333 .341 
.332 .366 
.360 .351 
Table 10: The best filters (Ek) improve the per- 
formance of the best method (fit-E) to nearly the 
level of JCB1. 
the K filter is slightly unhelpful. 
A number of filters have been considered (ta- 
ble 9). Results vary somewhat depending on these 
choices, though not too much, which is fortunate, 
since since we don't understand stop lists very 
well. To the extent that there is a pattern, we sus- 
pect that words axe slightly better than bigrams, 
and that the E filter is slightly better than the B 
filter which is slightly better than the K filter. Ta- 
ble 10 shows that the best filters (Ek) improve the 
performance of the best method (fit-E) to nearly 
the level of JCB1. 
filter sys. UL 
2 fit-B + 
2 fit-B + 
2 fit-B - 
2 fit-B - 
2 fit-G + 
2 fit-G - 
2 fit-G + 
2 fit-G - 
LL II 
+ .283 
- .280 
+ .280 
- .275 
+ .266 
÷ .251 
- .248 
- .232 
R 
.293 
.296 
.296 
.288 
.279 
.268 
.259 
.249 
Table 11: Limits do no harm: two limits are 
slightly better than one, and oneis slightly bet- 
ter than none. (UL = upper limit of ~ < idf; LL 
= lower limit of 0 _< ~) 
122 
The final experiment (table 11) shows that re- 
stricting ~ to 0 < ~ < id\] improves performance 
slightly. The combination of both the upper limit 
and the lower limit is slightly better than just one 
limit which is better than none. We view limits as 
a robustness device. Hopefully, they won't have 
to do much but every once in a while they prevent 
the system from wandering far astray. 
6 Conclusions 
This paper introduced an empirical histogram- 
based supervised learning method for estimating 
term weights, ~. Terms are assigned to bins based 
on features such as inverse document frequency, 
burstiness and expansion frequency. A different 
is estimated for each bin and each tf by counting 
the number of relevant and irrelevant documents 
associated with the bin and tff value. Regression 
techniques are used to interpolate between bins, 
but care is taken so that the regression cannot do 
too much harm (or too much good). Three varia- 
tions were considered: fit-G, fit-B and fit-E. The 
performance of query expansion (fit-E) is particu- 
larly encouraging. Using simple purely statistical 
methods, fit-E is nearly comparable to JCB1, a 
sophisticated natural  processing system 
developed by Just System, a leader in the Japanese 
word processing industry. 
.-: In addition to performance, we are also inter- 
ested in the interpretation of the weights. Empiri- 
cal weights tend to lie between 0 and idf. We find 
these limits to be a surprise given that standard 
term weighting formulas such as tf. idf generally 
do not conform to these limits. In addition, we 
find that ~ generally grows linearly with idf, and 
that the slope is between 0 and 1. We interpret the 
slope as a statistical shrink. The larger slopes are 
associated with very robust conditions, e.g., terms 
mentioned explicitly in all three areas of interest: 
(1) the query (where = D), (2) the document 
(tf _> 1) and (3) the expansion (ef > 1). There 
is generally more shrinking for terms brought in 
by query expansion (where = E), but if a term 
is mentioned in several documents in the expan- 
sion (el > 2), then it is not as essential that the 
term be mentioned explicitly in the query. The 
interactions among t f, id\], where, B, el, etc., are 
complicated, and therefore, we have found it safer 
and easier to use histogram methods than to try 
to account for all of the interactions at once in a 
single multiple regression. 
Acknowdedgement 
Authors thank Prof. Mitchell P. Marcus of Uni- 
versity of Pennsylvania for the valuable discussion 
about noise reduction in context of information 
retrieval. This reseach is supported by Sumitomo 
Electric. 

References 
Chris Buckley, Gerard Salton, James Allan, and Amit 
Singhal. 1995. Automatic query expansion us- 
ing smart: Trec 3. In The Third Text REtrieval 
Conference(TREC-3), pages 69-80. 
Aitao Chen, Fredric C. Gey, Kazuaki Kishida, Hailing 
Jiang, and Qun Liang. 1999. Comparing multiple 
methods for japanese and japanese-english text re- 
trieval. In NTCIR Workshop 1, pages 49-58, Tokyo 
Japan, Sep. 
Kenneth W. Church and William A. Gale. 1995. 
Poisson mixture. Natural Language Engineering, 
1(2):163-190. 
Kenneth W. Church. 2000. Empirical estimates of 
adaptation: The chance of two noriegas is closer 
to p/2 than p2. In Coling-2000, pages 180-186. 
William S. Cooper, Aitao Chen, and Fredric C. Gey. 
1994. Full text retrieval based on probabilistic equa- 
tion with coefficients fitted by logistic regressions. 
In The Second Text REtrieval Conference(TREU- 
2), pages 57-66. 
Sumio Fujita. 1999. Notes on phrasal index- 
ing: Jscb evaluation experiments at ntcir ad 
hoc'. In NTCIR Workshop 1, pages 101-108, 
http://www.rd.nacsis.ac.jp/ -ntcadm/, Sep. 
Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, 
Koji Eguchi, and Hiroyuki Katoand Souichiro Hi- 
daka. 1999. Overview of ir tasks at the first nt- 
cir workshop. In NTCIR Workshop 1, pages 11-44, 
http://www.rd.nacsis.ac.jp/ "ntcadm/, Sep. 
Slava M. Katz. 1996. Distribution of content words 
and phrases in text and  modelling. Natural 
Language Engineering, 2(1):15-59. 
K. L. Kwok. 1996. A new method of weighting query 
terms for ad-hoc retrieval. In SIGIR96, pages 187- 
195, Zurich, Switzerland. 
Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, 
Yoshitaka Hirano, Osamu Imaichi, and Tomoaki 
Imamura. 1997. Japanese morphological analysis 
system chasen manual. Technical Report NAIST- 
IS-TR97007, NAIST, Nara, Japan, Feb. 
