File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1315_metho.xml

Size: 15,049 bytes

Last Modified: 2025-10-06 14:07:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1315">
  <Title>Empirical Term Weighting and Expansion Frequency</Title>
  <Section position="4" start_page="117" end_page="119" type="metho">
    <SectionTitle>
2 Supervised Training
</SectionTitle>
    <Paragraph position="0"> The statistical task is to compute A, our best estimate of A, based on a training set. This paper will use supervised methods where the training materials not only include a large number of documents but also a few queries labeled with relevance judgements. null To make the training task more manageable, it is common practice to map the space of all terms into a lower dimensional feature space. In other words, instead of estimating a different A for each term in the vocabulary, we can model A as a function of tf and idf and various other features of  values in previous table. Most points fall between the dashed lines (lower limit of A = 0 and upper limit of A = idf). The plotting character denotes tf. Note that the line with tf = 4 is above the line with tf = 3, which is above the line with tf = 2, and so on. The higher lines have larger intercepts and larger slopes than the lower lines. That is, when we fit A ,~, a(tf) + b(tf), idf, with separate regression coefficients, a(tf) and b(tf), for each value of t f, we find that both a(tf) and b(tf) increase with t\].</Paragraph>
    <Paragraph position="1"> terms. In this way, all of the terms in a bin are assigned the weight, A. The common practice, for example, of assigning tf * idf weights can be interpreted as grouping all terms with the same idf into a bin and assigning them all the same weight, namely tf. idf. Cooper and his colleagues at Berkeley (Cooper et al., 1994) (Chen et al., 1999) have been using regression methods to fit as a linear combination of idf , log(t f) and various other features. This method is also grouping terms into bins based on their features and assigning similar weights to terms with similar features. In general, term weighting methods that are fit to data are more flexible than weighting methods that are not fit to data. We believe this additional flexibility improves precision and recall (table 8). Instead of multiple regression, though, we choose a more empirical approach. Parametric as- null Description (function of term t) df(t, rel,O) _-- # tel does d with tf(t,d) = 0 dr(t, tel, 1) _= # rel does d with tf(t, d) = 1 dr(t, rel, 2) _= # rel does d with tf(t, d) = 2 df(t, rel,3) ~ # rel does d with tf(t,d) = 3 df(t, rel,4+) ~ # tel does d with tf(t,d) _&gt; dr(t, tel, O) ~ # tel does d with tf(t, d) = 0 dr(t, rel, 1) ~_ # tel does d with tf(t, d) = 1 dr(t, tel, 2) ~ # rel does d with tf(t, d) = 2 where dr(bin, rel, t f) is 1 dr(bin, tel, t f) ~ Ib/=l ~ df(t, rel,tf) tEbin Similarly, the denominator can be approximated as: dr(bin, tel, t \]) P(bin, tfl~) ~ log2</Paragraph>
    <Paragraph position="3"> freq of term in corpus: TF(t) = ~a tf(t, d) # does d in collection = N dff = # does d with tf(t, d) _&gt; 1 where dr(bin, tel, t f) is  is computed for each term (ngram) in each query in training set.</Paragraph>
    <Paragraph position="4"> sumptions, when appropriate, can be very powerful (better estimates from less training data), but errors resulting from inappropriate assumptions can outweigh the benefits. In this empirical investigation of term weighting we decided to use conservative non-parametric histogram methods to hedge against the risk of inappropriate parametric assumptions.</Paragraph>
    <Paragraph position="5"> Terms are assigned to bins based on features such as idf, as illustrated in table 2. (Later we will also use B and/or ef in the binning process.) is computed separately for each bin, based on the use of terms in relevant and irrelevant documents, according to the labeled training material.</Paragraph>
    <Paragraph position="6"> The estimation method starts with a training file which indicates, among other things, the number of relevant and irrelevant documents for each term t in each training query, q. That is, for each t and q, we are are given dr(t, rel, tfo) and dr(t, tel, tfo), where dr(t, tel, tfo) is the number of relevant documents d with tf(t, d) = tfo, and df(t, rel, tfo) is the number of irrelevant documents d with tf(t, d) = tfo. The schema for the training file is described in table 3. From these training observations we wish to obtain a mapping from bins to As that can be applied to unseen test material. We interpret )~ as a log likelihood ratio: , P(bin, tflrel) ~(bin, t /) = ~og2-z-::-~'\[bin, t/IN) where the numerator can be approximated as: ,.~ _ dr(bin, rel, t f) P(bin, triter) ~ togs Nrel vant documents than others, N~t is computed by averaging:  tEbin To ensure that Nr~l + ~&amp;quot;~/= N, where N is the number of documents in the collection, we define This estimation procedure is implemented with the simple awk program in figure 2. The awk program reads each line of the training file, which contains a line for each term in each training query. As described in table 3, each training line contains 25 fields. The first five fields contain dr(t, tel, t f) for five values of tf, and the next five fields contain df(t, rel, tf) for the same five values of tf. The next two fields contain N,a and N;-~. As the awk program reads each of these lines from the training file, it assigns each term in each training query to a bin (based on \[log2(df)\], except when df &lt; 100), and maintains running sums of the first dozen fields which are used for computing dr(bin, rel, t f), df(bin, re'---l, tf), l~rret and I~--~ for five values of tf. Finally, after reading all the training material, the program outputs the table of ks shown in table 2. The table contains a column for each of the five tf values and a row for each of the dozen idf bins. Later, we will consider more interesting binning rules that make use of additional statistics such as burstiness and query expansion.</Paragraph>
    <Section position="1" start_page="118" end_page="119" type="sub_section">
      <SectionTitle>
2.1 Interpolating Between Bins
</SectionTitle>
      <Paragraph position="0"> Recall that the task is to apply the ks to new unseen test data. One could simply use the ks in table 2 as is. That is, when we see a new term in the test material, we find the closest bin in table 2 and report the corresponding ~ value. But since the idf of a term in the test set could easily fall between two bins, it seems preferable to find the two closest bins and interpolate between them.</Paragraph>
      <Paragraph position="2"> We use linear regression to interpolate along the idf dimension, as illustrated in table 4. Table 4 is a smoothed version of table 2 where A ~ a + b.idf.</Paragraph>
      <Paragraph position="3"> There are five pairs of coefficients, a and b, one for each value of tf.</Paragraph>
      <Paragraph position="4"> Note that interpolation is generally not necessary on the tf dimension because tf is highly quantized. As long as tf &lt; 4, which it usually is, the closest bin is an exact match. Even when tff &gt; 4, there is very little room for adjustments if we accept the upper limit of A &lt; idf.</Paragraph>
      <Paragraph position="5"> Although we interpolate along the idf dimension, interpolation is not all that important along that dimension either. Figure 1 shows that the differences between the test data and the training data dominate the issues that interpolation is attempting to deal with. The main advantage of regression is computational convenience; it is easier to compute a + b. idf than to perform a binary search to find the closest bin.</Paragraph>
      <Paragraph position="6"> Previous work (Cooper et al., 1994) used multiple regression techniques. Although our performance is similar (until we include query expansion) we believe that it is safer and easier to treat each value of tf as a separate regression for reasons discussed in table 5. In so doing, we are basically restricting the regression analysis to such an extent that it is unlikely to do much harm (or much good). Imposing the limits of 0 &lt; A _&lt; idf also serves the purpose of preventing the regression from wandering too far astray.</Paragraph>
      <Paragraph position="7">  This table approximates the data in table 1 with ~ a(tf) + b(tf), idf. Note that both the intercepts, a(tf), and the slopes, b(tf), increase with tf (with a minor exception for b(4+)).</Paragraph>
      <Paragraph position="8">  cients for method fit-G with comparable coefficients from the multiple regression: A = a2 + b2 * idf + c2 * log(1 + t f) where a2 ---- -4.1, b2 = 0.66 and c2 = 3.9. The differences in the two fits are particularly large when tf = 0; note that b(0) is negligible (0.05) and b2 is quite large (0.66). Reducing the number of parameters from 10 to 3 in this way increases the sum of square errors, which may or may not result in a large degradation in precision and recall. Why take the chance?</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="119" end_page="119" type="metho">
    <SectionTitle>
3 Burstiness
</SectionTitle>
    <Paragraph position="0"> Table 6 is like tables 4 but the binning rule not only uses idf, but also burstiness (B). Burstiness (Church and Gale, 1995)(Katz, 1996)(Church, 2000) is intended to account for the fact that some very good keywords such as &amp;quot;Kennedy&amp;quot; tend to be mentioned quite a few times in a document or not at all, whereas less good keywords such as &amp;quot;except&amp;quot; tend to be mentioned about the same number of times no matter what the document  asterisk are worrisome because the bins are too small and/or the slopes fall well outside the normal range of 0 to 1.) The slopes rarely exceeded .8 is previous models (fit-G and fit-B), whereas fit-E has more slopes closer to 1. The larger slopes are associated with robust conditions, e.g., terms appearing in the query (where = D), the document (tf &gt; 1) and the expansion (el &gt; 1). If a term appears in several documents brought in by query * expansion (el &gt; 2), then the slope can be large even if the term is not explicitly mentioned in the query (where = E). The interactions among tf , idf, ef and where are complicated and not easily captured with a straightforward multiple regression. null is about. Since &amp;quot;Kennedy&amp;quot; and &amp;quot;except&amp;quot; have similar idf values, they would normally receive similar term weights, which doesn't seem right.</Paragraph>
    <Paragraph position="1"> Kwok (1996) suggested average term frequency, avtf = TF(t)/df(t), be used as a tie-breaker for cases like this, where TF(t) = ~a if(t, d) is the standard notion of frequency in the corpus-based NLP. Table 6 shows how Kwok's suggestion can be reformulated in our empirical framework. The table shows the slopes and intercepts for ten regressions, one for each combination of tf and B</Paragraph>
    <Paragraph position="3"/>
  </Section>
  <Section position="6" start_page="119" end_page="121" type="metho">
    <SectionTitle>
4 Query Expansion
</SectionTitle>
    <Paragraph position="0"> We applied query expansion (Buckley et al., 1995) to generate an expanded part of the query. The original query is referred to as the description (D) and the new part is referred to as the expansion (E). (Queries also contain a narrative (N) part that is not used in the experiments below so that our results could be compared to previously published results.) The expansion is formed by applying a base-line query engine (fit-B model) to the description part of the query. Terms that appear in the top k = 10 retrieved documents are assigned to the E portion of the query (where(t) = E), unless they were previously assigned to some other portion of the query (e.g., where(t) = D). All terms, t, no matter where they appear in the query, also receive an expansion frequency el, an integer from 0 to k = 10 indicating how many of the top k documents contain t.</Paragraph>
    <Paragraph position="1"> The fit-E model is: A = a(tf, where, ef) + b( t f , where, el) * i df , where the regression coefficients, a and b, not only depend on tf as in fit-G, but also depend on where the term appears in the query and expansion frequency el. We consider 5 values of t f, 2 values of where (D and E) and 6 values of ef (0, 1, 2, 3, 4 or more). 32 of these 60 pairs of coefficients are shown in table 7. As before, most of the slopes are between 0 and 1.</Paragraph>
    <Paragraph position="2"> is usually between 0 and idf, but we restrict A to 0 &lt; A &lt; idf, just to make sure.</Paragraph>
    <Paragraph position="3"> In tables 4-7, the slopes usually lie between 0 and 1. In the previous models, fit-B and fit-G, the largest slopes were about 0.8, whereas in fit-E, the slope can be much closer to 1. The larger slopes are associated with very robust conditions, e.g., terms mentioned explicitly in all three areas of interest: (1) the query (where = D), (2) the document (tf &gt; 1) and (3) the expansion (el &gt; 1).</Paragraph>
    <Paragraph position="4"> Under such robust conditions, we would expect to find very little shrinking (downweighting to compensate for uncertainty).</Paragraph>
    <Paragraph position="5"> On the other hand, when the term is not mentioned in one of these areas, there can be quite a bit of shrinking. Table 7 shows that the slopes are generally much smaller when the term is not in the query (where = E) or when the term is not in the expansion (el = 0). However, there are some exceptions. The bottom right corner of table 7 contains some large slopes even though these terms are not mentioned explicitly in the query (where = E). The mitigating factor in this case is the large el. If a term is mentioned in several documents in the expansion (el _&gt; 2), then it is not as essential that it be mentioned explicitly in the query.</Paragraph>
    <Paragraph position="6"> With this model, as with fit-G and fit-B, ~ tends to increase monotonically with tf and idf, though there are some interesting exceptions. When the term appears in the query (where = D) but not in the expansion (el = 0), the slopes are quite small (e.g., b(3,D,0) = 0.11), and the slopes actually decrease as tf increases (b(2, D, 0) = 0.83 &gt; b(3,D,0) = 0.11). We normally expect to see slopes of .7 or more when t.f &gt; 3, but in this case (b(3, D, 0) = 0.11), there is a considerable shrinking because we very much expected to see the term in the expansion and we didn't. ....</Paragraph>
    <Paragraph position="7"> As we have seen, the interactions among t f, idf, ef and where are complicated and probably de- null use training (with the possible exception of JCB1); methods below the line do not.</Paragraph>
    <Paragraph position="8"> pend on many factors such as language, collection, typical query patterns and so on. To cope with such complications, we believe that it is safer to use histogram methods than to try to account for all of these interactions at once in a single multiple regression. The next section will show that fit-E has very encouraging performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML