File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1315_intro.xml

Size: 3,596 bytes

Last Modified: 2025-10-06 14:01:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1315">
  <Title>Empirical Term Weighting and Expansion Frequency</Title>
  <Section position="3" start_page="0" end_page="117" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> An empirical method for estimating term weights directly from relevance judgements is proposed.</Paragraph>
    <Paragraph position="1"> The method is designed to make as few assumptions as possible. It is similar to Berkeley's use of regression (Cooper et al., 1994) (Chen et al., 1999) where labeled relevance judgements are fit as a linear combination of (transforms of) t f, idf, etc., but avoids potentially troublesome assumptions by introducing histogram methods. Terms are grouped into bins. Weights are computed based on the number of relevant and irrelevant documents associated with each bin. The result- null * t: a term * d: a document * tf(t, d): term freq = # of instances of t in d * df(t): doc freq = # of docs d with tf(t, d) &gt; 1 * N: # of documents in collection * idf(t): inverse document freq: -log2 d~t) * df(t, tel, t f0): # of relevant documents d with tf(t, d) = tfo * df(t, rel, tfo): # of irrelevant documents d with tf(t, d) = tfo * el(t): expansion frequency = # docs d in query expansion with tf(t, d) &gt; 1 * TF(t): standard notion of frequency in corpus-based NLP: TF(t) = ~d tf(t, d) * B(t): burstiness: B(t) = 1 iff ~ is large. df(t)  ing weights usually lie between 0 and idf, which is a surprise; standard formulas like tf. idf would assign values well outside this range.</Paragraph>
    <Paragraph position="2"> The method extends naturally to include additional factors such as query expansion. Terms mentioned explicitly in the query receive much larger weights than terms brought in via query expansion. In addition, whether or not a term t is mentioned explicitly in the query, if t appears in documents brought in by query expansion (el(t) &gt; 1) then t will receive a much larger weight than it would have otherwise (ef(t) = 0). The interactions among these factors, however, are complicated and collection dependent. It is safer to use histogram methods than to impose unnecessary and potentially troublesome assumptions such as normality and independence.</Paragraph>
    <Paragraph position="3"> Under the vector space model, the score for a document d and a query q is computed by summing a contribution for each term t over an appropriate set of terms, T. T is often limited to terms shared by both the document and the query (minus stop words), though not always (e.g, query expansion).</Paragraph>
    <Paragraph position="4">  tf and idf. Terms are assi._~ed to bins based on idf. The column labeled idf is the mean idf for the terms in each bin. A is estimated separately for each bin and each tf value, based on the labeled relevance judgements.</Paragraph>
    <Paragraph position="5"> score~(d, q) = E t/(t, d) . idf(t) tET Under the probabilistic retrieval model, documents are scored by summing a similar contribution for each term t.</Paragraph>
    <Paragraph position="7"> In this work, we use A to refer to term weights.</Paragraph>
    <Paragraph position="9"> This paper will start by showing how to estimate A from relevance judgements. Three parameterizations will be considered: (1) fit-G, (2) fit-B, which introduces burstiness, and (3) fit-E, which introduces expansion frequency. The evaluation section shows that each model improves on the previous one. But in addition to performance, we are also interested in the interpretations of the parameters.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML