File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1070_metho.xml
Size: 11,337 bytes
Last Modified: 2025-10-06 14:13:25
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1070"> <Title>The Importance of Proper Weighting Methods</Title> <Section position="3" start_page="349" end_page="350" type="metho"> <SectionTitle> 2. AD-HOC WEIGHTS </SectionTitle> <Paragraph position="0"> Document or query weights can be based on any number of factors; two would be statistical occurrence information and a history of how well this feature (or other similar features) have performed in the past. In many situations, it's impossible to obtain history information and thus initial weights are often based purely on statistical information. A major class of statistical weighting schemes is examined below, showing that there is an enormous performance range within the class. Then the process of adding additional features to a document or query representative is examined in the context of these weighting schemes. These are issues that are somewhat subtle and are often overlooked.</Paragraph> <Paragraph position="1"> 2.1. Tf * Idf Weights Over the past 25 years, one class of term weights has proven itself to be useful over a wide variety of collections. This is the class of tf*idf (term frequency times inverse document frequency) weights \[1, 6, 7\], that assigns weight wik to term Tk in document/)i in proportion to the frequency of occurrence of the term in D~, and in inverse proportion to the number of documents to which the term is assigned. The weights in the document are then normalized by the length of the document, so that long documents are not automatically favored over short documents. While there have been some post-facto theoretical justifications for some of the tf*idf weight variants, the fact remains that they are used because they work well, rather than any theoretical reason.</Paragraph> <Paragraph position="2"> Table 1 presents the evaluation results of running a number of tf*idf variants for query weighting against a number of variants for document weighting (the runs presented here are only a small subset of the variants actually run). All of these runs use the same set of features (single terms), the only differences are in the term weights. The exact variants used aren't important; what is important is the range of results. Disregarding one extremely poor document weighting, the range of results is from 0.1057 to 0.2249. Thus a good choice of weights may gain a system over 100%. As points of comparison, the best official TREC run was 0.2171 (a system incorporating a very large amount of user knowledge to determine features) and the median TREC run in this category was 0.1595. The best run (DOCWT = lnc, QWT = ltc), is about 24% better than the most generally used tf*idf run (DOCWT = QWT = ntc).</Paragraph> <Paragraph position="3"> 24%is a substantial difference in performance, in a field where historically an improvement of 10% is considered quite good. The magnitude of performance improvement due to considering additional features such as syntactic phrases, titles and parts of speech is generally quite small (0 - 10%). Adding features and using good weights can of course be done at the same time; but the fact that somewhat subtle differences in weighting strategy can overwhelm the effect due to additional features is worrisome. This means the experimenter must be very careful when adding features that they do not change the appropriateness of the weighting strategy.</Paragraph> <Paragraph position="4"> 2.2. Adding New Features Suppose an experimenter has determined a good weighting strategy for a basic set of features used to describe a query or document and now wishes to extend the set of features. In the standard tf*idf, cosine-normalized class of weights, it is not as simple as it may first appear. The obvious first step, making sure the weights before normalization of the new set of features and the old set are commensurate, is normally straightforward. But then problems occur because of the cosine normalization. For example, suppose there were two documents in a collection, one of them much longer then the other:</Paragraph> <Paragraph position="6"> Now suppose the new approach adds a reasonably constant five features onto each document representative.</Paragraph> <Paragraph position="7"> (Examples of such features might be title words, or categories the document is in.) If the new features are just added on to the list of old features, and then the weights of the features are normalized by the total length of the document, then there are definite problems. Not only does the weight of the added features vary according to the length of the document (that could very well be what is wanted), but the weight of the old features have changed. A query that does not take advantage of the new features will suddenly find it much more difficult to retrieve short documents like D1. D1 is now much longer than it was, and therefore the values of Wl,k have all decreased because of normalization.</Paragraph> <Paragraph position="8"> Similarly, if the number of new added features tends to be much more for longer documents than short (for example, a very loose definition of phrase), a query composed of only old features will tend to favor short documents more than long (at least, more than it did originally). Since the original weighting scheme was a supposedly good one, these added features will hurt performance on the original feature portion of the similarity. The similarity on the added feature portion might help, but it will be difficult to judge how much.</Paragraph> <Paragraph position="9"> These normalization effects can be very major effects.</Paragraph> <Paragraph position="10"> Using a loose definition of phrase on CACM (a small test collection), adding phrases in the natural fashion above will hurt performance by 12~0. However, if the phrases are added in such a way that the weights of the original single terms are not affected by normalization, then the addition of phrases improves performance by 9%.</Paragraph> <Paragraph position="11"> One standard approach when investigating the usefulness of adding features is to ensure that the weights of the old features remain unchanged throughout the investigation. In this way, the contribution of the new features can be isolated and studied separately at the similarity level. \[Note that if this is done, the addition of new features may mean the re-addition of old features, if the weights of some old features are supposed to be modified.\] This is the approach we've taken, for instance with the weighting of phrases in TREC. The single term information and the phrase information are kept separate within a document vector. Each of the separate subvectors is normalized by the length of the single term sub-vector. In this way, the weights of all terms are kept commensurate with each other, and the similarity due to the original single terms is kept unchanged.</Paragraph> <Paragraph position="12"> The investigation of weighting strategies for additional features is not a simple task, even if separation of old features and new features is done. For example, Joel Fagan in his excellent study of syntactic and statistical phrases\[2\], spent over 8 months looking at weighting strategies. But if it's not designed into the experiment from the beginning, it will be almost impossible.</Paragraph> <Section position="1" start_page="350" end_page="350" type="sub_section"> <SectionTitle> 2.3. Relevance Feedback </SectionTitle> <Paragraph position="0"> One opportunity for good term weighting occurs in the routing environment. Here, a query is assumed to represent a continuing information need, and there have been a number of documents already seen for each query, some subset of which has been judged relevant. With this wealth of document features and information available, the official TREC routing run that proved to be the most effective was one that took the original query terms and assigned weights based on probability of occurrence in relevant and non-relevant documents\[3, 51. Once again, weighting, rather than feature selection, worked very well. (However, in this case the feature selection process did not directly adversely affect the weighting process.</Paragraph> <Paragraph position="1"> Instead, it was mostly the case that the additional features from relevant documents were simply not chosen or weighted optimally.) In this run, using the RPI feedback model developed by Fuhr\[3\], relevance feedback information was used for computing the feedback query term weight q~ of a term as p~(1 -ri)/\[ri(1 -Pi)\] - 1 Here Pi is the average document term weight for relevant documents, and ri is the corresponding factor for nonrelevant items. Only the terms occurring in the query were considered here, so no query expansion took place. Having derived these query term weights, the query was run against the document set. Let di denote the document term weight, then the similarity of a query to a document is computed by S(q, d) = ~\](log(qi * di + 1))</Paragraph> </Section> </Section> <Section position="4" start_page="350" end_page="351" type="metho"> <SectionTitle> 3. LEARNING WEIGHTS BY TERM FEATURES </SectionTitle> <Paragraph position="0"> The ad-hoc tf*idf weights above use only collection statistics to determine weights. However, if previous queries have been run on this collection, the results from those queries can be used to determine what term weighting factors are important for this collection. The final term weight is set to a linear combination of term weight factors, where the coemcient of each factor is set to minimize the squared error for the previous queries\[4, 5\]. The offcial TREC runs using this approach were nearly the top results; which was somewhat surprising given the very limited and inaccurate training information which was available.</Paragraph> <Paragraph position="1"> This approach to learning solves the major problem of learning in an ad-hoc environment: the fact that there is insufficient information about individual terms to learn reasonable weights. Most document terms have not occurred in previous queries, and therefore there is no evidence that can be directly applied. Instead, the known relevance information determines the importance of features of each term. The particular features used in TREC 1 were combinations of the following term factors: null t f: within-document frequency of the term logidf: log ((N+l)/n), where N is the number of documents in the collection and n is the number of documents containing the term lognumterms: log (number of different terms of the document) imaxtf: 1 / (maximum within-document frequency of a term in the document) After using the relevance information, the final weight for a term in a TREC 1 document was</Paragraph> <Paragraph position="3"> There is no reason why the choice of factors used in TREC 1 is optimal; slight variations had been used for an earlier experiment. Experimentation is progressing on the choice of factors, especially when dealing with both single terms and phrases. However, even so, the TREC 1 evaluation results were very good. If the minimal learning information used by this approach is available, the results suggest it should be preferred to the ad-hoc weighting schemes discussed earlier.</Paragraph> </Section> class="xml-element"></Paper>