File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/03/p03-1018_relat.xml
Size: 10,518 bytes
Last Modified: 2025-10-06 14:15:36
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1018"> <Title>Orthogonal Negation in Vector Spaces for Modelling Word-Meanings and Document Retrieval</Title> <Section position="4" start_page="0" end_page="0" type="relat"> <SectionTitle> 4 Evaluation and Results </SectionTitle> <Paragraph position="0"> This section describes experiments which compare the three methods of negation described above (postretrieval filtering, constant subtraction and vector negation) with the baseline alternative of no negation at all. The experiments were carried out using the vector space model described in Section 2.1.</Paragraph> <Paragraph position="1"> To judge the effectiveness of different methods at removing unwanted meanings, with a large number of queries, we made the following assumptions. A document which is relevant to the meaning of 'term a NOT term b' should contain as many references to term a and as few references to term b as possible.</Paragraph> <Paragraph position="2"> Close neighbours and synonyms of term b are undesirable as well, since if they occur the document in question is likely to be related to the negated term even if the negated term itself does not appear.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Queries and results for negating single </SectionTitle> <Paragraph position="0"> and multiple terms 1200 queries of the form 'term a NOT term b' were generated for 3 different document collections. The terms chosen were the 100 most frequently occurring (non-stop)wordsin the collection, 100mid-frequency words (the 1001st to 1100th most frequent), and 100 low-frequency words (the 5001st to 5100th most frequent). The nearest neighbour (word with highest cosine similarity) to each positive term was taken to be the negated term. (This assumes that a user is most likely to want to remove a meaning closely related to the positive term: there is no point in removing unrelated information which would not be retrieved anyway.) In addition, for the 100 most frequent words, an extra retrieval task was performed with the roles of the positive term and the negated term reversed, so that in this case the system was being asked to remove the very most common words in the collection from a query generated by their nearest neighbour. We anticipated that this would be an especially difficult task, and a particularly realistic one, simulating a user who is swamped with information about a 'popular topic' in which they are not interested.1 The document collections used were from the British National Corpus (published by Oxford University, the textual data consisting of ca 90M words, 85K documents), the New York Times News Syndicate (1994-96, from the North American News Text Corpus published by the Linguistic Data Consortium, ca 143M words, 370K documents) and the Ohsumed corpus of medical documents (Hersh et al., 1994) (ca 40M words, 230K documents).</Paragraph> <Paragraph position="1"> The 20 documents most relevant to each query were obtained using each of the following four techniques. null * No negation. The query was just the positive term and the negated term was ignored.</Paragraph> <Paragraph position="2"> * Post-retrievalfiltering. After vector retrieval using only the positive term as the query term, documents containing the negated term were eliminated.</Paragraph> <Paragraph position="3"> * Constant subtraction. Experiments were performed with a variety of subtraction constants.</Paragraph> <Paragraph position="4"> The query a NOT b was thus given the vector a[?]lb for some l [?] [0,1]. The results recordedin this paper were obtained using l = 0.75, which gives a direct comparison with vector negation.</Paragraph> <Paragraph position="5"> * Vector negation, as described in this paper.</Paragraph> <Paragraph position="6"> For each set of retrieved documents, the following results were counted.</Paragraph> <Paragraph position="7"> formance on query terms of different frequencies in this paper, though more detailed results are available from the author on request.</Paragraph> <Paragraph position="8"> neighbour of the negated term: to avoid inconsistency, we took as 'negative neighbours' only thosewhich werecloserto the negatedterm than to the positive term.</Paragraph> <Paragraph position="9"> * The relative frequency of the synonyms of the negatedterm, as givenby the WordNet database (Fellbaum, 1998). As above, words which were also synonyms of the positive term were discounted. On the whole fewer such synonyms were found in the Ohsumed and NYT documents, which have many medical terms and proper names which are not in WordNet.</Paragraph> <Paragraph position="10"> Additional experiments were carried out to compare the effectiveness of different forms of negation at removing several unwanted terms. The same 1200 queries were used as above, and the next nearest neighbour was added as a further negative argument. For two negated terms, the post-retrieval filtering process worked by discarding documents containing either of the negative terms. Constant subtraction worked by subtracting a constant multiple of each of the negated terms from the query. Vector negation worked by making the query vector orthogonal to the plane generated by the two negated terms, as in Equation 2.</Paragraph> <Paragraph position="11"> Results werecollected in much the same way as the results for single-argument negation. Occurrences of each of the negated terms were added together, as were occurrences of the neighbours and WordNet synonyms of either of the negated words.</Paragraph> <Paragraph position="12"> The results of our experiments are collected in Table 2 and summarised in Figure 1. The results for a single negated term demonstrate the following points.</Paragraph> <Paragraph position="13"> * All forms of negation proved extremely good at removing the unwanted words. This is trivially true for post-retrieval filtering, which works by discarding any documents that contain the negated term. It is more interesting that constant subtractionand vector negation performed so well, cutting occurrences of the negated word by 82% and 85% respectively compared with the baseline of no negation.</Paragraph> <Paragraph position="14"> * On average, using no negation at all retrieved the most positive terms, though not in every case. While this upholds the claimthat anyform of negation is likely to remove relevant as well as irrelevant results, the damage done was only around 3% for post-retrieval filtering and 25% for constant and vector negation.</Paragraph> <Paragraph position="15"> * These observations alone would suggest that post-retrieval filtering is the best method for the simple goal of maximising occurrences of the positive term while minimising the occurrences of the negated term. However, vector negation and constant subtraction dramatically outperformed post-retrieval filtering at removing neighbours of the negated terms, and were reliably better at removing WordNet synonyms as well. We believe this to be good evidence that, while post-search filtering is by definition better at removing unwanted strings, the vector methods (either orthogonal or constant subtraction) are much better at removing unwanted meanings. Preliminary observations suggest that in the cases where vector negation retrieves fewer occurrences of the positive term than other methods, the other methods are often retrieving documents that are still related in meaning to the negated term.</Paragraph> <Paragraph position="16"> * Constant subtraction can give similar results to vector negation on these queries (though the vector negation results are slightly better). This is with queries where the negated term is the closest neighbour of the positive term, and the assumption that the similarity between these pairs is around 0.75 is a reasonable approximation. However, further experiments with a variety of negated arguments chosen at random from a list of neighbours demonstrated that in this moregeneralsetting, the flexibility provided by vector negation produced conclusively better results than constant subtraction for any single fixed constant.</Paragraph> <Paragraph position="17"> In addition, the results for removing multiple negated terms demonstrate the following points.</Paragraph> <Paragraph position="18"> * Removing another negated term further reduces the retrieval of the positive term for all forms of negation. Constant subtraction is the worst affected, performing noticeably worse than vector negation.</Paragraph> <Paragraph position="19"> * All three forms of negation still remove many occurrences of the negated term. Vector negation and (trivially) post-search filtering perform as well as they do with a single negated term.</Paragraph> <Paragraph position="20"> However, constant subtraction performs much worse, retrieving more than twice as many unwanted terms as vector negation.</Paragraph> <Paragraph position="21"> * Post-retrieval filtering was even less effective at removing neighbours of the negated term than with a single negated term. Constant subtraction also performed much less well. Vector negation was by far the best method for removing negative neighbours. The same observation holds for WordNet synonyms, though the results are less pronounced.</Paragraph> <Paragraph position="22"> This shows that vector negation is capable of removing unwanted terms and their related words from retrieval results, while retaining more occurrences of the original query term than constant subtraction. Vector negation does much better than other methods at removing neighbours and synonyms, and we therefore expect that it is better at removing documents referring to unwanted meanings of ambiguous words. Experiments with sense-tagged data are planned to test this hypothesis.</Paragraph> <Paragraph position="23"> The goal of these experiments was to evaluate the extent to which the different methods could remove unwanted meanings, which we measured by counting the frequency of unwanted terms and concepts in retrieved documents. This leaves the problems of determining the optimal scope for the negation quantifier for an IR system, and of developing a natural user interface for this process for complex queries. These important challenges are beyond the scope of this paper, but would need to be addressed to incorporate vector negation into a state-of-the-art IR system.</Paragraph> </Section> </Section> class="xml-element"></Paper>