File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1051_metho.xml
Size: 9,482 bytes
Last Modified: 2025-10-06 14:09:37
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1051"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 403-410, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Differentiating Homonymy and Polysemy in Information Retrieval</Title> <Section position="4" start_page="404" end_page="404" type="metho"> <SectionTitle> 3 Experimental Setup </SectionTitle> <Paragraph position="0"> The experiments in this study use the WT10G corpus (Hawking and Craswell, 2002), an IR web test collection consisting of 1.69 million documents.</Paragraph> <Paragraph position="1"> There are two available Query / Relevance Judgments sets each consisting of 50 queries. This study uses the TREC 10 Web Track Ad-Hoc query set (NIST topics 501 - 550). The relevance judgments for these queries were produced using pooling based on the top 100 ranked documents retrieved by each of the systems that participated in the TREC 10 Web Track.</Paragraph> <Paragraph position="2"> Initially the author produced an index of the WT10G and performed retrieval on this unmodified collection in order to measure baseline retrieval effectiveness. The ranking algorithm was length normalized TF.IDF (Salton and McGill, 1983) which is comparable to the studies in section 2. Next, two modified versions of the collection were produced where additional ambiguity in the form of pseudowords had been added. The first used pseudowords created by selecting constituent pseudosenses which are unrelated, thus introducing additional homonymy. The second used a new method of generating pseudowords that exhibit polysemy (the methodology is described in section 4.1). Contrasting retrieval performance over these three indexes quantifies the relative impact of both homonymy and polysemy on retrieval effectiveness. The final step was to measure the effects of attempting to resolve the additional ambiguity which had been added to the collection. In order to do this, the author simulated disambiguation to varying degrees of accuracy and measured the impact that this had on retrieval effectiveness.</Paragraph> </Section> <Section position="5" start_page="404" end_page="406" type="metho"> <SectionTitle> 4 Methodology </SectionTitle> <Paragraph position="0"> To date only Nakov and Hearst (2003) have looked into creating more plausible pseudowords. Working with medical abstracts (MEDLINE) and the controlled vocabulary contained in the MESH hierarchy they created pseudosense pairings that are related. By identifying pairs of MESH subject categories which frequently co-occurred and selecting constituents for their pseudowords from these pairings they produced a disambiguation test collection. Their results showed that category based pseudowords provided a more realistic test data set for disambiguation, in that evaluation using them more closely resembled real words. The challenge in this study lay in adapting these ideas for open domain text.</Paragraph> <Section position="1" start_page="404" end_page="406" type="sub_section"> <SectionTitle> 4.1 Pseudoword Generation </SectionTitle> <Paragraph position="0"> This study used WordNet (Miller et al., 1990) to inform the production of pseudowords. WordNet (2.0) is a hierarchical semantic network developed at Princeton University. Concepts in WordNet are represented by synsets and links between synsets represent hypernmy (subsumes) and hyponymy (subsumed) relationships in order to form a hierarchical structure. A unique word sense consists of a lemma and the particular synset in which that lemma occurs. WordNet is a fine-grained lexical resource and polysemy can be derived to varying degrees of granularity by traversing the link structure between synsets (figure 1).</Paragraph> <Paragraph position="1"> An important feature of pseudowords is the number of constituents as this controls the amount of additional ambiguity created. A feature of all previous studies is that they generate pseudowords with a uniform number of constituents, e.g. size 2, size 5 or size 10, thus introducing uniform levels of additional ambiguity. It is clear that such an approach does not reflect real words given that they do not exhibit uniform levels of ambiguity. The approach taken in this study was to generate pseudowords where the number of constituents was variable. As each of the pseudowords in this study contain one query word from the IR collection then the number of constituents was linked directly to the number of senses of that word contained in WordNet. This effectively doubles the level of ambiguity expressed by the original query word. If a query word was not contained in WordNet then this was taken to be a proper name and exempted from the process of adding ambiguity. It was felt that to destroy any unambiguous proper names, which might act to anchor a query, would dramatically overstate the effects of ambiguity in terms of the IR simulation. The average size of the pseudowords produced in these experiments was 6.4 pseudosenses.</Paragraph> <Paragraph position="2"> When producing the traditional pseudoword based collection the only modification to Sanderson's (1994) approach (described in section 2), other than the variable size, involved formalizing his observation that the constituent words were unlikely to be related. Given access to WordNet it was possible to guarantee that this is the case by rejecting constituents which could be linked through its inheritance hierarchy. This ensures that the pseudowords produced only display homonymy. null In order to produce pseudowords that model polysemy it was essential to devise a method for selecting constituents that have the property of relatedness. The approach taken was to deliberately select constituent words that could be linked to a sense of the original query word through WordNet. Thus the additional ambiguity added to the collection models any underlying relatedness expressed by the original senses of the query word. Pseudowords produced in this way will now be referred to as root pseudowords as this reflects that the ambiguity introduced is modeled around one root constituent. Consider the following worked example for the query &quot;How are tornadoes formed?&quot; After the removal of stopwords we are left with 'tornadoes' and 'formed' each of which is then transformed into a root pseudoword. The first step involves identifying any potential senses of the target word from WordNet. If we consider the word 'tornado' it appears in two synsets: 1. tornado, twister -- (a localized and violently destructive windstorm occurring over land characterized by a funnel-shaped cloud extending toward the ground) 2. crack, tornado -- (a purified and potent form of cocaine that is smoked rather than snorted) For each sense of the target word the system expands WordNet's inheritance hierarchy to produce a directed graph of its hypernyms. Figure 2 shows an example of this graph for the first sense of the word 'tornado'. In order to ensure a related sense pair the system builds a pool of words which are subsumed by concepts contained in this graph.</Paragraph> <Paragraph position="3"> This is generated by recursively moving up the hierarchy until the pool contains at least one viable candidate. For a candidate to be viable it must meet the following criteria: 1) It must exist in the IR collection.</Paragraph> <Paragraph position="4"> 2) It must not be part of another pseudoword.</Paragraph> <Paragraph position="5"> 3) It can not be linked (through WordNet) to another constituent of the pseudoword.</Paragraph> <Paragraph position="6"> The pool for sense 1 of 'tornado' consists of [hur- null This process is repeated for each noun and verb sense of the query word. In this example there is one remaining sense of the word 'tornado' - a slang term used to refer to the drug crack cocaine. For this sense the system produced a pool consisting of [diacetylemorphine|heroin]. Once all senses of the query word have been expanded the resulting pseudoword, 'tornadoes/hurricane/heroin', is then used to replace all occurrences of its constituents within the collection. Through this process the system produces pseudowords with pseudosense pairings which have subsuming relationships, e.g. 'tornadoes/hurricane' are subsumed by the higher category of 'cyclone' whilst 'tornadoes/heroin' are subsumed by the higher semantic category of 'hard_drug'.</Paragraph> </Section> <Section position="2" start_page="406" end_page="406" type="sub_section"> <SectionTitle> 4.2 Pseudo-disambiguation </SectionTitle> <Paragraph position="0"> In order to perform pseudo-disambiguation the unmodified collection acts as a gold standard model answer. Through reducing each instance of a pseudoword back to one of its constituent components this models the selection process made by a disambiguation system. Obviously, the correct pseudosense for a given instance is the original word which appeared at that point in the collection.</Paragraph> <Paragraph position="1"> Variable levels of accuracy are introduced using a weighted probability model where the correct pseudosense for a given test instance is seeded with a fixed probability equivalent to the desired accuracy being simulated. When a disambiguation error is simulated one of the incorrect pseudosenses is selected randomly.</Paragraph> </Section> </Section> class="xml-element"></Paper>