File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/p97-1010_evalu.xml

Size: 21,451 bytes

Last Modified: 2025-10-06 14:00:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1010">
  <Title>Homonymy and Polysemy in Information Retrieval</Title>
  <Section position="5" start_page="73" end_page="77" type="evalu">
    <SectionTitle>
3 Experiments on Word-Sense
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
Disambiguation
3.1 Preliminary Experiments
</SectionTitle>
      <Paragraph position="0"> Our initial experiments were designed to investigate the following two hypotheses: Hypothesis 2 Word senses provide an effective separation between relevant and non-relevant documents. null As we saw earlier in the paper, it is possible for a query about 'AIDS' the disease to retrieve documents about 'hearing aids'. But to what extent are such inappropriate matches associated with relevance judgments? This hypothesis predicts that sense mismatches will be more likely to appear in documents that are not relevant than in those that are relevant. Hypothesis 3 Even a small domain-specific collection of documents exhibits a significant degree of lexical ambiguity.</Paragraph>
      <Paragraph position="1"> Little quantitative data is available about lexical ambiguity, and such data as is available is often confined to only a small number of words. In addition, it is generally assumed that lexical ambiguity does not occur very often in domain-specific text. This hypothesis was tested by quantifying the ambiguity for a large number of words in such a collection, and challenging the assumption that ambiguity does not occur very often.</Paragraph>
      <Paragraph position="2"> To investigate these hypotheses we conducted experiments with two standard test collections, one consisting of titles and abstracts in Computer Science, and the other consisting of short articles from Time magazine.</Paragraph>
      <Paragraph position="3"> The first experiment was concerned with determining how often sense mismatches occur between a query and a document, and whether these mismatches indicate that the document is not relevant. To test this hypothesis we manually identified the senses of the words in the queries for two collections (Computer Science and Time). These words were then manually checked against the words they matched in the top ten ranked documents for each query (the ranking was produced using a probabilistic retrieval system). The number of sense mismatches was then computed, and the mismatches in the relevant documents were identified.</Paragraph>
      <Paragraph position="4"> The second experiment involved quantifying the degree of ambiguity found in the test collections. We manually examined the word tokens in the corpus for each query word, and estimated the distribution of the senses. The number of word types with more than one meaning was determined. Because of the volume of data analysis, only one collection was examined (Computer Science), and the distribution of senses was only coarsely estimated; there were approximately 300 unique query words, and they constituted 35,000 tokens in the corpus.</Paragraph>
      <Paragraph position="5"> These experiments provided strong support for Hypotheses 2 and 3. Word meanings are highly correlated with relevance judgements, and the corpus study showed that there is a high degree of lexical ambiguity even in a small collection of scientific text (over 40% of the query words were found to be ambiguous in the corpus). These experiments provided a clear indication of the potential of word meanings to improve the performance of a retrieval system. The experiments are described in more detail in (Krovetz and Croft 92).</Paragraph>
    </Section>
    <Section position="2" start_page="73" end_page="76" type="sub_section">
      <SectionTitle>
3.2 Experiments with different sources of
</SectionTitle>
      <Paragraph position="0"> evidence The next set of experiments were concerned with determining the effectiveness of different sources of evidence for distinguishing word senses. We were also interested in the extent with which a difference in form corresponded to a difference in meaning. For example, words can differ in morphology (authorize/authorized), or part-of-speech (diabetic \[noun\]/diabetic \[adj\]), or in their ability to appear in a phrase (database/data base). They can also exhibit such differences, but represent different concepts, such as author/authorize. sink\[noun\]/sink\[verb\], or stone wall/stonewall. Our default assumption was that a difference in form is associated with a difference in meaning unless we could establish that the different word forms were related.</Paragraph>
      <Paragraph position="1">  We investigated two approaches for relating senses with respect to morphology and part of speech: 1) exploiting the presence of a variant of a term within its dictionary definition, and 2) using the overlap of the words in the definitions of suspected variants.  For example, liable appears within the definition of liability, and this is used as evidence that those words are related. Similarly, flat as a noun is defined as &amp;quot;a flat tire', and the presence of the word in its own definition, but with a different part of speech, is taken as evidence that the noun and adjective meanings are related. We can also compute the overlap between the definitions of liable and liability, and if they have a significant number of words in common then that is evidence that those meanings are related. These two strategies could potentially be used for phrases as well, but phrases are one of the areas where dictionaries are incomplete, and other methods are needed for determining when phrases are related. We will discuss this in Section 3.2.4.</Paragraph>
      <Paragraph position="2"> We conducted experiments to determine the effectiveness of the two methods for linking word senses. In the first experiment we investigated the performance of a part-of-speech tagger for identifying the related forms. These related forms (e.g., fiat as a noun and an adjective) are referred to as instances of zero-affix morphology, or functional shift (Marchand 63). We first tagged all definitions in the dictionary for words that began with the letter 'W'. This produced a list of 209 words that appeared in their own definitions with a different part of speech. However, we found that only 51 (24%) were actual cases of related meanings. This low success rate was almost entirely due to tagging error. That is, we had a false positive rate of 76% because the tagger indicated the wrong part of speech. We conducted a failure analysis and it indicated that 91% the errors occurred in idiomatic expressions (45 instances) or example sentences associated with the definitions (98 instances). We therefore omitted idiomatic senses and example sentences from further processing and tagged the rest of the dictionary. 2 The result of this experiment is that the dictionary contains at least 1726 senses in which the headword was mentioned, but with a different part of speech, of which 1566 were in fact related (90.7%). We analyzed the distribution of the connections, and this is given in Table 1 (n = 1566).</Paragraph>
      <Paragraph position="3"> However, Table 1 does not include cases in which the word appears in its definition, but in an inflected form. For example, 'cook' as a noun is defined as 'a person who prepares and cooks food'. Unless we recognize the inflected form, we will not capture all of the instances. We therefore repeated the procedure, but allowing for inflectional variants. The result is given in Table 2 (n = 1054).</Paragraph>
      <Paragraph position="4"> We also conducted an experiment to determine ~Idiomatic senses were identified by the use of font codes.</Paragraph>
      <Paragraph position="5"> the effectiveness of capturing related senses via word overlap. The result is that if the definitions for the root and variant had two or more words in common ,3 93% of the pairs were semantically related. However, of the sense-pairs that were actually related, two-thirds had only one word in common. We found that 65% of the sense-pairs with one word in common were related. Having only one word in common between senses is very weak evidence that the senses are related, and it is not surprising that there is a greater degree of error.</Paragraph>
      <Paragraph position="6"> Tile two experiments, tagging and word overlap, were found to be to be highly effective once the common causes of error were removed. In the case of tagging the error was due to idiomatic senses and example sentences, and in the case of word overlap the error was links due to a single word in common. Both methods have approximately a 90% success rate in pairing the senses of morphological variants if those problems are removed. The next section will discuss our experiments with morphology.</Paragraph>
      <Paragraph position="7">  We conducted several experiments to determine the impact of grouping morphological variants on retrieval performance. These experiments are described in detail in (Krovetz 93), so we will only summarize them here.</Paragraph>
      <Paragraph position="8"> Our experiments compared a baseline (no stemming) against several different morphology routines: 1) a routine that grouped only inflectional variants (plurals and tensed verb forms), 2) a routine that grouped inflectional as well as derivational variants (e.g.,-ize,-ity), and 3) the Porter stemmer (Porter 80). These experiments were done with four different test collections which varied in both size and subject area. We found that there was a significant improvement over the baseline performance from grouping morphological variants.</Paragraph>
      <Paragraph position="9"> Earlier experiments with morphology in IR did not report improvements in performance (Harman 91).</Paragraph>
      <Paragraph position="10"> We attribute these differences to the use of different test collections, and in part to the use of different retrieval systems. We found that the improvement varies depending on the test collection, and that collections that were made up of shorter documents were more likely to improve. This is because morphological variants can occur within the same document, but they are less likely to do so in documents that are short. By grouping morphological variants, we are helping to improve access to the shorter documents. However, we also found improvements even aExcluding closed class words, such as of and for.</Paragraph>
      <Paragraph position="11">  in a collection of legal documents which had an average length of more than 3000 words.</Paragraph>
      <Paragraph position="12"> We also found it was very difficult to improve retrieval performance over the performance of the Porter stemmer, which does not use a lexicon. The absence of a lexicon causes the Porter stemmer to make errors by grouping morphological &amp;quot;false friends&amp;quot; (e.g.. author/authority, or police/policy). We found that there were three reasons why the Porter stemmer improves performance despite such groupings. The first two reasons are associated with the heuristics used by the stemmer: 1) some word forms will be grouped when one of the forms has a combination of endings (e.g., -ization and -ize).</Paragraph>
      <Paragraph position="13"> We empirically found that the word forms in these groups are almost always related in meaning. 2) the stemmer uses a constraint on the form of the resulting stem based on a sequence of consonants and vowels; we found that this constraint is surprisingly effective at separating unrelated variants. The third reason has to do with the nature of morphological variants. We found that when a word form appears to be a variant, it often is a variant. For example, consider the grouping of police and policy. We examined all words in the dictionary in which a word ended in 'y', and in which the 'y' could be replaced by 'e' and still yield a word in the dictionary. There were 175 such words, but only 39 were clearly unrelated in meaning to the presumed root (i.e., cases like policy/police). Of the 39 unrelated word pairs, only 14 were grouped by the Porter stemmer because of the consonant/vowel constraints. We also identified the morphological &amp;quot;'false friends&amp;quot; for the 10 most frequent suffixes. We found that out of 911 incorrect word pairs, only 303 were grouped by the Porter stemmer.</Paragraph>
      <Paragraph position="14"> Finally, we found that conflating inflectional variants harmed the performance of about a third of the queries. This is partially a result of the interaction between morphology and part-of-speech (e.g., a query that contains work in the sense of theoretical work will be grouped with all of the variants associated with the the verb- worked, working, works); we note that some instances of works can be related to the singular form work (although not necessarily the right meaning of work), and some can be related to the untensed verb form. Grouping inflectional variants also harms retrieval performance because of an overlap between inflected forms and uninflected forms (e.g., arms can occur as a reference to weapons, or as an inflected form of arm). Conflating these forms has the effect of grouping unrelated concepts, and thus increases the net ambiguity.</Paragraph>
      <Paragraph position="15"> Our experiments with morphology support our atgument about distinguishing homonymy and polysemy. Grouping related morphological variants makes a significant improvement in retrieval performance. Morphological false friends (policy/police) often provide a strong separation between relevant and non-relevant documents (see (Krovetz and Croft 92)). There are no morphology routines that can currently handle the problems we encountered with inflectional variants, and it is likely that separating related from unrelated forms will make further improvements in performance.</Paragraph>
      <Paragraph position="16">  Relatively little attention has been paid in IR to the differences in a word's part of speech. These differences have been used to help identify phrases (Dillon and Gray 83), and as a means of filtering for word sense disambiguation (to only consider the meanings of nouns (Voorhees 93)). To the best of our knowledge the differences have never been examined for distinguishing meanings within the context of IR.</Paragraph>
      <Paragraph position="17"> The aim of our experiments was to determine how well part of speech differences correlate with differences in word meanings, and to what extent the use of meanings determined by these differences will affect the performance of a retrieval system. We conducted two sets of experiments, one concerned with homonymy, and one concerned with polysemy. In the first experiment the Church tagger was used to identify part-of-speech of the words in documents and queries. The collections were then indexed by the word tagged with the part of speech (i.e., instead of indexing 'book', we indexed 'book/noun' and 'book/verb'). 4 A baseline was established in which all variants of a word were present in the query, regardless of part of speech variation; the baseline did not include any morphological variants of the query words because we wanted to test the interaction between morphology and part-of-speech in a separate experiment. The baseline was then compared against a version of the query in which all variations were eliminated except for the part of speech that was correct (i.e., if the word was used as a noun ill the original query, all other variants were eliminated). This constituted the experiment that tested homonymy. We then identified words that were related in spite of a difference in part of speech; this was based on the data that was produced by tagging the dictionary (see Section 3.2.1). Another version of the queries was constructed in which part of speech variants were retained if the meaning was related, 4in actuality, we indexed it with whatever tags were used by the tagger; we are just using 'noun' and 'verb' for purposes of illustration.</Paragraph>
      <Paragraph position="18">  and this was compared to the previous version.</Paragraph>
      <Paragraph position="19"> When we ran the experiments, we found that performance decreased compared with the baseline.</Paragraph>
      <Paragraph position="20"> However, we found many cases where the tagger was incorrect. 5 We were unable to determine whether the results of the experiment were due to the incorrectness of the hypothesis being tested (that distinctions in part of speech can lead to an improvement in performance), or to the errors made by the tagger.</Paragraph>
      <Paragraph position="21"> We also assumed that a difference in part-of-speech would correspond to a difference in meaning. The data in Table 1 and Table 2 shows that many words are related in meaning despite a difference in partof-speech. Not all errors made by the tagger cause decreases in retrieval performance, and we are in the process of determining the error rate of the tagger on those words in which part-of-speech differences are also associated with a difference in concepts (e.g., novel as a noun and as an adjective). 6  Phrases are an important and poorly understood area of IR. They generally improve retrieval performance, but the improvements are not consistent. Most research to date has focused on syntactic phrases, in which words are grouped together because they are in a specific syntactic relationship (Fagan 87), (Smeaton and Van Rijsbergen 88). The research in this section is concerned with a subset of these phrases, namely those that are lexical. A lexical phrase is a phrase that might be defined in a dictionary, such as hot line or back end. Lexical phrases can be distinguished from a phrases such as sanctions against South Africa in that the meaning of a lexical phrase cannot necessarily be determined from the meaning of its parts.</Paragraph>
      <Paragraph position="22"> Lexical phrases are generally made up of only two or three words (overwhelmingly just two), and they usually occur in a fixed order. The literature mentions examples such as blind venetians vs. venetian blinds, or science library vs. library science, but these are primarily just cute examples. It is very rare that the order could be reversed to produce a different concept.</Paragraph>
      <Paragraph position="23"> Although dictionaries contain a large number of phrasal entries, there are many lexical phrases that are missing. These are typically proper nouns  ~There are approximately 4000 words in the Longman dictionary which have more than one part-of-speech. Less than half of those words will be like novel, and we are examining them by hand.</Paragraph>
      <Paragraph position="24"> due process, strict liability). We manually identified the lexical phrases in four different test collections (the phrases were based on our judgement), and we found that 92 out of 120 phrases (77%) were not found in the Longman dictionary. A breakdown of the phrases is given in (h:rovetz 95).</Paragraph>
      <Paragraph position="25"> For the phrase experiment we not only had to identify the lexical phrases, we also had to identiL' any related forms, such as database~data base. This was done via brute force -- a program simply concatenated every adjacent word in the database, and if it was also a single word in the collection it prim ted out the pair. We tested this with the Computer Science and Time collections, and used those results to develop an exception list for filtering the pairs (e.g., do not consider &amp;quot;special ties/specialties'). We represented the phrases using a proximity operator: and tried several experiments to include the related form when it was found in the corpus.</Paragraph>
      <Paragraph position="26"> We found that retrieval performance decreased for 118 out of 120 phrases. A failure analysis indicated that this was due to the need to assign partial credit to individual words of a phrase. The component words were always related to the meaning of the compound as a whole (e.g., Britain and Great Britain).</Paragraph>
      <Paragraph position="27"> We also found that most of the instances of open/closed compounds (e.g., database~data base) were related. Cases like &amp;quot;stone wall/stonewall' or 'bottle neck/bottleneck' are infrequent. The effect oll performance of grouping the compounds is related to the relative distribution of the open and closed forms. Database~data base occurred in about a 50/50 distribution, and the queries in which they occurred were significantly improved when the related form was included. null</Paragraph>
    </Section>
    <Section position="3" start_page="76" end_page="77" type="sub_section">
      <SectionTitle>
3.2.5 Interactions between Sources of
Evidence
</SectionTitle>
      <Paragraph position="0"> We found many interactions between the different sources of evidence. The most striking is the interaction between phrases and morphology. We found that the use of phrases acts as a filter for the grouping of morphological variants. Errors in morphology generally do not hurt performance within the restricted context. For example, the Porter stemmer will reduce department to depart, but this has no effect in the context of the phrase 'Justice department'.</Paragraph>
      <Paragraph position="1"> ~The proximity operator specifies that the query words must be adjacent and in order, or occur within a specific number of words of each other.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML