File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2021_metho.xml
Size: 24,910 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2021"> <Title>Using WordNet to Automatically Deduce Relations between Words in Noun-Noun Compounds</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Noun-Noun Compounds </SectionTitle> <Paragraph position="0"/> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We present an algorithm for automatically disambiguating noun-noun compounds by deducing the correct semantic relation between their constituent words. This algorithm uses a corpus of 2,500 compounds annotated with WordNet senses and covering 139 different semantic relations (we make this corpus available online for researchers interested in the semantics of noun-noun compounds). The algorithm takes as input the WordNet senses for the nouns in a compound, finds all parent senses (hypernyms) of those senses, and searches the corpus for other compounds containing any pair of those senses. The relation with the highest proportional co-occurrence with any sense pair is returned as the correct relation for the compound.</Paragraph> <Paragraph position="1"> This algorithm was tested using a 'leave-one-out' procedure on the corpus of compounds. The algorithm identified the correct relations for compounds with high precision: in 92% of cases where a relation was found with a proportional co-occurrence of 1.0, it was the correct relation for the compound being disambiguated. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> Keywords: Noun-Noun Compounds, Conceputal </SectionTitle> <Paragraph position="0"/> </Section> <Section position="5" start_page="0" end_page="160" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Noun-noun compounds are short phrases made up of two or more nouns. These compounds are common in everyday language and are especially frequent, and important, in technical documents (Justeson & Katz, 1995, report that such phrases form the majority of technical content of scientific and technical documents surveyed). Understanding these compounds requires the listener or reader to infer the correct semantic relationship between the words making up the compound, inferring, for example, that the phrase 'flu virus' refers to a virus that causes flu, while 'skin virus' describes a virus that affects the skin, and marsh virus a virus contracted in marshes. In this paper we describe a novel algorithm for disambiguating noun-noun compounds by automatically deducing the correct semantic relationship between their constituent words.</Paragraph> <Paragraph position="1"> Our approach to compound disambiguation combines statistical and ontological information about words and relations in compounds. Ontological information is derived from WordNet (Miller, 1995), a hierarchical machine readable dictionary, which is introduced in Section 1. Section 2 describes the construction of an annotated corpus of 2,500 noun-noun compounds covering 139 different semantic relations, with each noun and each relation annotated with its correct Word-Net sense.1 Section 3 describes our algorithm for finding the correct relation between nouns in a compound, which makes use of this annotated corpus. Our general approach is that the correct relation between two words in a compound can be deduced by finding other compounds containing words from the same semantic categories as the words in the compound to be disambiguated: if a particular relation occurs frequently in those other compounds, that relation is probably also the correct relation for the compound in question. Our al- null relation example head causes modifier flu virus modifier causes head college headache head has modifier picture book modifier has head lemon peel head makes modifier milk cow head made of modifier chocolate bird head for modifier cooking toy modifier is head dessert food head uses modifier gas antiques head about modifier travel magazine head located modifier mountain cabin head used by modifier servant language modifier located head murder town head derived from modifier oil money gorithm implements this approach by taking as input the correct WordNet senses for the constituent words in a compound (both base senses and parent or hypernyms of those senses), and searching the corpus for other compounds containing any pair of those base or hypernym senses. Relations are given a score equal to their proportional occurrence with those sense pairs, and the relation with the highest proportional occurrence score across all sense-pairs is returned as the correct relation for the compound. Section 4 describes two different leave-one-out tests of this 'Proportional Relation Occurrence' (PRO) algorithm, in which each compound is consecutively removed from the corpus and the algorithm is used to deduce the correct sense for that compound using the set of compounds left behind. These tests show that the PRO algorithm can identify the correct relations for compounds, and the correct senses of those relations, with high precision. Section 6 compares our algorithm for compound disambiguation with one recently presented alternative, Rosario et al.'s (2002) rule-based system for the disambiguation of noun-noun compounds. The paper concludes with a discussion of future developments of the PRO algorithm.</Paragraph> </Section> <Section position="6" start_page="160" end_page="160" type="metho"> <SectionTitle> 2 Introduction to WordNet </SectionTitle> <Paragraph position="0"> In both our annotated corpus of 2,500 noun-noun compounds and our proportional relation selection algorithmweuseWordNet(Miller, 1995). Thebasic unit of WordNet is the sense. Each word in WordNet is linked to a set of senses, with each sense identifying one particular meaning of that word. Forexample,thenoun'skin'hassensesrepresenting (i) the cutis or skin of human beings, (ii) the rind or peel of vegetables or fruit, (iii) the hide or pelt of an animal, (iv) a skin or bag used as a container for liquids, and so on. Each sense contains an identifying number and a 'gloss' (explaining what that sense means). Each sense is linked to its parent sense, which subsumes that sense as part of its meaning. For example, sense (i) of the word'skin'(thecutisorskinofhumanbeings)has a parent sense 'connective tissue' which contains that sense of skin and also contains the relevant sense of 'bone', 'muscle', and so on. Each parent sense has its own parents, which in turn have their own parent senses, and so on up to the (notional) root node of the WordNet hierarchy. This hierarchical structure allows computer programs to analyse the semantics of natural language expressions, by finding the senses of the words in a given expression and traversing the WordNet graph to make generalisations about the meanings of those words.</Paragraph> </Section> <Section position="7" start_page="160" end_page="162" type="metho"> <SectionTitle> 3 Corpus of Annotated Compounds </SectionTitle> <Paragraph position="0"> In this section we describe the construction of a corpus of noun-noun compounds annotated with the correct WordNet noun senses for constituent words, the correct semantic relation between those words, andthecorrectWordNetverbsenseforthat relation. In addition to providing a set of compounds to use as input for our compound disambiguation algorithm, one aim in constructing this corpus was to examine the relations that exist in naturally occurring noun-noun compounds. This followsfromexistingresearchontherelationsthat occur between noun-noun compounds (e.g. Gagn'e & Shoben, 1997). Gagn'e and her colleagues provide a set of 'thematic relations' (derived from relations proposed by, for example, Levi, 1978) which, they argue, cover the majority of semantic relations between modifier (first word) and head (second word) in noun-noun compounds. Table 1 shows the set of thematic relations proposed in Gagn'e & Shoben (1997). A side-effect of the construction of our corpus of noun-noun compounds was an assessment of the coverage and usefulness of this set of relations.</Paragraph> <Section position="1" start_page="160" end_page="161" type="sub_section"> <SectionTitle> 3.1 Procedure </SectionTitle> <Paragraph position="0"> The first step in constructing a corpus of annotated noun-noun compounds involved selection of a set of noun-noun compounds to classify. The source used was the set of noun-noun compounds defined in WordNet. Compounds from WordNet were used for two reasons. First, each compound had an associated gloss or definition written by the lexicographer who entered that compound into the corpus: this explains the relation between the two words in that compound. Sets of compounds from othersources wouldnot havesuch associated definitions. Second, by using compounds from WordNet, we could guarantee that all constituent words of those compounds would also have entries in WordNet, ensuring their acceptability to our compound disambiguation algorithm. An initial list of over 40,000 two-word noun-noun compounds were extracted from WordNet version 2.0. From this list we selected a random subset of compounds and went through that set excluding all compounds using scientific latin (e.g. ocimum basilicum), idiomatic compounds (e.g. zero hour, ugli fruit), compounds containing proper nouns (e.g. Yangtze river), non-english compounds (e.g.</Paragraph> <Paragraph position="1"> faux pas), and chemical terminology (e.g. carbon dioxide).</Paragraph> <Paragraph position="2"> The remaining compounds were placed in random order, and the third author annotated each compound with the WordNet noun senses of the constituent words, the semantic relation between those words, and the WordNet verb sense of that relation (again, with senses extracted from Word-Net version 2.0). A web page was created for this annotation task, showing the annotator the compound to be annotated and the WordNet gloss (meaning) for that compound (see Figure 1). This page also showed the annotator the list of possible WordNet senses for the modifier noun and head noun in the compound, allowing the annotator to select the correct WordNet sense for each word.</Paragraph> <Paragraph position="3"> After selecting correct senses for the words in the compound, another page was presented (Figure 2) allowing the annotator to identify the correct semantic relation for that compound, and then to select the correct WordNet sense for the verb in that relation.</Paragraph> <Paragraph position="4"> We began by assuming that Gagn'e & Shoben's (1997) set of 14 relations was complete and could account for all compounds being annotated. However, a preliminary test revealed some common relations (e.g., eats, lives in, contains, and resembles) that were not in Gagn'e & Shoben's set. These relations were therefore added to the list of relations we used. Various other less commonlyoccuring relations were also observed. To allow for these other relations, a function was added to the web page allowing the annotator to enter the appropriate relation appearing in the form &quot;noun (insert relation) modifier&quot; and &quot;modifier (insert relation) noun&quot;. They would then be shown the set of verb senses for that relation and asked to select the correct sense.</Paragraph> </Section> <Section position="2" start_page="161" end_page="162" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> Word sense, relation, and relation sense information was gathered for 2,500 compounds. Relation occurrence was well distributed across these compounds: there were 139 different relations used in the corpus. Frequency of these relations ranged widely: there were 86 relations that occured for just one compound in the corpus, and 53 relations that occurred more than once. For the relations that occured more than once in the corpus, the average number of occurrences was 46. Table 2 shows the 5 most frequent relations in the corpus: these 5 relations account for 54% of compounds.</Paragraph> <Paragraph position="1"> Note that 2 of the 5 relations in Table 2 (head con- null relation frequency number of relation senses head used for modifier 382 3 head about modifier 360 1 head located modifier 226 3 head contains modifier 217 3 head resembles modifier 169 1 tains modifier and head resembles modifier) are not listed in Gagn'e's set of taxonomic relations. This suggests that the taxonomy needs to be extended by the addition of further relations. In addition to identifying the relations used in compounds in our corpus, we also identified the WordNet verb sense of each relation. In total 146 different relation senses occurred in the corpus. Most relations in the corpus were associated with just 1 relation sense. However, a significant minority of relations (29 relations, or 21% of all relations) had more than one relation sense; on average,theserelationshadthreedifferentsenseseach. null Relations with more than one sense in the corpus tended to be the more frequently occurring relations: as Table 2 shows, of the 5 most frequent relations in the corpus, 3 were identified as having more than one relation sense. The two relations with the largest number of different relation senses occurring were carry (9 senses) and makes (8 senses). Table 3 shows the 3 most frequent senses for both relations. This diversity of relation senses suggests that Gagn'e's set of thematic relations may be too coarse-grained to capture distinctions between relations.</Paragraph> </Section> </Section> <Section position="8" start_page="162" end_page="165" type="metho"> <SectionTitle> 4 Compound Disambiguation Algorithm </SectionTitle> <Paragraph position="0"> The previous section described the development of a corpus of associations between word-sense and relation data for a large set of noun-noun compounds. This section presents the 'Proportional Relation Occurrence' (PRO) algorithm which makes use of this information to deduce the correct relation for a given compound.</Paragraph> <Paragraph position="1"> Our approach to compound disambiguation works by finding other compounds containing words from the same semantic categories as the words in the compound to be disambiguated: if a particular relation occurs frequently in those other compounds, that relation is probably also the correct relation for the compound in question. We take WordNet senses to represent semantic cate-Table 3: Senses for relations makes and carries. relation relation sense gloss example Makes bring forth or yield; spice tree Makes cause to occur or exist; smoke bomb Makes create or manufacture cider mill a man-made product; Carries contain or hold, have within; pocket watch Carries move while supporting, in passenger van a vehicle or one's hands; Carries transmit or serve as the radio wave medium for transmission; gories. Once thecorrect WordNet sense for a word has been identified, that word can placed a set of nested semantic categories: the category represented by that WordNet sense, by the parent sense (or hypernym) of that sense, the parent of that parent, and so on up to the (notional) root sense of WordNet (the semantic category which subsumes every other category in WordNet). Our algorithm uses the set of semantic categories for the words in a compound, and searches for other compounds containing words from any pair of those categories.</Paragraph> <Paragraph position="2"> Figure 3 shows the algorithm in pseudocode.</Paragraph> <Paragraph position="3"> The algorithm uses a corpus of annotated noun-noun compounds and, to disambiguate a given compound, takes as input the correct WordNet sense for the modifier and head words of that compound, plus all hypernyms of those senses. The algorithm pairs each modifier sense with each head sense (lines 1 & 2 in Figure 3). For each sensepair, the algorithm goes through the corpus of noun-noun compounds and extracts every compound whose modifier sense (or a hypernym of that sense) is equal to the modifier sense in the current sense-pair, and whose head sense (or a hypernym of that sense) is equal to the head sense in that pair (lines 5 to 8). The algorithm counts the number of times each relation occurs in that set of compounds, and assigns each relation a Proportional Relation Occurrence (PRO) score for that sense-pair (lines 10 to 12). The PRO score for a given relation R in a sense-pair S is a tuple with two components, as in Equation 1:</Paragraph> <Paragraph position="5"> The first term of this tuple is the proportion of times relationR occurs with sense-pairS (in other words, the conditional probability of relation R given sense-pair S); the second term is simply the proportion of times the relation co-occurs with the sense pair in the database of compounds D (in other words, the joint probability of relationRand sense-pair S). The algorithm compares the PRO scoreobtainedforeachrelationRfromthecurrent sense-pair with the score obtained for that relation from any other sense-pair, using the first term of the score tuple as the main key for comparison (lines 14 and 15), and using the second term as a tie-breaker (lines 16 to 18). If the PRO score for relation R in the current sense-pair is greater than the PRO score obtained for that relation with some other sense pair (or if no previous score for the relation has been entered), the current PRO tuple is recorded for relation R. In this way the algorithm finds the maximum PRO score for each relation R across all possible sense-pairs for the compound in question. The algorithm returns a list of candidate relations for the compound, sorted by PRO score (lines 19 and 20). The relations at the front of that list (those with the highest PRO scores) are those most likely to be the correct relation for that compound.</Paragraph> <Paragraph position="6"> Tests of this algorithm suggest that, in many cases, candidate relations for a given compound will be tied on the first term of their PRO score tuple. The use of the second score-tuple term is therefore an important part of the algorithm. For example, suppose that two competing relations for some compound have a proportional occurence of 1.0 (both relations occur in every occurrence of some sense-pair in the compound corpus). If the first relation occurs 20 times with its selected sense pair (i.e. there are 20 occurrences of the sense-pair in the corpus, and the relation occurs in each of those 20 occurrences), but the second relation only occurs occurs 2 times with its selected sense pair (i.e. there are 2 occurrences of that sense-pair in the corpus, and the relation occurs in each of those 2 occurrences), the first relation will be preferred over the second relation, because there is more evidence for that relation being the correct relation for the compound in question.</Paragraph> <Paragraph position="7"> The algorithm in Figure 3 returns a list of candidate semantic relations for a given compound (returning relations such as 'head carries modifier' for the compound vegetable truck or 'modifier causes head' for the compound storm damage, for example). This algorithm can also return a list of relation senses for a given compound (returning the WordNet verb sense 'carries: moves while supporting, in a vehicle or one's hands' for the relation for the compound vegetable truck but the verb sense 'carries: transmits or serves as the medium for transmission' for the compound radio wave, for example). To return a list of relation senses rather than relations, we replace Crel with CrelSense throughout the algorithm in Figure 3. Section 5 describes a test of both versions of the algorithm.</Paragraph> <Paragraph position="8"> 5 Testing the Algorithm To test the PRO algorithm it was implemented in a Perl program and applied to the corpus of compounds described in Section 3. We applied the program to two tasks: computing the correct relation for a given compound, and computing the correct relation sense for that compound. We used a 'leave-one-out' cross-validation approach, in which we consecutively removed each compound from the corpus (making it the 'query compound'), recorded the correct relation or relation sense for that compound, then passed the correct returned relations head and modifier senses of that query compound (plus their hypernyms), and the corpus of remaining compounds (excluding the query compound), to the Perl program. We carried out this process foreachcompoundinthecorpus. Theresultofthis procedure was a list, for each compound, of candidate relations or relation senses sorted by PRO score.</Paragraph> <Paragraph position="9"> We assessed the performance of the algorithm in two ways. We first considered the rank of the correct relation or relation sense for a given compound in the sorted list of candidate relations/relation senses returned by the algorithm. The algorithm always returned a large list of candidate relations or relation senses for each compound (over 100 different candidates returned for all compounds). In the relation selection task, the correct relation for a compound occurred in the first position in this list for 41% of all compounds (1,026 out of 2,500 compounds), and occured in one of the first 5 positions (in the top 5% of returned relations or relation senses) for 72% of all compounds (1780 compounds). In the relation-sense selection task, the correct relation for a compound occured in the first position in this list for 43% of all compounds, and occured in one of the first 5 positions for 74% of all compounds. This performance suggests that the algorithm is doing well in both tasks, given the large number of possible relations and relation senses available.</Paragraph> <Paragraph position="10"> Oursecondassessmentconsideredtheprecision and the recall of relation/relation senses returned by the algorithm at different proportional occurrence levels (different levels for the first term in PRO score tuples as described in Equation 1). For each proportional occurrence level between 0 and 1, we assumed that the algorithm would only return a relation or relation sense when the first rela- null returned relation senses tion in the list of candidate relations returned had a score at or above that level. We then counted the total number of compounds for which a response was returned at that level, and the total number of compounds for which a correct response was returned. The precision of the algorithm at a given PRO level was equal to the number of correct responses returned by the algorithm at that PRO level, divided by the total number of responses returned by the algorithm at that level. The recall of the algorithm at a given PRO level was equal to the number of correct responses returned by the algorithm at that level, divided by the total number of compounds in the database (the total number of compounds for which the algorithm could have returned a correct response).</Paragraph> <Paragraph position="11"> Figure 4 shows the total number of responses, andthetotalnumberofcorrectresponses, returned at each PRO level for the relation selection task. Figure5showsthesamedatafortherelation-sense selection task. As both graphs show, as PRO level increases, the total number of responses returned by the algorithm declines, but the total number of correct responses does not fall significantly. For example, in the relation selection task, at a PRO level of 0 the algorithm return a response (selects a relation) for all 2,500 compounds, and approximately 1,000 of those responses are correct (the algorithm's precision at this level is 0.41). At a PRO level of 1, the algorithm return a response (selects a relation) for just over 900 compounds, and approximately 850 of those responses are correct (the algorithm's precision at this level is 0.92). A similar pattern is seen for the relation sense responses returned by the algorithm. These graphs showthatwithaPROlevelaround1,thealgorithm makesarelativelysmallnumberoferrorswhenselecting the correct relation or relation sense for a given compound (an error rate of less than 10%).</Paragraph> <Paragraph position="12"> The PRO algorithm thus has a high degree of precision in selecting relations for compounds.</Paragraph> <Paragraph position="13"> As Figures 4 and 5 show, the number of correct responses returned by the PRO algorithm did not vary greatly across PRO levels. This means that the recall of the algorithm remained relatively constant across PRO levels: in the relation selection task, for example, recall ranged from 0.41 (at a PRO level of 0) to 0.35 (at a PRO level of 1). A similar pattern occurred in the relation-sense selection task.</Paragraph> </Section> class="xml-element"></Paper>