File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1177_intro.xml

Size: 12,412 bytes

Last Modified: 2025-10-06 14:02:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1177">
  <Title>Automatic Identification of Infrequent Word Senses</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Method
</SectionTitle>
    <Paragraph position="0"> McCarthy et al. (2004a) describe a method to produce a ranking over senses and find the predominant sense of a word just using raw text. We summarise the method below, and describe how we use it for identifying candidate senses for filtering.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Ranking the Senses
</SectionTitle>
      <Paragraph position="0"> In order to rank the senses of a target word (e.g.</Paragraph>
      <Paragraph position="1"> plant) we use a thesaurus acquired from automatically parsed text (section 2.2 below). This provides the a2 nearest neighbours to each target word (e.g.</Paragraph>
      <Paragraph position="2"> factory, refinery, tree etc...) along with the distributional similarity score between the target word and its neighbour. We then use the WordNet similarity package (Patwardhan and Pedersen, 2003) (see section 2.3) to give us a semantic similarity measure (hereafter referred to as the WordNet similarity measure) to weight the contribution that each neighbour (e.g. factory) makes to the various senses of the target word (e.g. flora, industrial, actor etc...).</Paragraph>
      <Paragraph position="3"> We take each sense of the target word (a3 ) in turn and obtain a score reflecting the prevalence which is used for ranking. Let a4a6a5a8a7a10a9a12a11a14a13a16a15a17a11a19a18a21a20a22a20a22a20a23a11a19a24a26a25 be the ordered set of the top scoring a2 neighbours of a3 from the thesaurus with associated distributional similarity scores a9a12a27a29a28a30a28a26a31a32a3a6a15a33a11a14a13a35a34a17a15a33a27a29a28a30a28a16a31a36a3a37a15a17a11a19a18a38a34a33a15a21a20a22a20a22a20a39a27a29a28a30a28a26a31a32a3a6a15a33a11a14a24a35a34a38a25 . Let a28a30a40a30a11a14a28a21a40a30a28a16a31a36a3a41a34 be the set of senses of a3 . For each sense of a3 (a3a42a28a44a43a46a45a47a28a30a40a30a11a14a28a21a40a30a28a16a31a36a3a41a34 ) we obtain a ranking score by summing over the a27a48a28a21a28a16a31a32a3a6a15a33a11a50a49a30a34 of each neighbour (a11a50a49a51a45a52a4a6a5 ) multiplied by a weight. This weight is the WordNet similarity score (a3a42a11a14a28a30a28 ) between the target sense (a3a53a28a35a43 ) and the sense of a11a54a49 (a11a14a28a35a55a56a45a57a28a30a40a30a11a14a28a21a40a30a28a16a31a36a11a54a49a16a34 ) that maximises this score, divided by the sum of all such WordNet similarity scores for a28a30a40a21a11a19a28a21a40a30a28a26a31a32a3a41a34 and a11a50a49 .</Paragraph>
      <Paragraph position="4"> Thus we rank each sense a3a42a28 a43 a45a10a28a30a40a21a11a19a28a21a40a30a28a26a31a32a3a41a34 us-</Paragraph>
      <Paragraph position="6"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Acquiring the Automatic Thesaurus
</SectionTitle>
      <Paragraph position="0"> There are many alternative distributional similarity measures proposed in the literature, for this work we used the measure and thesaurus construction method described by Lin (1998). For input we used grammatical relation data extracted using an automatic parser (Briscoe and Carroll, 2002). For each noun we considered the co-occurring verbs in the direct object and subject relation, the modifying nouns in noun-noun relations and the modifying adjectives in adjective-noun relations. We could easily extend the set of relations in the future. A noun, a3 , is thus described by a set of co-occurrence triples  a3a6a15a33a68a16a15a17a95a97a96 and associated frequencies, where a68 is a grammatical relation and a95 is a possible co-occurrence with a3 in that relation. For every pair of nouns, we computed their distributional similarity. If a98a41a31a32a3a99a34 is the set of co-occurrence types a31a32a68a16a15a33a95a100a34 such that a101a102a31a32a3a37a15a17a68a16a15a33a95a100a34 is positive then the similarity between two nouns, a3 and a11 , can be computed as:</Paragraph>
      <Paragraph position="2"> A thesaurus entry of size a2 for a target noun a3 is then defined as the a2 most similar nouns to a3 .</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 The WordNet Similarity Package
</SectionTitle>
      <Paragraph position="0"> We use the WordNet Similarity Package 0.05 and WordNet version 1.6. 2 The WordNet Similarity 2We use this version of WordNet since it would in principle allow us to map information to WordNets of other languages more accurately. We are able to apply the method to other versions of WordNet.</Paragraph>
      <Paragraph position="1"> package supports a range of WordNet similarity scores. We used the jcn measure to give results for the a3a53a11a19a28a21a28 function in equation 1 above, since this has given us good results for other experiments, and is efficient given the precompilation of required frequency files (information dat files). We discuss the merits of investigating other semantic similarity scores in section 6.</Paragraph>
      <Paragraph position="2"> The jcn (Jiang and Conrath, 1997) measure provides a similarity score between two WordNet senses (a28a117a124 and a28a21a125 ), these being synsets within WordNet. The measure uses corpus data to populate classes (synsets) in the WordNet hierarchy with frequency counts. Each synset, is incremented with the frequency counts from the corpus of all words belonging to that synset, directly or via the hyponymy relation. The frequency data is used to calculate the &amp;quot;information content&amp;quot; (IC) of a class a101a54a126a127a31a32a28a26a34a128a7a130a129a132a131a133a67a44a62a19a31a135a134a100a31a32a28a16a34a87a34 . Jiang and Conrath specify a distance measure: a136a6a49a93a137 a71 a31a32a28a117a124a123a15a33a28a30a125a138a34a56a7 a101a54a126a127a31a32a28a123a124a16a34a66a111a139a101a54a126a127a31a36a28a30a125a29a34a140a129a141a125a142a77a56a101a54a126a127a31a36a28a30a143a29a34 , where the third class, a28a30a143 is the most informative, or most specific superordinate synset of the two senses a28a117a124 and a28a30a125 . This is transformed from a distance measure in the WN-Similarity package by taking the reciprocal:</Paragraph>
      <Paragraph position="4"> The jcn measure uses corpus data for the calculation of IC. The experimental results reported here are obtained using IC counts from the BNC corpus with the resnik count option available in the Word-Net similarity package. We did not use the default IC counts provided with the package since these are derived from the hand-tagged data in SemCor. All the results shown here are those with the size of thesaurus entries (a2 ) set to 50. 3</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Filtering
</SectionTitle>
      <Paragraph position="0"> We use equation 1 above to produce ranking scores for the senses a28a30a40a21a11a19a28a21a40a30a28a26a31a32a3a41a34 of a target word a3 . We then use a threshold a98 a5 which is a constant percentage (a98a99a147 ) of the ranking score of the first ranked sense. Any senses with scores lower than a98 a5 are identified for filtering. This threshold will permit the filtering to be sensitive to the ranking scores of the word in question.</Paragraph>
      <Paragraph position="1">  gave only minimal changes to the results.</Paragraph>
      <Paragraph position="2"> We generated a thesaurus entry for all polysemous nouns which occurred in SemCor with a frequency a96 2, and in the BNC with a frequency a158 10. We experiment with a98a99a147a159a7a161a160a22a124a30a162a54a15a17a125a117a162a54a20a22a20a23a163a117a162a16a164 . For these experiments we evaluate using the gold-standard sense-tagged data available in i) SemCor and ii) the English SENSEVAL-2 all-words task. For each value of a98a99a147 we compute the number of sense types filtered (a165a53a166a32a167a21a134a168a40a21a28 ), and the percentage of these that are correctly filtered (a165a53a166a32a167a21a134a168a40a30a169a17a137a133a137 ) in that they do not occur at all in our gold-standard. We also compute for those types that do occur a165a53a166a32a67a16a2a29a83a133a104a36a104 a80 , the percentage of sense tokens that would be filtered incorrectly from the gold-standard by their removal from Word-Net. a165a53a166a36a67a12a2a29a83a133a104a36a104 a80a39a80 is the percentage of sense tokens that would be filtered incorrectly for the subset of words for which there are tokens filtered.</Paragraph>
      <Paragraph position="3"> The results when using the ranking scores derived from the BNC thesaurus for filtering the senses in SemCor are shown in table 1 for different values of a98a41a147 . For polysemous nouns in SemCor, the percentage of sense types that do not occur is 38%, so if we filtered randomly we could expect to get 38% accuracy. a165a53a166a32a167a21a134a168a40a21a169a33a137a133a137 is well above this baseline for all values of a98a41a147 . Whilst there are sense types in SemCor that are filtered erroneously, these are senses which occur less frequently than the non-filtered types. Furthermore, they account for a relatively small percentage of tokens for the filtered words as shown by a165a53a166a36a67a12a2a29a83a133a104a36a104 a80a39a80 . Table 2 shows that a165a53a166a36a67a12a2a29a83a133a104a36a104 a80 is lower than would be expected if the sense types which are filtered had average frequency. There are 10687 sense types for the polysemous nouns in SemCor, of which 6573 actually occur. The number of sense types filtered in error for each value of a98a99a147 is shown by a165a53a166a32a167a21a134a168a40a30a28a21a83a133a104a36a104 . The proportion of tokens expected for the given a165a53a166a32a167a21a134a168a40a21a28a30a83a133a104a36a104 , if the filtered types were of average frequency, is given by a166a36a67a12a2a29a83a133a55a170a7 a171a14a172a22a173a36a174 a83a81a79a61a175a133a176a133a176a177a33a178a38a179a36a180 . For the highest value of a98a41a147a181a7a182a163a123a162 , 3099 types are identified for filtering, this comprises 47% of the types occurring in Sem-Cor, however a165a53a166a36a67a12a2 a83a133a104a36a104 a80 shows that only 39% tokens are filtered. As the value of a98a99a147 decreases, we filter fewer sense types, less tokens in error and the ratio between a166a36a67a16a2a48a83a92a55 and a165a53a166a36a67a16a2a48a83a92a104a32a104 a80 increases. The compromise between the number of sense types filtered, and the removal of tokens in error will depend on the needs of the application, and can be altered with</Paragraph>
      <Paragraph position="5"> et al., 2001) is a much smaller sample of hand-tagged text compared to SemCor, comprising three documents from the Wall Street Journal section of  nouns occurring in this corpus, there are 77% sense types which do not occur. The results in table 3 show much higher values for a165a53a166a32a167a21a134a168a40a21a169a33a137a133a137 because of this higher baseline (77%). The filtering results nevertheless show superior performance to this base-line at all levels of a98a99a147 . This time there are no sense types filtered for a98a41a147 a7a184a143a117a162 . The frequencies of the types filtered in error are close to the values of a166a36a67a12a2 a83a133a55 , as shown in table 4. This is because the corpus is very small. Many types do not occur and many types have a low frequency, regardless of whether they are filtered or not.</Paragraph>
      <Paragraph position="6"> In this section we demonstrated that the ranking scores can be used alongside a threshold to remove senses which are considered rare for the corpus data at hand, that the majority of sense types filtered in this way do not occur in our test data, and that those that do typically have a low or average frequency.</Paragraph>
      <Paragraph position="7"> There are of course differences between the BNC corpus that we used to create our sense ranking and the test corpora, however, since the BNC is a balanced corpus we feel that this is a feasible means of evaluation, and the results bear this out. A main advantage of our approach is to enable us to tailor a resource such as WordNet to domain specific text, and it is to this that we now turn.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML