File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-2018_intro.xml

Size: 6,771 bytes

Last Modified: 2025-10-06 14:03:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2018">
  <Title>Exploring the Sense Distributions of Homographs</Title>
  <Section position="3" start_page="0" end_page="156" type="intro">
    <SectionTitle>
2 Methodology
</SectionTitle>
    <Paragraph position="0"> Our starting point is a list of 288 ambiguous words (homographs) where each comes together with two associated words that are typical of one sense and a third associated word that is typical of another sense. Table 1 shows the first ten entries in the list. It has been derived from the University of South Florida homograph norms (Nelson et al., 1980) and is based on a combination of native speakers' intuition and the expertise of specialists.</Paragraph>
    <Paragraph position="1"> The University of South Florida homograph norms comprise 320 words which were all selected from Roget's International Thesaurus (1962). Each word has at least two distinct meanings that were judged as likely to be understood by everyone. As described in detail in Nelson et al. (1980), the compilation of the norms was conducted as follows: 46 subjects wrote down the first word that came to mind for each of the 320 homographs. In the next step, for each homograph semantic categories were chosen to reflect  its meanings. All associative responses given by the subjects were assigned to one of these categories. This was first done by four judges individually, and then, before final categorization, each response was discussed until a consensus was achieved.</Paragraph>
    <Paragraph position="2"> The data used in our study (first ten items shown in Table 1) was extracted from these norms by selecting for each homograph the first two words relating to its first meaning and the first word relating to its second meaning.</Paragraph>
    <Paragraph position="3"> Thereby we had to abandon those homographs where all of the subjects' responses had been assigned to a single category, so that only one category appeared in the homograph norms. This was the case for 32 words, which is the reason that our list comprises only 288 instead of 320 items. Another resource that we use is the British National Corpus (BNC), which is a balanced sample of written and spoken English that comprises about 100 million words (Burnard &amp; Aston, 1998). This corpus was used without special preprocessing, i.e. stop words were not removed and no stemming was conducted. From the corpus we extracted concordances comprising text windows of a certain width (e.g. plus and minus 20 words around the given word) for each of the 288 homographs. For each concordance we computed two counts: The first is the number of concordance lines where the two words associated with sense 1 occur together. The second is the number of concordance lines where the first word associated with sense 1 and the word associated with sense 2 co-occur. The expectation is that the first count should be higher as words associated to the same sense should co-occur more often than words associated to different senses.</Paragraph>
    <Paragraph position="4"> sense 1 sense 2homo- null graph firstasso-ciation(w1) secondasso-ciation(w2) firstasso-ciation(w3) arm leg hand war ball game base dance bar drink beer crow bark dog loud tree base ball line bottom bass fish trout drum bat ball boy fly bay Tampa water hound bear animal woods weight beam wood ceiling light  associations to their first and second senses. However, as absolute word frequencies can vary over several orders of magnitude and as this effect could influence our co-occurrence counts in an undesired way, we decided to take this into account by dividing the co-occurrence counts by the concordance frequency of the second words in our pairs. We did not normalize for the frequency of the first word as it is identical for both pairs and therefore represents a constant factor. Note that we normalized for the observed frequency within the concordance and not within the entire corpus.</Paragraph>
    <Paragraph position="5"> If we denote the first word associated to sense 1 with w1, the second word associated with sense 1 with w2, and the word associated with sense 2 with w3, the two scores s1 and s2 that we compute can be described as follows: In cases where the denominator was zero we assigned a score of zero to the whole expression. For all 288 homographs we compared s1 to s2. If it turns out that in the vast majority of cases s1 is higher than s2, then this result would be an indicator that it is promising to use such co-occurrence statistics for the assignment of context words to senses. On the other hand, should this not be the case, the conclusion would be that this approach does not have the potential to work and should be discarded.</Paragraph>
    <Paragraph position="6"> As in statistics the results are often not as clear cut as would be desirable, for comparison we conducted another experiment to help us with the interpretation. This time the question was whether our results were caused by properties of the homographs or if we had only measured properties of the context words w1, w2 and w3.</Paragraph>
    <Paragraph position="7"> The idea was to conduct the same experiment again, but this time not based on concordances but on the entire corpus. However, considering the entire corpus would make it necessary to use a different kind of text window for counting the co-occurrences as there would be no given word to center the text window around, which could lead to artefacts and make the comparison problematic. We therefore decided to use concordances again, but this time not the concordances of the homographs (first column in Table 1) but the concordances of all 288 instances of w1 (second column in Table 1). This way we had exactly number of lines where w1 and w2 co-occurs1 = occurrence count of w2 within concordance number of lines where w1 and w3 co-occurs2 = occurrence count of w3 within concordance  the same window type as in the first experiment, but this time the entire corpus was taken into account as all co-occurrences of w2 or w3 with w1 must necessarily appear within the concordance of w1.</Paragraph>
    <Paragraph position="8"> We name the scores resulting from this experiment s3 and s4, where s3 corresponds to s1 and s4 corresponds to s2, with the only difference being that the concordances of the homographs are replaced by the concordances of the instances of w1. Regarding the interpretation of the results, if the ratio between s3 and s4 should turn out to be similar to the ratio between s1 and s2, then the influence of the homographs would be marginally or non existent. If there should be a major difference, then this would give evidence that, as desired, a property of the homograph has been measured.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML