File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0808_intro.xml
Size: 15,005 bytes
Last Modified: 2025-10-06 14:01:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0808"> <Title>Sense Discrimination with Parallel Corpora</Title> <Section position="3" start_page="0" end_page="7" type="intro"> <SectionTitle> 2 Methodology </SectionTitle> <Paragraph position="0"> We conducted a study using parallel, aligned versions of George Orwell's Nineteen Eighty-Four (Erjavec and Ide, 1998) in seven languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, and Hungarian. The study involves languages from four language families (Germanic, Romance, Slavic, and Finno-Ugric), three languages from the same family (Czech, Slovene and Bulgarian), as well as two non-Indo-European languages (Estonian and Hungarian). Although Nineteen Eighty-Four, (ca. 100,000 words), is a work of fiction, Orwell's prose is not highly stylized and, as such, it provides a reasonable sample of modern, ordinary language that is not tied to a given topic or sub-domain (which is the case for newspapers, technical reports, etc.).</Paragraph> <Paragraph position="1"> Furthermore, the translations of the text seem to be relatively faithful to the original: over 95% of the sentence alignments in the full parallel corpus of seven languages are one-to-one (Priest-Dorman, et al., 1997).</Paragraph> <Section position="1" start_page="0" end_page="4" type="sub_section"> <SectionTitle> 2.1 Preliminary Experiment </SectionTitle> <Paragraph position="0"> We constructed a multilingual lexicon based on the Orwell corpus, using a method outlined in Tufis and Barbu (2001, 2002). The complete English Orwell contains 7,069 different lemmas, while the computed lexicon comprises 1,233 entries, out of which 845 have (possibly multiple) translation equivalents in all languages. We then conducted a preliminary study using a subset of 33 nouns covering a range of frequencies and degrees of ambiguity (Ide, et al., 2001).</Paragraph> <Paragraph position="1"> For each noun in the sample, we extracted all sentences from the English Nineteen Eighty-Four containing the lemma in question, together with the parallel sentences from each of the six translations. The aligned sentences were automatically scanned to extract translation equivalents.</Paragraph> <Paragraph position="2"> A vector was then created for each occurrence, representing all possible lexical translations in the six parallel versions: if a given word is used to translate that occurrence, the vector contains a 1 in the corresponding position, 0 otherwise. The vectors for each ambiguous word were fed to an agglomerative clustering algorithm (Stolcke, 1996), where the resulting clusters are taken to represent different senses and sub-senses of the word in question.</Paragraph> <Paragraph position="3"> The clusters produced by the algorithm were compared with sense assignments made by two human annotators on the basis of WordNet 1.6.</Paragraph> <Paragraph position="4"> In order to compare the algorithm results with the annotators' sense assignments, we normalized the data as follows: for each annotator and the algorithm, each of the 33 words was represented as a vector of length n(n-1)/2, where n is the number of occurrences of the word in the corpus. The positions in the vector represent a &quot;yes-no&quot; assignment for each pair of occurrences, indicating whether or not they were judged to have the same sense (the same WordNet sense for the annotators, and the same cluster for the algorithm).</Paragraph> <Paragraph position="5"> Representing the clustering algorithm results in this form required some means to &quot;flatten&quot; the cluster hierarchies, which typically extend to 5 or 6 levels, to conform more closely to the completely flat WordNet-based data. Therefore, clusters with a minimum distance value (as assigned by the clustering algorithm) at or below 1.7 were combined, and each leaf of the resulting collapsed tree was treated as a different sense. This yielded a set of sense distinctions for each word roughly similar in number to those assigned by the annotators.</Paragraph> <Paragraph position="6"> The cluster output for glass in Figure 1 is an example of the results obtained from the clustering algorithm. For clarity, the occurrences have been manually labeled with WordNet 1.6 senses (Figure 2). The tree shows that the algorithm correctly Sentences in which more than one translation equivalent appears were eliminated (cca. 5% of the translations). Originally, the annotators attempted to group occurrences without reference to an externally defined sense set, but this proved to be inordinately difficult and produced highly variable results and was eventually abandoned.</Paragraph> <Paragraph position="7"> We used the number of senses annotators assigned rather than the number of WordNet senses as a guide to determine the minimum distance cutoff, because many WordNet senses are not represented in the corpus.</Paragraph> <Paragraph position="8"> grouped occurrences corresponding to WordNet sense 1 (a solid material) in one of the two main branches, and those corresponding to sense 2 (drinking vessel) in the other. The top group is further divided into two sub-clusters, the lower of which refer to a looking glass and a magnifying glass, respectively. While this is a particularly clear example of good results from the clustering algorithm, results for other words are, for the most part, similarly reasonable.</Paragraph> <Paragraph position="9"> Figure 1 : Output of the clustering algorithm 1. a brittle transparent solid with irregular atomic structure 2. a glass container for holding liquids while drinking 3. the quantity a glass will hold 4. a small refracting telescope 5. a mirror; usually a ladies' dressing mirror 6. glassware collectively; &quot;She collected old glass&quot; The results of the first experiment are summarized in Table 1, which shows the percentage of agreement between the cluster algorithm and each annotator, between the two annotators, and for the algorithm and both annotators taken together. The percentages are similar to those reported in earlier work; for example, Ng et al. (1999) achieved a raw percentage score of 58% agreement among annotators tagging nouns with WordNet 1.6 senses.</Paragraph> </Section> <Section position="2" start_page="4" end_page="7" type="sub_section"> <SectionTitle> 2.2 Second experiment </SectionTitle> <Paragraph position="0"> Comparison of sense differentiation achieved using translation equivalents, as determined by the clustering algorithm, with those assigned by human annotators suggests that use of translation equivalents for word sense tagging and disambiguation is worth pursuing. Agreement levels are comparable to (and in some cases higher than) those obtained in earlier studies tagging with WordNet senses. Furthermore, the pairwise difference in agreement between the human annotators and the annotators and the clustering algorithm is only 10-13%, which is also similar to scores obtained in other studies.</Paragraph> <Paragraph position="1"> In the second phase, the experiment was broadened to include 76 nouns from the multi-lingual lexicon, including words with varying ambiguity (the range in number of WordNet senses is 2 to 29, average 7.09) and semantic characteristics (e.g., abstract vs. concrete: &quot;thought&quot;, &quot;stuff&quot;, &quot;meaning&quot;, &quot;feeling&quot; vs. &quot;hand&quot;, &quot;boot&quot;, &quot;glass&quot;, &quot;girl&quot;, etc.). We chose nouns that occur a minimum of 10 times in the corpus, have no undetermined translations and at least five different translations in the six non-English languages, and have the log likelihood score of at least 18; that is:</Paragraph> <Paragraph position="3"> stands for the number of times T We computed raw percentages only; common measures of annotator agreement such as the Kappa statistic (Carletta, 1996) proved to be inappropriate for our two-category (&quot;yesno&quot;) classification scheme.</Paragraph> <Paragraph position="5"> number of potential translation equivalents in the parallel corpus. The LL score is set at a maximum value to ensure high precision for the extracted translation equivalents, which minimizes sense clustering errors due to incorrect word alignment.</Paragraph> <Paragraph position="6"> Table 2 summarizes the data.</Paragraph> <Paragraph position="7"> In this second experiment, we increased the number of annotators to four. The results of the clustering algorithm and the sense assignments made by the human annotators were normalized differently than in the earlier experiment, by ignoring sense numbers and interpreting the annotators' sense assignments as clusters only. To see why this was necessary, consider the following set of sense assignments for the seven occurrences</Paragraph> <Paragraph position="9"> Agreement is 43%; however, both annotators classify occurrences 1, 4, and 6 as having the same sense, although each assigned a different sense number to the group. If we ignore sense numbers and consider only the annotators' &quot;clusters&quot;, the agreement rate is much higher, and the data is more comparable to that obtained from the cluster algorithm.</Paragraph> <Paragraph position="10"> We also addressed the issue of the appropriate point at which to cut off the clustering by the algorithm. Our use of a pre-defined minimum In fact, the only remaining disagreement is that Annotator 1 assigns occurrences 3 and 5 together, whereas Annotator 2 assigns a different sense to occurrence 3--in effect, Annotator 2 makes a finer distinction than Annotator 1 between occurrences 3 and 5.</Paragraph> <Paragraph position="11"> distance value to determine the number of clusters (senses) in the earlier experiment yielded varying results for different words (especially words with significantly different numbers of translation equivalents) and we sought a more principled means to determine the cut-off value. The clustering algorithm was therefore modified to compute the correct number of clusters automatically by halting the clustering process when the number of clusters reached a value similar to the average number obtained by the annotators.</Paragraph> <Paragraph position="12"> As criteria, we used the minimum distance between existing clusters at each iteration, which determines the two clusters to be joined, where minimum distance is computed between two Best results were obtained when the clustering was stopped at the point where: (dist(k)-dist(k+1))/dist(k+1) < 0.12 where dist(k) is the minimal distance between two clusters at the kth iteration step.</Paragraph> <Paragraph position="13"> We defined a &quot;gold standard&quot; annotation by taking the majority vote of the four annotators (in case of ties, the annotator closest to the majority vote in the greatest number of cases was considered to be right). Using this heuristic, the clustering algorithm assigned the same number of senses as the gold standard for 41 words. However, overall agreement was much worse (67.9%) than when the number of clusters was pre-specified. The vast majority of clustering errors occurred when sense distributions are skewed; we therefore added a post-processing phase in which the smallest clusters are eliminated and their members included in the largest cluster when the number of occurrences in the largest cluster is at least ten times that of any other cluster.</Paragraph> <Paragraph position="14"> With this new heuristic, the algorithm produced the same number of clusters as the gold standard for only 15 words, but overall agreement reached 74.6%. Mismatching clusters typically included In principle, the upper limit for the number of senses for a word is the number of senses in WordNet 1.6; however, there was no case in which all WordNet senses appeared in the text. The factor of 10 is a conservative threshold; additional experiments might yield evidence for a lower value. only one element. There were only five words for which a difference in the number of clusters assigned by the gold standard vs. the algorithm significantly contributed to the 2.7% depreciation in agreement.</Paragraph> <Paragraph position="15"> We also experimented with eliminating the data for &quot;non-contributing&quot; languages (i.e., languages for which there is only one translation for the target word); this was ultimately abandoned because it worsened results by amplifying the effect of synonymous translations in other languages.</Paragraph> <Paragraph position="16"> Finally, we compared the use of weighted vs.</Paragraph> <Paragraph position="17"> unweighted clustering algorithms (see, e.g., Yarowsky and Florian, 1999) and determined that results were improved using weighted clustering. The clusters produced by each pair of classifiers (human or machine) were mapped for maximum overlap; differences were considered as divergences. The agreement between two different classifications was computed as the number of common occurrences in the corresponding clusters of the two classifications divided by the total number of the occurrences of the target word. For example, the word movement occurs 40 times in the corpus; both the &quot;gold standard&quot; and the algorithm identified four clusters, but the distribution of the 40 occurrences was substantially different, as summarized in Table 3. Thirty-four of the 40 occurrences appear in the clusters common to the two classifications; therefore, the agreement rate is 85%.</Paragraph> </Section> <Section position="3" start_page="7" end_page="7" type="sub_section"> <SectionTitle> 2.3 Results </SectionTitle> <Paragraph position="0"> The results of our second experiment are summarized in Table 4, which gives the agreement rate between baseline clustering (B), in which it is assumed all occurrences are labeled with the same sense; each pair of human annotators (1-4); the gold standard (G); and the clustering algorithm (A). The table shows that agreement rates among the human annotators, as compared to those between the algorithm and all but one annotator, are not significantly different, and that the algorithm's highest level of agreement is with the baseline. This is not surprising because of the second heuristic used. However, the second best agreement rate for the algorithm is with the gold standard, which suggests that sense distinctions determined using the algorithm are almost as reliable as sense distinctions determined manually.</Paragraph> <Paragraph position="1"> The agreement of the algorithm with the gold standard falls slightly below that of the human annotators, but is still well within the range of acceptability. Also, given that the gold standard was computed on the basis of the human annotations, it is understandable that these annotations do better than the algorithm.</Paragraph> </Section> </Section> class="xml-element"></Paper>