File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1039_metho.xml

Size: 28,747 bytes

Last Modified: 2025-10-06 14:08:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1039">
  <Title>Relieving The Data Acquisition Bottleneck In Word Sense Disambiguation</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Empirical Layout
</SectionTitle>
    <Paragraph position="0"> Similar to Mihalcea's approach, we compare results obtained by a supervised WSD system for English using manually sense annotated training examples against results obtained by the same WSD system trained on SALAAM sense tagged examples. The test data is the same, namely, the SENSEVAL 2 English Lexical Sample test set. The supervised WSD system chosen here is the University of Maryland System for SENSEVAL 2 Tagging (a0a2a1a4a3a5a3a7a6 ) (Cabezas et al. , 2002).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 a0a8a1a9a3a5a3a10a6
</SectionTitle>
      <Paragraph position="0"> The learning approach adopted by a0a2a1a4a3a11a3a10a6 is based on Support Vector Machines (SVM). a0a8a1a9a3a5a3a10a6 uses SVM-a12a14a13a16a15a18a17a20a19a22a21a24a23 by Joachims (Joachims, 1998).1 For each target word, where a target word is a test item, a family of classifiers is constructed, one for each of the target word senses. All the positive examples for a sense a25 a3a27a26a16a28 are considered the negative examples of a25 a3a30a29a31a28 , where a13a33a32a34a36a35 .(Allwein et al., 2000) In a0a8a1a9a3a5a3a10a6 , each target word is considered an independent classification problem.</Paragraph>
      <Paragraph position="1"> The features used for a0a2a1a4a3a11a3a10a6 are mainly contextual features with weight values associated with each feature. The features are space delimited units,  tokens, extracted from the immediate context of the target word. Three types of features are extracted: a37 Wide Context Features: All the tokens in the paragraph where the target word occurs.</Paragraph>
      <Paragraph position="2"> a37 Narrow Context features: The tokens that collocate in the surrounding context, to the left and right, with the target word within a fixed window size of a38 .</Paragraph>
      <Paragraph position="3"> a37 Grammatical Features: Syntactic tuples such as verb-obj, subj-verb, etc. extracted from the context of the target word using a dependency parser, MINIPAR(Lin, 1998).</Paragraph>
      <Paragraph position="4"> Each feature extracted is associated with a weight value. The weight calculation is a variant on the Inverse Document Frequency (IDF) measure in Information Retrieval. The weighting, in this case, is an Inverse Category Frequency (ICF) measure where each token is weighted by the inverse of its frequency of occurrence in the specified context of the target word.</Paragraph>
      <Paragraph position="5">  The manually-annotated training data is the SENSEVAL2 Lexical Sample training data for the English task, (SV2LS Train).2 This training data corpus comprises 44856 lines and 917740 tokens.</Paragraph>
      <Paragraph position="6"> There is a close affinity between the test data and the manually annotated training data. The Pearson a25a40a39 a28 correlation between the sense distributions for the test data and the manually annotated training data, per test item, ranges between a41a43a42a45a44a47a46a49a48 .3</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 SALAAM
</SectionTitle>
      <Paragraph position="0"> SALAAM exploits parallel corpora for sense annotation. The key intuition behind SALAAMis that when words in one language, L1, are translated into the same word in a second language, L2, then those L1 words are semantically similar. For example, when the English -- L1 -- words bank, brokerage, mortgage-lender translate into the French -- L2 -word banque in a parallel corpus, where bank is polysemous, SALAAMdiscovers that the intended sense for bank is the financial institution sense, not the geological formation sense, based on the fact that it is grouped with brokerage and mortgage-lender.</Paragraph>
      <Paragraph position="1"> SALAAM's algorithm is as follows: a37 SALAAM expects a word aligned parallel corpus as input;  butions. Throughout this paper, we opt for using the parametric Pearson a50 correlation rather than KL distance in order to test statistical significance.</Paragraph>
      <Paragraph position="2"> a37 L1 words that translate into the same L2 word are grouped into clusters; a37 SALAAM identifies the appropriate senses for the words in those clusters based on the words senses' proximity in WordNet. The word sense proximity is measured in information theoretic terms based on an algorithm by Resnik (Resnik, 1999); a37 A sense selection criterion is applied to choose the appropriate sense label or set of sense labels for each word in the cluster; a37 The chosen sense tags for the words in the cluster are propagated back to their respective contexts in the parallel text. Simultaneously, SALAAMprojects the propagated sense tags for L1 words onto their L2 corresponding translations. null  Sample trial and training corpora with no manual annotations. It comprises 61879 lines and 1084064 tokens.</Paragraph>
      <Paragraph position="3"> a37 MT: The English Brown Corpus, SENSE-VAL1 (trial, training and test corpora), Wall Street Journal corpus, and SENSEVAL 2 All Words corpus. All of which comprise 151762 lines and 37945517 tokens.</Paragraph>
      <Paragraph position="4"> a37 HT: UN English corpus which comprises 71672 lines of 1734001 tokens  The SALAAM-tagged corpora are rendered in a format similar to that of the manually annotated training data. The automatic sense tagging for MT and SV2LS TR training data is based on using SALAAM with machine translated parallel corpora. The HT training corpus is automatically sense tagged based on using SALAAM with the English-Spanish UN naturally occurring parallel corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Experimental Conditions
</SectionTitle>
      <Paragraph position="0"> Experimental conditions are created based on three  of SALAAM's tagging factors, Corpus, Language and Threshold: a37 Corpus: There are 4 different combinations for the training corpora: MT+SV2LS TR; MT+HT+SV2LS TR; HT+SV2LS TR; or SV2LS TR alone.</Paragraph>
      <Paragraph position="1"> a37 Language: The context language of the paral null lel corpus used by SALAAMto obtain the sense tags for the English training corpus. There are three options: French (FR), Spanish (SP), or, Merged languages (ML), where the results are obtained by merging the English output of FR and SP.</Paragraph>
      <Paragraph position="2"> a37 Threshold: Sense selection criterion, in SALAAM, is set to either MAX (M) or THRESH (T).</Paragraph>
      <Paragraph position="3"> These factors result in 39 conditions.4</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Test Data
</SectionTitle>
      <Paragraph position="0"> The test data are the 29 noun test items for the SENSEVAL 2 English Lexical Sample task, (SV2LS-Test). The data is tagged with the WordNet 1.7pre (Fellbaum, 1998; Cotton et al. , 2001). The average perplexity for the test items is 3.47 (see Section 5.3), the average number of senses is 7.93, and the total number of contexts for all senses of all test items is 1773.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> In this evaluation, a0a8a1a9a3a5a3a10a6 a3 is the a0a8a1a9a3a5a3a10a6 system trained with SALAAM-tagged data and a0a8a1a9a3a5a3a10a6 a0 is the a0a8a1a9a3a5a3a10a6 system trained with manually annotated data. Since we don't expect a0a8a1a9a3a5a3a10a6 a3 to outperform human tagging, the results yielded by a0a2a1a4a3a5a3a7a6 a0 , are the upper bound for the purposes of this study. It is important to note that a0a2a1a4a3a5a3a7a6 a3 is always trained with SV2LS TR as part of the training set in order to guarantee genre congruence between the training and test sets.The scores are calculated using scorer2.5 The average precision score over all the items for a0a8a1a9a3a5a3a10a6 a0 is 65.3% at 100% Coverage.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Metrics
</SectionTitle>
      <Paragraph position="0"> We report the results using two metrics, the harmonic mean of precision and recall, (a1a3a2a5a4a7a6 ) score, and the Performance Ratio (PR), which we define as the ratio between two precision scores on the same test data where precision is rendered using scorer2. PR is measured as follows:</Paragraph>
      <Paragraph position="2"> 4Originally, there are 48 conditions, 9 of which are excluded due to extreme sparseness in training contexts.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4.2 Results
</SectionTitle>
    <Paragraph position="0"> Table 1 shows the a1 a2a5a4a7a6 scores for the upper bound a0a8a1a9a3a5a3a10a6 a0 . a0a2a1a4a3a11a3a10a6 a3 a1a3a2a5a4a7a6 is the condition in a0a8a1a9a3a5a3a10a6 a3 that yields the highest overall a1 a2a5a4a7a6 score over all noun items. a0a2a1a4a3a5a3a7a6 a3a9a8a11a10a13a12 the maximum a1a7a2a5a4a7a6 score achievable, if we know which condition yields the best performance per test item, therefore it is an oracle condition.6 Since our approach is unsupervised, we also report the results of other unsupervised systems on this test set. Accordingly, the last seven row entries in Table 1 present</Paragraph>
    <Paragraph position="2"> and state-of-the-art unsupervised systems participating in the SENSEVAL2 English Lexical Sample task.</Paragraph>
    <Paragraph position="3"> All of the unsupervised methods including</Paragraph>
    <Paragraph position="5"> cantly below the supervised method, a0a2a1a4a3a11a3a10a6 a0 .</Paragraph>
    <Paragraph position="6"> a0a8a1a9a3a5a3a10a6 a3 a1a3a2a5a4a42a6 is the third in the unsupervised methods. It is worth noting that the average a1a7a2 a4a7a6 score across the 39 conditions is a38 a38a18a42a46a45a48a47 , and the lowest is a38a43a48 a42 a48a49a45 . The five best conditions for a0a2a1a4a3a11a3a10a6 a3 , that yield the highest average a1a7a2 a4a7a6 across all test items, use the HT corpus in the training data, four of which are the result of merged languages in SALAAM indicating that evidence from different languages simultaneously is desirable. a0a2a1a4a3a11a3a10a6 a3a27a8a11a10a13a12 is the maximum potential among all unsupervised approaches if the best of all the conditions are combined. One of our goals is to automatically determine which condition or set of conditions yield the best results for each test item.</Paragraph>
    <Paragraph position="7"> Of central interest in this paper is the performance ratio (PR) for the individual nouns. Table  top 12 test items listed in Table 2. Our algorithm does as well as supervised algorithm, a0a2a1a4a3a11a3a10a6 a0 , on 41.6% of this test set. In a0a8a1a9a3a5a3a10a6 a3 a1a3a2a5a4a7a6 , 31% of the test items, (9 nouns yield PR scores a50 a41a43a42a45a44a52a51 ), do as well as a0a2a1a4a3a5a3a7a6 a0 . This is an improvement of 11% absolute over state-of-the-art bootstrapping WSD algorithm yielded by Mihalcea (Mihalcea, 2002). Mihalcea reports high PR scores for six test items only: art, chair, channel, church, detention, nation. It is worth highlighting that her bootstrapping approach is partially supervised since it depends mainly on hand labelled data as a seed for the training data.</Paragraph>
    <Paragraph position="8"> Interestingly, two nouns, detention and chair, yield better performance than a0a2a1a4a3a5a3a7a6 a0 , as indicated by the PRs a48 a42a41a1a0 and a48 a42a41 a51 , respectively. This is attributed to the fact that SALAAM produces a lot more correctly annotated training data for these two words than that provided in the manually annotated training data for a0a8a1a9a3a5a3a10a6 a0 .</Paragraph>
    <Paragraph position="9"> Some nouns yield very poor PR values mainly due to the lack of training contexts, which is the case for mouth in a0a2a1a4a3a11a3a10a6 a3 a1a3a2a41a4a42a6 , for example. Or lack of coverage of all the senses in the test data such as for bar and day, or simply errors in the annotation of the SALAAM-tagged training data.</Paragraph>
    <Paragraph position="10"> If we were to include only nouns that achieve acceptable PR scores of a2 a41a43a42a46a45a3a0 -- the first 16 nouns in Table 2 for a0a2a1a4a3a5a3a7a6 a3 a8a11a10a44a12 -- the overall potential precision of a0a2a1a4a3a11a3a10a6 a3 is significantly increased to 63.8% and the overall precision of a0a8a1a9a3a5a3a10a6 a0 is increased to 68.4%.8 These results support the idea that we could replace hand tagging with SALAAM's unsupervised tagging if we did so for those items that yield an acceptable PR score. But the question remains: How do we predict which training/test items will yield acceptable PR scores?</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Factors Affecting Performance Ratio
</SectionTitle>
    <Paragraph position="0"> In an attempt to address this question, we analyze several different factors for their impact on the performance of a0a8a1a9a3a5a3a10a6 a3 quanitified as PR. In order to effectively alleviate the sense annotation acquisition bottleneck, it is crucial to predict which items would be reliably annotated automatically using a0a8a1a9a3a5a3a10a6 a3 . Accordingly, in the rest of this paper, we explore 7 different factors by examining the yielded PR values in a0a2a1a4a3a5a3a7a6 a3 a8a43a10a44a12 .</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Number of Senses
</SectionTitle>
      <Paragraph position="0"> The test items that possess many senses, such as art (17 senses), material (16 senses), mouth (10 senses) and post (12 senses), exhibit PRs of 0.98, 0.92, 0.73 and 0.66, respectively. Overall, the correlation between number of senses per noun and its PR score is an insignificant a39 a34a5a4 a41a43a42a45a38a43a48 , a25 a1a47a25 a48a7a6a13a51a1a8 a28 a34 a51a18a42a45a44a9a6a11a10 a50 a41a43a42 a48 a28 . Though it is a weak negative correlation, it does suggest that when the number of senses increases, PR tends to decrease.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Number of Training Examples
</SectionTitle>
      <Paragraph position="0"> This is a characteristic of the training data. We examine the correlation between the PR and the num-</Paragraph>
      <Paragraph position="2"> achieves an overall a26a9a27a29a28a31a30 score of a15a17a16a32a13a33 in the WSD task.</Paragraph>
      <Paragraph position="3"> ber of training examples available to a0a2a1a4a3a5a3a7a6 a3 for each noun in the training data. The correlation between the number of training examples and PR is insignificant at a39 a34a34a4 a41a43a42 a48a32a0 , a25 a1a47a25 a48a7a6a13a51a1a8 a28 a34 a41a43a42a46a45 a38a1a8a35a6a11a10 a50 a41a43a42 a47 a28 . More interestingly, however, spade, with only 5 training examples, yields a PR score of a48 a42a41 . This contrasts with nation, which has more than 4200 training examples, but yields a low PR score of a41a43a42a36a0 a44 . Accordingly, the number of training examples alone does not seem to have a direct impact on PR.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Sense Perplexity
</SectionTitle>
      <Paragraph position="0"> This factor is a characteristic of the training data.</Paragraph>
      <Paragraph position="1"> Perplexity is a51a3a37a39a38 a6a41a40a43a42a11a44a29a45 . Entropy is measured as follows: null</Paragraph>
      <Paragraph position="3"> where a53 is a sense for a polysemous noun and a46 is the set of all its senses.</Paragraph>
      <Paragraph position="4"> Entropy is a measure of confusability in the senses' contexts distributions; when the distribution is relatively uniform, entropy is high. A skew in the senses' contexts distributions indicates low entropy, and accordingly, low perplexity. The lowest possible perplexity is a48 , corresponding to a41 entropy. A low sense perplexity is desirable since it facilitates the discrimination of senses by the learner, therefore leading to better classification. In the SALAAM-tagged training data, for example, bar has the highest perplexity value of a44a18a42a36a56a3a0 over its 19 senses, while day, with 16 senses, has a much lower perplexity of a48 a42a45a38 .</Paragraph>
      <Paragraph position="5"> Surprisingly, we observe nouns with high perplexity such as bum (sense perplexity value of a38a18a42a41 a38 ) achieving PR scores of a48 a42a41 . While nouns with relatively low perplexity values such as grip (sense perplexity of a41a43a42a36a0 a38 ) yields a low PR score of a41a43a42a46a51a52a45 . Moreover, nouns with the same perplexity and similar number of senses yield very different PR scores.</Paragraph>
      <Paragraph position="6"> For example, examining holiday and child, both have the same perplexity of a51a18a42 a48 a47a52a47 and the number of senses is close, with 6 and 7 senses, respectively, however, the PR scores are very different; holiday yields a PR of a41a43a42a41a1a56 , and child achieves a PR of a41a43a42a45a44a1a8 . Furthermore, nature and art have the same perplexity of a51a18a42a46a51 a44 ; art has 17 senses while nature has 7 senses only, nonetheless, art yields a much higher PR score of (a41a43a42a45a44a3a56 ) compared to a PR of a41a43a42 a47a52a47 for nature.</Paragraph>
      <Paragraph position="7"> These observations are further solidified by the insignificant correlation of a39 a34a5a4 a41a43a42 a48a49a51 , a25 a1a47a25 a48a7a6a13a51a1a8 a28 a34 a41a43a42 a47a57a0a9a6a11a10 a50 a41a43a42a36a0 a28 between sense perplexity and PR.</Paragraph>
      <Paragraph position="8"> At first blush, one is inclined to hypothesize that, the combination of low perplexity associated with a large number of senses -- as an indication of high skew in the distribution -- is a good indicator of high PR, but reviewing the data, this hypothesis is dispelled by day which has 16 senses and a sense perplexity of a48 a42a45a38 , yet yields a low PR score of a41a43a42a41a1a56 .</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Semantic Translation Entropy
</SectionTitle>
      <Paragraph position="0"> Semantic translation entropy (STE) (Melamed, 1997) is a special characteristic of the SALAAM-tagged training data, since the source of evidence for SALAAM tagging is multilingual translations.</Paragraph>
      <Paragraph position="1"> STE measures the amount of translational variation for an L1 word in L2, in a parallel corpus. STE is a variant on the entropy measure. STE is expressed as follows:  where a19 is a translation in the set of possible translations a6 in L2; and a21 is L1 word.</Paragraph>
      <Paragraph position="2"> The probability of a translation a19 is calculated directly from the alignments of the test nouns and their corresponding translations via the maximum likelihood estimate.</Paragraph>
      <Paragraph position="3"> Variation in translation is beneficial for SALAAM tagging, therefore, high STE is a desirable feature. Correlation between the automatic tagging precision and STE is expected to be high if SALAAM has good quality translations and good quality alignments. However, this correlation is a low a39 a34 a41a43a42a45a38 a38 . Consequently, we observe a low correlation between STE and PR, a39 a34 a41a43a42a46a51a52a51 , a25 a1a47a25 a48a7a6a13a51a1a8 a28 a34</Paragraph>
      <Paragraph position="5"> Examining the data, the nouns bum, detention, dyke, stress, and yew exhibit both high STE and high PR; Moreover, there are several nouns that exhibit low STE and low PR. But the intriguing items are those that are inconsistent. For instance, child and holiday: child has an STE of a41a43a42a41a1a56 and comprises 7 senses at a low sense perplexity of a48 a42a46a45 a44 , yet yields a high PR of a41a43a42a45a44a1a8 . As mentioned earlier, low STE indicates lack of translational variation. In this specific experimental condition, child is translated as a2 enfant, enfantile, ni~no, ni~no-peque~no a3 , which are words that preserve ambiguity in both French and Spanish. On the other hand, holiday has a relatively high STE value of a41a43a42a46a45a52a45 , yet results in the lowest PR of a41a43a42a41a1a56 . Consequently, we conclude that STE alone is not a good direct indicator of PR.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Perplexity Difference
</SectionTitle>
      <Paragraph position="0"> Perplexity difference (PerpDiff) is a measure of the absolute difference in sense perplexity between the test data items and the training data items. For the manually annotated training data items, the overall correlation between the perplexity measures is a significant a39 a34 a41a43a42a45a44a52a45 which contrasts to a low over-all correlation of a39  a41a43a42 a47 a38 between the SALAAM-tagged training data items and the test data items. Across the nouns in this study, the correlation between PerpDiff and PR is a39 a34 a4 a41a43a42 a47 . It is advantageous to be as similar as possible to the training data to guarantee good classification results within a supervised framework, therefore a low PerpDiff is desirable. We observe cases with a low PerpDiff such as holiday (PerpDiff of a41a43a42a41a1a0 ), yet the PR is a low a41a43a42a41a1a56 . On the other hand, items such as art have a relatively high PerpDiff of a51a18a42a46a45a52a51 , but achieves a high PR of a41a43a42a45a44a1a8 . Accordingly, PerpDiff alone is not a good indicator of PR.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.6 Sense Distributional Correlation
</SectionTitle>
      <Paragraph position="0"> Sense Distributional Correlation (SDC) results from comparing the sense distributions of the test data items with those of SALAAM-tagged training data items. It is worth noting that the correlation between the SDC of manually annotated training data and that of the test data ranges from a39 a34 a41a43a42a45a44 a4 a48 a42a41 . A strong significant correlation of a39  test data. Overall, nouns that yield high PR have high SDC values. However, there are some instances where this strong correlation is not exhibited. For example, circuit and post have relatively high SDC values, a41a43a42a55a8 a44a48a47 and a41a43a42a36a56a3a0 a44 , respectively, in a0a8a1a9a3a5a3a10a6 a3 a8a43a10a44a12 , but they score lower PR values than detention which has a comparatively lower SDC value of a41a43a42a55a8a3a8 a45 . The fact that both circuit and post have many senses, 13 and 12, respectively, while detention has 4 senses only is noteworthy. detention has a higher STE and lower sense perplexity than either of them however. Overall, the data suggests that SDC is a very good direct indicator of PR.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.7 Sense Context Confusability
</SectionTitle>
      <Paragraph position="0"> A situation of sense context confusability (SCC) arises when two senses of a noun are very similar and are highly uniformly represented in the training examples. This is an artifact of the fine granularity of senses in WordNet 1.7pre. Highly similar senses typically lead to similar usages, therefore similar contexts, which in a learning framework detract from the learning algorithm's discriminatory power.</Paragraph>
      <Paragraph position="1"> Upon examining the 29 polysemous nouns in the training and test sets, we observe that a significant number of the words have similar senses according to a manual grouping provided by Palmer, in 2002.9 For example, senses 2 and 3 of nature, meaning trait and quality, respectively, are considered similar by the manual grouping. The manual grouping does not provide total coverage of all the noun senses in this test set. For instance, it only considers the homonymic senses 1, 2 and 3 of spade, yet, in the current test set, spade has 6 senses, due to the existence of sub senses.</Paragraph>
      <Paragraph position="2"> 26 of the 29 test items exhibit multiple groupings based on the manual grouping. Only three nouns, detention, dyke, spade do not have any sense groupings. They all, in turn, achieve high PR scores of a48 a42a41 .</Paragraph>
      <Paragraph position="3"> There are several nouns that have relatively high SDC values yet their performance ratios are low such as post, nation, channel and circuit. For instance, nation has a very high SDC value of a41a43a42a45a44a52a45a52a51 , a low sense perplexity of a48 a42a45a38 -- relatively close to the a48 a42a46a45 sense perplexity of the test data -- a sufficient number of contexts (4350), yet it yields a PR of a41a43a42a36a0 a44 . According to the manual sense grouping, senses 1 and 3 are similar, and indeed, upon inspection of the context distributions, we find the bulk of the senses' instance examples in the SALAAM-tagged training data for the condition that yields this PR in a0a2a1a4a3a5a3a7a6 a3 a8a43a10a44a12 are annotated with either sense 1 or sense 3, thereby creating confusable contexts for the learning algorithm. All the cases of nouns that achieve high PR and possess sense groups do not have any SCC in the training data which strongly suggests that SCC is an important factor to consider when predicting the PR of a system. null</Paragraph>
    </Section>
    <Section position="8" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.8 Discussion
</SectionTitle>
      <Paragraph position="0"> We conclude from the above exploration that SDC and SCC affect PR scores directly. PerpDiff, STE, and Sense Perplexity, number of senses and number of contexts seem to have no noticeable direct impact on the PR.</Paragraph>
      <Paragraph position="1"> Based on this observation, we calculate the SDC values for all the training data used in our experimental conditions for the 29 test items.</Paragraph>
      <Paragraph position="2"> Table 3 illustrates the items with the highest SDC values, in descending order, as yielded from any of the SALAAM conditions. We use an empirical cut-off value of a41a43a42a55a8a7a0 for SDC. The SCC values are reported as a boolean Y/N value, where a Y indicates the presence of a sense confusable context. As shown a high SDC can serve as a means of auto9http://www.senseval.org/sense-groups. The manual sense grouping comprises 400 polysemous nouns including the 29 nouns in this evaluation.</Paragraph>
      <Paragraph position="3">  sociated with their respective SCC and PR values.11 matically predicting a high PR, but it is not sufficient. If we eliminate the items where an SCC exists, namely, mouth, post, and authority, we are still left with nation and circuit, where both yield very low PR scores. nation has the desirable low PerpDiff of a41a43a42a46a51a52a51 . The sense annotation tagging precision of the a3a1a0 a51a3a2 a3 a6 a9 in this condition which yields the highest SDC -- Spanish UN data with the a3a4a0 a51a3a2 a3 a6 a9 for training -- is a low a38 a41a43a42 a47a6a5 and a low STE value of a41a43a42 a48a49a51 a44 . This is due to the fact that both French and Spanish preserve ambiguity in similar ways to English which does not make it a good target word for disambiguation within the SALAAM framework, given these two languages as sources of evidence. Accordingly, in this case, STE coupled with the noisy tagging could have resulted in the low PR. However, for circuit, the STE value for its respective condition is a high a41a43a42a46a51 a44a43a48 , but we observe a relatively high PerpDiff of a48 a42a36a0 a38 compared to the PerpDiff of a41 for the manually annotated data.</Paragraph>
      <Paragraph position="4"> Therefore, a combination of high SDC and nonexistent SCC can reliably predict good PR. But the other factors still have a role to play in order to achieve accurate prediction.</Paragraph>
      <Paragraph position="5"> It is worth emphasizing that two of the identified factors are dependent on the test data in this study, SDC and PerpDiff. One solution to this problem is to estimate SDC and PerpDiff using a held out data set that is hand tagged. Such a held out data set would be considerably smaller than the required size of a manually tagged training data for a classical supervised WSD system. Hence, SALAAM-tagged training data offers a viable solution to the annotation acquisition bottleneck.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML