File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1051_metho.xml
Size: 14,985 bytes
Last Modified: 2025-10-06 14:14:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1051"> <Title>Error Driven Word Sense Disambiguation</Title> <Section position="5" start_page="320" end_page="320" type="metho"> <SectionTitle> HASOBJ do/sga~iv*_8oC/ial.mogion_C/rea*ion_body something/~op HASSBJ do/aga~iv*.social_mo'~ion_C/reation_body you/p*rson PREPMOD do/s~ative_social.mogion.craation_body TO </SectionTitle> <Paragraph position="0"> econolTly / group_C/ ogn i g i on_at g r ibuge _act</Paragraph> <Section position="1" start_page="320" end_page="320" type="sub_section"> <SectionTitle> 2.3 Preparing the input. </SectionTitle> <Paragraph position="0"> As a result of adding lexical semantics we get a triple <functional relation, wordi/tagsetl, wordj/tagsetj>, but in its current formulation, the unsupervised learning algorithm is only able to learn relations holding among bigrams. Thus, it can learn either relations between a functional relation name (e.g. &quot;HASOBJ') and a tagset or between tagsets, without considering the relation between them. In both cases we report a loss of information which is fatal for the learning of proper rules for semantic disambiguation. There is an intuitive solution to this problem: most of the relations we are interested in are diadic in nature. For example, adjectival modification is a relation holding between two heads (MOD(hl,h2)). Also relations concerning verbal arguments can be split, in a neo-davidsonian perspective, into more atomic relations such as &quot;SUBJ (h 1,h2)&quot; &quot;OBJ (h 1,h2)&quot;. These relations can be translated into a &quot;bigram format&quot; by assuming that the relation itself is incorporated among the properties of the involved words (e.g. wl/IS-OBJ w2/IS-HEAD).</Paragraph> <Paragraph position="1"> Learnable properties of words are standardly expressed through tags. Thus, we can merge functional and semantic tags into a single tag (e.g.</Paragraph> <Paragraph position="3"> OBJ2._IS-OBJ3 w2/IS-HEAD4). The learner acquires constraints which relate functional and semantic information, as planned in this experiment. We obtain the following format where every line of the input text represents what we label an FS-pair (Functional Semantic pair): d_i HASOBJ something/gAsosJ-I u/42_41_38..36..29 do/ HASSBJ you/HAS~_BJ-' / 42_4 I_38_36-29 where relations labelled with -I are just inverse relations (e.g. HAS-SUBJ -I - IS-SUB J-OF).</Paragraph> <Paragraph position="4"> Functional relation involving modification through prepositional phrases is ternary as it involves the preposition, the governing head and the governed head. Crucially, however, only substantive heads receive semantic tags, which allows us to condense the preposition form in the FS tags as well. The representation of the modification structure of the phrase do to the economy becomes: do/ MOD-TO economy/MOD-TO-i :~2_41_38_36_29 14_9_7_4</Paragraph> </Section> </Section> <Section position="6" start_page="320" end_page="321" type="metho"> <SectionTitle> 3 Unsupervised Learning for WSD </SectionTitle> <Paragraph position="0"> Sufficiently large texts should contain good cues to learn rules for WSD in terms of selectional preferences. 2 The crucial assumption in using functional relations for WSD is that, when compositionality holds, selectional preferences can be checked through an intersection operation between the semantic features of the syntactically related lexical items. By looking at functional relations that contain at least one non-ambiguously tagged word, we can learn evidence for disambiguating ambiguous words appearing in the same context. So, if we know that in the sentence John went to Milan the word Milan is ~By selectional preferences we mean both the selection of semantic features of a dependent given a certain head and its inverse (i.e. selection of a head's semantic features by a dependent constituent).</Paragraph> <Paragraph position="1"> unambiguously tagged as place, we learn that in a structure GO to X, where GO is a verb of the same semantic class as the word go and X is a word containing place among its possible senses, then X is disambiguated as place.</Paragraph> <Paragraph position="2"> The Brill algorithm 3 is based on rule patterns which describe rules that can be learned, as well as on a lexicon where words are associated with ambiguity classes. The learning algorithm is recursively applied to an ambiguously tagged corpus, producing a set of rules. The set of learnable rules includes the rules for which there is corpus evidence in terms of unambiguous configurations. In other words, the learning algorithm extensively relies on bigrams where one of the words is unambiguously tagged. The preferred rules, the ones with the highest score, are those that best minimize the entropy of the untagged corpus. For instance, a rule which resolves ambiguity for 1000 oceurences of a given ambiguity class is preferred to one which resolves the same ambiguity only 100 times.</Paragraph> <Paragraph position="3"> Consider the following rule pattern: Change tagSet (X1 ,.~.1/2 ...X~) into tag -Y=i if the left context is associated with the tagSet (1~, Y2 ... lm). This pattern generates rules such as: 4 bil8_b+-4 bii8 LEFT b42_b32 1209.64 which is paraphrased as: If a noun is ambiguous between person and act and it appears as the subject of a verb which is ambiguous betu, een stative and communication, then disambiguate it as person. This instantiation relies on the fact that the untagged corpus contains a significant number of cases where a noun unambiguously tagged as person appears as subject. of a verb ambiguous between s~catiw and communication. The rule is then applied to the corpus in order to further reduce its ambiguity, and the new corpus is passed again as an input to the learner, and the next most preferred rule is learned.</Paragraph> <Paragraph position="4"> Three different scoring methods have been used 5 as criteria to select the best rule. They are referred to in the program documentation, ZFor the sake of clarity, we just present here the general lines of Brill's algorithm. For a detailed version of the algorithm see Brill's original paper (Brill, 1997). setting two different thresholds governing the possibility and in Dini et al. (1998a), as &quot;paper&quot;, &quot;original&quot; and &quot;goodlog&quot;. Here we will describe only &quot;original&quot; and &quot;goodlog&quot;, because &quot;paper&quot; differs from &quot;original&quot; only for some implementation details.</Paragraph> <Paragraph position="5"> In the method called &quot;original&quot;, at every iteration step the best scored disambiguation rule is learned, and the score of a rule is computed, according to Brill, in the following way: assume that Change the tag of a word from ~ to Y in context Cis arule (Y E ~). Call R the tag Z which maximizes the following function (where Z ranges over all the tags in ~ except Y, freq(Y) is the number of occurences of words unambiguously tagged with Y, freq(Z) is the number of occurences of words unambiguously tagged with Z, and incontext( Z, C) is the number of times a word unambiguously tagged with Z occurs in</Paragraph> <Paragraph position="7"> The score assigned to the rule would then be: S: incontext(Y, C) - freq(Y)*incontext(R,C) freq(R) In short, a good transformation from ~ to Y is one for which alternative tags in ~ have either very low frequency in the corpus or they seldom appear in context C. At every iteration cycle, the algorithm simply computes the best scoring transformation.</Paragraph> <Paragraph position="8"> The method &quot;goodlog&quot; uses a probabilistic measure which minimizes the effects of tag frequenc, adopting this is the formula for giving a score to the rule that selects the best tag Y in a context C (Y and Z belong to the ambiguous tagset): S ,~ . i, tincontext(YC) * \]req(Z) ~ = argrnaxy(~)aos(logt \]req(Y) incontext(Z,C) &quot;)) The differences in results between the different scoring methods are reported and commented on in section 4 in table 1.</Paragraph> </Section> <Section position="7" start_page="321" end_page="322" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> For the evaluation we used as test corpus the sub-set of the Brown corpus manually tagged with the 45 top-level WordNet tags. We started with the Penn Tree Bank representation and went through all the necessary steps to build FS-pairs for a tag or a word to appear in a rule: i) the minimal frequency of a tag; ii) the minimal frequency of a word in the corpus. We set the first parameter to 400 (that is, we asked the learner to consider only the 400 most frequent TagSets) and we ignored the second one (that is we asked the learner to consider all words in the corpus). used by the applier. These FS pairs were then labelled according to the manual codification and used as a standard for evaluation. We also produced, from the same source, a randomly tagged corpus for measuring the improvements of our system with respect to random choice.</Paragraph> <Paragraph position="1"> The results of comparing the randomly tagged corpus and the corpus tagged by our system using the methods &quot;original&quot; and &quot;goodlog&quot; are shown in table 1. As usual, Precision is the number of correctly tagged words divided by the total number of tagged words; Recall is the number of correctly tagged words divided by the number of words in the test corpus (about 40000). F-measure is (2*Precision*Recall)/(Precison+Recall). The column labelled &quot;Adjusted&quot; reports the Precision taking into account non-ambiguous words. The adjusted precision is computed in the following way: (Correct - unambiguous words) / ((Correct + Uncorrect) - unambiguous words). On an absolute basis, our results improve on those of Resnik (1997). who used an information-theory model of selectional strength preference rather than an error-driven learning algorithm. Indeed, if we compare the &quot;Adjusted&quot; measure we obtained with a set of about 500 rules (50% precision), with the average reported by Resnik (1997) (41deg~ precision), we obtain an advantage of 10 points, which, for a task suchas WSD, is noteworthy. For comparison with other experiments, refer to Resnik (1997).</Paragraph> <Paragraph position="2"> It is interesting to compare the figures provided by &quot;'goodlog&quot; and &quot;original&quot;. Since &quot;goodlog&quot; smooths the influence of absolute tag frequency, the learned rules achieve much higher precision, even though they are less efficient in terms of the number of words they can disambiguate. This is due to the fact that the most frequent words also tend to be the most ambiguous ones, thus the ones for which the task of WSD is most difficult (cf. Dini et al. (1998a)).</Paragraph> </Section> <Section position="8" start_page="322" end_page="323" type="metho"> <SectionTitle> 5 Towards SENSEVAL </SectionTitle> <Paragraph position="0"> As mentioned above, the present system will be adopted in the context of the SENSEVAL project, where we will adopt the Xerox Incremental Finite State Parser, which is completely based on finite state technology. Thus, in the present pilot experiment, we are only interested in relations which could reasonably be captured by a shallow parser, and complex informative relations present in the Penn Tree Bank are simply disregarded during the parsing step described in section 2.1. Also, structures which are traditionally difficult to parse through Finite State Automata, such as incidental and parenthetic clauses or coordinate structures, are discarded from the learning corpus. This might have caused a slight decrease in the performance of the system.</Paragraph> <Paragraph position="1"> Some additional decrease might have been caused by noise introduced by incorrect assignment of senses in context during the learning phase (see Schuetze et al. (1995)). In particular, the system has to face the problem of sense assignment to named entities such as person or industry names. Since we didn't use any text preprocessor, we simply made the assumption that any word having no semantic tag in Word-Net, and which is not a pronoun, is assigned the label human. This assumption is certainly questionable and we adopted it only as a working hypothesis. In the following rounds of this experiment we will plug in a module for named entity recognition in order to improve the performance of the system.</Paragraph> <Paragraph position="2"> Another issue that will be tackled in the SENSEVAL project concerns word independence. In this experiment we duplicated lexical heads when they were in a functional relation with different items. This permitted an easy adaptation to the input specification of the Brill learner, but it has drawbacks both in the learning and the application phase. During the learning phase the inability to capture the identity of the same lexical head subtracts evidence for the learning of new rules. For instance, assume that at an iteration cycle n the algorithm has learned that verbal information is enough to disambiguate the word cat as animal in the wild cat mewed. Since the FS-pairs cat/mew and wild/cat are autonomous, at cycle n + 1 the learner will have no evidence to learn that the adjective wild tends to associate with nouns of type animal. On the contrary, cat, as appearing in wild cat, will still be ambiguous. The consequences of assuming independence of lexical heads are even worse in the rule application phase. First, certain words are disambiguated only in some of the instances in which they appear, thus producing a decrease in terms of recall. Second, there might be a case where the same word is tagged differently according to the relations into which it enters, thus causing a decrease in terms of precision. Both problems will be overcome by the new Java-based versions of the Brill learner and applier which have been developed at CELI.</Paragraph> <Paragraph position="3"> When considering the particular WSD task, it is evident that the information conveyed by adjectives and pre-nominal modifiers is at least as important as that conveyed by verbs, and it is statistically more prominent. In the corpus obtained from parsing the PTB, approximately of FS-pairs are represented by pre-nominal modification (roughly analogous to the subject-verb FS-pairs and more frequent than the object-verb pairs, which amount to 1 of the whole corpus).</Paragraph> <Paragraph position="4"> But adjectives receive very poor lexical-semantic information from WordNet. This forced us to exclude them both fl'om the training and test corpora. This situation will again improve in the SENSEVAL experiment with the adoption of a different semantic lexicon.</Paragraph> </Section> class="xml-element"></Paper>