File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0801_metho.xml
Size: 12,399 bytes
Last Modified: 2025-10-06 14:09:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0801"> <Title>The Basque lexical-sample task</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> 2 Setting of the exercise </SectionTitle> <Paragraph position="0"> In this section we present the setting of the Basque lexical-sample exercise.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Basque </SectionTitle> <Paragraph position="0"> As Basque is an agglutinative language, the dictionary entry takes each of the elements necessary to form the different functions. More specifically, the affixes corresponding to the determinant, number and declension case are taken in this order and independently of each other (deep morphological structure). For instance, 'etxekoari emaiozu' can be roughly translated as '[to the one in the house] [give it]' where the underlined sequence of suffixes in Basque corresponds to 'to the one in the'.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Sense inventory </SectionTitle> <Paragraph position="0"> We chose the Basque WordNet, linked to WordNet 1.6, for the sense inventory. This way, the hand tagging enabled us to check the sense coverage and overall quality of the Basque WordNet, which is under construction. The Basque</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Corpora used </SectionTitle> <Paragraph position="0"> Being Basque a minority language it is not easy to find the required number of occurrences for each word. We wanted to have both balanced and newspaper examples, but we also had to include texts extracted from the web, specially for the untagged corpus. The procedure to find examples from the web was the following: for each target word all possible morphological declensions were automatically generated, searched in a searchengine, documents retrieved, automatically lemmatized (Aduriz et al. 2000), filtered using some heuristics to ensure quality of context, and finally filtered for PoS mismatches. Table 1 shows the number of examples from each source.</Paragraph> </Section> <Section position="4" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.4 Words chosen </SectionTitle> <Paragraph position="0"> Basically, the words employed in this task are the same words used in Senseval 2 (40 words, 15 nouns, 15 verbs and 10 adjectives), only the sense inventory changed. Besides, in Senseval 3 we replaced 5 verbs with new ones. The reason for this is that in the context of the MEANING project we are exploring multilingual lexical acquisition, and there are ongoing experiments that focus on those verbs. (Agirre et al. 2004; Atserias et al. 2004). In fact, 10 words in the English lexical-sample have translations in the Basque, Catalan, Italian, Romanian and Spanish lexical tasks: channel, crown, letter, program, party (nouns), simple (adjective), play, win, lose, decide (verbs).</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.5 Selection of examples from corpora </SectionTitle> <Paragraph position="0"> The minimum number of examples for each word according to the task specifications was calculated as follows: N=75+15*senses+7*multiwords As the number of senses in WordNet is very high, we decided to first estimate the number of senses and multiwords that really occur in the corpus. The taggers were provided with a sufficient number of examples, but they did not have to tag all. After they had tagged around 100 examples, they would count the number of senses and multiwords that had occurred and computed the N according to those counts.</Paragraph> <Paragraph position="1"> The context is constituted of 5 sentences, including the sentence with the target word appearing in the middle. Links were kept to the source corpus, document, and to the newspaper section when applicable.</Paragraph> <Paragraph position="2"> The occurrences were split at random in training set (2/3 of all occurrences) and test set (1/3). correspond to the source of the examples: newspaper, balanced corpus and Internet respectively.</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Hand tagging </SectionTitle> <Paragraph position="0"> Three persons, graduate linguistics students, took part in the tagging. They are familiar with word senses, as they are involved in the development of the Basque WordNet. The following procedure was defined in the tagging of each word.</Paragraph> <Paragraph position="1"> * Before tagging, one of the linguists (the editor) revised the 40 words in the Basque WordNet.</Paragraph> <Paragraph position="2"> She had to delete and add senses to the words, specially for adjectives and verbs, and was allowed to check the examples in the corpus.</Paragraph> <Paragraph position="3"> * The three taggers would meet, read the glosses and examples given in the Basque WordNet and discuss the meaning of each synset. They tried to agree and clarify the meaning differences among the synsets. For each word two hand-taggers and a referee is assigned by chance.</Paragraph> <Paragraph position="4"> * The number of senses of a word in the Basque WordNet might change during this meeting; that is, linguists could agree that one of the word's senses was missing, or that a synset did not fit with a word. This was done prior to looking at the corpus. Then, the editor would update the Basque WordNet according to those decisions before giving the taggers the final synset list. Overall (including first bullet above), 143 senses were deleted and 92 senses added, leaving a total of 316 senses. This reflects the current situation of the Basque WordNet, which is still under construction.</Paragraph> <Paragraph position="5"> * Two taggers independently tagged all examples for the word. No communication was allowed while tagging the word.</Paragraph> <Paragraph position="6"> * Multiple synset tags were allowed, as well as the following tags: the lemma (in the case of multiword terms), U (unassignable), P (proper noun), and X (incorrectly lemmatized). Those with an X were removed from the final release. In the case of proper nouns and multiword terms no synset tag was assigned. Sometimes the U tag was used for word senses which are not in the Basque WordNet. For instance, the sense of kanal corresponding to TV channel, which is the most frequent sense in the examples, is not present in the Basque WordNet (it was not included in WordNet 1.6).</Paragraph> <Paragraph position="7"> * A program was used to compute agreement rates and to output those occurrences where there was disagreement. Those occurrences were grouped by the senses assigned.</Paragraph> <Paragraph position="8"> * A third tagger, the referee, reviewed the disagreements and decided which one was the correct sense (or senses).</Paragraph> <Paragraph position="9"> The taggers were allowed to return more than one sense, and they returned 9887 tags (1.34 per occurrence). Overall, the two taggers agreed in at least one tag 78.2% of the time. Some words attained an agreement rate above 95% (e.g. nouns kanal or tentsio), but others like herri town/people/nation- attained only 52% agreement. On average, the whole tagging task took 54 seconds per occurrence for the tagger, and 20 seconds for the referee. However, this average does not include the time the taggers and the referee spent in the meetings they did to understand the meaning of each synset. The comprehension of a word with all its synsets required 45.5 minutes on average.</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 Final release </SectionTitle> <Paragraph position="0"> Table 1 includes the total amount of hand-tagged and untagged examples that were released. In addition to the usual release, the training and testing data were also provided in a lemmatized version (Aduriz et al. 2000) which included lemma, PoS and case information. The motivation was twofold: * to make participation of the teams easier, considering the deep inflection of Basque.</Paragraph> <Paragraph position="1"> * to factor out the impact of different lemmatizers and PoS taggers in the system comparison.</Paragraph> </Section> <Section position="6" start_page="1" end_page="1" type="metho"> <SectionTitle> 5 Participants and Results </SectionTitle> <Paragraph position="0"> 5 teams took part in this task: Swarthmore College (swat), Basque Country University (BCU), Instituto per la Ricerca Scientifica e Tecnologica (IRST), University of Minnesota Duluth (Duluth) and University of Maryland (UMD). All the teams presented supervised systems which only used the tagged training data, and no other external resource. In particular, no system used the pointers to the full texts, or the additional untagged texts. All the systems used the lemma, PoS and case information provided, except the BCU team, which had additional access to number, determiner and ellipsis information directly from the analyzer. This extra information was not provided publicly because of representation issues.</Paragraph> <Paragraph position="1"> according to Recall.</Paragraph> <Paragraph position="2"> We want to note that due to a bug, a few examples were provided without lemmas.</Paragraph> <Paragraph position="3"> The results for the fine-grained scoring are shown in Table 2, including the Most Frequent Sense baseline (MFS). We will briefly describe each of the systems presented by each team in order of best recall.</Paragraph> <Paragraph position="4"> * Swat presented three systems based in the same set of features: the best one was based on Adaboost, the second on a combination of five learners (Adaboost, maximum entropy, clustering system based on cosine similarity, decision lists, and naive bayes, combined by majority voting), and the third on a combination of three systems (the last three). * BCU presented two systems: the first one based on Support Vector Machines (SVM) and the second on a majority-voting combination of SVM, cosine based vectors and naive bayes.</Paragraph> <Paragraph position="5"> * IRST participated with a kernel-based method. * Duluth participated with a system that votes among three bagged decision trees.</Paragraph> <Paragraph position="6"> * UMD presented a system based on SVM.</Paragraph> <Paragraph position="7"> The winning system is the one using Adaboost from Swat, followed closely by the BCU system using SVM.</Paragraph> </Section> <Section position="7" start_page="1" end_page="1" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> These are the main issues we think are interesting for further discussion.</Paragraph> <Paragraph position="1"> Sense inventory. Using the Basque WordNet presented some difficulties to the taggers. The Basque WordNet has been built using the translation approach, that is, the English synsets have been 'translated' into Basque. The taggers had some difficulties to comprehend synsets, and especially, to realize what makes a synset different from another. In some cases the taggers decided to group some of the senses, for instance, in herri town/people/nation- they grouped 6 senses. This explains the relatively high number of tags per occurrence (1.34). The taggers think that the tagging would be much more satisfactory if they had defined the word senses directly from the corpus.</Paragraph> <Paragraph position="2"> Basque WordNet quality. There was a mismatch between the Basque WordNet and the corpus: most of the examples were linked to a specific genre, and this resulted in i) having a handful of senses in the Basque WordNet that did not appear in our corpus and ii) having some senses that were not included in the Basque WordNet. Fortunately, we already predicted this and we had a preparation phase where the editor enriched WordNet accordingly. Most of the deletions in the preliminary part were due to the semi-automatic method to construct the Basque WordNet. All in all, we think that tagging corpora is the best way to ensure the quality of the WordNets and we plan to pursue this extensively for the improvement of the Basque WordNet.</Paragraph> </Section> class="xml-element"></Paper>