File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0808_metho.xml
Size: 5,736 bytes
Last Modified: 2025-10-06 14:09:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0808"> <Title>An Evaluation Exercise for Romanian Word Sense Disambiguation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Sense inventory </SectionTitle> <Paragraph position="0"> For the Romanian WSD task, we have chosen a set of words from three parts of speech - nouns, verbs and adjectives. Table 1 presents the number of words under each part of speech, and the average number of senses for each class.</Paragraph> <Paragraph position="1"> The senses were (manually) extracted from a Romanian dictionary (Dict,ionarul EXplicativ al limbii and their dictionary definitions were incorporated in the Open Mind Word Expert. For each annotation task, the contributors could choose from this list of 39 words. For each chosen word, the system displays the associated senses, together with their definitions, and a short (1-4 words) description of the sense. After the user gets familiarized with these senses, the system displays each example sentence, and the list of senses together with their short description, to facilitate the tagging process.</Paragraph> <Paragraph position="2"> For the coarse grained WSD task, we had the option of using the grouping provided by the dictionary. A manual analysis however showed that some of the senses in the same group are quite distinguishable, while others that were separated were very similar.</Paragraph> <Paragraph position="3"> For example, for the word circulatie (roughly, circulation). The following two senses are grouped in the dictionary: 2a. movement, travel along a communication line/way 2b. movement of the sap in plants or the cytoplasm inside cells Sense 2a fits better with sense 1 of circulation: 1. the event of moving about while sense 2b fits better with sense 3: 3. movement or flow of a liquid, gas, etc. within a circuit or pipe.</Paragraph> <Paragraph position="4"> To obtain a better grouping, a linguist clustered the similar senses for each word in our list of forty. The average number of senses for each class is almost halved.</Paragraph> <Paragraph position="5"> Notice that Romanian is a language that uses diacritics, and the the presence of diacritics may be crucial for distinguishing between words. For example peste without diacritics may mean fish or over. In choosing the list of words for the Romanian WSD task, we have tried to avoid such situations. Although some of the words in the list do have diacritics, omitting them does not introduce new ambiguities. null</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Corpus </SectionTitle> <Paragraph position="0"> Examples are extracted from the ROCO corpus, a 400 million words corpus consisting of a collection of Romanian newspapers collected on the Web over a three years period (1999-2002).</Paragraph> <Paragraph position="1"> The corpus was tokenized and part-of-speech tagged using RACAI's tools (Tufis, 1999). The tokenizer recognizes and adequately segments various constructs: clitics, dates, abbreviations, multiword expressions, proper nouns, etc. The tagging followed the tiered tagging approach with the hidden layer of tagging being taken care of by Thorsten Brants' TNT (Brants, 2000). The upper level of the tiered tagger removed from the assigned tags all the attributes irrelevant for this WSD exercise. The estimated accuracy of the part-of-speech tagging is around 98%.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Sense Tagged Data </SectionTitle> <Paragraph position="0"> While several sense annotation schemes have been previously proposed, including single or dual annotations, or the &quot;tag until two agree&quot; scheme used during SENSEVAL-2, we decided to use a new scheme and collect four tags per item, which allowed us to conduct and compare inter-annotator agreement evaluations for two-, three-, and four-way agreement. The agreement rates are listed in Table 3.</Paragraph> <Paragraph position="1"> The two-way agreement is very high - above 90% - and these are the items that we used to build the annotated data set. Not surprisingly, four-way agreement is reached for a significantly smaller number of cases. While these items with four-way agreement were not explicitly used in the current evaluation, we believe that this represents a &quot;platinum standard&quot; data set with no precedent in the WSD research community, which may turn useful for a range of future experiments (for bootstrapping, in particular).</Paragraph> <Paragraph position="2"> In addition to sense annotated examples, participants have been also provided with a large number of unlabeled examples. However, among all participating systems, only one system - described in (Serban and TVatar 2004) - attempted to integrate this additional unlabeled data set into the learning process.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Participating Systems </SectionTitle> <Paragraph position="0"> Five teams participated in this word sense disambiguation task. Table 4 lists the names of the participating systems, the corresponding institutions, and references to papers in this volume that provide detailed descriptions of the systems and additional analysis of their results.</Paragraph> <Paragraph position="1"> There were no restrictions placed on the number of submissions each team could make. A total number of seven submissions was received for this task. Table 5 shows all the submissions for each team, and gives a brief description of their approaches.</Paragraph> </Section> class="xml-element"></Paper>