File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2312_evalu.xml
Size: 3,354 bytes
Last Modified: 2025-10-06 13:59:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2312"> <Title>Resolution of Lexical Ambiguities in Spoken Dialogue Systems</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> The percentage of correctly disambiguated lexemes from both systems is calculated by the following formula: a7 a1a1a0a3a2a5a4 a4a7a6a9a8a11a10a13a12a15a14a17a16a18a16 . Where a7 is he result in percent, a2 the number of lexemes that match with the goldstandard, a4 the number of not-decidable ones and a10 the number of total lexemes. As opposed to the human annotators, both systems always select a specific reading and never assign the value not-decidable. For this evaluation, therefore, we treat any concept occurring in a</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Evaluation Knowledge </SectionTitle> <Paragraph position="0"> For this evaluation, ONTOSCORE transformed the SRH from our corpus into concept representations as described above. To perform the WSD task, ONTOSCORE calculates a coherence score for each of these concept sets in a0 . The concepts in the highest ranked set are considered to be the ones representing the correct word meaning in this context. OntoScore has two variations: Using the first variation, the relations between two concepts are weighted a16 for taxonomic relations and a14 for all others. The second mode allows each relation being assigned an individual weight as described in Section 4.1. For this purpose, the relations have been weighted according to their level of generalization. More specific relations should indicate a higher degree of semantic coherence and are therefore weighted cheaper, which means that they - more likely - assign the correct meaning. Compared to the gold-standard, the original method of Gurevych et al. (2003a) reached a precision of 63.76a0 (f-measure = .78)8 as compared to 64.75a0 (f-measure = .79) for the new method described herein (baseline 52.48a0 ).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Evaluation Supervised </SectionTitle> <Paragraph position="0"> For the purpose of evaluating a supervised learning approach on our data we used the efficient and general statistical TnT tagger, the short form for Trigrams'n'Tags (Brants, 2000). With this tagger it is possible to train a new statistical model with any tagset. In our case the tagset consisted of part-of-speech specific concepts of the SmartKom ontology. The data we used for preparing the model consisted of a gold-standard annotation of the training data set. Compared to the gold-standard made for the test corpus the method achieved a precision of 75.07a0 (baseline 52.48a0 ).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Evaluation Comparison </SectionTitle> <Paragraph position="0"> For a direct comparison we computed f-measures for the human reliability, the majority class baseline method as well as for the knowledge-based and data-driven methods in Table 3.</Paragraph> <Paragraph position="1"> 1979) with a29 a9 a28 a18 a30 by regarding the accuracy as precision and recall as 100a31 .</Paragraph> </Section> </Section> class="xml-element"></Paper>