File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1012_evalu.xml
Size: 3,759 bytes
Last Modified: 2025-10-06 13:58:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1012"> <Title>Semantic Coherence Scoring Using an Ontology</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Context </SectionTitle> <Paragraph position="0"> The ONTOSCORE software runs as a module in SMARTKOM (Wahlster et al., 2001), a multi-modal and multi-domain spoken dialogue system. The system features the combination of speech and gesture as its input and output modalities. The domains of the system include cinema and TV program information, home electronic device control, mobile services for tourists, e.g. tour planning and sights information.</Paragraph> <Paragraph position="1"> ONTOSCORE operates on n-best lists of SRHs produced by the language interpretation module out of the ASR word graphs. It computes a numerical ranking of alternative SRH and thus provides an important aid to the understanding component of the system in determining the best SRH. The ONTOSCORE software employs two knowledge sources, an ontology (about 730 concepts and 200 relations) and a word/concept lexicon (ca. 3.600 words), covering the respective domains of the system.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> The evaluation of ONTOSCORE was carried out on a dataset of 2.284 SRHs. We reformulated the problem of measuring the semantic coherence in terms of classifying the SRHs into two classes: coherent and incoherent. To our knowledge, there exists no similar software performing semantic coherence scoring to be used for comparison in this evaluation. Therefore, we decided to use the results from human annotation (s. Section 2.2) as the baseline.</Paragraph> <Paragraph position="1"> A gold standard for the evaluation of ONTOSCORE was derived by the annotators agreeing on the correct solution in cases of disagreement. This way, we obtained 1.246 (54.55%) SRH classified as coherent by humans, which is also assumed to be the baseline for this evaluation. null Additionally, we performed an inverse linear transformation of the scores (which range from 1 to a0 a31a31a6 a8 ), so that the output produced by ONTOSCORE is a score on the scale from 0 to 1, where higher scores indicate greater coherence. In order to obtain a binary classification of SRHs into coherent versus incoherent with respect to the knowledge base, we set a cutoff thresh old. The dependency graph of the threshold value and the results of the program in % is shown in Figure 1.</Paragraph> <Paragraph position="2"> versus incoherent classification The best results are achieved with the threshold 0.29.</Paragraph> <Paragraph position="3"> With this threshold, ONTOSCORE correctly classifies 1.487 SRH, i.e. 65.11% in the evaluation dataset (the word/concept relation is not taken into account at this point).</Paragraph> <Paragraph position="4"> Figure 3 shows the dependency graph between a40 , representing the threshold for the word/concept relation and the results of ONTOSCORE, given the best cutoff threshold for the classification (i.e. 0.29) derived in the previous experiments.</Paragraph> <Paragraph position="5"> The best results are achieved with the a40 a1 a3a5a4a0a1a0 . In other words, the proportion of concepts vs. words must be no less than 1 to 3. Under these settings, ONTOSCORE correctly classifies 1.672 SRH, i.e. 73.2% in the evaluation dataset. This way, the technique brings an additional improvement of 8.09% as compared to initial results.</Paragraph> </Section> class="xml-element"></Paper>