File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3012_metho.xml
Size: 10,647 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3012"> <Title>Word level confidence measurement using semantic features. In Proceedings of ICASSP, Hong Kong, April.</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Data and Annotation Experiments </SectionTitle> <Paragraph position="0"> We performed a number of annotation experiments.</Paragraph> <Paragraph position="1"> The purpose of these experiments was to: investigate the reliability of the annotations; create a domain model based on human annotations; null produce a training dataset for statistical classifiers; null set a Gold Standard as a test dataset for the evaluation.</Paragraph> <Paragraph position="2"> All annotation experiments were conducted on data collected in hidden-operator tests following the paradigm described in Rapp and Strube (2002). Subjects were asked to verbalize a predefined intention in each of their turns, the system's reaction was simulated by a human operator. We collected utterances from 29 subjects in 8 dialogues with the system each. All user turns were recorded in separate audio files. These audio files were processed by two versions of our dialogue system with different speech recognition modules. Data describing our corpora is given in Table 1. The first and the second system's runs are referred to as Dataset 1 and The corpora obtained from these experiments were further transformed into a set of annotation files, which can be read into GUI-based annotation tools, e.g., MMAX (M&quot;uller and Strube, 2003). This tool can be adopted for annotating different levels of information, e.g., semantic coherence and domains of utterances, the best speech recognition hypothesis in the N-best list, as well as domains of individual concepts. The two annotators were trained with the help of an annotation manual. A reconciled version of both annotations resulted in the Gold Standard. In the following, we present the results of our annotation experiments.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Coherence, domains of SRHs in Dataset 1 </SectionTitle> <Paragraph position="0"> The first experiment was aimed at annotating the speech recognition hypotheses (SRH) from Dataset 1 w.r.t. their domains. This process was two-staged.</Paragraph> <Paragraph position="1"> In the first stage, the annotators labeled randomly mixed SRHs, i.e. SRHs without discourse context, for their semantic coherence as coherent or incoherent. In the second stage, coherent SRHs were labeled for their domains, resulting in a corpus of 1511 hypotheses labeled for at least one domain category.</Paragraph> <Paragraph position="2"> The numbers for ambiguous domain attributions can be found in Table 2. The class distribution is given in Table 3.</Paragraph> <Paragraph position="3"> computed for individual categories. P(A) is the percentage of agreement between annotators. P(E) is the percentage we expect them to agree by chance.</Paragraph> <Paragraph position="4"> The annotations are generally considered to be reliable if K > 0:8. This is true for all classes except those which occur very rarely on our data.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Domains of ontological concepts </SectionTitle> <Paragraph position="0"> In the second experiment, ontological concepts were annotated with zero or more domain categories.1 We 1Top-level concepts like Event are typically not domainspecific. Therefore, they will not be assigned any domains. extracted 231 concepts from the lexicon, which is a subset of ontological concepts relevant for our corpus of SRHs. The annotators were given the textual descriptions of all concepts. These definitions are supplied with the ontology. We computed two kinds of inter-annotator agreement. In the first case, we calculated the percentage of concepts, for which the annotators agreed on all domain categories, resulting in ca. 47.62% (CONCabs, see Figure 1). In the second case, the agreement on individual domain decisions (1848 overall) was computed, ca. 86.85% (CONCindiv, see Figure 1).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Best conceptual representation and </SectionTitle> <Paragraph position="0"> domains of SRHs in Dataset 2 As will be evident from Section 4.1, each SRH can be mapped to a set of possible interpretations, which are called conceptual representations (CR). In this experiment, the best conceptual representation and the domains of coherent SRHs from Dataset 2 were annotated. As our system operates on the basis of CR, it is necessary to disambiguate them in a pre-processing step.</Paragraph> <Paragraph position="1"> 867 SRHs used in this experiment are mapped to 2853 CR, i.e. on average each SRH is mapped to 3.29 CR. The annotators' agreement on the task of determining the best CR reached ca. 88.93%.</Paragraph> <Paragraph position="2"> For the task of domain annotation, again, we computed the absolute agreement, when the annotators agreed on all domains for a given SRH. This resulted in ca. 92.5% (SRHabs, see Figure 1). The agreement on individual domain decisions (6936 overall) yielded ca. 98.92% (SRHindiv, see Figure 1). As the Figure 1 suggests, annotating utterances with domains is an easier task for humans than annotating ontological concepts with the same information. One possible reason for this is that even for an isolated SRH of an utterance there is at least some local context available, which clarifies its high-level meaning to some extent. An isolated concept has no defining context whatsoever.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Domain Classification </SectionTitle> <Paragraph position="0"> In this section, we present the algorithms employed for assigning domains to speech recognition hypotheses. The system called DOMSCORE performs several processing steps, each of which will be de- null SRHindiv) refers to identical individual domain decisions. null scribed separately in the respective subsections.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 From SRHs to conceptual representations </SectionTitle> <Paragraph position="0"> SRH is a set of words W = fw1;:::;wng. DOMSCORE operates on high-level representations of SRHs as conceptual representations (CR). CR is a set of ontological concepts CR = fc1;:::;cng.</Paragraph> <Paragraph position="1"> Conceptual representations are obtained from W through the process called word-to-concept mapping. In this process, all possible ontological senses corresponding to individual words in the lexicon are permutated resulting in a set I of possible interpretations I = fCR1;:::;CRng for each speech recognition hypothesis.</Paragraph> <Paragraph position="2"> For example, in our data a user formulated the query concerning the TV program, as:2 The two hypotheses have two conceptual representations each. This is due to the lexical ambiguity of the word come as either MotionProcess or WatchProcess in German. Movie in SRH1 is mapped to Broadcast. As a consequence, the</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Domain classification of CR </SectionTitle> <Paragraph position="0"> The domain specificity score S of the conceptual representation CR for the domain d is, then, defined as the average score of all concepts in CR for this domain. For a given domain model DM, this formally means:</Paragraph> <Paragraph position="2"> where n is the number of concepts in the respective CR. As each CR is scored for all domains d, the output of DOMSCORE is a set of domain scores:</Paragraph> <Paragraph position="4"> where #d is the number of domain categories.</Paragraph> <Paragraph position="5"> Tables 7 and 8 display the results of the domain scoring algorithm for the conceptual representations In the Gold Standard evaluation data, SRH1 was annotated as the best SRH and attributed the domain Electronic Program Guide, CR1b was selected as its best conceptual representation. As can be seen in the above tables, this CR1b gets the highest domain score for Electronic Program Guide on the basis of both DManno and DMtf idf . Consequently, both domain models attribute this domain to SRH1.</Paragraph> <Paragraph position="6"> SRH2 was not labeled with any domains in the Gold Standard, as this hypothesis is an incoherent one and hence cannot be considered to belong to any domain at all. According to DManno, its representation CR2a gets a single score 1 for the domain Route Planning and CR2b gets multiple equal scores. DOMSCORE interprets a single score as a more reliable indicator for a specific domain than multiple equal scores and assigns the domain Route Planning to SRH2. On the basis of DMtf idf the highest overall score for CR2a;2b is the one for domain Electronic Program Guide. Therefore, the model will assign this domain to SRH2.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Word2Concept ratio </SectionTitle> <Paragraph position="0"> In previous experiments (Gurevych et al., 2003a), we found that when operating on sets of concepts as representations of speech recognition hypotheses, the ratio of the number of ontological concepts n in a given CR and the total number of words w in the respective SRH must be accounted for. This relation is defined by the ratio R = n=w.</Paragraph> <Paragraph position="1"> The idea is to prevent an incoherent SRH containing many function words with zero concept mappings, represented by a single concept in the extreme, from being classified as coherent. Experimental results indicate that the optimal threshold R should be set to 0.33. This means that if there are more than three words corresponding to a single concept on average, the SRH is likely to be incoherent and should be excluded from processing.</Paragraph> <Paragraph position="2"> DOMSCORE implements this as a post-processing technique. For both conceptual representations of SRH1 the ratio is R = 1=3, whereas for those of SRH2, we find R = 1=5. This value is under the threshold, which means that SRH2 is considered incoherent and its domain scores are dropped. Finally, this results in both models assigning the single domain Electronic Program Guide as the best one to the utterance in Example 1.</Paragraph> </Section> </Section> class="xml-element"></Paper>