File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2808_metho.xml
Size: 18,657 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2808"> <Title>Making Relative Sense: From Word-graphs to Semantic Frames</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Ontology-based Scoring and Tagging </SectionTitle> <Paragraph position="0"> The Ontology Used: The ontology used in the experiments described herein was initially designed as a general purpose component for knowledge-based NLP. It includes a top-level ontology developed following the procedure outlined by Russell and Norvig (1995) and originally covered the tourism domain encoding knowledge about sights, historical persons and buildings. Then, the existing ontology was adopted in the SMARTKOM project (Wahlster et al., 2001) and modified to cover a number of new domains, e.g., new media and program guides, pedestrian and car navigation and more (Gurevych et al., 2003b). The top-level ontology was re-used with some slight extensions. Further developments were motivated by the need of a process hierarchy.</Paragraph> <Paragraph position="1"> This hierarchy models processes which are domain-independent in the sense that they can be relevant for many domains, e.g., InformationSearchProcess. The modeling of Process as a kind of event that is continuous and homogeneous in nature, follows the frame semantic analysis used in the FRAMENET project (Baker et al., 1998).</Paragraph> <Paragraph position="2"> The role structure also reflects the general intention to keep abstract and concrete elements apart. A set of most general properties has been defined with regard to the role an object can play in a process: agent, theme, experiencer, instrument (or means), location, source, target, path. These general roles applied to concrete processes may also have subroles: thus an agent in a process of buying (TransactionProcess) is a buyer, the one in the process of cognition is a cognizer. This way, roles can also build hierarchical trees. The property theme in the process of information search is a required piece-ofinformation, in PresentationProcess it is a presentableobject, i.e., the entity that is to be presented. The OntoScore System: The ONTOSCORE software runs as a module in the SMARTKOM multi-modal and multi-domain spoken dialogue system (Wahlster, 2003).</Paragraph> <Paragraph position="3"> The system features the combination of speech and gesture as its input and output modalities. The domains of the system include cinema and TV program information, home electronic device control as well as mobile services for tourists, e.g. tour planning and sights information.</Paragraph> <Paragraph position="4"> ONTOSCORE operates on n-best lists of SRHs produced by the language interpretation module out of the ASR word graphs. It computes a numerical ranking of alternative SRHs and thus provides an important aid to the spoken language understanding component. More precisely, the task of ONTOSCORE in the system is to identify the best SRH suitable for further processing and evaluate it in terms of its contextual coherence against the domain and discourse knowledge.</Paragraph> <Paragraph position="5"> ONTOSCORE performs a number of processing steps.</Paragraph> <Paragraph position="6"> At first each SRH is converted into a concept representation (CR). For that purpose we augmented the system's lexicon with specific concept mappings. That is, for each entry in the lexicon either zero, one or many corresponding concepts where added. A simple vector of concepts - corresponding to the words in the SRH for which entries in the lexicon exist - constitutes each resulting CR. All other words with empty concept mappings, e.g. articles and aspectual markers, are ignored in the conversion. Due to lexical ambiguity, i.e. the one to many word - concept mappings, this processing step yields a set a0a2a1a4a3a6a5a8a7a10a9a12a11a13a5a8a7a8a14a15a11a17a16a18a16a19a16a19a11a20a5a8a7a22a21a24a23 of possible interpretations for each SRH.</Paragraph> <Paragraph position="7"> Next, ONTOSCORE converts the domain model, i.e. an ontology, into a directed graph with concepts as nodes and relations as edges. In order to find the shortest path between two concepts, ONTOSCORE employs the single source shortest path algorithm of Dijkstra (Cormen et al., 1990). Thus, the minimal paths connecting a given concept a25a18a26 with every other concept in CR (excluding a25a17a26 itself) are selected, resulting in an a27a29a28a30a27 matrix of the respective paths.</Paragraph> <Paragraph position="8"> To score the minimal paths connecting all concepts with each other in a given CR, Gurevych et al. (2003a) adopted a method proposed by Demetriou and Atwell (1994) to score the semantic coherence of alternative sentence interpretations against graphs based on the Longman Dictionary of Contemporary English (LDOCE). As defined by Demetriou and Atwell (1994), a7 a1 a3a17a31 a9 a11a32a31 a14 a11a17a16a19a16a19a16a19a11a32a31 a21 a23 is the set of direct relations (both isa and semantic relations) that can connect two nodes (concepts); and a1a34a3a6a35a36a9a15a11a37a35a36a14a15a11a19a16a19a16a17a16a19a11a32a35a38a21a39a23 is the set of corresponding weights, where the weight of each isa relation is set to a40 and that of each other relation to a41 .</Paragraph> <Paragraph position="9"> The algorithm selects from the set of all paths between two concepts the one with the smallest weight, i.e. the cheapest. The distances between all concept pairs in CR are summed up to a total score. The set of concepts with the lowest aggregate score represents the combination with the highest semantic relatedness. The ensuing distance between two concepts, e.g. a0a2a1a4a3a6a5a8a7a9a3a11a10a6a12 is then defined as the minimum score derived between a3a6a5 and a3a11a10 . So far, a number of additional normalization steps, contextual extensions and relation-specific weighted scores have been proposed and evaluated (Gurevych et al., 2003a; Porzel et al., 2003a; Loos and Porzel, 2004) The ONTOSCORE module currently employs two knowledge sources: an ontology (about 800 concepts and 200 relations) and a lexicon (ca. 3.600 words) with word to concept mappings, covering the respective domains of the system.</Paragraph> <Paragraph position="10"> A Motivating Example: Given the utterance shown in its transcribed form in example (1), we get as input the set of recognition hypotheses shown in examples (1a) (1e) extracted from the word graph produced by the ASR system.</Paragraph> <Paragraph position="11"> mains as follows: a13 The task of hypotheses verification to be solved successfully if the SRHs 1a to 1e are ranked in such a way that hypothesis 1e achieves the best score.</Paragraph> <Paragraph position="12"> a13 The task of sense disambiguation to be solved successfully if all ambiguous lexical items, such as the verb kommen in 1e, are tagged with their contextually adequate senses given in our It is important to point out that there are at least two essential differences between spontaneous speech semantic tagging and the textual correlates, i.e., a14 a smaller size of processable context as well as a14 imperfections, hesitations, disfluencies and speech recognition errors.</Paragraph> <Paragraph position="13"> For our evaluations we employ the ONTOSCORE system to select the best hypotheses, best sense and best relation and compare its answers to keys contained in corresponding gold-standards produced by specific annotation experiments.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Hypotheses Disambiguation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Data and Annotation </SectionTitle> <Paragraph position="0"> The corresponding data collection is described in detail by Gurevych and Porzel (2004). In the first experiment 552 utterances were annotated within the discourse context, i.e. the SRHs were presented in their original dialogue order. In this experiment, the annotators saw the SRHs together with the transcribed user utterances. The task of the annotators was to determine the best SRH from the n-best list of SRHs corresponding to a single user utterance. The decision had to be made on the basis of several criteria. The most important criteria was how well the SRH captures the intentional content of the user's utterance. If none of the SRHs captured the user's intent adequately, the decision had to be made by looking at the actual word error rate. In this experiment the inter-annotator agreement was 90.69%, i.e. 1,247 markables out of 1,375. In a second experiment annotators had to label each SRHs as being semantically coherent or incoherent, reaching an agreement of 79.91a15 (1,096 out of 1,375). Each corpus was then transformed into an evaluation gold standard by means of the annotators agreeing on a single solution for the cases of disagreement.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Evaluation Results </SectionTitle> <Paragraph position="0"> The evaluation of ONTOSCORE was carried out on a set of 95 dialogues. The resulting dataset contained 552 utterances resulting in 1,375 SRHs, corresponding to an average of 2.49 SRHs per user utterance. The corpus had been annotated by human subjects according to specific annotation schemata which are described above.</Paragraph> <Paragraph position="1"> Identifying the Best SRH The task of ONTOSCORE in our multimodal dialogue system is to determine the best SRH from the n-best list of SRHs corresponding to a given user utterance. The baseline for this evaluation was computed by adding the individual ratios of utterance/SRHs - corresponding to the likelihood of guessing the best one in each individual case - and dividing it by the number of utterances - yielding the overall likelihood of guessing the best one as 63.91%. The accuracy of ONTOSCORE on this task amounts to 86.76%. This means that in 86.76% of all cases the best SRH defined by the human gold standard is among the best scored by the ONTOSCORE module.</Paragraph> <Paragraph position="2"> Classifying the SRHs as Semantically Coherent versus Incoherent For this evaluation we used the same corpus, where each SRH was labeled as being either semantically coherent versus incoherent with respect to the previous discourse context. We defined a baseline based on the majority class, i.e. coherent, in the corpus, 63.05%.</Paragraph> <Paragraph position="3"> In order to obtain a binary classification into semantically coherent and incoherent SRHs, a cutoff threshold must be set. Employing a cutoff threshold of 0.44, we find that the contextually enhanced ONTOSCORE system correctly classifies 70.98% of SRHs in the corpus.</Paragraph> <Paragraph position="4"> From these results we can conclude that the task of an absolute classification of coherent versus incoherent is substantially more difficult than that of determining the best SRH, both for human annotators and for ON-TOSCORE. Both human and the system's reliability is lower in the coherent versus incoherent classification task, which allows to classify zero, one or multiple SRHs from one utterance as coherent or incoherent. In both tasks, however, ONTOSCORE's performance mirrors and approaches human performance.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Sense Disambiguation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Data and Annotation </SectionTitle> <Paragraph position="0"> The second data set was produced by means of Wizard-of-Oz experiments (Francony et al., 1992). In this type of setting a full-blown multimodal dialogue system is simulated by a team of human hidden operators. A test per-son communicates with the supposed system and the dialogues are recorded and filmed digitally. Here over 224 subjects produced 448 dialogues (Schiel et al., 2002), employing the same domains and tasks as in the first data collection. In this annotation task annotators were given the recognition hypotheses together with a corresponding list of ambiguous lexemes automatically retrieved form the system's lexicon and their possibles senses, from which they had to pick one or select not-decidable for cases where not coherent meaning was detectable.</Paragraph> <Paragraph position="1"> Firstly, we examined if humans are able to annotate the data reliably. Again, this was the case, as shown by the resulting inter annotator agreement of 78.89 a0 . Secondly, a gold-standard is needed to evaluate the system's performance. For that purpose, the annotators reached an agreement on annotated items of the test data which had differed in the first place. The ensuing gold-standard altogether was annotated with 2225 markables of ambiguous tokens, stemming from 70 ambiguous words occurring in the test corpus.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Evaluation Results </SectionTitle> <Paragraph position="0"> For calculating the majority class baselines, all markables in the gold-standards were counted. Corresponding to the frequency of each concept of each ambiguous lexeme the percentage of correctly chosen concepts by means of selecting the most frequent meaning without the help of a system was calculated by means of the formula given by Porzel and Malaka (2004). This resulted in a baseline of 52.48 a0 for the test data set.</Paragraph> <Paragraph position="1"> For this evaluation, ONTOSCORE transformed the SRH from our corpus into concept representations as described in Section 2. To perform the WSD task, ONTOSCORE calculates a coherence score for each of these concept sets. The concepts in the highest ranked set are considered to be the ones representing the correct word meaning in this context. In this experiment we used OntoScore in two variations: Using the first variation, the relations between two concepts are weighted a1 for taxonomic relations and a2 for all others. The second mode allows each non taxonomic relation being assigned an individual weight depending on its position in the relation hierarchy. That means the relations have been weighted according to their level of generalization. More specific relations should indicate a higher degree of semantic coherence and are therefore weighted cheaper, which means that they - more likely - assign the correct meaning. Compared to the gold-standard, the original method of Gurevych et al. (2003a) reached a precision of 63.76 a0 as compared to 64.75 a0 for the new method described herein.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Relation Tagging </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Data and Annotation </SectionTitle> <Paragraph position="0"> For this annotation we employed a subset of the second data set, i.e. we looked only at the hypotheses identified as being the best one (see above). For these utterance representations the semantic relations that hold between the predicate (in our case concepts that are part of the ontology's Process hierarchy) and the entities (in our case concepts that are part of the ontology's Physical Object hierarchy) had to be identified. The inter-annotator agreement on this task amounted to 79.54 a0 .</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Evaluation Results </SectionTitle> <Paragraph position="0"> For evaluating the performance of the ONTOSCORE system we defined an accurate match, if the correct semantic relation (role) was chosen by the system for the corresponding concepts contained therein1. As inaccurate we counted in analogy to the word error rates in speech recognition: a1 deletions, i.e. missing relations in places were one ought to have been identified; a1 insertions, i.e. postulating any relation to hold where none ought to have been; and a1 substitutions, i.e. postulating a specific relation to hold where some other ought to have been.</Paragraph> <Paragraph position="1"> An example of a substitution in this task is given the SRH shown in Example 2.</Paragraph> <Paragraph position="2"> castle.</Paragraph> <Paragraph position="3"> In this case the sense disambiguation was accurate, so that the two ambiguous entities, i.e. kommen and Schloss, were correctly mapped onto a MotionDirectedTransliterated (MDT) process and Sight object - the concept Person resulted from an unambiguous word-to-concept mapping from the form I. The error in this case was the substitution of [hasgoal] with the relation [has-source], as shown below: [MDT] [has-agent] [Agent] [MDT] [has-source] [Sight] As a special case of substitution we also counted those cases as inaccurate where a relation chain was selected by the algorithm, while in principle such chains, e.g. metonymic chains are possible and in some domains not infrequent, in the still relatively simple and short dialogues that constitute our data2. Therefore cases such as the connection betweenWatchPerceptualProcess (WPP) and Sight shown in Example 3 were counted as substitutions, because simpler ones should have been found or modeled3.</Paragraph> <Paragraph position="4"> 1Regardless of whether they were the correct senses or not as defined in the sense disambiguation task.</Paragraph> <Paragraph position="5"> 2This, in turn, also shed a light on the paucity of the capabilities that current state-of-the-art systems exhibit. 3We are quite aware that such an evaluation is as much a test of the knowledge store as well as of the processing algorithms. We will discuss this in Section 7.</Paragraph> <Paragraph position="6"> As a deletion such cases were counted where the annotators (more specifically the ensuing gold standard) contained a specific relation such as [WPP] [has-watchableobject] [Sight], was not tagged at all by the system. As an insertion we counted the opposite case, i.e. where any relations, e.g. between [Agent] and [Sight] in Example (2) were tagged by the system.</Paragraph> <Paragraph position="7"> As compared to the human gold standard we obtained an accuracy of 76.31 a2 and an inaccuracy of substitutions of 15.32 a2 , deletions of 7.11 a2 and insertions of 1.26 a2 .</Paragraph> </Section> </Section> class="xml-element"></Paper>