File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2312_metho.xml

Size: 19,749 bytes

Last Modified: 2025-10-06 14:09:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2312">
  <Title>Resolution of Lexical Ambiguities in Spoken Dialogue Systems</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 State of the Art
</SectionTitle>
    <Paragraph position="0"> After work on WSD had overcome so-called early doubts (Ide and Veronis, 1998) in the 1960's, it was applied to various NLP tasks, such as machine translation, information retrieval, content and grammatical analysis and text processing. Yarowsky (1995) used both supervised and unsupervised WSD for correct phonetizitation of words in speech synthesis. However, there is no recorded work on processing speech recognition hypotheses resulting from speech utterances as it is done in our research.</Paragraph>
    <Paragraph position="1"> In general, following Ide and Veronis (1998) the various WSD approaches of the past can be divided into two types, i.e., data- and knowledge-based approaches.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Data-based Methods
</SectionTitle>
      <Paragraph position="0"> Data-based approaches extract their information directly from texts and are divided into supervised and unsupervised methods (Yarowsky, 1995; Stevenson, 2003).</Paragraph>
      <Paragraph position="1"> Supervised methods work with a given (and therefore limited) set of potential classes in the learning process.</Paragraph>
      <Paragraph position="2"> For example, Yarowsky (1992) used a thesaurus to generate 1042 statistical models of the most general categories. Weiss (1973) already showed that disambiguation rules can successfully be learned from hand-tagged corpora.</Paragraph>
      <Paragraph position="3"> Despite the small size of his training and test corpus, an accuracy of 90a0 was achieved. Even better results on a larger corpus were obtained by Kelly and Stone 1975 who included collocational, syntactic and part of speech information to yield an accuracy of 93a0 on a larger corpus. As always, supervised methods require a manually annotated learning corpus.</Paragraph>
      <Paragraph position="4"> Unsupervised methods do not determine the set of classes before the learning process, but through analysis of the given data by identifying clusters of similar cases. One example is the algorithm for clustering by committee described by Pantel and Lin (2003), which automatically discovers word senses from text. Generally, unsupervised methods require large amounts of data. In the case of spoken dialogue and speech recognition output sufficient amounts of data will hopefully become available once multi-domain spoken dialogue systems are deployed in real world applications.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Knowledge-based Methods
</SectionTitle>
      <Paragraph position="0"> Knowledge-based approaches work with lexica and/or ontologies. The kind of knowledge varies widely and machine-readable as well as computer lexica are employed. The knowledge-based approach employed herein (Gurevych et al., 2003a) operates on an ontology partially derived from FrameNet data (Baker et al., 1998) and described by Gurevych et al. (2003b).</Paragraph>
      <Paragraph position="1"> In a comparable approach Sussna (1993) worked with the lexical reference system WordNet and used a similar metric for the calculation of semantic distance of a number of input lexemes. Depending on the type of semantic relation (hyperonymy, synonymy etc.) different weights are given and his metric takes account of the number of arcs of the same type leaving a node and the depth of a given edge in the overall tree. The disambiguation results on textual data reported by Sussna (1993) turned out to be significantly better than chance. In contrast to many other work on WSD with WordNet he took into account not only the isa hierarchy, but other relational links as well. The method is, therefore, similar to the one used in this evaluation, with the difference that this one uses a semantic-web conform ontology instead of WordNet and it is applied to speech recognition hypotheses. The fact, that our WSD work is done on SRHs makes it difficult to compare the results with methods evaluated on textual data such as in the past SENSEVAL studies (Edmonds, 2002).</Paragraph>
      <Paragraph position="2"> The ontology-based system has been successfully used for a set of tasks such as finding the best speech recognition hypotheses from sets of competing SRHs, labeling SRHs as correct or incorrect representations of the users intention and for scoring their degree of contextual coherence (Gurevych et al., 2003a; Porzel and Gurevych, 2003; Porzel et al., 2003). In general, the system offers an additional way of employing ontologies, i.e. to use the knowledge modeled therein as the basis for evaluating the semantic coherence of sets of concepts. It can be employed independent of the specific ontology language used, as the underlying algorithm operates only on the nodes and named edges of the directed graph represented by the ontology. The specific knowledge base, e.g. written in OIL-RDFS, DAML+OIL or OWL,1 is converted into a graph, consisting of the class hierarchy, with each class corresponding to a concept representing either an entity or a process and their slots, i.e. the named edges of the graph corresponding to the class properties, constraints and restrictions.</Paragraph>
      <Paragraph position="3"> 1OIL-RDFS, DAML+OIL and OWL are frequently used knowledge modeling languages originating in W3C and Semantic Web projects. For more details, see www.w3c.org/RDF,</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
www.w3c.org/OWL and www.daml.org.
3 Data and Annotation Experiment
</SectionTitle>
    <Paragraph position="0"> In this section we describe the data collection and annotation experiments performed in order to obtain independent data sets for training and evaluation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Data Collection
</SectionTitle>
      <Paragraph position="0"> The first data set was used for training the supervised model is described in Gurevych et al. (2002b) and was collected using the so-called Hidden Operator Test (Rapp and Strube, 2002). This procedure represents a simplification of classical end-to-end experiments and Wizard-of-Oz experiments (Francony et al., 1992) - as it is conductible without the technically very complex use of a real or a seemingly real conversational system. The subjects are prompted to ask for specific information and the system response is pre-manufactured. We had 29 subjects prompted to say certain inputs in 8 dialogues. 1479 turns were recorded. In our experimental setup each user-turn in the dialogue corresponded to a single illocution, e.g.</Paragraph>
      <Paragraph position="1"> route request or sights information request as described by Gurevych et al. (2002a).</Paragraph>
      <Paragraph position="2"> The second data set was used for testing the data- and ontology-based systems and thusly will be called the test corpus. It was produced by means of Wizard-of-Oz experiments (Francony et al., 1992). In this type of setting a full-blown multimodal dialogue system is simulated by a team of human hidden operators. A test person communicates with the supposed system and the dialogues are recorded and filmed digitally. Here over 224 subjects produced 448 dialogues (Schiel et al., 2002), employing the same domains and tasks as in the first data collection.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Data Pre-Processing
</SectionTitle>
      <Paragraph position="0"> After manual segmentation of the data into single utterances. The resulting audio files were then manually transcribed. The segmented audio files were handed to the speech recognition engine integrated in the SMARTKOM dialogue system (Wahlster, 2003). Employing the semantic parsing system described by Engel (2002) the corresponding speech recognition word lattices (Oerder and Ney, 1993) were first transformed into n-best lists of so-called hypotheses sequences. These were mapped onto conceptual representations, which contain the multiple semantic interpretations of the individual hypotheses sequences that arise due to lexical ambiguities.</Paragraph>
      <Paragraph position="1"> For obtaining the training data, we used only the best, correct and perfectly disambiguated speech recognition hypotheses as described by Porzel et al. (2003) from the first data set of 552 utterances. For obtaining the test data we took a random sample of 3100 utterances from the second data set. This seeming discrepancy between training and test data is due to the fact that only a part of the test data set actually contains ambiguous lexical items and many of the utterances quite similar to each other.</Paragraph>
      <Paragraph position="2"> For example, given the utterance shown in its transcribed form in example (1), we then obtained the sequence of recognition hypotheses shown in examples (1a) - (1e).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Annotation
</SectionTitle>
      <Paragraph position="0"> We employed VISTAE2 (M&amp;quot;uller, 2002) for annotating the data and for creating the corresponding goldstandards for the training and test corpora. The annotation of the data was done by two persons specially trained for the annotation tasks, with different purposes: a2 First of all, if humans are able to annotate the data reliably, it is generally more feasible that machines are able to do that as well. This was the case as shown by the resulting inter annotator agreement of 78.89a0 .</Paragraph>
      <Paragraph position="1"> a2 Secondly, a gold-standard is needed to evaluate the systems' performances. For that purpose, the annotators reached an agreement on annotated items of the test data which had differed in the first place.</Paragraph>
      <Paragraph position="2"> The resulting gold-standard represents the highest degree of correctly disambiguated data and is used for comparison with the tagged data produced by the disambiguation systems.</Paragraph>
      <Paragraph position="3">  The class-based kappa statistic of (Cohen, 1960; Carletta, 1996) cannot be applied here, as the classes vary depending on the number of ambiguities per entry in the lexicon. Also an additional class, i.e.,not-decidable was allowed for cases as in SRH (1c), where it is impossible to assign sensible meanings. The test data set altogether was annotated with 2219 markables of ambiguous tokens, stemming from 70 ambiguous words occurring in the test corpus.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Calculating the Baselines
</SectionTitle>
      <Paragraph position="0"> For calculating the majority class baseline, which in our case corresponds to the performance of a unigram tagger, we applied the method described in (Porzel and Malaka, 2004). Therefore, all markables in the gold-standard were counted and, corresponding to the frequency of each concept of each ambiguous lexeme, the percentage of correctly chosen concepts by means of selecting the most frequent meaning was calculated. This resulted in a base-line of 52.48a0 for the test data set.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Word Sense Disambiguation Systems
</SectionTitle>
    <Paragraph position="0"> Both word sense disambiguation systems described herein were tested and developed with the SMARTKOM research framework. As one of the most advanced current systems, the SMARTKOM (Wahlster, 2003) comprises a large set of input and output modalities together with an efficient fusion and fission pipeline. SMARTKOM features speech input with prosodic analysis, gesture input via infrared camera, recognition of facial expressions and their emotional states. On the output side, the system features a gesturing and speaking life-like character together with displayed generated text and multimedia graphical output. It currently comprises nearly 50 modules running on a parallel virtual machine-based integration software called Multiplatform3 described in Herzog et al. (2003).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The Knowledge-driven System
</SectionTitle>
      <Paragraph position="0"> The ontology employed for the evaluation has about 800 concepts and 200 relations (apart from the isarelations defining the general taxonomy) and is described by Gurevych et al. (2003b). It includes a generic top-level ontology whose purpose is to provide a basic structure of the world, i.e. abstract classes to divide the universe in distinct parts as resulting from the ontological analysis.4 The modeling of Processes and Physical Objects as a kind of event that is continuous and homogeneous in nature, follows the frame semantic analysis used for generating the FRAMENET data (Baker et al., 1998).</Paragraph>
      <Paragraph position="1">  lined in Russell and Norvig (1995).</Paragraph>
      <Paragraph position="2"> The hierarchy of Processes is connected to the hierarchy of Physical Objects via slot-constraint definitions herein referred to as relations.</Paragraph>
      <Paragraph position="3"> The system performs a number of processing steps. A first preprocessing step is to convert each SRH into a concept representation (CR). For that purpose the system's lexicon is used, which contains either zero, one or many corresponding concepts for each entry. A simple vector of concepts - corresponding to the words in the SRH for which entries in the lexicon exist - constitutes each resulting CR. All other words with empty concept mappings, e.g. articles, are ignored in the conversion. Due to lexical ambiguity, i.e. the one to many word - concept mappings, this processing step yields a set a0a2a1a4a3a6a5a8a7a10a9a12a11a13a5a8a7a15a14a16a11a18a17a19a17a19a17a19a11a13a5a8a7a15a20a22a21 of possible interpretations for each SRH.</Paragraph>
      <Paragraph position="4"> For example, the words occurring in a SRH such as (2) have the corresponding entries in the lexicon that are shown below.</Paragraph>
      <Paragraph position="6"> Since we have multiple concept entries for individual words, i.e. lexical ambiguities, we get a resulting set a0 of concept representations.</Paragraph>
      <Paragraph position="8"> The concept representations consist of a different number of concepts, because the concept none is not represented in the CRs. The concept none is assigned to lexemes which have one (or more than one) meaning outside the SmartKom domains or constitute functional grammatical markers.</Paragraph>
      <Paragraph position="9"> The system then converts the domain model, i.e. an ontology, into a directed graph with concepts as nodes and relations as edges. In order to find the shortest path between two concepts, the ONTOSCORE system employs the single source shortest path algorithm of Dijkstra (Cormen et al., 1990). Thus, the minimal paths connecting a given concept a0a2a1 with every other concept in CR (excluding a0a3a1 itself) are selected, resulting in an a4a6a5a7a4 matrix of the respective paths. To score the minimal paths connecting all concepts with each other in a given CR, a method proposed by Demetriou and Atwell (1994) to score the semantic coherence of alternative sentence interpretations against graphs based on the Longman Dictionary of Contemporary English (LDOCE) was used in the original system.5 The new addition made for this evaluation was to assign different weights to the individual relations found by the algorithm, depending on their level of granularity within the relation hierarchy. For example, a broad level relation such as has-theme which is found in the class statement of Process is weighted with negative 1 as it has only one super-relation, i.e. has-role, whereas a more specific relation such as has-actor is weighted with negative 4 because it has four super-relations, i.e. has-artist, has-associated-person(s), has-attribute and has-role.</Paragraph>
      <Paragraph position="10"> As before, the algorithm selects from the set of all paths between two concepts the one with the smallest weight, i.e. the cheapest. The distances between all concept pairs in CR are summed up to a total score.6 The set of concepts with the lowest aggregate score represents the combination with the highest semantic relatedness.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The Data-driven System
</SectionTitle>
      <Paragraph position="0"> In this section we describe the implementation of the statistical learning techniques employed for the task of performing WSD on our corpus of spoken dialogue data.</Paragraph>
      <Paragraph position="1"> For our experiments we took the general purpose statistical tagger (Brants, 2000), which is generally used for part-of-speech tagging. It employs a VITERBI algorithm for second order Markov models (Rabiner, 1989), linear interpolation for smoothing and deleted interpolation for  a10a3a24 a12 a14a25a24 a17 a14a19a18a26a18a15a18a19a14a27a24 a20 a21 is the set of corresponding weights, where the weight of each isa relation is set to a28 and that of each other relation to a29 .</Paragraph>
      <Paragraph position="2"> 6Note that more specific relations subtract more then less specific ones from the aggregate score.</Paragraph>
      <Paragraph position="3"> determining the weights. According to Edmonds (2002), WSD is in many ways similar to part-of-speech tagging as it involves labeling every word in a text with a tag from a pre-specified set of tag possibilities for each word by using features of the context and other information.</Paragraph>
      <Paragraph position="4"> This, together with the fact that we do not find crossparadigmatic ambiguities in our data, led to the idea to use a part-of-speech tagger as a concept tagger.</Paragraph>
      <Paragraph position="5"> In our case the tagset consisted of part-of-speech specific concepts of the SmartKom Ontology. The data we used for preparing the model consisted of a combination of three gold-standard annotations, namely the best SRHs, the correct SRHs and the correctly disambiguated SRHs as described in Section 3.3. These were listed lexeme by lexeme with their corresponding concepts in a file in the format expected by TnT. TnT used the file to produce a new model, consisting of a trigram model and a lexicon with lexemes and the concepts which corresponded to them as shown in Figure 1.</Paragraph>
      <Paragraph position="6">  As one can see in Table 1, in our corpus the concept Greeting occurred 38 times and was followed 20 times by Person, which itself was followed 13 times by EmotionExperiencerSubjectProcess. This is equivalent to an utterance beginning with &amp;quot;Hello, I want . . . &amp;quot;.</Paragraph>
      <Paragraph position="7"> The lexicon (see Table 2) shows how often a certain lexeme was tagged with which concept. For example, the German TV channel ARD was tagged in all occurrences with the concept Channel. The German preposition am (at) occurred 17 times and in 12 cases it was tagged as a TwoPointRelation, in one case as TemporalTwoPointRelation and in 4 cases with none. In cases in which the tagger cannot decide between different concepts, because of missing context, it chooses the concept, which occurred most frequently in the model according to the lexicon.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML