File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0207_metho.xml

Size: 11,273 bytes

Last Modified: 2025-10-06 14:07:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0207">
  <Title>Annotating Semantic Consistency of Speech Recognition Hypotheses</Title>
  <Section position="3" start_page="3" end_page="3" type="metho">
    <SectionTitle>
2 Domain Modeling in SmartKom
</SectionTitle>
    <Paragraph position="0"> The SmartKom research project (a consortium of twelve academic and industrial partners) aims at developing a multi-modal and multi-domain information system. Domains include cinema information, home electronic device control, etc. A central goal is the development of new computational methods for disambiguating different modalities on semantic and pragmatic levels.</Paragraph>
    <Paragraph position="1"> The information flow in SmartKom is organized as follows: On the input side the parser picks an N-best list of hypotheses out of the speech recognizer's word lattice (Oerder and Ney, 1993). This list is sent to the media fusion component and then handed over to the intention recognition component.</Paragraph>
    <Paragraph position="2"> The main task of intention recognition in SmartKom is to select the best hypothesis from the N-best list produced by the parser.</Paragraph>
    <Paragraph position="3"> This is then sent to the dialogue management component for computing an appropriate action. In order to find the best hypothesis, the intention recognition module consults a number of other components involved in language, discourse and domain analysis and requests confidence scores to make an appropriate decision (s. Fig. 1).</Paragraph>
    <Paragraph position="4"> Tasks of the domain modeling component are: Philadelphia, July 2002, pp. 46-49. Association for Computational Linguistics. Proceedings of the Third SIGdial Workshop on Discourse and Dialogue,  * to supply a confidence score on the consistency of SRH with respect to the domain model; * to detect the domain currently in focus.</Paragraph>
    <Paragraph position="5">  These tasks are inherently related to each other: It is possible to assign SRH to certain domains only if they are consistent with the domain model. On the other hand, a consistency score can only be useful when it is given with respect to certain domains.</Paragraph>
  </Section>
  <Section position="4" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3 Data
</SectionTitle>
    <Paragraph position="0"> We consider semantic consistency scoring and domain detection a classification task. The question is whether it is feasible to solve this task automatically. As a first step towards an answer we reformulate the problem: automatic classification of SRH is possible only if humans are able to do that reliably.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Data Collection
</SectionTitle>
      <Paragraph position="0"> In order to test the reliability of such annotations we collected a corpus of SRH. The data collection was conducted by means of a hidden operator test (Rapp and Strube, 2002).</Paragraph>
      <Paragraph position="1"> In the test the SmartKom system was simulated. We had 29 subjects prompted to say certain inputs in 8 dialogues. 1479 turns were recorded. Each user-turn in the dialogue corresponded to a single intention, e.g. route request or sights information request.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Data Preprocessing
</SectionTitle>
      <Paragraph position="0"> The data obtained from the hidden operator tests had to be prepared for our study to compose a corpus with N-best SRH. For this purpose we sent the audio files to the speech recognizer. The input for the domain modeling component, i.e. N-best lists of SRH were recorded in log-files and then processed with a couple of Perl scripts. The final corpus consisted of ca. 2300 SRH. This corresponds to ca.</Paragraph>
      <Paragraph position="1"> 1.55 speech recognition hypotheses per user's turn.</Paragraph>
      <Paragraph position="2"> The SRH corpus was then transformed into a set of annotation files which could be read into MMAX, the annotation tool adopted for this task (Mueller and Strube, 2001).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="7" type="metho">
    <SectionTitle>
4 Annotation Scheme
</SectionTitle>
    <Paragraph position="0"> For our study, a markable, i.e. an expression to be annotated, is a single SRH. The annotators as well as the domain modeling component in SmartKom currently do not take the dialogue context into account and do not perform context-dependent analysis. Hence, we presented the markables completely out of dialogue order and thus prevented the annotators from interpreting SRH contextdependently. null</Paragraph>
    <Section position="1" start_page="3" end_page="6" type="sub_section">
      <SectionTitle>
4.1 Semantic Consistency
</SectionTitle>
      <Paragraph position="0"> In the first step, the annotators had to classify markables with respect to semantic consistency. Semantic consistency is defined as well-formedness of an SRH on an abstract semantic level. We differentiate three classes of semantic consistency: consistent, semi-consistent, or inconsistent. First, all nouns and verbs contained in the hypothesis are extracted and corresponding concepts are retrieved from a lemma-concept dictionary (lexicon) supplied for the annotators. The decision regarding consistency, semi-consistency and inconsistency has to be done on the basis of evaluating the set of concepts corresponding to the individual hypothesis.</Paragraph>
      <Paragraph position="1"> * Consistent means that all concepts are semantically related to each other, e.g.</Paragraph>
      <Paragraph position="2"> &amp;quot;ich moechte die kuerzeste Route&amp;quot;  is mapped to the concepts &amp;quot;self&amp;quot;, &amp;quot;wish&amp;quot;, &amp;quot;route&amp;quot; all of which are related to each other. Therefore the hypothesis is considered consistent.</Paragraph>
      <Paragraph position="3"> * The label semi-consistent is used if at least a fragment of the hypothesis is  meaningful. For example, the hypothesis &amp;quot;ich moechte das Video sind&amp;quot;  is considered semi-consistent as the fragment &amp;quot;ich moechte das Video&amp;quot;, i.e. a set of corresponding concepts &amp;quot;self&amp;quot;, &amp;quot;want&amp;quot;, &amp;quot;video&amp;quot; is semantically wellformed. null * Inconsistent hypotheses are those whose conceptual mappings are not semantically related within the domain model. E.g. &amp;quot;ich wuerde die Karte ja Wiedersehen&amp;quot;  is conceptualized as &amp;quot;self&amp;quot;, &amp;quot;map&amp;quot;, &amp;quot;parting&amp;quot;. This set of concepts does not semantically make sense and the hypothesis should be rejected. null</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.2 Domain Detection
</SectionTitle>
      <Paragraph position="0"> One of our considerations was that it is principally not always feasible to detect domains from an SRH. This is because the output of speech recognition is often corrupt, which may, in many cases, lead to false domain assignments. We argue that domain detection is dependent on the semantic consistency score.</Paragraph>
      <Paragraph position="1"> Therefore, according to our annotation scheme no domain analysis should be given to the semantically inconsistent SRH.</Paragraph>
      <Paragraph position="2"> If the hypothesis is considered either consistent or semi-consistent, certain domains will be assigned to it. The list of SmartKom domains for this study is finite and includes the following: route planning, sights information, cinema information, electronic program guide, home electronic device control, personal assistance, interaction management, small-talk and off-talk.</Paragraph>
      <Paragraph position="3"> In some cases multiple domains can be assigned to a single markable. The reason is that some domains are inherently so close to each other, e.g. cinema information and electronic program guide, that the distinction can only be made when the context is taken into account. As this is not the case for our study we allow for the specification of multiple domains per SRH.</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
5.1 The Kappa Statistic
</SectionTitle>
      <Paragraph position="0"> To measure the reliability of annotations we used the Kappa statistic (Carletta, 1996).</Paragraph>
      <Paragraph position="1"> The value of Kappa statistic (K) for semantic consistency in our experiment was 0.58, which shows that there was not a high level of agreement between annotators  . In the field of content analysis, where the Kappa statistic originated, K&gt;0.8 is usually taken to indicate good reliability, 0.68&lt;K&lt;0.8 allows to draw tentative conclusions.</Paragraph>
      <Paragraph position="2"> The distribution of semantic consistency</Paragraph>
    </Section>
    <Section position="4" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
5.2 Discussion of the results
</SectionTitle>
      <Paragraph position="0"> One reason for the relatively low coefficient of agreement between annotators could be a small number of annotators (two) as compared to rather fine distinction between the classes inconsistent vs. semi-consistent and semi-consistent vs. consistent respectively.</Paragraph>
      <Paragraph position="1"> Another reason arises from the analysis of disagreements among annotators. We find many annotation errors caused by the fact that the annotators were not able to interpret the conceptualized SRH correctly. In spite of the fact that we emphasized the necessity of care- null Results on the reliability of domain assignments are not the subject of the present paper and will be published elsewhere.</Paragraph>
      <Paragraph position="2">  ful examination for high-quality annotations, the annotators tended to take functional words like prepositions into account. According to our annotation scheme, however, they had to be ignored during the analysis.</Paragraph>
      <Paragraph position="3"> 5.3 Revisions to the annotation scheme As already noted, one possible reason for disagreements among annotators is a rather fine distinction between the classes inconsistent vs. semi-consistent and semi-consistent vs. consistent. We had difficulties in defining strict criteria for separating semi-consistent as a class on its own. The percentage of its use is rather low as compared to the other two and amounts to 10.3% on average.</Paragraph>
      <Paragraph position="4"> A possible solution to this problem might be to merge the class semi-consistent with either consistent or inconsistent. We conducted a corresponding experiment with the available annotations.</Paragraph>
      <Paragraph position="5"> In the first case we merged the classes inconsistent and semi-consistent. We then ran the Kappa statistic over the data and obtained K=0.7. We found this to be a considerable improvement as compared to earlier K=0.58.</Paragraph>
      <Paragraph position="6"> In the second case we merged the classes consistent and semi-consistent. The Kappa statistic with this data amounted to 0.59, which could not be considered an improvement. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML