File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3012_evalu.xml

Size: 5,374 bytes

Last Modified: 2025-10-06 13:59:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3012">
  <Title>Word level confidence measurement using semantic features. In Proceedings of ICASSP, Hong Kong, April.</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Evaluation metrics
</SectionTitle>
      <Paragraph position="0"> The evaluation of the algorithms and domain models presented herein poses a methodological problem. As stated in Section 3.3, the annotators were allowed to assign 1 or more domains to an SRH, so the number of domain categories varies in the Gold Standard data. The output of DOMSCORE, however, is a set with confidence values for all domains ranging from 0 to 1. To the best of our knowledge, there exists no evaluation method that allows the straight-forward evaluation of these confidence sets against the varying number of binary domain decisions.</Paragraph>
      <Paragraph position="1"> As a consequence, we restricted the evaluation to the subset of 758 SRHs unambiguously annotated for a single domain in Dataset 2. For each SRH we compared the recognized domain of its best CR with the annotated domain. This recognized domain is the one that was scored the highest confidence by DOMSCORE. In this way we measured the precision on recognizing the best domain of an SRH. The best conceptual representation of an SRH had been previously disambiguated by humans as reported in Section 3.3. Alternatively, this kind of disambiguation can be performed automatically, e.g., with the help of the system presented in Gurevych et al. (2003a).</Paragraph>
      <Paragraph position="2"> The system scores semantic coherence of SRHs, where the best CR is the one with the highest semantic coherence.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> We included two baselines in this evaluation. As assigning domains to speech recognition hypotheses is a classification task, the majority class frequency can serve as a first baseline. For a second baseline, we trained a statistical classifier employing the k-nearest neighbour method using Dataset 1. This dataset had also been employed to create the tf*idf model. The statistical classifier treated each SRH as a bag of words or bag of concepts labeled with do- null The results of DOMSCORE employing the hand-annotated and tf*idf domain models as well as the baseline systems' performances are displayed in Figure 2. The diagram shows that all systems clearly outperform the majority class baseline. The hand-annotated domain model (precision 88.39%) outperforms the tf*idf domain model (precision 82.59%). The model created by humans turns out to be of higher quality than the automatically computed one. However, the k-nearest neighbour baseline with words as features performs better (precision 93.14%) than the other methods employing ontological concepts as representations.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Discussion
</SectionTitle>
      <Paragraph position="0"> We believe that this finding can be explained in terms of our experimental setup which favours the statistical model. Table 9 gives the absolute frequency for all domain categories in the evaluation data. As the data implies, three of the possible categories are missing in the data.</Paragraph>
      <Paragraph position="1">  The main reason for our results, however, lies in the controlled experimental setup of the data collection. Subjects had to verbalize pre-defined intentions in 8 scenarios, e.g. record a specific program on TV or ask for information regarding a given historical sight. Naturally, this leads to restricted man-machine interactions using controlled vocabulary. As a result, there is rather limited lexical variation in the data. This is unfortunate for illustrating the strengths of high-level ontological representations. null In our opinion, the power of ontological representations is just their ability to reduce multiple lexical surface realizations of the same concept to a single unit, thus representing the meaning of multiple words in a compact way. This effect could not be exploited in a due way given the test corpora in these experiments. We expect a better performance of concept-based methods as compared to word-based ones in broader domains.</Paragraph>
      <Paragraph position="2"> An additional important point to consider is the portability of the domain recognition approach. Statistical models, e.g., tf*idf and k-nearest neighbour rely on substantial amounts of annotated data when moving to new domains. Such data is difficult to obtain and requires expensive human efforts for annotation. When the manually created domain model is employed for the domain classification task, the extension of knowledge sources to a new domain boils down to extending the list of concepts with some additional ones and annotating them for domains. These new concepts are part of the extension of the system's general ontology, which is not created specifically for domain classification, but employed for many purposes in the system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML