XML Viewer - w97-0614

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0614_evalu.xml
Size: 11,431 bytes
Last Modified: 2025-10-06 14:00:26
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0614">
  <Title>Grammatical analysis in the OVIS spoken-dialogue system</Title>
  <Section position="6" start_page="68" end_page="71" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> This section evaluates the NLP component with respect to efficiency and accuracy.</Paragraph>
    <Section position="1" start_page="68" end_page="68" type="sub_section">
      <SectionTitle>
5.1 Test set
</SectionTitle>
      <Paragraph position="0"> We present a number of results to indicate how well the NLP component currently performs. We used a corpus of more than 20K word-graphs, output of a preliminary version of the speech recognizer, and typical of the intended application. The first 3800 word-graphs of this set are semantically annotated.</Paragraph>
      <Paragraph position="1"> This set is used in the experiments below. Some characteristics of this test set are given in Table 1.</Paragraph>
      <Paragraph position="2"> As can be seen from this table, this test set is considerably easier than the rest of this set. For this reason, we also present results (where applicable) for a set of 5000 arbitrarily selected word-graphs. At the time of the experiment, no further annotated corpus material was available to us.</Paragraph>
    </Section>
    <Section position="2" start_page="68" end_page="69" type="sub_section">
      <SectionTitle>
5.2 Efficiency
</SectionTitle>
      <Paragraph position="0"> We report on two different experiments. In the first experiment, the parser is given the utterance as it  the number of words of the actual utterances, the average number of transitions per word, and the average number of words per utterances.</Paragraph>
      <Paragraph position="1"> was actually spoken (to simulate a situation in which speech recognition is perfect). In the second experiment, the parser takes the full word-graph as its input. The results are then passed on to the robustness component. We report on a version of the robustness component which incorporates bigramscores (other versions are substantially faster). All experiments were performed on a HP-UX 9000/780 machine with more than enough core memory. Timings measure CPU-time and should be independent of the load on the machine. The timings include all phases of the NLP component (including lexical lookup, syntactic and semantic analysis, robustness, and the compilation of semantic representations into updates). The parser is a head-corner parser implemented (in SICStus Prolog) with selective memoization and goal-weakening as described in (van Noord, 1997). Table 2 summarizes the results of these two experiments.</Paragraph>
      <Paragraph position="2"> From the experiments we can conclude that almost all input word-graphs can be treated fast enough for practical applications. In fact, we have found that the few word-graphs which cannot be treated efficiently almost exclusively represent cases where speech recognition completely fails and no useful combinations of edges can be found in the word-graph. As a result, ignoring these few cases does not seem to result in a degradation of practical system performance.</Paragraph>
    </Section>
    <Section position="3" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
5.3 Accuracy
</SectionTitle>
      <Paragraph position="0"> In order to evaluate the accuracy of the NLP component, we used the same test set of 3800 word-graphs.</Paragraph>
      <Paragraph position="1"> For each of these graphs we know the corresponding actual utterances and the update as assigned by the annotators. We report on word and sentence accuracy, which is an indication of how well we are able to choose the best path from the given word-graph, and on concept accuracy, which indicates how often the analyses are correct.</Paragraph>
      <Paragraph position="2"> The string comparison on which sentence accuracy and word accuracy are based is defined by the minimal number of substitutions, deletions and insertions that is required to turn the first string into the second (Levenshtein distance). The string that is being compared with the actual utterance is defined as the best path through the word-graph, given the best-first search procedure defined in the previous section. Word accuracy is defined as 1- ~ where n is the length of the actual utterance and d is the distance as defined above.</Paragraph>
      <Paragraph position="3"> In order to characterize the test sets somewhat further, Table 3 lists the word and sentence accuracy both of the best path through the word-graph (using acoustic scores only), the best possible path through the word-graph, and a combination of the acoustic score and a bigram language model. The first two of these can be seen as natural upper and lower boundaries.</Paragraph>
    </Section>
    <Section position="4" start_page="69" end_page="71" type="sub_section">
      <SectionTitle>
5.4 Concept Accuracy
</SectionTitle>
      <Paragraph position="0"> Word accuracy provides a measure for the extent to which linguistic processing contributes to speech recognition. However, since the main task of the linguistic component is to analyze utterances semantically, an equally important measure is concept accuracy, i.e. the extent to which semantic analysis corresponds with the meaning of the utterance that was actually produced by the user.</Paragraph>
      <Paragraph position="1"> For determining concept accuracy, we have used a semantically annotated corpus of 3800 user responses. Each user response was annotated with an update representing the meaning of the utterance that was actually spoken. The annotations were made by our project partners in Amsterdam, in accordance with the guidelines given in (Veldhuijzen van Zanten, 1996).</Paragraph>
      <Paragraph position="2"> Updates take the form described in Section 3. An update is a logical formula which can be evaluated against an information state and which gives rise to a new, updated information state. The most straight-forward method for evaluating concept accuracy in this setting is to compare (the normal form of) the update produced by the grammar with (the normal form of) the annotated update. A major obstacle for this approach, however, is the fact that very fine-grained semantic distinctions can be made in the update-language. While these distinctions are relevant semantically (i.e. in certain cases they may lead to slightly different updates of an information state), they often can be ignored by a dialogue manager. For instance, the update below is semantically not equivalent to the one given in Section 3, as the ground-focus distinction is slightly different.</Paragraph>
      <Paragraph position="3"> us erwant s .travel .destination.place ( \[# town.leiden\] ; \[ ! town. abcoude\] )  word-graphs, the average number of milliseconds per word-graph, and the maximum number of milliseconds for a word-graph. The final column lists the maximum space requirements (per word-graph, in Kbytes). For word-graphs the average CPU-times are actually quite misleading because CPU-times vary enormously for different word-graphs. For this reason, we present in the second table the proportion of word-graphs that can be treated by the NLP component within a given amount of CPU-time (in milliseconds).  possible path through the word-graph, based on acoustic scores only (Possible); a combination of acoustic score and bigram score (Acoustic + Bigram), as reported by the current version of the system. However, the dialogue manager will decide in both cases that this is a correction of the destination town.</Paragraph>
      <Paragraph position="4"> Since semantic analysis is the input for the dialogue manager, we have therefore measured concept accuracy in terms of a simplified version of the update language. Following the proposal in (Boros and others, 1996), we translate each update into a set of semantic units, were a unit in our case is a triple (CommunicativeFunction, Slot, Value). For instance, the example above, as well as the example in Section 3, translates as (denial, destination_town, leiden) ( corrections destination_town, abcoude ) Both the updates in the annotated corpus and the updates produced by the system were translated into semantic units of the form given above.</Paragraph>
      <Paragraph position="5"> Semantic accuracy is given in the following tables according to four different definitions. Firstly, we list the proportion of utterances for which the corresponding semantic units exactly match the semantic units of the annotation (match). Furthermore we calculate precision (the number of correct semantic units divided by the number of semantic units which were produced) and recall (the number of correct semantic units divided by the number of semantic units of the annotation). Finally, following (Boros and others, 1996), we also present concept accuracy as</Paragraph>
      <Paragraph position="7"> where SU is the total number of semantic units in the translated corpus annotation, and SUs, SUt, and SUp are the number of substitutions, insertions, and deletions that are necessary to make the translated grammar update equivalent to the translation of the corpus update.</Paragraph>
      <Paragraph position="8"> We obtained the results given in Table 4.</Paragraph>
      <Paragraph position="9"> The following reservations should be made with respect to the numbers given above.</Paragraph>
      <Paragraph position="10"> * The test set is not fully representative of the task, because the word-graphs are relatively simple.</Paragraph>
      <Paragraph position="11"> * The test set was also used during the design of the grammar. Therefor~ the experiment is  accuracy. Semantic accuracy consists of the percentage of graphs which receive a fully correct analysis (match), percentages for precision and recall of semantic slots, and concept accuracy. The first row presents the results if the parser is given the actual user utterance (obviously WA and SA are meaningless in this case). The second and third rows present the results for word-graphs. In the third row 'bigram information is incorporated in the robustness component.</Paragraph>
      <Paragraph position="12"> methodologically unsound since no clear separation exists between training and test material.</Paragraph>
      <Paragraph position="13"> * Errors in the annotated corpus were corrected by us.</Paragraph>
      <Paragraph position="14"> * Irrelevant differences between annotation and analysis were ignored (for example in the case of the station names cuijk and cuyk).</Paragraph>
      <Paragraph position="15"> Even if we take into account these reservations, it seems that we can conclude that the robustness component adequately extracts useful information even in cases where no full parse is possible: concept accuracy is (luckily) much higher than sentence accuracy. Conclusion We have argued in this paper that sophisticated grammatical analysis in combination with a robust parser can be applied successfully as an ingredient of a spoken dialogue system. Grammatical analysis is thereby shown to be a viable alternative to techniques such as concept spotting. We showed that for a state-of-the-art application (public transport information system) grammatical analysis can be applied efficiently and effectively. It is expected that the use of sophisticated grammatical analysis allows for easier construction of linguistically more complex spoken dialogue systems.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML