File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-1021_evalu.xml
Size: 6,497 bytes
Last Modified: 2025-10-06 14:00:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1021"> <Title>Spoken Dialogue Interpretation with the DOP Model</Title> <Section position="4" start_page="142" end_page="143" type="evalu"> <SectionTitle> 6. Evaluation </SectionTitle> <Paragraph position="0"> In our experimental evaluation of DOP we were interested in the following questions: (1) Is DOP fast enough for practical spoken dialogue understanding? (2) Can we constrain the OVIS subtrees without loosing accuracy? (3) What is the impact of dialogue context on the accuracy? For all experiments, we used a random split of the 10,000 OVIS trees into a 90% training set and a 10% test set. The training set was divided up into the four subcorpora described in section 4, which served to create the corresponding DOP parsers. The 1000 word-graphs for the test set utterances were used as input. For each word-graph, the previous system question was known to determine the particular DOP parser. while the user utterances were kept apart. As to the complexity of the word-graphs: the average number of transitions per word is 4.2, and the average number of words per word-graph path is 4.6. All experiments were run on an SGI Indigo with a MIPS RI0000 processor and 640 Mbyte of core memory, To establish the semantic accuracy of the system, the best meanings produced by the DOP parser were compared with the meanings in the test set. Besides an exact match metric, we also used a more fine-grained evaluation for the semantic accuracy. Following the proposals in Boros et al. (1996) and van Noord et al. (1997), we translated each update meaning into a set of semantic units, where a unit is triple <CommunicativeFunction, Slot, Value>. For instance, the next example user. wants, travel, des t inat ion.</Paragraph> <Paragraph position="2"> translates as: <denial, destination town, almere> <correction, destination_town, alkmaar> Both the updates in the OVIS test set and the updates produced by the DOP parser were translated into semantic units of the form given above. The semantic accuracy was then evaluated in three different ways: (1) match, the percentage of updates which were exactly correct (i.e. which exactly matched the updates in the test set); (2) precision, the number of correct semantic units divided by the number of semantic units which were produced; (3) recall, the number of correct semantic units divided by the number of semantic units in the test set.</Paragraph> <Paragraph position="3"> As to question (1), we already suspect that it is not efficient to use all OVIS subtrees. We therefore performed experiments with versions of DOP where the subtree collection is restricted to subtrees with a certain maximum depth. The following table shows for four different maximum depths (where the maximum number of frontier words is limited to 3), the number of subtree types in the training set, the semantic accuracy in terms of match, precision and recall (as percentages), and the average CPU time per word-graph in seconds.</Paragraph> <Paragraph position="4"> subtree- semantic accuracy #subtrees CPU time depth match precision recall The experiments show that at subtree-depth 4 the highest accuracy is achieved, but that only for subtree-depths I and 2 are the processing times fast enough for practical applications. Thus there is a trade-off between efficiency and accuracy: the efficiency deteriorates if the accuracy improves. We believe that a match of 78.5% and a corresponding precision and recall of resp. 83.0% and 84.3% (for the fast processing times at depth 2) is promising enough for further research. Moreover, by testing DOP directly on the word strings (without the word-graphs), a match of 97.8% was achieved. This shows that linguistic ambiguities do not play a significant role in this domain. The actual problem are the ambiguities in the word-graphs (i.e. the multiple paths).</Paragraph> <Paragraph position="5"> Secondly, we are concerned with the question as to whether we can impose constraints on the subtrees other than their depth, in such a way that the accuracy does not deteriorate and perhaps even improves. To answer this question, we kept the maximal subtree-depth constant at 3, and employed the following constraints: * Eliminating once-occurring subtrees: this led to a considerable decrease for all metrics; e.g. match decreased from 79.8% to 75.5%.</Paragraph> <Paragraph position="6"> * Restricting subtree lexicalization: restricting the maximum number of words in the subtree frontiers to resp. 3, 2 and 1, showed a consistent decrease in semantic accuracy similar to the restriction of the subtree depth in table 1. The match dropped from 79.8% to 76.9% if each subtree was lexicalized with only one word.</Paragraph> <Paragraph position="7"> * Eliminating subtrees with only non-head words: this led also to a decrease in accuracy; the most stringent metric decreased from 79.8% to 77.1%.</Paragraph> <Paragraph position="8"> Evidently, there can be important relations in OVIS that involve non-head words.</Paragraph> <Paragraph position="9"> Finally, we are interested in the impact of dialogue context on semantic accuracy. To test this, we neglected the previous system questions and created one DOP parser for the whole training set. The semantic accuracy metric match dropped from 79.8% to 77.4% (for depth 3). Moreover, the CPU time per sentence deteriorated by a factor of 4 (which is mainly due to the fact that larger training sets yield slower DOP parsers).</Paragraph> <Paragraph position="10"> The following result nicely illustrates how the dialogue context can contribute to better predictions for the correct meaning of an utterance. In parsing the word-graph corresponding to the acoustic utterance Donderdag acht februari (&quot;Thursday eight February&quot;), the DOP model without dialogue context assigned highest probability to a derivation yielding the word string Dordrecht acht februari and its meaning. The uttered word Donderdag was thus interpreted as the town Dordrecht which was indeed among the other hypothesized words in the word-graph. If the DOP model took into account the dialogue context, the previous system question When do you want to leave? was known and thus triggered the subtrees from the date-subcorpus only, which now correctly assigned the highest probability to Donderdag acht februari and its meaning, rather than to Dordrecht acht februari.</Paragraph> </Section> class="xml-element"></Paper>