File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2802_metho.xml
Size: 24,522 bytes
Last Modified: 2025-10-06 14:09:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2802"> <Title>Towards Measuring Scalability in Natural Language Understanding Tasks</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Evaluating Dialogue-, Speech- and </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Discourse Understanding Systems </SectionTitle> <Paragraph position="0"> In this section we will briefly sketch out the most frequently used metrics for evaluating the performances of the relevant components and systems at hand.</Paragraph> <Paragraph position="1"> Evaluation of the Dialogue Systems Performance: For evaluation of the overall performance of a dialogue system as a whole frameworks such as PAR-ADISE (Walker et al., 2000) for unimodal and PROMISE (Beringer et al., 2002) for multimodal systems have set a de facto standard. These frameworks differentiate between: null dialogue efficiency metrics, i.e. elapsed time, system- and user turns dialogue quality metrics, mean recognition score and absolute number as well as percentages of timeouts, rejections, helps, cancels, and barge-ins, task success metrics, task completion (per survey) user satisfaction metrics (per survey) These metrics are crucial for evaluating the aggregate performance of the individual components, they cannot, however, determine the amount of understanding versus misunderstanding or the system-specific a priori difficulty of the understanding task. Their importance, however, will remain undiminished, as ways of determining such global parameters are vital to determining the aggregate usefulness and felicity of a system as a whole. At the same time individual components and ensembles thereof - such as the performance of the uni- or multi-modal input understanding system - need to be evaluated as well to determine bottlenecks and weak links in the discourse understanding processing chain.</Paragraph> <Paragraph position="2"> Evaluation of the Automatic Speech Recognition Performance: The commonly used word error rate (WER) can be calculated by aligning any two sets word sequences and adding the number of substitutions S, deletions D and insertions I. The WER is then given by the following formula where N is the total number of words in the test set.</Paragraph> <Paragraph position="4"> Another measure of accuracy that is frequently used is the so called Out Of Vocabulary (OOV) measure, which represents the percentage of words that was not recognized despite their lexical coverage. WER and OOV are commonly intertwined together with the combined acoustic- and language-model confidence scores, which are constituted by the posterior probabilities of the hidden Markov chains and n-gram frequencies. Together these scores enable evaluators to measure the absolute performance of a given speech recognition system. In order to arrive at a measure that is relative to the given taskdifficulty, this difficulty must also be calculated, which can be done by means of measuring the perplexity of the task (see Section 3).</Paragraph> <Paragraph position="5"> Evaluation of the Natural Language Understanding Performance: A measure for understanding rates called concept error rate has been proposed for example by Chotimongcol and Rudnicky (2001) and is designed in analogy to word error rates employed in automatic speech recognition that are combined with keyword spotting systems. Chotimongcol and Rudnicky (2001) propose to differentiate whether the erroneous concept occurs in a non-concept slot that contains information that is captured in the grammar but not considered relevant for selecting a system action (e.g., politeness markers, such as please), in a value-insensitive slot whose identity, suffices to produce a system action (e.g., affirmatives such as yes), or in a value-sensitive slot for which both the occurrence and the value of the slot are important (e.g., a goal object, such as Heidelberg). An alternative proposal for concept error rates is embedded into the speech recognition and intention spotting system by Lumenvox1, wherein two types of errors and two types of non-errors for concept transcriptions are proposed: A match when the application returned the correct concept and an out of grammar match when the ap-</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Transcription.htm </SectionTitle> <Paragraph position="0"> plication returned no concepts, or discarded the returned concepts because the user failed to say any concept covered by the grammar.</Paragraph> <Paragraph position="1"> A grammar mismatch when the application returned the incorrect concept, but the user said a concept covered by the grammar and an out of grammar mis-match when the application returned a concept, and chose that concept as a correct interpretation, but the user did not say a concept covered by the grammar.</Paragraph> <Paragraph position="2"> Neither of these measures are suitable for our purposes as they are known to be feasible only for contextinsensitive applications that do not include discourse models, implicit domain-specific information and other contextual knowledge as discussed in (Porzel et al., 2004). Therefore this measure has also been called key-word recognition rate for single utterance systems. In our minds another crucial shortcoming is the lack of comparability, as these measures do not take the general difficulty of the understanding tasks into account. Again, this has been realized in the automatic speech recognition community and led to the so called perplexity measurements for a given speech recognition task. We will, therefore, sketch out the commonly employed perplexity measurements in Section 3.</Paragraph> <Paragraph position="3"> The most detailed evaluation scheme for discourse comprehension, introduced by Higashinaka et al. (2002) and also extended by Higashinaka et al. (2003), features the metrics given in Table 2.</Paragraph> <Paragraph position="4"> 1. slot accuracy 2. insertion error rate 3. deletion error rate 4. substitution error rate 5. slot error rate 6. update precision 7. update insertion error rate 8. update deletion error rate 9. update substitution error rate 10. speech understanding rate 11. slot accuracy for filled slots 12. deletion error rate for filled slots 13. substitution error rate for filled slots These metrics are combined by means of combining the results of an m5 multiple linear regression algorithm and a support vector regression approach. The resulting weighted sum is compared to human intuitions and PARADISE-like metrics concerning task completion rates and -times. While this promising approach manages to combine factors related to speech recognition, interpretation and discourse modeling, there are some shortcomings that stem from the fact that this schema was developed for single-domain systems that employ frame-based attribute value pairs for representing the user's intent. Recent advances in dialogue management and multi-domain systems enable approaches that are more flexible than slot-filling, e.g. using discourse pegs, dialogue games and overlay operations for handling multiple tasks and cross-modal references (LuperFoy, 1992; L&quot;ockelt et al., 2002; Pfleger et al., 2002; Alexandersson and Becker, 2003). More importantly - for the topic of this paper - no means of measuring the a priori discourse understanding difficulty is given.</Paragraph> <Paragraph position="5"> Measuring Precision, Recall and F-Measures: In the realm of semantic analyses the task of word sense disambiguation is usually regarded to be among the difficult ones. This means it can only be solved after all other problems involved in language understanding have been resolved as well. The hierarchical nature and interdependencies of the various tasks are mirrored in the results of the corresponding competitive evaluation tracts - e.g. the message understanding conference (MUC) or SEN-SEVAL competition. It becomes obvious that the ungraceful degradation of f-measure scores (shown in Table 2 is due to the fact that each higher-level task inherits the imprecisions and omissions of the previous ones, e.g. errors in the named entity recognition (NE) task cause recall and precision declines in the template element task (TE), which, in turn, thwart successful template relation task performance (TR) as well as the most difficult sce- null nario template (ST) and co-reference task (CO). This decline can be seen in Table 2 (Marsh and Perzanowski, 1999).</Paragraph> <Paragraph position="6"> NE CO TE TR ST f .94 f .62 f .87 f .76 f .51 the best performing systems of the 7th Message Understanding Conference Despite several problems stemming from the prerequisite to craft costly gold standards, e.g. tree banks or annotated test corpora, precision and recall and their weighable combinations in the corresponding f-measures (such as given in Table 2), have become a de facto standard for measuring the performance of classification and retrieval tasks (Van Rijsbergen, 1979). Precision p states the percentage of correctly tagged (or classified) entities of all tagged/classified entities, whereas recallr states the positive percentage of entities tagged/classified as compared to the normative amount, i.e. those that ought to have been tagged or classified. Together these are combinable to an overall f-measure score, defined as:</Paragraph> <Paragraph position="8"> Herein can be set to reflect the respective importance of p versus r, if = 0:5 then both are weighted equally.</Paragraph> <Paragraph position="9"> These measures are commonly employed for evaluating part-of-speech tagging, shallow parsing, reference resolution tasks and information retrieval tasks and sub-tasks. An additional problem with this method is that most natural language understanding systems that perform deeper semantic analyses produce representations often based on individual grammar formalisms and mark-up languages for which no gold standards exist. For evaluating discourse understanding systems, however, such gold standards and annotated training corpora will continue to be needed.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Measuring Perplexity and Baselines </SectionTitle> <Paragraph position="0"> In this section we will describe the most frequently used metrics for estimating the complexity of the tasks performed by the relevant components and systems at hand.</Paragraph> <Paragraph position="1"> Measuring Perplexity in Automatic Speech Recognition: Perplexity is a measure of the probability weighted average number of words that may follow after a given word (Hirschman and Thompson, 1997). In order to calculate the perplexity B, the entropy H needs to be given - i.e., the probability of the word sequences in the specific language of the systemW. The perplexity is then defined as:</Paragraph> <Paragraph position="3"> Improvements of specific ASR systems can then consequently be measured by keeping the perplexity constant and measuring WER and OOV performance for recognition quality and confidence scores for hypothesis verification and selection.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Measuring Task-specific Baselines: Baselines for </SectionTitle> <Paragraph position="0"> classification or tagging tasks are commonly defined based on chance performance, on an a posteriori computed majority class performance or on the performance of an established baseline classification method such as naive bayes, tf*idf or k-means . That means: what is the corresponding f-measure, if the evaluated component guesses randomly - for chance performance metrics, what is the corresponding f-measure if the evaluated component always chooses the most frequent solution - for majority class performance metrics, what is the corresponding f-measure of the establisched baseline classification method.</Paragraph> <Paragraph position="1"> Much like kappa statistics proposed by Carletta (1996), existing employments of majority class baselines assume an equal set of identical potential mark-ups, i.e. attributes and their values, for all markables. Therefore, they cannot be used in a straight forward manner for many tasks that involve disjunct sets of attributes and values in terms of the type and number of attributes and their values involved in the classification task. This, however, is exactly what we find in natural language understanding tasks, such as semantic tagging or word sense disambiguation tasks (Stevenson, 2003). Additionally, baseline computed on other methods cannot serve as a means for measuring scalability, because of the circularity involved: as one would need a way of measuring the baseline method's scalability factor in the first place. Table 3 provides an overview of the existing ways of measuring performance and task difficulty in automatic speech recognition and</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Measuring Task Difficulty </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Proportional Baseline Rates </SectionTitle> <Paragraph position="0"> As a precursor step before this we need a clear definition of a natural language understanding task. For this we propose to assume a MATE-like annotation point of view, which provides a set of disjunct levels of annotations for the individual discriminatory decisions that can be performed on spoken dialogue data, ranging from annotating referring expressions, e.g. named entities and their relations, anaphora and their antecedents, to word senses and dialogue acts. Each task must, therefore, have a clearly defined set of markables, attributes and values for each corpus of spoken dialogue data.</Paragraph> <Paragraph position="1"> As a first step we will propose a uniform and generic method for computing task-specific majority class baselines for a given task Tw from the entire set of task, i.e. T = fT1;: :: ;Tzg and Tw 2 T.</Paragraph> <Paragraph position="2"> A gold standard annotation of a task features a finite set of markable tokens C = fc1;: :: ;cng for task Tw, e.g.</Paragraph> <Paragraph position="3"> if n = 2 in a corpus containing only the two ambiguous lexemes bank and run as markables, i.e. c1 and c2 respectively. For a member ci of the set C we can now define the number of values for the tagging attribute of sense as: Ai = fbi1;: :: ;binig. For example, for three senses of the markable bank as c1 we get the corresponding value set A1 = fbuilding, institution, shoreg and for run as c2 the value set A2 = fmotion, stormg.</Paragraph> <Paragraph position="4"> Note that the value sets have markable-dependent sizes.</Paragraph> <Paragraph position="5"> For our toy example containing the two markables c1 for bank and c2 for run they are: For computing the proportional majority classes we need to compute the occurrences of a value j for a markable i in a given gold standard test data set. We call this Vij. Now we can determine the most frequently given value and its number for each markable ci as: V maxi = maxij2f1;:::;bigVij For example, given a marked-up toy corpus containing our ambiguous lexemes as shown below as task T1: The runstorm on the bankbuilding on Monday caused the bankinstitution to collapse early this week. It employees can therefore now enjoy a leisurely runmotion on the bankshore of the Hudson river. It is uncertain if the bankinstitution can be saved so that they can runmotion back to their desks and resume their work.</Paragraph> <Paragraph position="6"> This results in the list of the value occurrences shown below with V maxi set in bold face:</Paragraph> <Paragraph position="8"> We define the total number of values for a markableci as:</Paragraph> <Paragraph position="10"> If we always choose the most frequent attribute for markable ci, the percentage of correct guesses correspomds to Bi. We can now calculate the total number of values as:</Paragraph> <Paragraph position="12"> Based on this we can compute the task-specific proportional baseline for task Tw, i.e., BTw, over the entire test set as:</Paragraph> <Paragraph position="14"> Thus, BTw calculates the average of correct guesses for the majority baseline. Returning to our toy example for c1 we get V S1 = 4 and for c2 we get V S2 = 3. Additionally, we also get different individual majority class baselines for each markable, i.e., for c1 we get B1 = 12, and for c2 we get B2 = 23. We also get a total number of values given for C (c1 and c2), i.e., V S = 7. Now we can compute the overall baseline BT1 as: If we extend the corpus by an additional ambiguity, i.e. that of the spatial and temporal readings of the lexeme on, to yield an annotated corpus such as given below as task T2: The runstorm on the bankbuilding ontemporal Monday caused the bankinstitution to collapse early this week. It employees can therefore now enjoy a leisurely runmotion onspatial the bankshore of the Hudson river. It is uncertain if the bankinstitution can be saved so that they can runmotion back to their desks and resume their work.</Paragraph> <Paragraph position="15"> We get the list of the value occurrences shown below: The reduction by .02 points, in this case, indicates that a method that for each markable always chooses the most frequently occurring one would perform slightly worse on the second corpus as compared to the first. Note that this proportional baseline measure is able to compute the performance of such a majority class-based method on any data set for any task. It does as such provide a picture depicting a problem's or task's inherent difficulty, but only if the distribution of values for the markables at hand is fairly homogeneous. However, if we assume distributions of markable values such as shown below, we get identical values for BT3 and BT4.</Paragraph> <Paragraph position="16"> for both task baselines BT3 and BT4 with T3 featuring the task distribution depicted as c3 and T4 that of c4, despite the fact task T2 was undoubtedly the more difficult one. To create a more applicable measure for task difficulty i.e. one that also applies for cases of heterogeneous value distributions - we need do need to calculate an entropy metric that takes the individual value distributions into account.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Measuring Markable-specific Entropy </SectionTitle> <Paragraph position="0"> As a means of illustrating such a markable-specific entropy metric we can look at the value space for each markable and define a minimal amount of binary decisions that are on average necessary for solving the problem and compute what part of the problem is solved by them. For example, looking at the markable c4 from above we find that the problem can be solved by means of the following decisions: With one decision we can partition the space between b41 and the rest (b42 through b45) thereby assigning 16 times the value b41 to c4. This decision already solves 50% of the problem. Next we need a second decision for partitioning the value space between b42 ^ b43 and b44 ^ b45 and a third for cutting between b42 and b43 as well as b44 and b45 respectively. Therefore, three decisions are needed for assigning the value 4 to b42 and solving 12.5% of the problem. In the case of c4 the same holds for b43, b44 and b45, giving us the following decision and solution table with d(bji) standing for the average amount of binary decisions necessary for solving bji : As an illustrative approximation of a task's entropy we can now compute the aggregate amount of decisions weighted by their contribution to the overall solution (given as its probability - i.e. 50% = :5). For c4 this yields:</Paragraph> <Paragraph position="2"> In these cases we can now say that solving the markable-specific value distribution of c4 is more difficult than solving that of c3, indicated by the increase of 1 point in this quasi-entropy measure. Note that if we had a binary decision procedure that solves f% of the cases correctly than we get an average error rate for T4f2 of 0:9 0:9 = 0;81 whereas for T3 only 0:9.</Paragraph> <Paragraph position="3"> After this approximate illustration of measuring task difficulty via the notion of its entropy, we can now compute a corresponding markable-specific entropy measure Hci based on the standard formula:</Paragraph> <Paragraph position="5"> This computation yields Hc3 = 1 and Hc4 = 2, which also reflects the difference in difficulty of T3 (consisting of the sole markable c3) versus T4 (consisting of the sole markable c4).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Combing Markable-specific Entropies </SectionTitle> <Paragraph position="0"> We propose to apply an analogous way of combining the individual markable-specific entropies, by a weighted average, whereby the markable-specific weights are determined by V Si and the averaging based on V S. As an example we return to our sample tasks T1 and T2.</Paragraph> <Paragraph position="2"> Correspondingly, we get for task T1 consisting of markables c1 and c2 a value HT1 1:25 and for task T2 consisting of markables c1, c2 and c3 a value HT2 1:35.</Paragraph> <Paragraph position="3"> In much the same way as the proportional baseline rate - only more generally applicable - this increase of 0:1 points in task entropy reflects the increase in task difficulty from T1 to T2. Now, that we have a clearly defined way of measuring task-specific difficulties - based on their markable-specific entropies - we can evaluate our approach by means of a larger experiment described below. null</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Evaluating the Metrics </SectionTitle> <Paragraph position="0"> In the following we will report on the results of a corpus study to evaluate the task-specific entropy measurement proposed above. In our mind such a study can be performed in the following way: Given a marked up corpus as an evaluation gold standard we can alternate the corpus' difficulty in three potential ways: eliminate parts of the corpus so that the number of values of the individual markables is decreased, we will call this vertical pruning; eliminate parts of the corpus so that the number of the individual values is decreased, we will call this horizontal pruning; eliminate parts of the corpus so that both the number of markables and their respective values is reduced, we will call this diagonal pruning.</Paragraph> <Paragraph position="1"> Since each of these procedures can increase and reduce the overall task difficulty, we can use them to test if our proposed task entropy measure is able to reflect that in a non toy-world example. For our study we employ the SMARTKOM (Wahlster, 2003) sense-tagged corpus employed in the word sense disambiguation study reported by (Loos and Porzel, 2004). An overview of the markables and their value distributions is given in Appendix 1.</Paragraph> <Paragraph position="2"> We can now compute the entropy for the whole task as: HTwhole = 1966:180832100 0:94 For the horizontal pruning we removed all markables were a single value assumed more than 90% of the entire set. Intuitively that makes the task harder because we took out the easy cases which amounted to about 20% of the entire corpus.</Paragraph> <Paragraph position="3"> We can now compute the entropy for the horizontally eased task as: HThorizontal = 1718:079591548 1:11 For the vertical pruning we removed all values of bi3 of the entire set. Intuitively that makes the task easier because less decisions are necessary to solve those markables that had values in bi3.</Paragraph> <Paragraph position="4"> HTvertical = 1894:893862073 0:91 For the diagonal pruning we again removed all values of bi3 of the entire set making the task easier and removed horizontally all markable where the majority class was under 60%, i.e. the hardest cases.</Paragraph> <Paragraph position="5"> HTdiagonal = 1729:42541936 0:89</Paragraph> </Section> class="xml-element"></Paper>