File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1412_evalu.xml
Size: 8,349 bytes
Last Modified: 2025-10-06 13:59:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1412"> <Title>Noun Phrase Generation for Situated Dialogs</Title> <Section position="7" start_page="85" end_page="87" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> We report several methods of evaluating the NP frames produced using the process given by the decision trees. First, we report the results of a strict evaluation in which the system's output must exactly match expressions produced by the human subjects. We also compare this result with a hand-crafted Centering-style generation algorithm. Requiring the algorithm to exactly match human performance is an overly-strict criterion, since in many contexts several possible referring expression forms could be equally felicitous in a given context, so we also conducted a human judgment study. The 5 test dialogs contain 295 target expressions.</Paragraph> <Section position="1" start_page="85" end_page="85" type="sub_section"> <SectionTitle> 5.1 Exact Match Evaluation </SectionTitle> <Paragraph position="0"> The output of the decision tree classifier was compared to the expressions observed in the test dialog. Table 4 reports the results of this evaluation.</Paragraph> <Paragraph position="1"> The accuracy obtained was 31.2%. The most frequent tag gives a 20.0% baseline performance using this strict match criterion.</Paragraph> </Section> <Section position="2" start_page="85" end_page="85" type="sub_section"> <SectionTitle> 5.2 Comparison to Centering </SectionTitle> <Paragraph position="0"> For purposes of comparing the performance of our generation algorithm to existing work on generation of NPs, we performed a manual evaluation of the centering-style generation algorithm described in (Kibble and Power, 2000) against our dialog corpus. Algorithms developed according to the centering framework use discourse coherence to make decisions about pronominalization (Grosz et al., 1995), where coherence is measured in terms of topical continuity from one sentence to the next.</Paragraph> <Paragraph position="1"> Centering designates the backward-looking center (Cb) as the item in the current sentence that was most topical in the previous sentence. Therefore, to perform a centering-style evaluation, the dialogs must be broken into sentence-like units, and a ranking procedure must be devised for the items mentioned in each unit.</Paragraph> <Paragraph position="2"> The current evaluation corpus, being a spoken dialog, has not been parsed to automatically determine the syntactic or dependency structure, but rather was manually segmented into utterance units, where each unit contained a main predicate and its satellites. The items mentioned in each unit were ranked according to thematic roles, using the ranking {AGENT > PATIENT > COMP > AD-JUNCT}, and excluding references to the speakers themselves, which often appear in AGENT position (Byron and Stent, 1998). The Cb in each unit, if there is one, is the highest-ranked item from the prior unit's list that is repeated in the current unit's list. Following a procedure similar to that reported by Kibble and Power, our decision procedure recommends pronominalizing an item if it is the Cb of its unit and if it is in Subject position, otherwise a description is generated. Based on this rule, all items that are being mentioned for the first time in the discourse are predicted to require a description. null Although most prior studies take the recommendation to pronominalize to mean that a personal pronoun (e.g. it) should be generated, due to the demonstrative nature of our domain, the decision to produce a pronoun can result in either a demonstrative or a personal pronoun. Therefore, we considered the algorithm's output to match human production when the target expression in the human corpus was either a personal or demonstrative pronoun, and the algorithm generated either category of pronoun. Table 5 shows the comparison of our system's output and the output from the centering algorithm on anaphoric mentions. The 5 dialogs used for testing in this study contained 145 such items. Both algorithms obtained a similar accuracy (64.8% our system vs. 64.1% centering) and over-generated pronoouns. Although our algorithm does not outperform centering, it assumes less structural analysis of the input text.</Paragraph> </Section> <Section position="3" start_page="85" end_page="87" type="sub_section"> <SectionTitle> 5.3 Human Judgment Evaluation </SectionTitle> <Paragraph position="0"> Evaluating generation studies by calculating their similarity to human spontaneous speech may not be the ideal performance metric, since several different realizations may be equally felicitous in a given context. Therefore, we also performed a human judgement evaluation. In this evaluation, judges compared the NPs generated by our algorithm to the NPs produced by human subjects, and to NPs with randomly generated feature assignments. Judges viewed test NPs in the context of the original test corpus.</Paragraph> <Paragraph position="1"> To re-create the context in which the original expression was produced, the video, audio, and dialog transcript were played for the judges using the Anvil annotation tool (Kipp, 2004). The judges could play or pause the video as they wished. Using the word-alignments established during the data annotation phase, the audio of the test NPs was replaced by silence, and the words were removed in the transcript shown in the time-line viewer. For each test item, the judges were presented with a selection box showing two possible referring expressions that they were asked to compare using a qualitative ranking (option 1 is better, option 2 is better, or they are equal), given a particular target ID and the context. Figure 4 shows a screen-shot of the judges' annotation tool.</Paragraph> <Paragraph position="2"> The judges did not know the source of the expressions they evaluated (system, human production, or random). The 10 judges were volunteers from the university community who were self-identified native speakers of English. They were not compensated for their time.</Paragraph> <Paragraph position="3"> The decision tree selected NP-frame slot values which were converted into realized NPs. The Det and Head choices were directly translated into surface forms (for Head=noun we chose a consistent common noun for each semantic class: but- null ton, door or cabinet. If the system's selection of Mod feature matched the value from the corpus, we used the expression produced by the original speaker. If the original expression did not include a modifier, but the system selected Mod:+,welexicalized this feature to a simple but correct spatial description like on the right, on the left or in front. Table 6 shows the results of human judging.</Paragraph> <Paragraph position="4"> The system's output was either equal or preferred to the original spontaneous language in 62.6% of cases where these two choices were compared directly. Interestingly, the randomly-generated choice was preferred over the original spontaneous language in 13.0% of trials, and was preferred over the system's output in 22.5% of trials. Aggregating over all judges, the system's performance was judged to be much better than random, but not as good as the original human language.</Paragraph> <Paragraph position="5"> Trials were balanced among judges so that each target item was seen by four judges: with two comparing the system's response to the original human language, one comparing the system to random, and one comparing the human to random.</Paragraph> <Paragraph position="6"> There were 282 trials for which 2 judges saw the identical pair of choices. Out of these, the two judges' responses agreed in 197 cases, producing an inter-annotator reliability (kappa score) of 0.51, with raw agreement of 69% and expected agreement of 37%. Although this is a relatively low kappa value, we believe that the aggregate judgments of all of the judges over all of the test items are still informative, since the scores of items for which we have two judgements follow a very sim- null ilar pattern to the overal distribution of responses. The low inter-annotator agreement may be due to the substitutability of the expressions.</Paragraph> </Section> </Section> class="xml-element"></Paper>