File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1073_metho.xml

Size: 9,111 bytes

Last Modified: 2025-10-06 14:15:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1073">
  <Title>Designing a Task-Based Evaluation Methodology for a Spoken Machine Translation System</Title>
  <Section position="3" start_page="569" end_page="569" type="metho">
    <SectionTitle>
2 Goals of a Task-Based Evaluation
</SectionTitle>
    <Paragraph position="0"> Methodology for an MT System The goal of a task-based evaluation for an MT system is to convey whether speakers' goals were translated correctly. An advantage of focusing on goal translation is that it allows us to compare dialogues where the speakers employ different dialogue strategies. In our project, we focus on three issues in goal communication: (1) distinction of goals based on subgoal complexity, (2) distinction of goals based on the speaker's prioritization, and (3) distinction of goals based on domain.</Paragraph>
  </Section>
  <Section position="4" start_page="569" end_page="571" type="metho">
    <SectionTitle>
3 Prioritization of Goals
</SectionTitle>
    <Paragraph position="0"> While we want to evaluate whether speakers' important goals are translated correctly, this is sometimes difficult to ascertain, since not only must the speaker's goals be concisely describable and circumscribable, but also they must not change while she is attempting to achieve her task. Speakers usually have a prioritization of goals that cannot be predicted in advance and which differs between speakers; for example, if one client wants to book a trip to Tokyo, it may be imperative for him to book the flight tickets at the least, while reserving rooms in a hotel might be of secondary importance, and finding out about sights in Tokyo might be of lowest priority. However, his goals could be prioritized in the opposite order, or could change if he finds one goal too difficult to communicate and abandons it in frustration.</Paragraph>
    <Paragraph position="1"> If we insist on eschewing unreliability issues inherent in asking the client about the priority of his goals after the dialogue has terminated (and he has perhaps forgotten his earlier priority assignment), we cannot rely on an invariant prioritization of goals across speakers or across a dialogue. The only way we can predict the speaker's goals at the time he is trying to communicate them is in cases where his goals are not communicated and he attempts to repair them.</Paragraph>
    <Paragraph position="2"> We can distinguish between cases in which.goal communication succeeds or fails, and we can count the number of repair attempts in both cases. The insight is that speakers will attempt to repair higher priority goals more than lower priority goals, which they will abandon sooner.</Paragraph>
    <Paragraph position="3"> The number of repair attempts per goal quantifies the speaker's priority per goal to some degree. null We can capture this information in a simple metric that distinguishes between goals that eventually succeed or fail with at least one repair attempt. Goals that eventually succeed with tg repair attempts can be given a score of 1/tg, which has a maximum score of 1 when there is only one repair attempt, and decays to 0 as the number of repair attempts goes to infinity. Similarly, we can give a score of-(1 - 1/tg) to goals that are eventually abandoned with tg repair attempts; this has a maximum of 0 when there is only a single repair attempt and goes to -1 as tg goes to infinity. So the overall dialogue score becomes the average over all goals of the difference between these two metrics, with a maximum score of 1 and a minimum score of  Another factor to be considered is goal complexity; clearly we want to distinguish between dialogues with the same main goals but in which some have many subgoals while others have few subgoals with little elaboration. For instance, one traveller going to Tokyo may be satisfied with simply specifying his departure and arrival times for the outgoing and return laps of his flight, while another may have the additional subgoals of wanting a two-day stopover in London, vegetarian meals, and aisle seating in the non-smoking section. In the metric above, both goals and subgoals are treated in the same way (i.e., the sum over goals includes subgoals), and we are not weighting their scores any differently. While many subgoals require that the main goal they fall under be communicated for them to be communicated, it is also true that for some speakers, communicating just the main goal and not the subgoal may be a communication failure. For example, if it is crucial for a speaker  to get a stopover in London, even if his main goal (requesting a return flight from New York to Tokyo) is successfully communicated, he will view the communication attempt a failure unless the system communicates the stopover successfully also. On the other hand, communicating the subgoal (e.g., a stopover in London), without communicating the main goal is nonsensical - the travel agent will not know what to make of &amp;quot;a stopover in London&amp;quot; without the accompanying main goal requesting the flight to Tokyo.</Paragraph>
    <Paragraph position="4"> However, even if two dialogues have the same goals and subgoals, the complexity of the translation task may differ; for example, if in one dialogue (A) the speaker communicates a single goal or subgoal per speaker turn, while in the other (B) the speaker communicates the goal and all its subgoals in the same speaker turn, it is clear that the dialogue in which the entire goal structure is conveyed in the same speaker turn will be the more difficult translation task.</Paragraph>
    <Paragraph position="5"> We need to be able to account for the average goal complexity per speaker turn in a dialogue and scale the above metric accordingly; if dialogues A and B have the same score according to the given metric, we should boost the score of B to reflect that it has required a more rigorous translation effort. A first attempt would be to simply multiply the score of the dialogue by the average subgoal complexity per main goal per speaker turn in the dialogue, where Nmg is the number of main goals in a speaker turn and Nsg is the number of subgoals. In the metric below, the average subgoal complexity is 1 for speaker turns in which there are no subgoals, and increases as the number of subgoals in the speaker turn increases.</Paragraph>
    <Paragraph position="7"> Scoring a dialogue is a coding task; scorers will need to be able to distinguish goals and subgoals in the domain. We want to minimize training for scorers while maximizing agreement between them. To do so, we list a predefined set of main goals (e.g., making flight arrangements or hotel bookings) and group together all sub-goals that pertain to these main goals in a two-level tree. Although this formalization sacrifices subgoal complexity, we are unable to determine this without predefining a subgoal hierarchy and we want to avoid predefining subgoal priority, which is set by assigning a subgoal hierarchy.</Paragraph>
    <Paragraph position="8"> After familiarizing themselves with the set of main goals and their accompanying subgoals, scorers code a dialogue by distinguishing in a speaker turn between the main goals and subgoals, whether they are successfully communicated or not, and the number of repair attempts in successive speaker turns. Scorers must also indicate which domain each goal falls under; we distinguish goals as in-domain (i.e., referring to the travel-reservation domain), out-of-domain (i.e., unrelated to the task in any way), and cross-domain (i.e., discussing the weather, common polite phrases, accepting, negating, opening or closing the dialogue, or asking for repeats). null The distinction between domains is important in that we can separate in-domain goals from cross-domain goals; cross-domain goals often serve a meta-level purpose in the dialogue.</Paragraph>
    <Paragraph position="9"> We can thus evaluate performance over all goals while maintaining a clear performance measure for in-domain goals. Scores should be calculated separately based on domain, since this will indicate system performance more specifically, and provide a useful metric for grammar developers to compare subsequent and current domain scores for dialogues from a given scenario.</Paragraph>
    <Paragraph position="10"> In a large scale evaluation, multiple pairs of speakers will be given the same scenario (i.e., a specific task to try and accomplish; e.g., flying to Frankfurt, arranging a stay there for 2 nights, sightseeing to the museums, then flying on to Tokyo}; domain scores will then be calculated and averaged over all speakers.</Paragraph>
    <Paragraph position="11"> Actual evaluation is performed on transcripts of dialogues labelled with information from system logs; this enables us to see the original utterance (human transcription} and evaluate the correctness of the target output. If we wish to, log-file evaluations also permit us to evaluate the system in a glass-box approach, evaluating individual system components separately (Simpson and Fraser, 1993).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML