File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/j97-1002_metho.xml
Size: 47,007 bytes
Last Modified: 2025-10-06 14:14:30
<?xml version="1.0" standalone="yes"?> <Paper uid="J97-1002"> <Title>The Reliability of a Dialogue Structure Coding Scheme</Title> <Section position="5" start_page="14" end_page="14" type="metho"> <SectionTitle> COMMAND STATEMENT J INSTRUCT EX~ QUESTION </SectionTitle> <Paragraph position="0"> Is the person who is transferring information asking a question in an attempt to get evidence that the transfer was successful, so they can move on?</Paragraph> </Section> <Section position="6" start_page="14" end_page="14" type="metho"> <SectionTitle> YES NO ALIGN Does the question ask for confirmation of </SectionTitle> <Paragraph position="0"> material which the speaker believes might be inferred, given the dialogue context?</Paragraph> </Section> <Section position="7" start_page="14" end_page="14" type="metho"> <SectionTitle> YES NO </SectionTitle> <Paragraph position="0"> CHECK Does the question ask for a yes-no answer, or something more complex?</Paragraph> </Section> <Section position="8" start_page="14" end_page="14" type="metho"> <SectionTitle> YES-NO COMPLEX QUERY-YN QUERY-W </SectionTitle> <Paragraph position="0"> Figure 1 Conversational move categories.</Paragraph> </Section> <Section position="9" start_page="14" end_page="14" type="metho"> <SectionTitle> RESPONSE </SectionTitle> <Paragraph position="0"> Does the response contribute task/domain information, or does it only show evidence that communication has been successful?</Paragraph> </Section> <Section position="10" start_page="14" end_page="14" type="metho"> <SectionTitle> PREPARATION READY COMMUNICATION INFORMATION ACKNOWLEDGEMENT Does the response contain just </SectionTitle> <Paragraph position="0"> the information requested, or is it amplified?</Paragraph> </Section> <Section position="11" start_page="14" end_page="14" type="metho"> <SectionTitle> AMPL|FIED CLARIFY INFO REQUESTED </SectionTitle> <Paragraph position="0"> Does the response mean yes, no, or something more complex?</Paragraph> </Section> <Section position="12" start_page="14" end_page="15" type="metho"> <SectionTitle> YES NO COMPLEX REPLY-Y REPLY-N REPLY-W </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="14" end_page="15" type="sub_section"> <SectionTitle> 3.1 The Move Coding Scheme </SectionTitle> <Paragraph position="0"> The move coding analysis is the most substantial level. It was developed by extending the moves that make up Houghton's (1986) interaction frames to fit the kinds of interactions found in the Map Task dialogues. In any categorization, there is a trade-off between usefulness and ease or consistency of coding. Too many semantic distinctions make coding difficult. These categories were chosen to be useful for a range of purposes but still be reliable. The distinctions used to classify moves are summarized in Computational Linguistics Volume 23, Number 1 the action. The instruction can be quite indirect, as in example 3 below, as long as there is a specific action that the instructor intends to elicit (in this case, focusing on the start point). In the Map Task, this usually involves the route giver telling the route follower how to navigate part of the route. Participants can also give other INSTRUCT moves, such as telling the partner to go through something again but more slowly. In these and later examples, G denotes the instruction giver, the participant who knows the route, and F, the instruction follower, the one who is being told the route. Editorial comments that help to establish the dialogue context are given in square brackets.</Paragraph> <Paragraph position="1"> Example 1 G: Go right round, ehm, until you get to just above them.</Paragraph> <Paragraph position="2"> Example 2 G: If you come in a wee bit so that you're about an inch away from both edges.</Paragraph> <Paragraph position="3"> Example 3 G: We're going to start above th ... directly above the telephone kiosk. Example 4 F: Say it ... start again.</Paragraph> <Paragraph position="4"> Example 5 F: Go. \[as first move of dialogue; poor quality but still an instruction\] elicited by the partner. (If the information were elicited, the move would be a response, such as a reply to a question.) The information can be some fact about either the domain or the state of the plan or task, including facts that help establish what is mutually known.</Paragraph> <Paragraph position="5"> Example 6 G: Where the dead tree is on the other side of the stream there's farmed land.</Paragraph> </Section> </Section> <Section position="13" start_page="15" end_page="28" type="metho"> <SectionTitle> 3.1.3 The CHECK Move. A CHECK move requests the partner to confirm information </SectionTitle> <Paragraph position="0"> that the speaker has some reason to believe, but is not entirely sure about. Typically Carletta et al. Dialogue Structure Coding the information to be confirmed is something the partner has tried to convey explicitly or something the speaker believes was meant to be inferred from what the partner has said. In principle, CHECK moves could cover past dialogue events (e.g., &quot;I told you about the land mine, didn't I?&quot;) or any other information that the partner is in a position to confirm. However, CHECK moves are almost always about some information that the speaker has been told. One exception in the Map Task occurs when a participant is explaining a route for the second time to a different route follower, and asks for confirmation that a feature occurs on the partner's map even though it has not yet been mentioned in the current dialogue.</Paragraph> <Paragraph position="1"> Example 11 G: ... you go up to the top left-hand corner of the stile, but you're only, say about a centimetre from the edge, so that's your line.</Paragraph> <Paragraph position="2"> F: OK, up to the top of the stile? G: Right, em, go to your right towards the carpenter's house.</Paragraph> <Paragraph position="3"> F: Alright well I'll need to go below, I've got a blacksmith marked. G: Right, well you do that.</Paragraph> <Paragraph position="4"> F: Do you want it to go below the carpenter? \[*\] G: No, I want you to go up the left hand side of it towards green bay and make it a slightly diagonal line, towards, em sloping to the right. F: So you want me to go above the carpenter? \[**\] G: Uh-huh.</Paragraph> <Paragraph position="5"> F: Right.</Paragraph> <Paragraph position="6"> Note that in example 13, the move marked * is not a CHECK because it asks for new information F has only stated that he'll have to go below the blacksmith--but the move marked ** is a CHECK because F has inferred this information from G's prior contributions and wishes to have confirmation.</Paragraph> <Paragraph position="7"> 3.1.4 The ALIGN Move. An ALIGN move checks the partner's attention, agreement, or readiness for the next move. At most points in task-oriented dialogue, there is some piece of information that one of the participants is trying to transfer to the other participant. The purpose of the most common type of ALIGN move is for the transferer to know that the information has been successfully transferred, so that they can close that part of the dialogue and move on. If the transferee has acknowledged the information clearly enough, an ALIGN move may not be necessary. If the transferer needs Computational Linguistics Volume 23, Number 1 more evidence of success, then alignment can be achieved in two ways. If the transferer is sufficiently confident that the transfer has been successful, a question such as &quot;OK?&quot; suffices. Some participants ask for this kind of confirmation immediately after issuing an instruction, probably to force more explicit responses to what they say. Less-confident transferers can ask for confirmation of some fact that the transferee should be able to infer from the transferred information, since this provides stronger evidence of success. Although ALIGN moves usually occur in the context of an unconfirmed information transfer, participants also use them at hiatuses in the dialogue to check that &quot;everything is OK&quot; (i.e., that the partner is ready to move on) without asking about anything in particular.</Paragraph> <Paragraph position="8"> Example 14 G: OK? \]after an instruction and an acknowledgment\] Example 15 G: You should be skipping the edge of the page by about half an inch, OK? Example 16 G: Then move that point up half an inch so you've got a kind of diagonal line again.</Paragraph> <Paragraph position="9"> F: Right.</Paragraph> <Paragraph position="10"> G: This is the left-hand edge of the page, yeah? \[where the query is asked very generally about a large stretch of dialogue, &quot;just in case&quot;\] 3.1.5 The QUERY-YN Move. A QUERY-YN asks the partner any question that takes a yes or no answer and does not count as a CHECK or an ALIGN. In the Map Task, these questions are most often about what the partner has on the map. They are also quite often questions that serve to focus the attention of the partner on a particular part of the map or that ask for domain or task information where the speaker does not think that information can be inferred from the dialogue context.</Paragraph> <Paragraph position="11"> Right just move straight down from there, then, Past the blacksmith? \[with no previous mention of blacksmith or any distance straight down, so that F can't guess the answer\] Carletta et al. Dialogue Structure Coding 3.1.6 The QUERY-W Move. A QUERY-W is any query not covered by the other categories. Although most moves classified as QUERY-W are wh-questions, otherwise unclassifiable queries also go in this category. This includes questions that ask the partner to choose one alternative from a set, as long as the set is not yes and no. Although technically the tree of coding distinctions allows for a CHECK or an ALIGN to take the form of a wh-question, this is unusual in English. In both ALIGN and CHECK moves, the speaker tends to have an answer in mind, and it is more natural to formulate them as yes-no questions. Therefore, in English all wh-questions tend to be categorized as QUERY-W. It might be possible to subdivide QUERY-W intO theoretically interesting categories rather than using it as a &quot;wastebasket,&quot; but in the Map Task such queries are rare enough that subdivision is not worthwhile.</Paragraph> <Paragraph position="12"> Example 21 G: Towards the chapel and then you've F: Towards what? Example 22 G: Right, okay. Just move round the crashed spaceship so that you've ... you reach the finish, which should be left ... just left of the ... the chestnut tree.</Paragraph> <Paragraph position="13"> F: Left of the bottom or left of the top of the chestnut tree? Example 23 F: No I've got a ... I've got a trout farm over to the right underneath Indian Country here.</Paragraph> <Paragraph position="14"> G: Mmhmm.</Paragraph> <Paragraph position="15"> F: I want you to go three inches past that going south, in other words just to the level of that, I mean, not the trout farm.</Paragraph> <Paragraph position="16"> To the level of what?</Paragraph> <Section position="1" start_page="18" end_page="21" type="sub_section"> <SectionTitle> 3.2 Response moves </SectionTitle> <Paragraph position="0"> The following moves are used within games after an initiation, and serve to fulfill the expectations set up within the game.</Paragraph> <Paragraph position="1"> 3.2.1 The ACKNOWLEDGE Move. An ACKNOWLEDGE move is a verbal response that minimally shows that the speaker has heard the move to which it responds, and often also demonstrates that the move was understood and accepted. Verbal acknowledgments do not have to appear even after substantial explanations and instructions, since acknowledgment can be given nonverbally, especially in face-to-face settings, and because the partner may not wait for one to occur. Clark and Schaefer (1989) give five kinds of evidence that an utterance has been accepted: continued attention, initiating a relevant utterance, verbally acknowledging the utterance, demonstrating an understanding of the utterance by paraphrasing it, and repeating part or all of the utterance verbatim. Of these kinds of evidence, only the last three count as ACKNOWLEDGE moves in this coding scheme; the first kind leaves no trace in a dialogue transcript to be coded, and the second involves making some other, more substantial dialogue move.</Paragraph> <Paragraph position="2"> form that means &quot;yes&quot;, however that is expressed. Since REPLY-Y moves are elicited responses, they normally only appear after QUERY-YN, ALIGN, and CHECK moves.</Paragraph> <Paragraph position="3"> G: Do you want me to run by that one again? F: Yeah, if you could.</Paragraph> <Paragraph position="4"> 3.2.3 The REPLY-N Move. Similar to REPLY-Y, a reply to a query with a yes-no surface form, that means &quot;no&quot; is a REPLY-N.</Paragraph> <Paragraph position="5"> Example 30 G: Do you have the west lake, down to your left? F: No.</Paragraph> <Paragraph position="6"> Example 31 G: So you're at a point that's probably two or three inches away from both the top edge, and the left-hand side edge. Is that correct? F: No, not at the moment.</Paragraph> <Paragraph position="7"> One caveat about the meaning of the difference between REPLY-Y and REPLY-N: rarely, queries include negation (e.g., &quot;You don't have a swamp?&quot;; &quot;You're not anywhere near the coast?&quot;). As for the other replies, whether the answer is coded as a REPLY-Y or a REPLY-N depends on the surface form of the answer, even though in this case &quot;yes&quot; and &quot;no&quot; can mean the same thing.</Paragraph> <Paragraph position="8"> Carletta et al. Dialogue Structure Coding 3.2.4 The REPLY-W Move. A REPLY-W is any reply to any type of query that doesn't simply mean &quot;yes&quot; or &quot;no.&quot; Example 32 G: And then below that, what've you got? F: A forest stream.</Paragraph> <Paragraph position="9"> Example 33 G: No, but right, first, before you come to the bakery do another wee lump F: Why? G: Because I say.</Paragraph> <Paragraph position="10"> Example 34 F: Is this before or after the backward s? G: This is before it.</Paragraph> <Paragraph position="11"> 3.2.5 The CLARIFY Move. A CLARIFY move is a reply to some kind of question in which the speaker tells the partner something over and above what was strictly asked. If the information is substantial enough, then the utterance is coded as a reply followed by an EXPLAIN, but in many cases, the actual change in meaning is so small that coders are reluctant to mark the addition as truly informative. Route givers tend to make CLARIFY moves when the route follower seems unsure of what to do, but there isn't a specific problem on the agenda (such as a landmark now known not to be shared). Example 35 G: And then, have you got the pirate ship? F: Mmhmm.</Paragraph> <Paragraph position="12"> G: Just curve from the point, go right ... go down and curve into the right til you reach the tip of the pirate ship F: So across the bay? G: Yeah, through the water.</Paragraph> <Paragraph position="13"> F: So I just go straight down? G: Straight down, and curve to the right, til you're in line with the pirate ship.</Paragraph> <Paragraph position="14"> Example 36 \[ . .. instructions that keep them on land ... \] F: So I'm going over the bay? G: Mm, no, you're still on land.</Paragraph> <Paragraph position="15"> 3.2.6 Other Possible Responses. All of these response moves help to fulfill the goals proposed by the initiating moves that they follow. It is also theoretically possible at any point in the dialogue to refuse to take on the proposed goal, either because the responder feels that there are better ways to serve some shared higher-level dialogue Computational Linguistics Volume 23, Number 1 goal or because the responder does not share the same goals as the initiator. Often refusal takes the form of ignoring the initiation and simply initiating some other move. However, it is also possible to make such refusals explicit; for instance, a participant could rebuff a question with &quot;No, let's talk about .... &quot; an initiation with &quot;What do you mean--that won't work!&quot;, or an explanation about the location of a landmark with &quot;Is it?&quot; said with an appropriately unbelieving intonation. One might consider these cases akin to ACKNOWLEDGE moves, but with a negative slant. These cases were sufficiently rare in the corpora used to develop the coding scheme that it was impractical to include a category for them. However, it is possible that in other languages or communicative settings, this behavior will be more prevalent. Grice and Savino (1995) found that such a category was necessary when coding Italian Map Task dialogues where speakers were very familiar with each other. They called the category OBJECT.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.3 The READY Move </SectionTitle> <Paragraph position="0"> In addition to the initiation and response moves, the coding scheme identifies READY moves as moves that occur after the close of a dialogue game and prepare the conversation for a new game to be initiated. Speakers often use utterances such as &quot;OK&quot; and &quot;right&quot; to serve this purpose. It is a moot point whether READY moves should form a distinct move class or should be treated as discourse markers attached to the subsequent moves, but the distinction is not a critical one, since either interpretation can be placed on the coding. It is sometimes appropriate to consider READY moves as distinct, complete moves in order to emphasize the comparison with ACKNOWLEDGE moves, which are often just as short and even contain the same words as READY moves.</Paragraph> </Section> <Section position="3" start_page="21" end_page="22" type="sub_section"> <SectionTitle> 3.4 The Game Coding Scheme </SectionTitle> <Paragraph position="0"> Moves are the building blocks for conversational game structure, which reflects the goal structure of the dialogue. In the move coding, a set of initiating moves are differentiated, all of which signal some kind of purpose in the dialogue. For instance, instructions signal that the speaker intends the hearer to follow the command, queries signal that the speaker intends to acquire the information requested, and statements signal that the speaker intends the hearer to acquire the information given. A conversational game is a sequence of moves starting with an initiation and encompassing all moves up until that initiation's purpose is either fulfilled or abandoned.</Paragraph> <Paragraph position="1"> There are two important components of any game coding scheme. The first is an identification of the game's purpose; in this case, the purpose is identified simply by the name of the game's initiating move. The second is some explanation of how games are related to each other. The simplest, paradigmatic relationships are implemented in computer-computer dialogue simulations, such as those of Power (1979) and Houghton (1986). In these simulations, once a game has been opened, the participants work on the goal of the game until they both believe that it has been achieved or that it should be abandoned. This may involve embedding new games with subservient Carletta et al. Dialogue Structure Coding purposes to the top-level one being played (for instance, clarification subdialogues about some crucial missing information), but the embedding structure is always clear and mutually understood. Although some natural dialogue is this orderly, much of it is not; participants are free to initiate new games at any time (even while the partner is speaking), and these new games can introduce new purposes rather than serving some purpose already present in the dialogue. In addition, natural dialogue participants often fail to make clear to their partners what their goals are. This makes it very difficult to develop a reliable coding scheme for complete game structure.</Paragraph> <Paragraph position="2"> The game coding scheme simply records those aspects of embedded structure that are of the most interest. First, the beginning of new games is coded, naming the game's purpose according to the game's initiating move. Although all games begin with an initiating move (possibly with a READY move prepended to it), not all initiating moves begin games, since some of the initiating moves serve to continue existing games or remind the partner of the main purpose of the current game again. Second, the place where games end or are abandoned is marked. Finally, games are marked as either occurring at top level or being embedded (at some unspecified depth) in the game structure, and thus being subservient to some top-level purpose. The goal of these definitions is to give enough information to study relationships between game structure and other aspects of dialogue while keeping those relationships simple enough to code.</Paragraph> </Section> <Section position="4" start_page="22" end_page="23" type="sub_section"> <SectionTitle> 3.5 The Transaction Coding Scheme </SectionTitle> <Paragraph position="0"> Transaction coding gives the subdialogue structure of complete task-oriented dialogues, with each transaction being built up of several dialogue games and corresponding to one step of the task. In most Map Task dialogues, the participants break the route into manageable segments and deal with them one by one. Because transaction structure for Map Task dialogues is so closely linked to what the participants do with the maps, the maps are included in the analysis. The coding system has two components: (1) how route givers divide conveying the route into subtasks and what parts of the dialogue serve each of the subtasks, and (2) what actions the route follower takes and when.</Paragraph> <Paragraph position="1"> The basic route giver coding identifies the start and end of each segment and the subdialogue that conveys that route segment. However, Map Task participants do not always proceed along the route in an orderly fashion; as confusions arise, they often have to return to parts of the route that have already been discussed and that one or both of them thought had been successfully completed. In addition, participants occasionally overview an upcoming segment in order to provide a basic context for their partners, without the expectation that their partners will be able to act upon their descriptions (for instance, describing the complete route as &quot;a bit like a diamond shape *.. but.., a lot more wavy than that... &quot;). They also sometimes engage in subdialogues not relevant to any segment of the route, sometimes about the experimental setup but often nothing at all to do with the task. This gives four transaction types: NORMAL, REVIEW, OVERVIEW, and IRRELEVANT.</Paragraph> <Paragraph position="2"> Other types of subdialogues are possible (such as checking the placement of all map landmarks before describing any of the route, or concluding the dialogue by reviewing the entire route), but are not included in the coding scheme because of their rarity.</Paragraph> <Paragraph position="3"> Coding involves marking where in the dialogue transcripts a transaction starts and which of the four types it is, and for all but IRRELEVANT transactions, indicating the start and end point of the relevant route section using numbered crosses on a copy of the route giver's map. The ends of transactions are not explicitly coded because, Computational Linguistics Volume 23, Number 1 generally speaking, transactions do not appear to nest; for instance, if a transaction is interrupted to review a previous route segment, participants by and large restart the goal of the interrupted transaction afterwards. It is possible that transactions are simply too large for the participants to remember how to pick up where they left off. Note that it is possible for several transactions (even of the same type) to have the same starting point on the route.</Paragraph> <Paragraph position="4"> The basic route follower coding identifies whether the follower action was drawing a segment of the route or crossing out a previously drawn segment, and the start and end points of the relevant segment, indexed using numbered crosses on a copy of the route follower's map.</Paragraph> <Paragraph position="5"> 4. Reliability of Coding Schemes It is important to show that subjective coding distinctions can be understood and applied by people other than the coding developers, both to make the coding credible in its own right and to establish that it is suitable for testing empirical hypotheses. Krippendorff (1980), working within the field of content analysis, describes a way of establishing reliability, which applies here.</Paragraph> </Section> <Section position="5" start_page="23" end_page="23" type="sub_section"> <SectionTitle> 4.1 Tests of reliability </SectionTitle> <Paragraph position="0"> Krippendorff argues that there are three different tests of reliability with increasing strength. The first is stability, also sometimes called test-rest reliability, or intertest variance; a coder's judgments should not change over time. The second is reproducibility, or intercoder variance, which requires different coders to code in the same way. The third is accuracy, which requires coders to code in the same way as some known standard. Stability can be tested by having a single coder code the same data at different times. Reproducibility can be tested by training several coders and comparing their results. Accuracy can be tested by comparing the codings produced by these same coders to the standard, if such a standard exists. Where the standard is the coding of the scheme's &quot;expert&quot; developer, the test simply shows how well the coding instructions fit the developer's intention.</Paragraph> <Paragraph position="1"> Whichever type of reliability is being assessed, most coding schemes involve placing units into one of n mutually exclusive categories. This is clearly true for the dialogue structure coding schemes described here, once the dialogues have been segmented into appropriately sized units. Less obviously, segmentation also often fits this description. If there is a natural set of possible segment boundaries that can be treated as units, one can recast segmentation as classifying possible segment boundaries as either actual segment boundaries or nonboundaries. Thus for both classification and segmentation, the basic question is what level of agreement coders reach under the reliability tests.</Paragraph> </Section> <Section position="6" start_page="23" end_page="24" type="sub_section"> <SectionTitle> 4.2 Interpreting reliability results </SectionTitle> <Paragraph position="0"> It has been argued elsewhere (Carletta 1996) that since the amount of agreement one would expect by chance depends on the number and relative frequencies of the categories under test, reliability for category classifications should be measured using the kappa coefficient. 1 Even with a good yardstick, however, care is needed to determine</Paragraph> <Paragraph position="2"> the proportion of times that one would expect them to agree by chance.</Paragraph> <Paragraph position="3"> Carletta et al. Dialogue Structure Coding from such figures whether or not the exhibited agreement is acceptable, as Krippendorff (1980) explains. Reliability in essence measures the amount of noise in the data; whether or not that will interfere with results depends on where the noise is and the strength of the relationship being measured. As a result, Krippendorff warns against taking overall reliability figures too seriously, in favor of always calculating reliability with respect to the particular hypothesis under test. Using a, a generalized version of kappa, which also works for ordinal, interval, and ratio-scaled data, he remarks that a reasonable rule of thumb for associations between two variables that both rely on subjective distinctions is to require a > .8, with .67 < a < .8 allowing tentative conclusions to be drawn. Krippendorff also describes an experiment by Brouwer in which English-speaking coders reached a = .44 on the task of assigning television characters to categories with complicated Dutch names that did not resemble English words! It is interesting to note that medical researchers have agreed on much less strict guidelines, first drawn up by Landis and Koch (1977), who call K < 0 &quot;poor&quot; agreement, 0 to .2 &quot;slight&quot;, .21 to .40 &quot;fair&quot;, .41 to .60 &quot;moderate&quot;, .61 - .80 &quot;substantial&quot;, and .81 to 1 &quot;near perfect&quot;. Landis and Koch describe these ratings as &quot;clearly arbitrary, but useful 'benchmarks'&quot; (p. 165).</Paragraph> <Paragraph position="4"> Krippendorff also points out that where one coding distinction relies on the results of another, the second distinction cannot be reasonable unless the first also is. For instance, it would be odd to consider a classification scheme acceptable if coders were unable to agree on how to identify units in the first place. In addition, when assessing segmentation, it is important to choose the class of possible boundaries sensibly. Although kappa corrects for chance expected agreement, it is still susceptible to order of magnitude differences in the number of units being classified, when the absolute number of units placed in one of the categories remains the same. For instance, one would obtain different values for kappa on agreement for move segment boundaries using transcribed word boundaries and transcribed letter boundaries, simply because there are so many extra agreed nonboundaries in the transcribed letter case. Despite these warnings, kappa has clear advantages over simpler metrics and can be interpreted as long as appropriate care is used.</Paragraph> </Section> <Section position="7" start_page="24" end_page="26" type="sub_section"> <SectionTitle> 4.3 Reliability of Move Coding </SectionTitle> <Paragraph position="0"> The main move and game cross-coding study involved four coders, all of whom had already coded substantial portions of the Map Task Corpus. For this study, they simply segmented and coded four dialogues using their normal working procedures, which included access to the speech as well as the transcripts. All of the coders interacted verbally with the coding developers, making it harder to say what they agree upon than if they had worked solely from written instructions. On the other hand, this is a common failing of coding schemes, and in some circumstances it can be more important to get the ideas of the coding scheme across than to tightly control how it is done.</Paragraph> <Paragraph position="1"> ment a dialogue into moves. Two different measures of agreement are useful. In the first, kappa is used to assess agreement on whether or not transcribed word boundaries are also move segment boundaries. On average, the coders marked move boundaries roughly every 5.7 words, so that there were roughly 4.7 times as many word boundaries that were not marked as move boundaries as word boundaries that were. The second measure, similar to information retrieval metrics, is the actual agreement reached measuring pairwise over all locations where any coder marked a boundary. That is, the measure considers each place where any coder marked a boundary and averages Computational Linguistics Volume 23, Number 1 the ratio of the number of pairs of coders who agreed about that location over the total number of coder pairs. Note that it would not be possible to define &quot;unit&quot; in the same way for use in kappa because then it would not be possible for the coders to agree on a nonboundary classification. Pairwise percent agreement is the best measure to use in assessing segmentation tasks when there is no reasonable independent definition of units to use as the basis of kappa. It is provided for readers who are skeptical about our use of transcribed word boundaries.</Paragraph> <Paragraph position="2"> The move coders reached K = .92 using word boundaries as units (N = 4, 079 \[the number of units\], k = 4 \[the number of coders\]); pairwise percent agreement on locations where any coder had marked a move boundary was 89% (N = 796).</Paragraph> <Paragraph position="3"> Most of the disagreement fell into one of two categories. First, some coders marked a READY move but the others included the same material in the move that followed. One coder in particular was more likely to mark READY moves, indicating either greater vigilance or a less restrictive definition. Second, some coders marked a reply, while others split the reply into a reply plus some sort of move conveying further information not strictly elicited by the opening question (i.e., an EXPLAIN, CLARIFY, or INSTRUCT). This confusion was general, suggesting that it might be useful to think more carefully about the difference between answering a question and providing further information. It also suggests possible problems with the CLARIFY category, since unlike EXPLAIN and INSTRUCT moves, most CLARIFY moves follow replies, and since CLARIFY moves are intended to contain unelicited information. However, in general the agreement on segmentation reached was very good and certainly provides a solid enough foundation for more classification.</Paragraph> <Paragraph position="4"> able uses the kappa coefficient; units in this case are moves for which all move coders agreed on the boundaries surrounding the move. Note that it is only possible to measure reliability of move classification over move segments where the boundaries were agreed. The more unreliable the segmentation, the more data must be omitted. Classification results can only be interpreted if the underlying segmentation is reasonably robust.</Paragraph> <Paragraph position="5"> Overall agreement on the entire coding scheme was good (K = .83, N = 563, k = 4), with the largest confusions between (1) CHECK and QUERY-YN, (2) INSTRUCT and CLARIFY, and (3) ACKNOWLEDGE, READY, and REPLY-Y. Combining categories, agreement was also very good (K = .89) for whether a move was an initiation type or a response or ready type. For agreed initiations themselves, agreement was very high (K -~ .95, N = 243, k = 4) on whether the initiation was a command (the INSTRUCT move), a statement (the EXPLAIN move), or one of the question types (QUERY-YN, QUERY-W, CHECK, or ALIGN). Coders were also able to agree on the subclass of question (K = .82, N = 98, k = 4). Coders could also reliably classify agreed responses as ACKNOWLEDGE, CLARIFY, or one of the reply categories (K = .86, N = 236, k = 4). However, coders had a little more difficulty (K = .75, N = 132, k = 4) distinguishing between different types of moves that all contribute new, unelicited information (INSTRUCT, EXPLAIN, and CLARIFY).</Paragraph> <Paragraph position="6"> sponsored by the University of Pennsylvania, three non-HCRC computational linguists and one of the original coding developers, who had not done much coding, move coded a Map Task dialogue from written instructions only, using just the transcript and not the speech source. Agreement on move classification was K = .69 (N 139, k = 4). Leaving the coding developer out of the coder pool did not change the results Carletta et al. Dialogue Structure Coding (K = .67, k = 3), suggesting that the instructions conveyed his intentions fairly well. The coding developer matched the official Map Task coding almost entirely. One coder never used the CHECK move; when that coder was removed from the pool, K = .73 (k = 3). When CHECK and QUERY-YN were conflated, agreement was K = .77 (k = 4). Agreement on whether a move was an initiation, response, or ready type was good (K = .84). Surprisingly, non-HCRC coders appeared to be able to distinguish the CLARIFY move better than in-house coders. This amount of agreement seems acceptable given that this was a first coding attempt for most of these coders and was probably done quickly. Coders generally become more consistent with experience.</Paragraph> <Paragraph position="7"> level of coding most useful for work in other domains. To test how well the scheme would transfer, it was applied by two of the coders from the main move reliability study to a transcribed conversation between a hi-fi sales assistant and a married couple intending to purchase an amplifier. Dialogue openings and closings were omitted since they are well understood but do not correspond to categories in the classification scheme. The coders reached K -- .95 (N -- 819,k = 2) on the move segmentation task, using word boundaries as possible move boundaries, and K -- .81 (N -- 80, k -2) for move classification. These results are in line with those from the main trial. The coders recommended adding a new move category specifically for when one conversant completes or echoes an utterance begun by another conversant. Neither of the coders used INSTRUCT, READY, or CHECK moves for this dialogue.</Paragraph> </Section> <Section position="8" start_page="26" end_page="27" type="sub_section"> <SectionTitle> 4.4 Reliability of Game Coding </SectionTitle> <Paragraph position="0"> The game coding results come from the same study as the results for the expert move cross-coding results. Since games nest, it is not possible to analyze game segmentation in the same way as was done for moves. Moreover, it is possible for a set of coders to agree on where the game begins and not where it ends, but still believe that the game has the same goal, since the game's goal is largely defined by its initiating utterance.</Paragraph> <Paragraph position="1"> Therefore, the best analysis considers how well coders agree on where games start and, for agreed starts, where they end. Since game beginnings are rare compared to word boundaries, pairwise percent agreement is used.</Paragraph> <Paragraph position="2"> Calculating as described, coders reached promising but not entirely reassuring agreement on where games began (70%, N = 203). Although one coder tended to have longer games (and therefore fewer beginnings) than the others, there was no striking pattern of disagreement. Where the coders managed to agree on the beginning of a game (i.e., for the most orderly parts of the dialogues), they also tended to agree on what type of game it was (INSTRUCT, EXPLAIN, QUERY-W, QUERY-YN, ALIGN, or CHECK) (K ---- .86, N = 154, k -- 4). Although this is not the same as agreeing on the category of an initiating move because not all initiating moves begin games, disagreement stems from the same move naming confusions (notably, the distinction between QUERY-YN and CHECK). There was also confusion about whether a game with an agreed beginning was embedded or not (K -- .46). The question of where a game ends is related to the embedding subcode, since games end after other games that are embedded within them. Using just the games for which all four coders agreed on the beginning, the coders reached 65% pairwise percent agreement on where the game ended. The abandoned game subcode turned out to be so scarce in the cross-coding study that it was not possible to calculate agreement for it, but agreement is probably poor. Some coders have commented that the coding practice was unstructured enough that it was easy to forget to use the subcode.</Paragraph> <Paragraph position="3"> To determine stability, the most experienced coder completed the same dialogue Computational Linguistics Volume 23, Number 1 twice, two months and many dialogues apart. She reached better agreement (90%; N = 49) on where games began, suggesting that one way to improve the coding would be to formalize more clearly the distinctions that she believes herself to use. When she agreed with herself on where a game began, she also agreed well with herself about what game it was (K = .88, N = 44, the only disagreements being confusions between CHECK and QUERY-YN), whether or not games were embedded (K = .95), and where the games ended (89%). There were not enough instances of abandoned games marked to test formally, but she did not appear to use the coding consistently.</Paragraph> <Paragraph position="4"> In general, the results of the game cross-coding show that the coders usually agree, especially on what game category to use, but when the dialogue participants begin to overlap their utterances or fail to address each other's concerns clearly, the game coders have some difficulty agreeing on where to place game boundaries. However, individual coders can develop a stable sense of game structure, and therefore if necessary, it should be possible to improve the coding scheme.</Paragraph> </Section> <Section position="9" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 4.5 Reliability of Transaction Coding </SectionTitle> <Paragraph position="0"> Unlike the other coding schemes, transaction coding was designed from the beginning to be done solely from written instructions. Since it is possible to tell uncontroversially from the video what the route follower drew and when they drew it, reliability has only been tested for the other parts of the transaction coding scheme.</Paragraph> <Paragraph position="1"> The replication involved four naive coders and the &quot;expert&quot; developer of the coding instructions. All four coders were postgraduate students at the University of Edinburgh; none of them had prior experience of the Map Task or of dialogue or discourse analysis. All four dialogues used different maps and differently shaped routes.</Paragraph> <Paragraph position="2"> To simplify the task, coders worked from maps and transcripts. Since intonational cues can be necessary for disambiguating whether some phrases such as &quot;OK&quot; and &quot;right&quot; close a transaction or open a new one, coders were instructed to place boundaries only at particular sites in the transcripts, which were marked with blank lines. These sites were all conversational move boundaries except those between READY moves and the moves following them. Note that such move boundaries form a set of independently derived units, which can be used to calculate agreement on transaction segmentation. The transcripts did not name the moves or indicate why the potential transaction boundaries were placed where they were.</Paragraph> <Paragraph position="3"> Each subject was given the coding instructions and a sample dialogue extract and pair of maps to take away and examine at leisure. The coders were asked to return with the dialogue extract coded. When they returned they were given a chance to ask questions. They were then given the four complete dialogues and maps to take away and code in their own time. The four coders did not speak to each other about the exercise. Three of the four coders asked for clarification of the OVERVIEW distinction, which turned out to be a major source of unreliability; there were no other queries.</Paragraph> <Paragraph position="4"> 4.5.1 Measures. Overall, each coder marked roughly a tenth of move boundaries as transaction boundaries. When all coders were taken together as a group, the agreement reached on whether or not conversational move boundaries are transaction boundaries was K = .59 (N = 657, k = 5). The same level of agreement (K = .59) was reached when the expert was left out of the pool. This suggests the disagreement is general rather than arising from problems with the written instructions. Kappa for different pairings of naive coders with the expert were .68, .65, .53, and .43, showing considerable variation from subject to subject. Note that the expert interacted minimally with the coders, and therefore differences were not due to training.</Paragraph> <Paragraph position="5"> Carletta et al. Dialogue Structure Coding Agreement on the placement of map reference points was good; where the coders agreed that a boundary existed, they almost invariably placed the begin and end points of their segments within the same four centimeter segment of the route, and often much closer, as measured on the original A3 (296 x 420 mm) maps. In contrast, the closest points that did not refer to the same boundary were usually five centimeters apart, and often much further. The study was too small for formal results about transaction category. For 64 out of 78 boundaries marked by at least two coders, the category was agreed.</Paragraph> <Paragraph position="6"> 4.5.2 Diagnostics. Because this study was relatively small, problems were diagnosed by looking at coding mismatches directly rather than by using statistical techniques. Coders disagreed on where to place boundaries with respect to introductory questions about a route segment (such as &quot;Do you have the swamp?&quot;, when the route giver intends to describe the route using the swamp) and attempts by the route follower to move on (such as &quot;Where do I go now?&quot;). Both of these confusions can be corrected by clarifying the instructions. In addition, there were a few cases where coders were allowed to place a boundary on either side of a discourse marker, but the coders did not agree. Using the speech would probably help, but most uses of transaction coding would not require boundary placement this precise. OVERVIEW transactions were too rare to be reliable or useful and should be dropped from future coding systems.</Paragraph> <Paragraph position="7"> Finally, coders had a problem with &quot;grain size&quot;; one coder had many fewer transactions than the other coders, with each transaction covering a segment of the route which other coders split into two or more transactions, indicating that he thought the route givers were planning ahead much further than the other coders did. This is a general problem for discourse and dialogue segmentation. Greene and Cappella (1986) show very good reliability for a monologue segmentation task based on the &quot;idea&quot; structure of the monologue, but they explicitly tell the coders that most segments are made up of two or three clauses. Describing a typical size may improve agreement, but might also weaken the influence of the real segmentation criteria. In addition, higher-level segments such as transactions vary in size considerably. More discussion between the expert and the novices might also improve agreement on segmentation, but would make it more difficult for others to apply the coding systems.</Paragraph> </Section> </Section> class="xml-element"></Paper>