File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/j05-3001_abstr.xml
Size: 10,818 bytes
Last Modified: 2025-10-06 13:44:24
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-3001"> <Title>Squibs and Discussions Evaluating Discourse and Dialogue Coding Schemes</Title> <Section position="2" start_page="0" end_page="292" type="abstr"> <SectionTitle> 2. Agreement Measures </SectionTitle> <Paragraph position="0"> There are many ways in which the level of agreement between coders can be evaluated, and the choice of which to apply in order to assess reliability is the source of much confusion. An appropriate statistic for this purpose must measure agreement as a function of the coding process and not of the coders, data, or categories. Only if the results of a test are solely dependent on the degree to which there is a shared understanding of how the phenomena to be described are mapped to the given categories can we infer the reliability of the resulting data. Some agreement measures do not behave in this manner and are therefore unsuitable for evaluating reliability.</Paragraph> <Paragraph position="1"> A great deal of importance is placed on domain specificity in discourse and dialogue studies and as such, researchers are often encouraged to evaluate schemes using corpora from more than one domain. Concerning agreement, this encouragement is misplaced.</Paragraph> <Paragraph position="2"> Since an appropriate agreement measure is a function of only the coding process, if the original agreement test is performed in a scientifically sound manner, little more can be proved by applying it again to different data. Any differences in the results between corpora are a function of the variance between samples and not of the reliability of the coding scheme.</Paragraph> <Paragraph position="3"> Di Eugenio and Glass (2004) identify three general classes of agreement statistics and suggest that all three should be used in conjunction in order to accurately evaluate coding schemes. However, this suggestion is founded on some misunderstandings of Craggs and Wood Evaluating Discourse and Dialogue Coding Schemes the role of agreement measure in reliability studies. We shall now rectify these and conclude that only one class of agreement measure is suitable.</Paragraph> <Section position="1" start_page="290" end_page="291" type="sub_section"> <SectionTitle> 2.1 Percentage Agreement </SectionTitle> <Paragraph position="0"> The first of the recommended agreement tests, percentage agreement, measures the proportion of agreements between coders. This is an unsuitable measure for inferring reliability, and it was the use of this measure that prompted Carletta (1996) to recommend chance-corrected measures.</Paragraph> <Paragraph position="1"> Percentage agreement is inappropriate for inferring reliability because it excludes any notion of the level of agreement that we could expect to achieve by chance. Reliability should be inferred by locating the achieved level of agreement on a scale between the best possible (coders agree perfectly) and the worst possible (coders do not understand or cannot perform the mapping and behave randomly). Without any indication of the agreement that coders would achieve by behaving randomly, any deviation from perfect agreement is uninterpretable (Krippendorff 2004b).</Paragraph> <Paragraph position="2"> The justification given for using percentage agreement is that it does not suffer from what Di Eugenio and Glass (2004) referred to as the &quot;prevalence problem.&quot; Prevalence refers to the unequal distribution of label use by coders. For example, Table 1 shows an example taken from Di Eugenio and Glass (2004) showing the classification of the utterance Okay as an acceptance or acknowledgment. It represents a confusion matrix describing the number of occasions that coders used pairs of labels for a given turn. This table shows that the two coders favored the use of accept strongly over acknowledge. They correctly state that this skew in the distribution of categories increases the expected chance agreement, thus lowering the overall agreement in chance-corrected tests. The reason for this is that since one category is more popular than others, the likelihood of coders' agreeing by chance by choosing this category increases. We therefore require a comparable increase in observed agreement to accommodate this.</Paragraph> <Paragraph position="3"> Di Eugenio and Glass (2004) perceive this as an &quot;unpleasant behavior&quot; of chance-corrected tests, one that prevents us from concluding that the example given in Table 1 shows satisfactory levels of agreement. Instead they use percentage agreement to arrive at this conclusion. By examining the data, it is clear that this conclusion would be false.</Paragraph> <Paragraph position="4"> In Table 1, the coders agree 90 out of 100 times, but all agreements occur when both coders choose accept. There is not a single case in which they agree on Okay's being used as an acknowledgment. The only conclusion one may justifiably draw is that the coders cannot distinguish the use of Okay as an acceptance from its use as an acknowledgment.</Paragraph> <Paragraph position="5"> Rather than being an unpleasant behavior, accounting for prevalence in the data is an Computational Linguistics Volume 31, Number 3 important part of accurately reporting the level of agreement. This helps us to avoid arriving at incorrect conclusions such as believing that the data shown in Table 1 suggest reliable coding.</Paragraph> </Section> <Section position="2" start_page="291" end_page="291" type="sub_section"> <SectionTitle> 2.2 Chance-Corrected Agreement: Unequal Coder Category Distribution </SectionTitle> <Paragraph position="0"> The second class of agreement measure recommended in Di Eugenio and Glass (2004) is that of chance-corrected tests that do not assume an equal distribution of categories between coders. Chance-corrected tests compute agreement according to the ratio of observed (dis)agreement to that which we could expect by chance, estimated from the data. The measures differ in the way in which this expected (dis)agreement is estimated.</Paragraph> <Paragraph position="1"> Those that do not assume an equal distribution between coders calculate expected (dis)agreement based on the individual distribution of each coder.</Paragraph> <Paragraph position="2"> The concern that in discourse and dialogue coding, coders will differ in the frequency with which they apply labels leads Di Eugenio and Glass to conclude that Cohen's (1960) kappa is the best chance-corrected test to apply. To clarify, by unequal distribution of categories, we do not refer to the disparity in the frequency with which categories occur (e.g., verbs are more common than pronouns) but rather to the difference in proclivity between coders (e.g., coder A is more likely to label something a noun than coder B).</Paragraph> <Paragraph position="3"> Cohen's kappa calculates expected chance agreement, based on the individual coders' distributions, in a manner similar to association measures, such as chi-square. This means that its results are dependent on the preferences of the individual coders taking part in the tests. This violates the condition set out at the beginning of this section whereby agreement must be a function of the coding process, with coders being viewed as interchangeable. The purpose of assessing the reliability of coding schemes is not to judge the performance of the small number of individuals participating in the trial, but rather to predict the performance of the schemes in general. The proposal that in most discourse and dialogue studies, the assumption of equal distribution between coders does not hold is, in fact, an argument against the use of Cohen's kappa. Assessing the agreement between coders and accounting for their idiosyncratic proclivity toward or against certain labels tells us little about how the coding scheme will perform when applied by others. The solution is not to apply a test that panders to individual differences, but rather to increase the number of coders so that the influence of any individual on the final result becomes less pronounced.</Paragraph> <Paragraph position="4"> Another reason provided for using Cohen's kappa is that its sensitivity to bias (differences in coders' category distribution) can be exploited to improve coding schemes. However, there is no need to calculate kappa in order to observe bias, since it will be evident in a contingency table of the data in question. Even if it were necessary to compute kappa for this purpose, however, this would not justify its use as a reliability test.</Paragraph> </Section> <Section position="3" start_page="291" end_page="292" type="sub_section"> <SectionTitle> 2.3 Chance-Corrected Agreement: Assumed Equal Coder Category Distribution </SectionTitle> <Paragraph position="0"> The remaining class of agreement measure assumes an equal distribution of categories for all coders. Once we have accepted that this assumption is necessary in order to 1 When there is a single correct label that should be used, such as part-of-speech tags used to describe the syntactic function of a word or group of words, then training coders may mitigate coder preference.</Paragraph> <Paragraph position="1"> Craggs and Wood Evaluating Discourse and Dialogue Coding Schemes predict the performance of the scheme in general, there appears to be no objection to using this type of statistical test for assessing agreement in discourse and dialogue work. Tests that fall into this class include Siegel and Castellan's (1988) extension of Scott's (1955) pi, confusingly called kappa, and Krippendorff's (2004a) alpha. Both of these measures calculate expected (dis)agreement based on the frequency with which each category is used, estimated from the overall usage by the coders.</Paragraph> <Paragraph position="2"> Kappa is more frequently described in statistics textbooks and more commonly implemented in statistical software. In circumstances in which mechanisms other than nominal labels are used to annotate data, alpha has the benefit of being able to deal with different degrees of disagreement between pairs of interval, ordinal, and ratio values, among others.</Paragraph> <Paragraph position="3"> Di Eugenio and Glass (2004) conclude with the proposal that these three forms of agreement measure collectively provide better means with which to judge agreement than any individual test. We would argue, to the contrary, that applying three different metrics to measure the same property suggests a lack of confidence in any of them.</Paragraph> <Paragraph position="4"> Percentage agreement and Cohen's kappa do not provide an insight into a scheme's reliability, so reporting their results is potentially misleading.</Paragraph> </Section> </Section> class="xml-element"></Paper>