File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4020_metho.xml

Size: 13,467 bytes

Last Modified: 2025-10-06 14:08:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4020">
  <Title>Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Corpus Description
</SectionTitle>
    <Paragraph position="0"> The corpus used is a collection of 380 email messages marked by two annotators with either one or two of the following labels: question, answer, broadcast, attachment transmission, planning-meeting scheduling, planning scheduling, planning, action item, technical discussion, and social chat. If two labels are used, one is designated primary and the other secondary.</Paragraph>
    <Paragraph position="1"> These ten categories were selected in order to direct the automatic summarization of email messages.</Paragraph>
    <Paragraph position="2"> This corpus is a subset of a larger corpus of approximately 1000 messages exchanged between members of the Columbia University chapter of the Association for Computing Machinery (ACM) in 2001.</Paragraph>
    <Paragraph position="3"> The annotation of the rest of corpus is in progress.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Standard Kappa Shortcomings
</SectionTitle>
    <Paragraph position="0"> Commonly, the kappa statistic is used to measure inter-annotator agreement. It determines how strongly two annotators agree by comparing the probability of the two agreeing by chance with the observed agreement. If the observed agreement is significantly greater than that expected by chance, then it is safe to say that the two annotators agree in their judgments. Mathematically,</Paragraph>
    <Paragraph position="2"> [?]= where K is the kappa value, p(A) is the probability of the actual outcome and p(E) is the probability of the expected outcome as predicted by chance.</Paragraph>
    <Paragraph position="3"> When each data point in a corpus is assigned a single label, calculating p(A) is straightforward: simply count up the number of times the two annotators agree and divide by the total number of annotations. However, in labeling this email corpus, labelers were allowed to select either a single label or two labels designating one as primary and one as secondary.</Paragraph>
    <Paragraph position="4"> The option of a secondary label increases the possible labeling combinations between two annotators fivefold. In the format &amp;quot;{&lt;A's labels&gt;, &lt;B's labels&gt;}&amp;quot; the possibilities are as follows: {a,a}, {a,b}, {ab,a}, {ab,b}, {ab,c}, {ab,ab}, {ab,ba}, {ab,ac}, {ab,bc}, {ab,cd}. The algorithm initially used to calculate the kappa statistic simply discarded the optional secondary label. This solution is unacceptable for two reasons. 1) It makes the reliability metric inconsistent with the annotation instructions. Why offer the option of a secondary label, if it is to be categorically ignored? 2) It discards useful information regarding partial agreement by treating situations corresponding to {ab,ba}, {ab,bc} and {ab, b} as simple disagreements.</Paragraph>
    <Paragraph position="5"> Despite this complication, the objective in computing p(A) remains the same, count the agreements and divide by the number of annotations. But how should the partial agreement cases ({ab, a}, {ab, b}, {ab,ba}, {ab,ac}, and {ab,bc}) be counted? For example, when considering a message that clearly contained both a question and an answer, one annotator had labeled the message as primarily question and secondarily answer, with another primarily answer and secondarily question. Should such an annotation be considered an agreement, as the two concur on the content of the message? Or disagreement, as they differ in their employ of primary and secondary? To what degree do two annotators agree if one labels a message primarily a and secondarily b and the other labels it simply a or simply b? What if there is agreement on the primary label and discrepancy on the secondary? Or vice versa? In the traditional Boolean assignment, each combination would have to be counted as either agreement or disagreement. Instead, in order to compute a useful value of p(A), we propose to assign a degree of agreement to each. This is similar in concept to Krippendorff's (1980) alpha measure for multiple observers.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Kappa Algorithm Augmentation
</SectionTitle>
    <Paragraph position="0"> To augment the computation of the kappa statistic, we consider annotations marked with primary and secondary labels not as two distinct selections, but as one divided selection.1 When an annotator selects a single label for a message, that label-message pair is assigned a score of 1.0. When an annotator selects a primary and secondary label, a weight p is assigned to the primary label and (1-p) to the secondary label for the corresponding label-message pair. Before computing the kappa score for the corpus, a single value p where 0.5 p 1.0 must be selected. If p = 1.0 the secondary labels are completely ignored, while if p = 0.5, secondary and primary labels are given equal weight. By examining the resulting kappa score at different values of p, insight into how the annotators are employing the optional secondary label can be gained. Moreover, single messages can be trivially isolated in order to reveal how each data point has been annotated with respect to primary and secondary labels. Landis and Koch (1977) present a method for calculating a weighted kappa measure. This method is useful for single annotations where the categories have an obvious relationship to each other, but does not extend to multiply labeled data points where relationships between categories are unknown.</Paragraph>
    <Paragraph position="1"> 1 Before settling on this approach, we considered counting each annotation equivalently whether primary or secondary. This made computation of p(A) and p(E) more complex, and by ignoring the primary/secondary distinction offered less insight into the use of the labels.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Compute p(A)
</SectionTitle>
      <Paragraph position="0"> To compute p(A), the observed probability, two annotation matrices are created, one for each annotator. These annotation matrices, Mannotator, have N rows and M columns, where n is the number of messages and m is the number of labels. These annotation matrices are propagated as follows.</Paragraph>
      <Paragraph position="1">  1],[ =yxM A , if A marked only label y for message x.</Paragraph>
      <Paragraph position="2"> pyxM A =],[ , if A marked label y as the primary label for message x.</Paragraph>
      <Paragraph position="3"> pyxM A [?]= 1],[ , if A marked label y as the secondary label for message x.</Paragraph>
      <Paragraph position="4"> 0],[ =yxM A , otherwise.</Paragraph>
      <Paragraph position="5">  Table 1 shows a sample set of annotations on 5 messages by annotator A. Table 2 shows the resulting MA based on the annotation data in Table 1 where p=0.6.  With the two annotation matrices, MA and MB, an agreement matrix, Ag, is constructed where ],[*],[],[ yxMyxMyxAg BA= . A total, ., is set to the sum of all cells of Ag. Finally,</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Compute p(E)
</SectionTitle>
      <Paragraph position="0"> Instead of assuming an even distribution of labels, we compute p(E), the expected probability, using the relative frequencies of each annotator's labeling preference.</Paragraph>
      <Paragraph position="1"> Using the above annotation matrices, relative frequency vectors, Freqannotator, are generated. Table 3 shows</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> This technique is not meant to inflate the kappa scores, but rather to provide further insight into how the annotators are using the two labels. Execution of this augmented kappa algorithm on this corpus suggests that the annotation guidelines need revision before the superset corpus is completely annotated. (Only 150 of 380 messages present a label for use in a machine learning experiment with .&gt;0.6.) The exact nature of the adjustments is yet undetermined. However, both a strict specification of when the secondary label ought to be used, and reconsideration of the ten available labels would likely improve the annotation effort.</Paragraph>
    <Paragraph position="1"> When we examine our labeled data, we find the average kappa statistic across the three annotators did not increase through examination of the secondary labels. If we ignore the secondary labels (p=1.0), the average .=0.299. When primary and secondary labels are given equal weight (p=0.5), the average .=0.281.</Paragraph>
    <Paragraph position="2"> By examining the average kappa statistic for each message individually at different p values, messages can be quickly categorized into four classes: those that demonstrate greatest agreement at p = 1.0; those with greatest agreement at p = 0.5; those that yield a nearly constant low kappa value and those that yield a nearly constant high kappa value. These classes suggest certain characteristics about the component messages, and can be employed to improve the ongoing annotation process. Class 1) Those messages that show a constant, high kappa score are those that are consistently categorized with a single label. (92/380 messages.) Class 2) Those messages with a constant, low kappa are those messages that are least consistently annotated regardless of whether a secondary label is used or not. (183/380 messages.) Class 3) Messages that show greater agreement at p = 1.0 than at p = 0.5 demonstrate greater inconsistency when the annotators opt to use the secondary labels but are in (greater) agreement regarding the primary label. Whether the primary label is more general or more specific depends on, hopefully, annotation standards, but in the absence of rigorous instructions, individual annotator preference. (58/380 messages.) Class 4) Messages that show greater agreement at p = 0.5 than at p = 1.0 are those messages where the primary and secondary labels are switched by some annotators, the above {ab,ba} case. From inspection, this most often occurs when the two features are not in a general/specific relationship (e.g., planning and question being selected for a message that contains a question about planning), but are rather concurrent features (e.g., question and answer being labeled on a message that obviously includes both a question and an answer).</Paragraph>
    <Paragraph position="3"> (47/380 messages.) Each of the four categories of messages can be utilized to a distinct end towards improvement of annotation instructions and/or annotation standards. Class 1 messages are clear examples of the labels. Class 2 messages are problematic. These messages can be used to redirect the annotators, revise the annotation manual or reconsider the annotation standards. Class 3 messages are those in which annotators use the optional secondary label, but not consistently.</Paragraph>
    <Paragraph position="4"> These messages can be employed to reinstruct the annotators as to the expected use of the secondary label.</Paragraph>
    <Paragraph position="5"> Class 4 messages pose a real dilemma. When these messages in fact do contain two concurrent features, they are not going to be good examples for machine learning experiments. While representative of both categories, they will (most likely) at feature analysis (the critical component of machine learning algorithms) be poor exemplars of each. While the fate of Class 4 messages is uncertain2, identification of these awkward examples is an important first step in handling their automatic classification.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> Calculating a useful metric for interannotator reliability when each data point is marked with optionally one or two labels proved to be a complicated task. Multiple labels raise the possibility of partial agreement between two annotators. In order to compute the observed probability (p(A)) component of the kappa statistic a constant weight, p, between 0.5 and 1.0 is selected. Each singleton annotation is then assigned a weight of 1, while the primary label of a doubleton annotation is assigned a weight of p, the secondary 1-p. These weights are then used to determine the partial agreement in the calculation of p(A). This augmentation to the algorithm for computing kappa is not meant to inflate the reliability metric, but rather to allow for a more thorough view of annotated data. By examining how 2 One potential solution would be to create a new annotation category for each commonly occurring pair. While each Class 4 message would remain a poor exemplar of each component category, it would be a good exemplar of this new &amp;quot;mixed&amp;quot; type.</Paragraph>
    <Paragraph position="1"> the annotated components of a corpus demonstrate agreement at varying levels of p, insight is gained into how the annotators are viewing these data and how they employ the optional secondary label.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML