File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3254_metho.xml

Size: 15,984 bytes

Last Modified: 2025-10-06 14:09:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3254">
  <Title>Evaluating information content by factoid analysis: human annotation and stability</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Agreement
</SectionTitle>
    <Paragraph position="0"> In our previous work, a \deflnitive&amp;quot; list of factoids was given (created by one author), and we were interested in whether annotators could consistently mark the text with the factoids contained in this list. In the new annotation cycle reported on here, we study the process of factoid lists creation, which is more time-consuming.</Paragraph>
    <Paragraph position="1"> We will discuss agreement in factoid annotation flrst, as it is a more straightforward concept, even though procedurally, factoids are flrst deflned (cf. section 3.2) and then annotated (cf.</Paragraph>
    <Paragraph position="2"> section 3.1).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Agreement of factoid annotation
</SectionTitle>
      <Paragraph position="0"> Assuming that we already have the right list of factoids available, factoid annotation of a 100 word summary takes roughly 10 minutes, and measuring agreement on the decision of assigning factoids to sentences is relatively straightforward. We calculate agreement in terms of Kappa, where the set of items to be classifled are all factoid{summary combinations (e.g. in the case of Phase 1, N=153 factoids times 20 sentences = 2920), and where there are two categories, either 'factoid is present in summary (1)' or 'factoid is not present in summary (0)'. P(E), probability of error, is calculated on the basis of the distribution of the categories, whereas P(A), probability of agreement, is calculated as the average of observed to possible pairwise agreements per item. Kappa is calculated as</Paragraph>
      <Paragraph position="2"> given in Figure 1.</Paragraph>
      <Paragraph position="3"> We measure agreement at two stages in the process: entirely independent annotation (Phase 1), and corrected annotation (Phase 2). In Phase 2, annotators see an automatically generated list of discrepancies with the other annotator, so that slips of attention can be corrected. Crucially, Phase 2 was conducted without any discussion. After Phase 2 measurement, discussion on the open points took place and a consensus was reached (which is used for the experiments in the rest of the paper).</Paragraph>
      <Paragraph position="4"> Figure 1 includes results for the Fortuyn text as we have factoid{summary annotations by both annotators for both texts. The Kappa flgures indicate high agreement, even in Phase 1 (K=.87 and K=.86); in Phase 2, Kappas are as high as .89 and .95. Note that there is a difference between the annotation of the Fortuyn and the Kuwait text: in the Fortuyn case, there was no discussion or disclosure of any kind in Phase 1; one author created the factoids, and both used this list to annotate. The agreement of K=.86 was thus measured on entirely independent annotations, with no prior communication whatsoever. In the case of the Kuwait text, the prior step of flnding a consensus factoid list had already taken place, including some discussion. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Agreement of factoid deflnition.
</SectionTitle>
      <Paragraph position="0"> We realised during our previous work, where only one author created the factoids, that the task of deflning factoids is a complicated process and that we should measure agreement on this task too (using the Kuwait text). Thus, we do not have this information for the Fortuyn text.</Paragraph>
      <Paragraph position="1"> But how should the measurement of agreement on factoid creation proceed? It is di-cult to flnd a fair measure of agreement over set operations like factoid splitting, particularly as the sets can contain a difierent set of summaries marked for each factoid. For instance, consider the following two sentences: (1) M01-004 Saddam Hussein said ... that they will leave the country when the situation stabilizes. and (2) M06-004 Iraq claims it ... would withdraw soon.</Paragraph>
      <Paragraph position="2"> One annotator created a factoid \(P30) Saddam H/Iraq will leave the country soon/when situation stabilises&amp;quot; whereas the other annotator split this into two factoids (F9.21 and F9.22). Note that the annotators use their own, independently chosen factoid names.</Paragraph>
      <Paragraph position="3"> Our procedure for annotation measurement is as follows. We create a list of identity and subsumption relations between factoids by the two annotators. In the example above, P30 would be listed as subsuming F9.21 and F9.22. It is time-consuming but necessary to create such a list, as we want to measure agreement only amongst those factoids which are semantically related. We use a program which maximises shared factoids between two summary sentences</Paragraph>
      <Paragraph position="5"> to suggest such identities and subsumption relations. null We then calculate Kappa at Phases 1 and 2. It is not trivial to deflne what an 'item' in the Kappa calculation should be. Possibly the use of Krippendorfi's alpha will provide a better approach (cf. Nenkova and Passonneau (2004)), but for now we measure using the better-known kappa, in the following way: For each equivalence between factoids A and C, create items f A { C { s j s 2 S g (where S is the set of all summaries). For each factoid A subsumed by a set B of factoids, create items f A ^ b { s j b 2 B, s 2 Sg. For example, given 5 summaries a, b, c, d, e, Annotator A1 assigns P30 to summaries a, c and e. Annotator A2 (who has split P30 into F9.21 and F9.22), assigns a to F9.21 and c and e to F9.22. This creates the 10 items for Kappa calculation given in Figure 2.</Paragraph>
      <Paragraph position="6"> Results for our data set are given in Figure 3. For Phase 1 of factoid deflnition, K=.7 indicates relatively good agreement (but lower than for the task of factoid annotation). Many of the disagreements can be reduced to slips of attention, as the increased Kappa of .81 for Phase 2 shows.</Paragraph>
      <Paragraph position="7"> Overall, we can observe that this high agreement for both tasks points to the fact that the task can be robustly performed in naturally occurring text, without any copy-editing. Still, from our observations, it seems that the task of factoid annotation is easier than the task of factoid deflnition.</Paragraph>
      <Paragraph position="8">  One of us then used the Kuwait consensus agreement to annotate the 16 machine summaries for that text which were created by different participants in DUC-2002, an annotation which could be done rather quickly. However, a small number of missing factoids were detected, for instance the (incorrect) factoid that Saudi Arabia was invaded, that the invasion happened on a Monday night, and that Kuwait City is Kuwait's only sizable town. Overall, the set of factoids available was considered adequate for the annotation of these new texts.</Paragraph>
      <Paragraph position="9">  function of number of underlying summaries.</Paragraph>
      <Paragraph position="10"> 4 Growth of the factoid inventory The more summaries we include in the analysis, the more factoids we identify. This growth of the factoid set stems from two factors. Different summarisers select difierent information and hence completely new factoids are introduced to account for information not yet seen in previous summaries. This factor also implies that the factoid inventory can never be complete as summarisers sometimes include information which is not actually in the original text. The second factor comes from splitting: when a new summary is examined, it often becomes necessary to split a single factoid into multiple factoids because only a certain part of it is included in the new summary. After the very flrst summary, each factoid is a full sentence, and these are gradually subdivided.</Paragraph>
      <Paragraph position="11"> In order to determine how many factoids exist in a given set of N summaries, we simulate earlier stages of the factoid set by automatically re{ merging those factoids which never occur apart within the given set of summaries.</Paragraph>
      <Paragraph position="12"> Figure 4 shows the average number of factoids over 100 drawings of N difierent summaries from the whole set, which grows from 1.0 to about 4.5 for the Kuwait text (long curve) and about 4.1 for the Fortuyn text (short curve). The Kuwait curve shows a steeper incline, possibly due to the fact that the sentences in the Kuwait text are longer. Given the overall growth for the total number of factoids and the number of factoids per sentence, it would seem that the splitting factors and the new information factor are equally productive.</Paragraph>
      <Paragraph position="13"> Neither curve in Figure 4 shows signs that it might be approaching an assymptote. This conflrms our earlier conclusion (van Halteren and Teufel, 2003) that many more summaries than 10 or 20 are needed for a full factoid inventory.2</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Weighted factoid scores and
</SectionTitle>
    <Paragraph position="0"> stability The main reason to do factoid analysis is to measure the quality of summaries, including machine summaries. In our previous work, we do this with a consensus summary. We are now investigating difierent weighting factors for the importance of factoids. Previously, the weighting factors we suggested were information content, position in the summaries and frequency. We investigated the latter two.</Paragraph>
    <Paragraph position="1"> Each factoid we flnd in a summary to be evaluated contributes to the score of the summary, by an amount which re ects the perceived value of the factoid, what we will call the \weighted factoid score (WFS)&amp;quot;. The main component in this value is frequency, i.e., the number of model summaries in which the factoid is observed.</Paragraph>
    <Paragraph position="2"> When frequency weighting is used by itself, each factoid occurrence is worth one.3 We could also assume that more important factoids are placed earlier in a summary, and that the frequency weight is adjusted on the basis of position. Experimentation is not complete, but the adjustments appear to in uence the ranking only slightly. The results we present here are those using pure frequency weights.</Paragraph>
    <Paragraph position="3"> We noted in our earlier paper that a good quality measure should demonstrate at least the following properties: a) it should reward inclusion in a summary of the information deemed 2It should be noted that the estimation in Figure 4 improves upon the original estimation in that paper, as the determination of number of factoids for that flgure did not consider the splitting factor, but just counted the number of factoids as taken from the inventory at its highest granularity.</Paragraph>
    <Paragraph position="4"> 3This is similar to the relative utility measure introduced by Radev and Tam (2003), which however operates on sentences rather than factoids. It also corresponds to the pyramid measure proposed by Nenkova and Passonneau (2004), which also considers an estimation of the maximum value reachable. Here, we use no such maximum estimation as our comparisons will all be relative.</Paragraph>
    <Paragraph position="5">  summary rankings on the basis of two difierent sets of N summaries, for N between 1 and 50.</Paragraph>
    <Paragraph position="6"> most important in the document and b) measures based on two factoid analyses constructed along the same lines should lead to the same, or at least very similar, ranking of a set of summaries which are evaluated. Since our measure rewards inclusion of factoids which are mentioned often and early, demand a) ought to be satisfled by construction.</Paragraph>
    <Paragraph position="7"> For demand b), some experimentation is in order. For various numbers of summaries N, we take two samples of N summaries from the whole set (allowing repeats so that we can use N larger than the number of available summaries; a statistical method called 'bootstrap'). For each sample in a pair, we use the weighted factoid score with regard to that sample of N summaries to rank the summaries, and then determine the ranking correlation (Spearman's %0) between the two rankings. The summaries that we rank here are the 20 human summaries of the Kuwait text, plus 16 machine summaries submitted for DUC-2002.</Paragraph>
    <Paragraph position="8"> Figure 5 shows how the ranking correlation increases with N for the Kuwait text. Its mean value surpasses 0.8 at N=11 and 0.9 at N=19.</Paragraph>
    <Paragraph position="9"> At N=50, it is 0.98. What this means for the scores of individual summaries is shown in Figure 6, which contains a box plot for the scores for each summary as observed in the 200 drawings for N=50. The high ranking correlation and the reasonable stability of the scores shows that our measure fulfllls demand b), at least at a high enough N. What could be worrying is the fact that the machine summaries (right of the dotted line) do not seem to be performing signiflcantly worse than the human ones (left  uations based on 200 difierent sets of 10 model summaries.</Paragraph>
    <Paragraph position="10"> of the line). However, an examination of the better scoring machine summaries show that in this particular case, their information content is indeed good. The very low human scores appear to be cases of especially short summaries (including one DUC summariser) and/or summaries with a deviating angle on the story.</Paragraph>
    <Paragraph position="11"> It has been suggested in DUC circles that a lower N should su-ce. That even a value as high as 10 is insu-cient is already indicated by the ranking correlation of only 0.76. It becomes even clearer with Figure 7, which mirrors Figure 6 but uses N=10. The scores for the summaries vary wildly, which means that ranking is almost random.</Paragraph>
    <Paragraph position="12"> Of course, the suggestion might be made that the system ranking will most likely also be stabilised by scoring summaries for more texts, even with such a low (or even lower) N per text. However, in that case, the measure only yields information at the macro level: it merely gives an ordering between systems. A factoid-based measure with a high N also yields feedback on a micro level: it can show system builders which vital information they are missing and which super uous information they are including. We expect this feedback only to be reliable at the same order of N at which single-text-based scoring starts to stabilise, i.e. around 20 to 30. As the average ranking correlation between two weighted factoid score rankings based on 20 summaries is 0.91, we could assume that the ranking based on our full set of 20 difierent summaries should be an accurate ranking. If we compare it to the DUC information overlap rankings for this text, we flnd that the individual rankings for D086, D108 and D110 have correlations with our ranking of 0.50, 0.64 and 0.79. When we average over the three, this goes up to 0.83.</Paragraph>
    <Paragraph position="13"> In van Halteren and Teufel (2003), we compared a consensus summary based on the top-scoring factoids with unigram scores. For the 50 Fortuyn summaries, we calculate the F-measure for the included factoids with regard to the consensus summary. In a similar fashion, we build a consensus unigram list, containing the 103 unigrams that occur in at least 11 summaries, and calculate the F-measure for unigrams. The correlation between those two scores was low (Spearman's %0 = 0.45). We concluded from this experiment that unigrams, though much cheaper, are not a viable substitute for factoids.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML