File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/03/w03-0508_relat.xml
Size: 23,120 bytes
Last Modified: 2025-10-06 14:15:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0508"> <Title>Examining the consensus between human summaries: initial experiments with factoid analysis</Title> <Section position="4" start_page="0" end_page="0" type="relat"> <SectionTitle> 2 Data and factoid annotation </SectionTitle> <Paragraph position="0"> Our goal is to compare the information content of difierent summaries of the same text. In this initial investigation we decided to focus on a single text. The text used for the experiment is a BBC report on the killing of the Dutch politician Pim Fortuyn. It is about 600 words long, and contains a mix of factual information and personal reactions.</Paragraph> <Paragraph position="1"> Our guidelines asked the human subjects to write generic summaries of roughly 100 words. We asked them to formulate the summary in their own words, so that we can also see which difierent textual forms are produced for the same information.</Paragraph> <Paragraph position="2"> Knowledge about the variability of expression is important both for evaluation and system building, and particularly so in in multi-document summarisation, where redundant information is likely to occur in difierent textual forms.</Paragraph> <Paragraph position="3"> We used two types of human summarisers. The largest group consisted of Dutch students of English and of Business Communications (with English as a chosen second language). Of the 60 summaries we received, we had to remove 20. Summaries were removed if it was obvious from the summary that the student had insu-cient skill in English or if the word count was too high (above 130 words). A second group consisted of 10 researchers, who are either native or near-native English speakers. With this group there were no problems with language, format or length, and we could use all 10 summaries.</Paragraph> <Paragraph position="4"> Our total number of summaries was thus 50.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The factoid as atomic information units </SectionTitle> <Paragraph position="0"> We use atomic semantic units called factoids to represent the meaning of a sentence. For instance, we represent the sentence The police have arrested a white Dutch man.</Paragraph> <Paragraph position="1"> by the union of the following factoids: Note that in this case, factoids correspond to expressions in a FOPL-style semantics, which are compositionally interpreted. However, we deflne atomicity as a concept which depends on the set of summaries we work with. If a certain set of potential factoids always occurs together, this set of factoids is treated as one factoid, because difierentiation of this set would not help us in distinguishing the summaries. If we had found, e.g., that there is no summary that mentions only one of FP25 and FP26, those factoids would be combined into one new factoid \FP27 The suspect is a Dutch man&quot;.</Paragraph> <Paragraph position="2"> Our deflnition of atomicity means that the \amount&quot; of information associated with one factoid can vary from a single word to an entire sentence.</Paragraph> <Paragraph position="3"> An example for a large chunk of information that occurred atomically in our texts was the fact that the victim wanted to become PM (FV71), a factoid which covers an entire sentence. On the other hand, a single word may contain several factoids. The word \gunman&quot; leads to two factoids: \FP24 The perpetrator is male&quot; and \FA20 A gun was used in the attack&quot;.</Paragraph> <Paragraph position="4"> The advantage of our functional, summary-setdependent deflnition of atomicity is that the deflnition of what counts as a factoid is more objective than if factoids had to be invented by intuition, which is hard. One possible disadvantage of our definition of atomicity (which is dependent on a given set of summaries) is that the set of factoids used may have to be adjusted if further summaries are added to the collection. In practice, for a flxed set of summaries for experiments, this is less of an issue. We decompose meanings into separate (compositionally interpreted) factoids, if there are mentions in our texts which imply information overlap. If one summary contains \was murdered&quot; and another \was shot dead&quot;, we can identify the factoids The flrst summary contains only the flrst two factoids, whereas the second contains all three. That way, the semantic similarity between related words can be expressed.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Compositionality, generalisation and </SectionTitle> <Paragraph position="0"> factuality The guidelines for manual annotation of summaries with factoids stated that only factoids which are explicitly expressed in the text should be marked. When we identifled factoids in our actual summary collection, most factoids turned out to be independent of each other, i.e. the union of the factoids can be compositionally interpreted. However, there are relations between factoids which are not as straightforward. For instance, in the case of \FA21 Multiple shots were flred&quot; and \FA22 Six shots were flred&quot;, FA22 implies FA21; any attempt to express the relationship between the factoids in a compositional way would result in awkward factoids. We accept that there are factoids which are most naturally expressed as generalisations of other factoids, and record for each factoid a list of factoids that are more general than it is, so that we can include these related factoids as well. In one view of our data, if a summary states FA22, FA21 is automatically added.</Paragraph> <Paragraph position="1"> In addition to generality, there are two further complicated phenomena we had to deal with. The flrst one is real inference, rather than generalisation, as in the following cases: which in turn implies FL50. We again record inference relations and automatically compute the transitive closure of all inferences, but we do not currently formally distinguish them from the simpler generalisation relations.</Paragraph> <Paragraph position="2"> The second phenomenon is the description of people's opinions. In our source document, quotations of the reactions of several politicians were given. In the summaries, our subjects often generalised these reactions and produced statements such as Dutch as well as international politicians have expressed their grief and disbelief.</Paragraph> <Paragraph position="3"> As more than one entity can be reported as saying the same thing, straightforward factoid union is not powerful enough to accurately represent the attribution of opinions, as our notation does not contain variables for discourse referents and quoted statements. We therefore revert to a separate set of factoids, which are multiplied-out factoids that combine the statement (what is being said) together with a description of who said it. Elements of the description can be interpreted in a compositional manner. For instance, the above sentence is expressed in our notation as Another problem with attribution of opinions is that there is not always a clear distinction between fact and opinion. For instance, the following sentence is presented as opinion in the original \Geraldine Coughlan in the Hague says it would have been di-cult to gain access to the media park.&quot; Nevertheless, our summarisers often decided to represent such opinions as facts, ie. as \The media park was di-cult to gain entry to.&quot; { in fact, in our data, every summary containing this factoid presents it as fact. For now, we have taken the pragmatic approach that the classiflcation of factoids into factual and opinion factoids is determined by the actual representation of the information in the summaries (cf. FL51 above, where the flrst letter \F&quot; stands for factual, the flrst letter \O&quot; for opinion).</Paragraph> <Paragraph position="4"> The factoid approach can capture much flner shades of meaning difierentiations than DUC-style information overlap does { in an example from Lin and Hovy (2002), an assessor judged some content overlap between \Thousands of people are feared dead&quot; and \3,000 and perhaps ... 5,000 people have been killed.&quot; In our factoid representation, a distinction between \killed&quot; and \feared dead&quot; would be made, and difierent numbers of people mentioned would have been difierentiated.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Factoid annotation </SectionTitle> <Paragraph position="0"> The authors have independently marked the presence of factoids in all summaries in the collection.</Paragraph> <Paragraph position="1"> Factoid annotation of a 100 word summary takes roughly half an hour. Even with only short guidelines, the agreement on which factoids are present in a summary appears to be high. The recall of an individual annotator with regard to the consensus annotation is about 96%, and precision about 97%.</Paragraph> <Paragraph position="2"> This means that we can work with the current factoid presence table with reasonable confldence.</Paragraph> <Paragraph position="3"> Whereas single summaries contain between 32 and 55 factoids, the collection as a whole contains 256 difierent factoids. Figure 1 shows the growth of the number of factoids with the size of the collection (1 to 40 summaries). We assume that the curve is Zipflan. This observation implies that larger numbers of summaries are necessary if we are looking for a deflnitive factoid list of a document.</Paragraph> <Paragraph position="4"> The maximum number of possible factoids is not bounded by the number of factoids occurring in the document itself. As we explained above, factoids come into existence because they are observed in the collection of summaries, and summaries sometimes contain factoids which are not actually present in the document. Examples of such factoids are \FP31 The suspect has made no statement&quot;, which is true but not stated in the source text, and \FP23 The suspect was arrested on the scene&quot;, which is not even true. The reasons for such \creative&quot; factoids vary from the expression of the summarisers' personal knowledge or opinion to misinterpretation of the source text. In total we flnd 87 such factoids, 51 factual ones and 36 incorrect generalisations of attribution.</Paragraph> <Paragraph position="5"> Of the remaining 169 \correct&quot; factoids, most (125) are factual. Within these factoids, we flnd 74 generalisation links. The rest of the factoids concern opinions and their attribution. There are 18 descriptions of opinion, with 11 generalisation links, and 26 descriptions of attribution, with 16 generalisation links. For all types, we see that most facts are being represented at difiering levels of generalisation. Some of the generalisation links are part of 3- or 4-link hierarchies, e.g. \FV40 Victim outspoken about/campaigning on immigration issues&quot; (26 mentions) to \FV41 Victim was anti- immigration&quot; (23) to \FV42 Victim wanted to close borders to im- null It is not surprising that more speciflc factoids are less frequent than their generalisations, but we expect interesting correlations between a factoid's importance and the degree and shape of the decline of its generalisation hierarchy, especially where factoids about the attribution of opinion are concerned. This is an issue for further research.</Paragraph> <Paragraph position="6"> 3 Human summaries as benchmark for evaluation If we plan to use human summaries as a reference point for the evaluation of machine-made summaries, we are assuming that there is some consensus between the human summarisers as to which information is important enough to include in a summary. Whether such consensus actually exists is uncertain.</Paragraph> <Paragraph position="7"> In very broad terms, we can distinguish four possible scenarios: 1. There is a good consensus between all human summarisers. A large percentage of the factoids present in the summaries is in fact present in a large percentage of the summaries. We can determine whether this is so by measuring factoid overlap.</Paragraph> <Paragraph position="8"> 2. There is no such overall consensus between all summarisers, but there are subsets of summarisers between whom consensus exists. Each of these subsets has summarised from a particular point of view, even though a generic summary was requested, and the point of view has led to group consensus. We can determine whether this is so by doing a cluster analysis on the factoid presence vectors. We should flnd clusters if and only if group consensus exists.</Paragraph> <Paragraph position="9"> 3. There is no such thing as overall consensus, but there is a difierence in perceived importance between the various factoids. We can determine whether this is the case by examining how often each factoid is used in the summaries. Factoids that are more important ought to be included more often. In that case, it is still possible to create a consensus-like reference summary for any desired summary size.</Paragraph> <Paragraph position="10"> 4. There is no difierence in perceived importance of the various factoids at all. Inclusion of factoids in summaries appears to be random.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Factoid frequency and consensus </SectionTitle> <Paragraph position="0"> We will start by examining whether an importance hierarchy exists, as this can help us decide between scenario 1, 3 or 4. If still necessary, we can check for group consensus later.</Paragraph> <Paragraph position="1"> If we count how often each factoid is used, it quickly becomes clear that we do not have to worry about worst-case scenario 4. There are clear difierences in the frequency of use of the factoids. On the other hand, scenario 1 does not appear to be very likely either. There is full consensus on the inclusion of only a meager 3 factoids, which can be summarised in 3 words: Fortuyn was murdered.</Paragraph> <Paragraph position="2"> If we accept some disagreement, and take the factoids which occur in at least 90% of the summaries, this increases the consensus summary to 5 factoids and 6 words: Fortuyn, a politician, was shot dead.</Paragraph> <Paragraph position="3"> Setting our aims ever lower, 75% of the summaries include 6 further factoids and the summary goes up to 20 words: Pim Fortuyn, a Dutch right-wing politician, was shot dead before the election. A suspect was arrested. Fortuyn had received threats.</Paragraph> <Paragraph position="4"> A 50% threshold yields 8 more factoids and the 47-word summary: Pim Fortuyn, a Dutch right-wing politician, was shot dead at a radio station in Hilversum. Fortuyn was campaigning on immigration issues and was expected to do well in the election. He had received threats. There were shocked reactions. Political campaigning was halted. The police arrested a man.</Paragraph> <Paragraph position="5"> If we want to arrive at a 100-word summary (actually 104), we need to include 26 more factoids, and we need to allow all factoids which occur in at least 30% of the summaries: Pim Fortuyn was shot six times and died shortly afterwards. He was attacked when leaving a radio station in the (well-secured) media park in Hilversum. The Dutch far-right politician was campaigning on an anti- immigration ticket and was outspoken about Islam. He was expected to do well in the upcoming election, getting at least 15% of the votes. Fortuyn had received threats. He expected an attack and used bodyguards. Dutch and international politicians were shocked and condemned the attack. The Dutch government called a halt to political campaigning. The gunman was chased. The police later arrested a white Dutch man. The motive is unknown.</Paragraph> <Paragraph position="6"> We conclude that the extreme scenarios, full consensus and full absence of consensus, can be rejected for this text. This leaves the question whether the partial consensus takes the form of clusters of consenting summarisers.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Summariser clusters </SectionTitle> <Paragraph position="0"> In order to determine whether the summarisers can be assigned to groups within which a large amount of consensus can be found, we turn to statistical techniques. We flrst form 256-dimensional binary vectors recording the presence of each of the factoids in each tances between factoid vectors into two dimensions summariser's summary. We also added a vector for the 104-word consensus summary above (\Cons&quot;). We then calculate the distances between the various vectors and use these as input for classical multi-dimensional scaling. The result of scaling into two dimensions is shown in Figure 2.</Paragraph> <Paragraph position="1"> Only a few small clusters appear to emerge. Although we certainly cannot conclude that there are no clusters, we would have expected more clearly delimited groups of summarisers, i.e. difierent points of view, if scenario 2 described the actual situation. For now we will assume that, for this document, scenario 3 is the most likely.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 The consensus summary as an </SectionTitle> <Paragraph position="0"> evaluation tool Two of the main demands on a gold standard generic summary for evaluation are: a) that it contains the information deemed most important in the document and b) that two gold standard summaries constructed along the same lines lead to the same, or at least very similar, ranking of a set of summaries which are evaluated.</Paragraph> <Paragraph position="1"> If we decide to use a single human summary as a gold standard, we in fact assume that this human's choice of important material is acceptable for all other summary users, which it the wrong assumption, as the lack of consensus between the various human summaries shows. We propose that the use of a reference summary which is based on the factoid importance hierarchy described above, as it uses a less subjective indication of the relative importance of the information units in the text across a population of summary writers. The reference summary would then take the form of a consensus summary, in our case the 100-word compound summary on the basis of factoids over the 30% threshold.</Paragraph> <Paragraph position="2"> The construction of the consensus summary would indicate that demand a) will be catered for, but we still have to check demand b). We can do this by computing rankings based on the F-measure for included factoids, and measuring the correlation coefflcient %0 between them.</Paragraph> <Paragraph position="3"> As we do not have a large number of automatic summaries of our text available, we use our 50 human summaries as data, pretending that they are summaries we wish to rank (evaluate).</Paragraph> <Paragraph position="4"> If we compare the rankings on the basis of single human summaries as gold standard, it turns out that the ranking correlation %0 between two \gold&quot; standards is indeed very low at an average of 0.20 (variation between -0.51 and 0.85). For the consensus summary, we can compare rankings for various numbers of base summaries. After all, the consensus summary should improve with the number of contributing base summaries and ought to approach an ideal consensus summary, which would be demonstrated by a stabilizing derived ranking.</Paragraph> <Paragraph position="5"> We investigate if this assumption is correct by creating pairs of samples of N=5 to 200 base summaries, drawn (in a way similar to bootstrapping) from our original sample of 50. For each pair of samples, we automatically create a pair of consensus summaries and then determine how well these two agree in their ranking. Figure 3 shows how %0 increases with N (based on 1000 trials per N). At N=5 and 10, %0 has a still clearly unacceptable average 0.40 or 0.53. The average reaches 0.80 at 45, 0.90 at 95 and 0.95 at a staggering 180 base summaries.</Paragraph> <Paragraph position="6"> We must note, however, that we have to be careful with these measurements, since 40 of our 50 starting summaries were made by less experienced non-natives. In fact, if we bootstrap pairs of N=10 base summary samples (100 trials) on just the 10 higher-quality summaries (created by natives and near-natives), we get an average %0 of 0.74. The same experiment on 10 difierent summaries from the other 40 (100 trials for choosing the 10, and for each 100 trials to estimate average %0) yields average %0's ranging from 0.55 to 0.63. So clearly the difierence in experience has its efiect. Even so, even the 'better' summaries lead to a ranking correlation of %0=0.74 at N=10, which still is much lower than we would like to see. We estimate that with this type of summaries an acceptably stable ranking (%0 around 0.90) would be reached somewhere between 30 and for 50 summaries on the basis of two consensus summaries, each based on a size N base summary collection, for N between 5 and 200 viz. the need for human interpretation when mapping summaries to factoid lists. The question is whether simpler measures might not be equally informative. We investigate this using unigram overlap, following Papineni et al. (2001) in their suggestion that unigrams best represent contents, while longer n-grams best represent uency.</Paragraph> <Paragraph position="7"> Again, we reuse our 50 summaries as summaries to be evaluated. For each of these summaries, we calculate the F-measure for the included factoids with regard to the consensus summary shown above. In a similar fashion, we build a consensus unigram list, containing the 103 unigrams that occur in at least 11 summaries, and calculate the F-measure for unigrams. The two measures are plotted against each other in Figure 4.</Paragraph> <Paragraph position="8"> Some correlation is present (r = 0.48 and Spearman's ranking correlation %0 = 0.45), but there are clearly profound difierences. If we look at the rankings produced from these two F-measures, S054, on position 16 on the basis of factoids, drops to position 37 on the basis of unigrams. S046, on the other hand, climbs from 42nd to 4th place when considered by unigrams instead of factoids. Apart from these extreme cases, these are also clear difierences in the top-5 for the two measurements: S030, S028, R001, S003 and S023 are the top-5 when measuring with factoids, whereas S032, R002, S030, S046 and S028 are the top-5 when measuring with unigrams.</Paragraph> <Paragraph position="9"> It would seem that unigrams, though they are much cheaper, are not a viable substitute for factoids.</Paragraph> </Section> </Section> class="xml-element"></Paper>