File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0509_metho.xml

Size: 12,700 bytes

Last Modified: 2025-10-06 14:08:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0509">
  <Title>A Survey for Multi-Document Summarization</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Task and annotator
</SectionTitle>
    <Paragraph position="0"> We have four tasks and three annotators (indicated by a number). Annotator 1 and 2 did the same task, but annotator 3 did only a part of it. All of them have college degrees, in particular annotators 1 and 2 are Japanese native speakers and have majors in linguistics at US universities.</Paragraph>
    <Paragraph position="1"> Some examples (free summaries for one document set, and axes and table data for three sets, all translated into English) are shown in the appendix.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Free style summarization
</SectionTitle>
    <Paragraph position="0"> The first task is a free style summarization. The inter-annotator agreement based on the word vector metric adopted by TSC evaluation (TSC homepage) is calculated. This is a cosine metric of tf*idf measure of the words in the summaries. Most of the pairs (37 sets out of 40 sets) had values of 0.5 or more, which is much larger than that of automatic systems measured against the human made summaries in TSC-1 (ranging around 0.4 in 10% summary, 0.5 in 40% summary). We can reasonably believe the summaries are very reliable.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Sentence Extraction
</SectionTitle>
    <Paragraph position="0"> Now we will look at the summarization by sentence extraction. Annotators 1 and 3 conducted the task for the entire data, so we will compare the results of those two. We asked the annotators to extract about 20% of the sentences as a summary of each document set, but the actual numbers of extracted sentences are slightly different between the two. Table 2 shows the number of sentences selected by the two annotators with inter-annotator agreement data. The number of sentences selected by both annotators (533) looks low, compared to the number of sentences selected by only one annotator (650 and 746). However, the chi-square test is 513.9, which means that these two results are strongly correlated (less than 0.0001% chance).</Paragraph>
    <Paragraph position="1"> Annotator 1 annotator 3 selected not selected</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Axis
</SectionTitle>
    <Paragraph position="0"> Axis is based on the idea of (McKeown et al. 2001).</Paragraph>
    <Paragraph position="1"> They defined 4 categories of document sets based on the main topic of the document set for the purpose of using different summarization strategies (they actually used two sub-systems), shown in Table 3.</Paragraph>
    <Paragraph position="3"> The documents center around one single event at one place and at roughly the same time, involving the same agents and actions</Paragraph>
    <Paragraph position="5"> The documents deal with one event concerning one person</Paragraph>
    <Paragraph position="7"> Several events occurring at different places and times and usually with different protagonists, are reported</Paragraph>
    <Paragraph position="9"> The number in brackets for each category indicates the number of document sets in the DUC 2001 training data. As can be seen, the number of &amp;quot;person centered&amp;quot; sets is quite high. We believe this is due to the pre-filtering in the DUC data. &amp;quot;Other&amp;quot; is also high, which means more categories may be needed.</Paragraph>
    <Paragraph position="10"> We created new categories based on our study of document sets (other than the 100 sets reported here).</Paragraph>
    <Paragraph position="11"> We defined 13 categories, shown in Table 4, for what we will call the axis of the document set.</Paragraph>
    <Paragraph position="12">  The axis is a combination of two types of information; single or multi, and 6 kinds of named entities (person, location, organization, facility, product and event). &amp;quot;Single&amp;quot; means that all the articles are talking about a single event, person or other entity, whereas &amp;quot;Multi&amp;quot; articles are talking about multiple entities that might participate in similar types of events. We used 6 categories of entity types, which are the major categories defined in the MUC (Grishman and Sundheim 1996) or ACE project (ACE homepage). For example, if a document set is talking about Einstein's biography, it should be tagged as &amp;quot;single-person&amp;quot;, and if a set is talking about earthquakes in California last year, it should be tagged as &amp;quot;multi-event&amp;quot;.</Paragraph>
    <Paragraph position="13"> In order to demonstrate the validity of the categories, we tried to categorize the training data of DUC 2001's multi-document sets into our categories. Two people assigned one or two categories to each set. We allow more than one axis to a document set, as some document sets should be inherently categorized into more than one axis. If we consider only the first choices, the inter annotator agreement ratio is 80% and if we include the second choices, the ratio is 93.3%.</Paragraph>
    <Paragraph position="14"> We believe the categorization is practical. Table 5 shows the distribution of axis categories tagged on our 100 data sets by three annotators. Note that annotators 1 and 2 assigned more than one axis to some data sets, so the totals exceed 100.</Paragraph>
    <Paragraph position="15"> All the categories except multi-facility are used by at least two annotators. Because the axis &amp;quot;other&amp;quot; is used rarely, the set of axes are empirically found to have quite good coverage.</Paragraph>
    <Paragraph position="16"> The inter-annotator agreements are 55, 61 and 67 % among the three annotators. Although the ratios are lower than that on the DUC data (we believe this is because of the pre-filtering of document sets), the agreement is still high at 55-67% even though there are 13 kinds of axis. Note that chi-square test is not suitable to measure the data because of the data sparseness.  There are 39 document sets that have the same axis assigned by the three annotators, and there are only 7 document sets that have three different axes by three annotators (no overlap at all). Even when different categories are tagged, sometimes all of them are understandable and we can say that these are all correct. So for some document sets, more than one category is instinctively correct. This result indicates that, for some large percentage of document sets, it is possible to assign axis(es). We believe for summarizing those document sets, knowing the axis before summarization could be quite helpful. We are seeking a method to automate the process of finding the axis(es).</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Table
</SectionTitle>
    <Paragraph position="0"> A table is a good way to summarize a document set talking about multiple events of the same type, a collection of similar events or chronological events. We asked annotators to make a table for each document set.</Paragraph>
    <Paragraph position="1"> Table 6 shows some statistics of the created tables. The average number of columns is 3.47 and 5.25 for annotator 1 and annotator 2, respectively. Regarding comparison between tables, the percentages of complete overlap (relationship of columns is 1 to 1 and the same information is collected in the column) are 58% and 38%. The percentages of overlap (relationship of columns is not 1 to 1, but the information of the columns is overlapping between the tables) are 94% and 70%.</Paragraph>
    <Paragraph position="2"> We can see that annotator 1 made fewer columns than annotator 2, and most columns made by annotator 1 overlap columns made by annotator 2. So the difference is probably due to the fact that annotator 2 made more detailed tables. (As this is the first such survey, it was not easy to create good instructions) In other words, it might be the case that most important information (which was turned into columns) is simultaneously found by the two annotators.</Paragraph>
    <Paragraph position="3"> Annotator 1 Annotator 2 Ave. num. of column 3.47 5.25  When we compared the tables created by the two annotators one by one, we categorized the results into 5 categories.</Paragraph>
    <Paragraph position="4"> A) Two tables are completely the same B) The information in the tables is the same, but the way of segmenting information into columns is different. For example, one of the tables has a column &amp;quot;visiting activity (of a diplomat)&amp;quot; including information about visiting place, person and purpose, whereas the other table has columns &amp;quot;visiting place&amp;quot;, &amp;quot;the person to meet&amp;quot; and &amp;quot;purpose of the visit&amp;quot;.</Paragraph>
    <Paragraph position="5"> C) Missing one or two columns from either or both of tables (in total). This means one of the tables has one or two fewer columns and the information in the columns is not mentioned in the other table. As we can guess from Table 6, most of the missing columns were found in tables of annotator 1.</Paragraph>
    <Paragraph position="6"> D) Missing more than two columns from tables.</Paragraph>
    <Paragraph position="7"> E) The two tables are completely different in structure, because of the table creator's different point of view.</Paragraph>
    <Paragraph position="8">  There are only a small number of document sets (8) from which the annotators made completely the same table. However, for more than half the document sets, the tables created by the two annotators are quite similar (including &amp;quot;same table&amp;quot;, &amp;quot;only segmentation&amp;quot; and &amp;quot;missing one or two columns&amp;quot;). This is complementary to the result shown in Table 6; for many document sets, the tables by annotator 2 have additional information compared to the tables by annotator 1.</Paragraph>
    <Paragraph position="9"> We also asked the annotators to judge if each document set is suitable to summarize into a table. We made three categories for the survey.</Paragraph>
    <Paragraph position="10"> A) Table is natural for summarizing the document set B) Information can be summarized in table format C) Table is not suitable to summarize the document set The result for the two annotators is shown in Table 8. Annotators 1 and 2 judged 40 and 45 sets to be suitable for a table, 36 and 38 are OK and 24 and 17 are not suitable. This is an interesting result - that for so many document sets (40-45%) a table is judged to be natural for summarizing. Compared to that, only a smaller fraction (17-24%) are judged unsuitable. The relationships between the two annotators' judgments are also shown in Table 8. The Chi-test is 17.94 and the probability is 0.13%; that means that the two judges are highly correlated.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 Discussion
</SectionTitle>
    <Paragraph position="0"> We reported a survey for multi-document summarization. We believe the results are encouraging for the pursuit of some novel strategies of multi-document summarization.</Paragraph>
    <Paragraph position="1"> One of them is the notion of axis. As we observed that for some percentage of the document sets, the axis can be tagged with some certainty, we might be able to make an automatic system to find it. Once the axis is correctly found, it might be useful for multi document summarization. For example, if a set is &amp;quot;single-person&amp;quot; then the summary for the set should be centered on the person. This may suggest, for example, generating a summary of type 'biography' (Mani 2001). If a document set is found to be &amp;quot;multi-event&amp;quot;, then the summary should focus on the differences of the events.</Paragraph>
    <Paragraph position="2"> The other result found in the experiment is that a quite large percentage of document sets can be summarized in table format. As this is a preliminary experiment, there is incompleteness in the instruction and we believe further study on this topic is necessary. In addition to setting guidelines for the degree of detail, the style of cell contents shall be more uniform. Currently, cells contain words, phrases and sentences. We believe that by making more careful instructions for annotation, the comparison between different tables can be more systematized. In other words, a systematic evaluation may be possible.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML