File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0902_metho.xml

Size: 21,555 bytes

Last Modified: 2025-10-06 14:09:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0902">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 9-16, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics On the Subjectivity of Human Authored Short Summaries</Title>
  <Section position="4" start_page="9" end_page="11" type="metho">
    <SectionTitle>
3 Production of Human Authored Short
Summaries
</SectionTitle>
    <Paragraph position="0"> Our aim is to investigate an effective, robust approach to summary evaluation. In this paper, we identify and quantify the aspect of human subjectivity while authoring short summaries. To this end, four subjects produced a short summary (approximately 100 characters, or 15 words) for broadcast news stories given a simple instruction set. This summary is referred to as a 'one line' summary because it corresponds approximately to the average sentence length for this data set.</Paragraph>
    <Section position="1" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
3.1 Author Profiles
</SectionTitle>
      <Paragraph position="0"> Four summary authors are briefly profiled below:  Subject A. A linguist by profession, a polyglot out of interest, and an author by hobby. This subject is fluent in English, Spanish and French; English being the first language. The subject is trained to write summaries and translations.</Paragraph>
      <Paragraph position="1"> Subject B. A manager by qualification and a polyglot by necessity; English is a second language. This subject was trained in making presentations and documentation. We hoped to benefit from the synergy  of both fields for summary production.</Paragraph>
      <Paragraph position="2"> Subject C. A physicist by qualification and currently working towards a PhD in speech recognition. English is the first language. In addition, this subject has an interest in theatre and drama, thus is exposed to literature and related fields.</Paragraph>
      <Paragraph position="3"> Subject D. Working on research in multiparty meetings as a post doctoral fellow. English is the first language for this subject. Experience of meeting summarisation. null All subjects are educated to at least graduate level, and have are fluent in English. It was expected that they could produce summaries of good quality without detailed instruction or further training. A simple instruction set (discussed later) was given, leaving wide room for interpretation about what might be included in the summary. Hence subjectivity was promoted.</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
3.2 Data
</SectionTitle>
      <Paragraph position="0"> The human subjects worked on a small subset of American broadcast news stories from the TDT-2 corpus (Cieri et al., 1999). They were used for NIST TDT evaluations and the TREC-8 and TREC-9 spoken document retrieval evaluations. Each program in the corpus contained 7 to 8 news stories on average, spanning 30 minutes as broadcast which might be reduced to 22 minutes once advertisement breaks were removed. A set of 51 hand transcriptions were manually selected from the corpus. The average length was 487 words in 25 sentences per transcription. null</Paragraph>
    </Section>
    <Section position="3" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
3.3 Instructions
</SectionTitle>
      <Paragraph position="0"> Summary production. A simple instruction was given to the subjects in order to arrive at a summary: a0 Each summary should contain about 100 characters, possibly in the subject's own words.</Paragraph>
      <Paragraph position="1"> As the news stories ranged from 16 to 84 sentences, subjects would have to prioritise information that could be included in their 'one line' summary. The instruction implicitly encouraged the subjects to put as much important information as possible into a summary, while maintaining a good level of fluency.</Paragraph>
      <Paragraph position="2"> It was also a flexible instruction so that subjects were able to use their own expressions when necessary.</Paragraph>
      <Paragraph position="3"> After completion of the task, they commented that this instruction made them experiment with different words to shorten or expand the information they wanted to include. For example, how could an earthquake disaster be expressed in different ways: 8000+ feared dead? a1a2a1a2a1 or thousands of people killed? a1a2a1a2a1 or a lot of people are believed to be dead? Another feature of this instruction was the amount of generalisation that a subject was likely to use. For example, a subject could say US Senate to decide on tobacco bill but given the length constraints, it could be like Senate to vote on bill, hiking tobacco price while adding extra information, but omitting specific details.</Paragraph>
      <Paragraph position="4"> Questionnaire production. When producing summaries, subjects were aware that they also had to prepare questions with the following instructions:  a0 A questionnaire may consists of 2-4 questions; a0 An answer must be found in the particular summary, without reading the entire story; a0 Yes / no questions should not be used; a0 The summary may roughly be reconstructed  from the question-answer set.</Paragraph>
      <Paragraph position="5"> Each fact might be questioned in such a way that the particular summary could be recovered. Ideally we would expect each question to elicit a precise information point chosen for the summary -- e.g., who did it, when did it happen, what was the cause? The question-answer set enabled us to gauge the most relevant information as decided by the subjects, so that their subjectiveness became apparent.</Paragraph>
    </Section>
    <Section position="4" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
3.4 Full Sample
</SectionTitle>
      <Paragraph position="0"> A 'one line' summary-questionnaire pair was produced for 51 broadcast news stories by each of the four subjects. The statistics in Table 1 show the average number of words and characters for each summary. It is observed that Subjects A (6.1 characters / word) and C (5.8) tended to use longer words than B  words and characters for each summary, and the average number of questions per summary.</Paragraph>
      <Paragraph position="1"> (4.9) and D (5.3). The table also shows how the average number of questions varies between subjects. Table 2 shows a full sample. The complete news story is found in the Appendix. The difference between the four summaries can be clearly observed.</Paragraph>
      <Paragraph position="2"> One noticeable aspect is the amount of abstraction preferred by various subjects. Both Subjects A and D fully utilised words from the news story and made a small amount of abstraction. In particular, Sub-ject A chose to pick out a person ('Fisher') who conducted the study, while D opted for specifics of the study ('dopamine' -- a responsible chemical).</Paragraph>
      <Paragraph position="3"> On the other hand, Subjects B and C have rendered their interpretation of the story in their own expressions. They have produced a highly abstracted summary reflecting the sense of the story while ignoring the specifics -- nevertheless they were very different from each other. All four summaries happen to be of good quality, however it is the sheer divergence in the words, the expressions and subjective interpretation that is striking.</Paragraph>
      <Paragraph position="4"> Word usage among the subjects is also interesting -- e.g., 'visual images' as against 'physical traits'; similarly 'inner feelings' as against 'chemistry'. Such expressions and idioms are open for interpretation, making it difficult to quantify the informativeness of any summary.</Paragraph>
      <Paragraph position="5"> There also exist many factual news stories among the 51 test stories. It is left for a future study to compare between factual and non-factual news, in particular about the amount of abstraction.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="11" end_page="12" type="metho">
    <SectionTitle>
4 Cross Comprehension Test
</SectionTitle>
    <Paragraph position="0"> Each question can extract a relevant answer from the particular summary by the same author. If a question set were applied to a different summary, some answers may be discernible whereas others may not.</Paragraph>
    <Paragraph position="1"> The cross comprehension test achieves this by swap-</Paragraph>
    <Section position="1" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
Subject A
</SectionTitle>
      <Paragraph position="0"> Summary: Fisher's study claims we seek partners using unconscious love maps; women prefer status, men go for physical traits. Questions:  1. Who is the author of this study? 2. What claim does the researcher make concerning our method for seeking a sexual partner? 3. What do women look for in men? 4. What do men go for?</Paragraph>
    </Section>
    <Section position="2" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
Subject B
</SectionTitle>
      <Paragraph position="0"> Summary: Internal feelings of love between men and women are unique; external features depend on culture. Questions:  1. What are unique? 2. What is this topic about? 3. What differs between men and women? 4. Why does it differ?  1. What do women look for in men? 2. What do men look for in women? 3. What is the chemical that controls attraction?  from broadcast news stories by four subjects. ping a summary-questionnaire pair, i.e., each summary was paired with questions produced by different authors. Figure 1 illustrates the way it works. A single judge examines whether each question can be answered by reading a swapped summary. The judge is a person different from the four summary authors. Further, if the answer is found, it may be relevant, partially relevant, or totally irrelevant to the one expected by the author. Thus, the decision is made from the following four options: relevant: a relevant answer is found -- the answer is deemed to be relevant if it conveys the same meaning as expected by the author even if a different expression is used; partially relevant: an answer is partially relevant;  summary-questionnaire pairs between subjects. For example, a summary by Subject A may be questioned by those set by Subjects B, C, and D. irrelevant: an answer is found, but is totally different from that expected by the author. not found: no answer is found.</Paragraph>
      <Paragraph position="1"> Sample (re-visited). Table 3 shows the summary and questions crossed from the sample in Table 2. For example, when the 'one line' summary authored  by Subject A is matched with Subject B's questions, corresponding answers may be 1. ?; 2. seeking partners; 3. women prefer status, men go for physical traits; 4. unconscious love maps.</Paragraph>
      <Paragraph position="2"> We may thus conclude answers are 'not found', 'relevant', 'irrelevant', and 'partially relevant' because, from Table 2, actual answers sought by B were 1. internal feelings; 2. love between men and women; 3. external features; 4. cultural reason.</Paragraph>
      <Paragraph position="3">  Compensating ill-framed questions. We are aware that not all 'one line' summaries were well written. For example, it may be difficult to reach the expected answer ('external features') for Question</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="12" end_page="12" type="metho">
    <SectionTitle>
3 by Subject B ('What differs between men and women?')
</SectionTitle>
    <Paragraph position="0"> by reading the summary from the same subject.</Paragraph>
    <Paragraph position="1"> Moreover, subjects occasionally set a question that could not be answered properly by reading the particular summary alone. By crossing the summary-questionnaire pair, ill-framed questions are effectively compensated, because they are equally posed to all candidate summaries.</Paragraph>
    <Paragraph position="2"> Judgement difficulty. One potential problem in this scheme is the difficulty a judge may face when choosing from the four options. A judge's decision can also be affected by subjectivity. Our assumptions are that (1) because there are only four options, there is less room for the subjectivity in comparison Summary by Subject A: Fisher's study claims we seek partners using unconscious love maps; women prefer status, men go for physical traits.  Questions by Subject B: 1. What are unique? (N) 2. What is this topic about? (R) 3. What differs between men and women? (I) 4. Why does it differ? (P) Questions by Subject C: 1. What is being discussed? (R) 2. What are the factors affecting the particular event? (R) Questions by subject D: 1. What do men look for in women? (R) 2. What do women look for in men? (R) 3. What is the chemical that controls attraction? (N)  tioned by Subjects B, C, or D? (R), (P), (I), and (N) after each question indicate the answer is relevant, partially relevant, irrelevant, and not found. to the summary writing task, and that (2) a decision between 'relevant' and 'partially relevant' and one between 'irrelevant' and 'not found' are both not very important because the former two are roughly associated with commonly shared information and the latter two correspond to the subjective part. Although the following section shows results by a single judge, we are currently conducting the same experiments using multiple judges in order to quantify our assumptions.</Paragraph>
  </Section>
  <Section position="7" start_page="12" end_page="14" type="metho">
    <SectionTitle>
5 Evaluation Results
</SectionTitle>
    <Paragraph position="0"> Each of the four 'one line' summaries from the 51 broadcast news stories were evaluated using three sets of 'crossed' questions.</Paragraph>
    <Section position="1" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
5.1 Summary Relevance
</SectionTitle>
      <Paragraph position="0"> Figure 2(a) shows, when paired with questions by other subjects, how many answers could be found in a candidate summary. The figure indicates that summaries authored by the different subjects contained 'relevant' information for less than half (47% overall average for four subjects) of questions. The number goes up slightly (61%) if 'partially relevant' answers are included. The number of answers that were 'not found' indicates the level of subjectivity for this 'summary writing' exercise; more than one third (35%) of information that one subject thought  questionnaire relevance was calculated when evaluated against summaries by other subjects. was the most important was discarded by the others. We surmise that 'irrelevant' answers were also caused by the subjectivity; occasionally authors arrived at contradictory summaries of the same story due to its ambiguous nature. In such cases, questions were produced from that author's subjective view, and they certainly affected the relevance of a summary by the other subject.</Paragraph>
      <Paragraph position="1"> Another notable outcome of this experiment is that the number of answers found 'relevant', 'partially relevant' or 'irrelevant' was 71%, 61%, 54% and 73% for Subjects A, B, C, and D, respectively. This seems roughly proportional to the average length of summaries by each subject (113, 99, 81, and 131 characters, respectively). The longer the summary, the more information one can write in the summary. It is thus hypothesised that only the summary length matters for finding the 'relevant' information in summaries. Looking at this outcome from a different perspective, there is no evidence that one author was more subjective than the others.</Paragraph>
    </Section>
    <Section position="2" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
5.2 Questionnaire Relevance
</SectionTitle>
      <Paragraph position="0"> Figure 2(b) shows, when paired with summaries by other subjects, how many candidate questions could be answered. It is based on the same evaluation as 2(a), but observed from the different angle. Approximately the same number (55-59%) of 'relevant', and 'partially relevant' answers were found for Subjects A, B, and D. However, it was much higher (80%) for Subject C. The reason seems to be that this subject frequently set questions that might accept a wide range of answers, while other subjects tended to frame questions that required more specific information in the summary; e.g., Subject C's 'what is being discussed?' was a general question that was more likely to have some answer than Subject B's question 'what differs between men and women?'.</Paragraph>
    </Section>
    <Section position="3" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
5.3 Discussion
</SectionTitle>
      <Paragraph position="0"> The overall number of 'relevant' and 'partially relevant' answers found by the cross comprehension test was just over 61% for four subjects. This accounts for the amount of information that was agreed by all the subjects as important. For more than one third of summary contents, subjects had different opinions about whether they should be in their 'one line' summaries, resulting in categories such as 'irrelevant' or 'not found'. Occasionally these categories resulted from ill-framed questions, but such questions were infrequent. For most of the cases, they were caused by the subjectivity of a different individual.</Paragraph>
      <Paragraph position="1"> We noted earlier that only the summary length matters and there is no evidence that one author was more subjective than the others. It is probably because, given a clear instruction about the summary length (i.e., roughly 100 characters for this task), there is an upper bound for the amount of information that anyone can fit into the summary, while maintaining fluency. When the summary is short, one has to make a serious decision about which important information should go into a summary, and the decision often reflects one's subjective thoughts.</Paragraph>
      <Paragraph position="2"> Our argument is that, assuming the subject's effort, the amount of subjectivity was controlled by the summary length constraints rather than an individual's nature.</Paragraph>
      <Paragraph position="3">  maries by the cross comprehension test.</Paragraph>
      <Paragraph position="4"> The diversity of summaries caused by individual subjectivity may be alleviated by carefully drafting an instruction set. However it probably results in a large list of instructions, and the drafting process certainly will not be straightforward. Further, it is not likely that we can ever completely remove the subjectivity from human work. Indeed, if subjectivity disappeared from human authored summary by well crafted instructions, it would be more like turning human activity into a mechanical process, rather than a machine to simulate human work.</Paragraph>
      <Paragraph position="5"> A non-trivial problem of the approach may be the amount of human effort needed for evaluation. Production of summary-questionnaire pairs may not be difficult, as it is based on a simple instruction set and even accepts ill-framed questions, but it still requires human time. On the other hand, a judge's role is the most critical -- it is labour intensive, and the effect of potentially subjective judgement needs to be studied. null Although certainly not flawless, the cross comprehension test has its own advantage. A simple instruction set is effective; it encourages authors to make their best effort to put as much information into a short summary. Most importantly, the test is robust; it sometimes causes ill-framed questions, but they can be compensated by relative comparison achieved by crossing summary-questionnaire pairs.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="14" end_page="14" type="metho">
    <SectionTitle>
6 Evaluation of Machine Generated
Summaries
</SectionTitle>
    <Paragraph position="0"> The objective of this evaluation is to measure the information content of machine generated summaries using a human authored summary as a yardstick.</Paragraph>
    <Paragraph position="1"> Although very subjective for many cases, a human summary can still be a reference if we do not treat them as a 'gold standard'.</Paragraph>
    <Paragraph position="2"> The cross comprehension test of machine generated and human authored summaries is illustrated in Machine generated summary: senate to vote to approve the expansion of north atlantic treaty organisation to bigger nato means us obligations Summary by subject B: US Senate to decide on NATO expansion; US assesses bigger NATO more arms deal but poor ties with Russia.  Questions by subject D: 1. What is happening to the NATO? 2. Who sees this move as a threat? 3. Who is bearing the main cost?  from the one who wrote the summary. A human authored summary may still be the best summary in many respects, but it will no longer be considered perfect. One may target the relevance level of the human summary (e.g., 61% for the 'one line' summary task from the broadcast news stories) for automatic summarisation research.</Paragraph>
    <Paragraph position="3"> Table 4 shows one example from those with which we are currently experimenting. Answers sought by Subject D were 'expansion', 'Russian', and 'American taxpayers', respectively. Given this question set, answers are 'relevant', 'relevant', and 'not found' for the summary by Subject B, and answers found in the machine generated summary are 'relevant', 'not found', and 'not found', respectively.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML