XML Viewer - w00-0407

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0407_metho.xml
Size: 9,213 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0407">
  <Title>Evaluation of Phrase-Representation Summarization based on Information Retrieval Task</Title>
  <Section position="4" start_page="60" end_page="64" type="metho">
    <SectionTitle>
3 Improvements
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="60" end_page="61" type="sub_section">
      <SectionTitle>
3.1 Description of Questions
</SectionTitle>
      <Paragraph position="0"> To assess the relevance accurately, the situation of information retrieval should be realistic enough for the subjects to feel as if they really want to know about a given question. The previous experiments gave only a short description of a topic. We consider it is not sufficiently specific and the interpretation of a question must varied with the subjects.</Paragraph>
      <Paragraph position="1"> We selected two topics (&amp;quot;moon cake&amp;quot; and &amp;quot;journey in Malay. Peninsula&amp;quot;) and assumed three questions. To indicate to the subjects, we set detailed situation including the motivation to know about that or the use of the information obtained for each question. This method satisfies the restriction &amp;quot;to limit the variation in assessment between readers&amp;quot; in the MLUCE Protocol (Minel, et ai. 1997).</Paragraph>
      <Paragraph position="2">  For each topic, ten documents are selected from search results by major WWW search engines, so that more than five relevant documents are included for each question. The topics, the outline of the questions, the queries for WWW search, and the number of relevant documents are shown in Table 2. The description of Question-a2 that was given to the subjects is shown in Fig. 3.</Paragraph>
      <Paragraph position="3"> One day just after the mid-autumn festival, my colleague Mr. A brought some moon cakes to the office. He said that one of his Chinese friends had given them to him. They rooked so new to us that we shared and ate them at a coffee break. Chinese eat moon cakes at the mid-autumn festival while Japanese have dumplings then. Someone asked a question why Chinese ate moon cakes, to-which nobody gave the answer. Some cakes tasted sweet as we expected; some were stuffed with salty fillings like roasted pork. Ms. B said that there were over fifty kinds of filling. Her story made me think of a question: What kinds of filling are there for moon cakes sold at the mid-autumn festival in</Paragraph>
    </Section>
    <Section position="2" start_page="61" end_page="61" type="sub_section">
      <SectionTitle>
3.2 Number of Subjects per Summary
Sample
</SectionTitle>
      <Paragraph position="0"> In the previous experiments, one to three subjects were assigned to each summary sample.</Paragraph>
      <Paragraph position="1"> Because the judgement must vary with the subjects even if a detailed situation is given, we assigned ten subjects per summary sample to reduce the influence of each person's assessment. The only requirement for subjects is that they should be familiar with WWW search process.</Paragraph>
    </Section>
    <Section position="3" start_page="61" end_page="61" type="sub_section">
      <SectionTitle>
3.3 Relevance Levels
</SectionTitle>
      <Paragraph position="0"> In the previous experiments, a subject reads a summary and judges whether it is relevant or irrelevant. However, a summary sometimes does not give enough information for relevance judgement. In actual information retrieval situations, selecting criteria vary depending on the question, the motivation, and other circumstances. We will not examine dubious documents if sufficient information is obtained or we do not have sufficient time, and we will examine dubious documents when an exhaustive survey is required. Thus, here we introduce four relevance levels L0 to L3 to simulate various cases in the experiment. L3, L2, and L1 are considered relevant, the confidence becomes lower in order. To reduce the variance of interpretation by subjects, we define each level as follows.</Paragraph>
      <Paragraph position="1"> L3: The answer to the given question is found .in a summary.</Paragraph>
      <Paragraph position="2"> L2: A clue to the answer is found in a summary. null Ll:Apparent clues are not found, but it is probable that the answer is contained in the whole document.</Paragraph>
      <Paragraph position="3"> L0: A summary is not relevant to the question at all.</Paragraph>
      <Paragraph position="4"> If these are applied to the case of the fare of the Malay Railway, the criteria will be interpreted as follows.</Paragraph>
      <Paragraph position="5"> L3:An expression like &amp;quot;the berth charge of the second class is about RMI5&amp;quot; is in a summary.</Paragraph>
      <Paragraph position="6"> L2: An expression like &amp;quot;I looked into the fare of the train&amp;quot; is in a summary.</Paragraph>
      <Paragraph position="7"> LI:A summary describes about a trip by the Malay Railway, but the fare is not referred in it.</Paragraph>
    </Section>
    <Section position="4" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
3.4 Measures of Accuracy
</SectionTitle>
      <Paragraph position="0"> In the previous experiments, precision and recall are used to measure accuracy. There are two drawbacks to these measurements: (1) the variance of the subjects' assessment makes the measure inaccurate, and (2) performance of each summary sample is not measured.</Paragraph>
      <Paragraph position="1"> Precision and recall are widely used to measure information retrieval performance. In the evaluation of summarization, they are calculated as follows.</Paragraph>
      <Paragraph position="3"> relevant by a subject (S) Documents that are assessed relevant by a subject</Paragraph>
      <Paragraph position="5"> In the previous experiments, the assessment standard was not fixed, and some subjects tended to make the relevant set broader and others narrower. The variance reduces the significance of the average precision and recall value. Because we introduced four relevance levels and showed the assessment criteria to the subjects, we can assume three kinds of relevance document sets: L3 only, L3 + L2, and L3 + L2 + L1. The set composed only of the documents with L3 assessment should have a high precision score. This case represents a user wants to know only high-probability information, for example, the user is hurried, or just one answer is sufficient. The set including L1 documents should get a high recall score. This case represents a user wants to know any information concerned with a specific question.</Paragraph>
      <Paragraph position="6"> Precision and recall represent the performance of a summarization method for certain question, however they do not indicate the reason why the method presents higher or lower performance. To find the reasons and improve a summarization method based on them, it is useful to analyze quality and performance connected together for each summary sample.</Paragraph>
      <Paragraph position="7"> Measuring each summary's performance is necessary for such analysis. Therefore, we introduce the relevance score, which represents the correspondence between the subject judgement and the correct document relevance.</Paragraph>
      <Paragraph position="8"> The score of each pair of subject judgement and document relevance is shown in Table 3.</Paragraph>
      <Paragraph position="9"> By averaging scores of all subjects for every sample, summary's performances are compared.</Paragraph>
      <Paragraph position="10"> By averaging scores of all summary samples for every summarization method, method's performances are compared.</Paragraph>
    </Section>
    <Section position="5" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
4.1 Accuracy
</SectionTitle>
      <Paragraph position="0"> The precision and recall are shown in Fig. 4, and the F-measure is shown in Fig. 5. The F-measure is the balanced score of precision and recall, calculated as follows:</Paragraph>
      <Paragraph position="2"> Figures 4 and 5 show that the phrase-represented summary (C) presents the highest performance. It satisfies both the high precision and the high recall requirements. Because there are various situations in WWW searches, phrase-representation sumtnarization is considered suitable in any cases.</Paragraph>
      <Paragraph position="3">  The relevance score for each question is shown in Fig. 6. The phrase-represented summary (C) gets the highest score on average, and the best in Question-a2 and Question-b. For Question-al, though all summaries get poor scores, the sentence extraction summary (B) is the best among them.</Paragraph>
    </Section>
    <Section position="6" start_page="62" end_page="64" type="sub_section">
      <SectionTitle>
4.2 Time
</SectionTitle>
      <Paragraph position="0"> The time required to assess relevance is shown in Fig. 7. The time for Question-a is a sum of the times for Questions al and a2. In the Question-a case, phrase-represented summary (C) requires the shortest time. For Question-b, leading fixed-length characters (A) requires the shortest time, and this result is different from the intuition.</Paragraph>
      <Paragraph position="1"> This requires further examination.</Paragraph>
      <Paragraph position="3"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML