File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/p94-1002_evalu.xml

Size: 8,196 bytes

Last Modified: 2025-10-06 14:00:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1002">
  <Title>MULTI-PARAGRAPH SEGMENTATION EXPOSITORY TEXT</Title>
  <Section position="6" start_page="21121" end_page="21121" type="evalu">
    <SectionTitle>
EVALUATION
</SectionTitle>
    <Paragraph position="0"> One way to evaluate these segmentation algorithms is to compare against judgments made by human readers, another is to compare the algorithms against texts premarked by authors, and a third way is to see how well the results improve a computational task. This section compares the algorithm against reader judgments, since author markups are fallible and are usually applied to text types that this algorithm is not designed for, and Hearst (1994) shows how to use TextTiles in a task (although it does not show whether or not the results of the algorithms used here are better than some other algorithm with similar goals).</Paragraph>
    <Section position="1" start_page="21121" end_page="21121" type="sub_section">
      <SectionTitle>
Reader Judgments
</SectionTitle>
      <Paragraph position="0"> Judgments were obtained from seven readers for each of thirteen magazine articles which satisfied the length criteria (between 1800 and 2500 words) 5 and which contained little structural demarkation. The judges SOne longer text of 2932 words was used since reader judgments had been obtained for it from an earlier experiment. Judges were technical researchers. Two texts had three or four short headers which were removed for consistency.</Paragraph>
      <Paragraph position="1">  were asked simply to mark the paragraph boundaries at which the topic changed; they were not given more explicit instructions about the granularity of the segmentation. null Figure 3 shows the boundaries marked by seven judges on the Stargazers text. This format helps illustrate the general trends made by the judges and also helps show where and how often they disagree. For instance, all but one judge marked a boundary between paragraphs 2 and 3. The dissenting judge did mark a boundary after 3, as did two of the concurring judges. The next three major boundaries occur after paragraphs 5, 9, 12, and 13. There is some contention in the later paragraphs; three readers marked both 16 and 18, two marked 18 alone, and two marked 17 alone. The outline in the Introduction gives an idea of what each segment is about.</Paragraph>
      <Paragraph position="2"> Passonneau &amp; Litman (1993) discuss at length considerations about evaluating segmentation algorithms according to reader judgment information. As Figure 3 shows, agreement among judges is imperfect, but trends can be discerned. In Passonneau &amp; Litman's (1993) data, if 4 or more out of 7 judges mark a boundary, the segmentation is found to be significant using a variation of the Q-test (Cochran 1950). My data showed similar results. However, it isn't clear how useful this significance information is, since a simple majority does not provide overwhelming proof about the objective reality of the subtopic break. Since readers often disagree about where to draw a boundary marking for a topic shift, one can only use the general trends as a basis from which to compare different algorithms. Since the goals of TextTiling are better served by algorithms that produce more rather than fewer boundaries, I set the cutoff for &amp;quot;true&amp;quot; boundaries to three rather than four judges per paragraph. 6 The remaining gaps are considered nonboundaries.</Paragraph>
    </Section>
    <Section position="2" start_page="21121" end_page="21121" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> Figure 4 shows a plot of the results of applying the block comparison algorithm to the Stargazer text. When the lowermost portion of a valley is not located at a paragraph gap, the judgment is moved to the nearest paragraph gap. 7 For the most part, the regions of strong similarity correspond to the regions of strong agreement among the readers. (The results for this text were fifth highest out of the 13 test texts.) Note however, that the similarity information around paragraph 12 is weak. This paragraph briefly summarizes the contents of the previous three paragraphs; much of the terminol6Paragraphs of three or fewer sentences were combined with their neighbor if that neighbor was deemed to follow at &amp;quot;true&amp;quot; boundary, as in paragraphs 2 and 3 of the Stargazers text.</Paragraph>
      <Paragraph position="1"> rThis might be explained in part by (Stark 1988) who shows that readers disagree measurably about where to place paragraph boundaries when presented with texts with those boundaries removed.</Paragraph>
      <Paragraph position="2"> ogy that occurred in all of them reappears in this one location (in the spirit of a Grosz ~; Sidner (1986) &amp;quot;pop&amp;quot; operation). Thus it displays low similarity both to itself and to its neighbors. This is an example of a breakdown caused by the assumptions about the subtopic structure. It is possible that an additional pass through the text could be used to find structure of this kind.</Paragraph>
      <Paragraph position="3"> The final paragraph is a summary of the entire text; the algorithm recognizes the change in terminology from the preceding paragraphs and marks a boundary; only two of the readers chose to differentiate the summary; for this reason the algorithm is judged to have made an error even though this sectioning decision is reasonable. This illustrates the inherent fallibility of testing against reader judgments, although in part this is because the judges were given loose constraints.</Paragraph>
      <Paragraph position="4"> Following the advice of Gale et al. (1992a), I compare the Mgorithm against both upper and lower bounds.</Paragraph>
      <Paragraph position="5"> The upper bound in this case is the reader judgment data. The lower bound is a baseline algorithm that is a simple, reasonable approach to the problem that can be automated. A simple way to segment the texts is to place boundaries randomly in the document, constraining the number of boundaries to equal that of the average number of paragraph gaps assigned by judges.</Paragraph>
      <Paragraph position="6"> In the test data, boundaries are placed in about 41% of the paragraph gaps. A program was written that places a boundary at each potential gap 41% of the time (using a random number generator), and run 10,000 times for each text, and the average of the scores of these runs was found. These scores appear in Table 1 (results at 33% are also shown for comparison purposes).</Paragraph>
      <Paragraph position="7"> The algorithms are evaluated according to how many true boundaries they select out of the total selected (precision) and how many true boundaries are found out of the total possible (recall) (Salton 1988). The recall measure implicitly signals the number of missed boundaries (false negatives, or deletion errors); the number of false positives, or insertion errors, is indicated explicitly. null In many cases the algorithms are almost correct but off by one paragraph, especially in the texts that the algorithm performs poorly on. When the block similarity algorithm is allowed to be off by one paragraph, there is dramatic improvement in the scores for the texts that lower part of Table 2, yielding an overall precision of 83% and recall of 78%. As in Figure 4, it is often the case that where the algorithm is incorrect, e.g., paragraph gap 11, the overall blocking is very close to what the judges intended.</Paragraph>
      <Paragraph position="8"> Table 1 shows that both the blocking algorithm and the chaining algorithm are sandwiched between the upper and lower bounds. Table 2 shows some of these results in more detail. The block similarity algorithm seems to work slightly better than the chaining algorithm, although the difference may not prove significant over the long run. Furthermore, in both versions of the algorithm, changes to the parameters of the algorithm  perturbs the resulting boundary markings. This is an undesirable property and perhaps could be remedied with some kind of information-theoretic formulation of the problem.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML