XML Viewer - p99-1077

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1077_evalu.xml
Size: 6,152 bytes
Last Modified: 2025-10-06 14:00:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1077">
  <Title>TextTiling VecTile \[ Subjects</Title>
  <Section position="6" start_page="592" end_page="592" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="592" end_page="592" type="sub_section">
      <SectionTitle>
5.1 The Task
</SectionTitle>
      <Paragraph position="0"> In a pilot study, five subjects were presented with five texts from a popular-science magazine, all between 2,000 and 3,400 words, or between 20 and 35 paragraphs, in length. Section headings and any other clues were removed from the layout. Paragraph breaks were left in place. Thus the task was not to find paragraph breaks, but breaks between multi-paragraph passages that according to the the subject's judgment marked topic shifts. All subjects were native speakers of English. 1  length with section breaks removed. Please mark the places where the topic seems to change (draw a line between paragraphs). Read at normal speed, do not take much longer than you normally would. But do feel free to go back and reconsider your decisions (even change your markings) as you go along.</Paragraph>
      <Paragraph position="1"> Also, for each section, suggest a headline of a few words that captures its main content.</Paragraph>
      <Paragraph position="2"> If you find it hard to decide between two places, mark both, giving preference to one and indicating that the other was a close rival.&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="592" end_page="592" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> To obtain an &amp;quot;expert opinion&amp;quot; against which to compare the algorithms, those paragraph boundaries were marked as &amp;quot;correct&amp;quot; section breaks which at least three out of the five subjects had marked.</Paragraph>
      <Paragraph position="1"> (Three out of seven (Litman and Passonneau, 1995; Hearst, 1997) or 30% (Kozima, 1994) are also sometimes deemed sufficient.) For the two systems as well as the subjects, precision and recall with respect to the set of &amp;quot;correct&amp;quot; section breaks were calculated. The results are listed in Table 1.</Paragraph>
      <Paragraph position="2"> The context vectors clearly led to an improved performance over the counting of pure string repetitions. null The simple assignment of section breaks to the nearest paragraph boundary may have led to noise in some cases; moreover, it is not really part of the task of measuring cohesion. Therefore the texts were processed again, this time moving the windows over whole paragraphs at a time, calculating gapvalues at the paragraph gaps. For each paragraph break, the number of subjects who had marked it as a section break was taken as an indicator of the &amp;quot;strength&amp;quot; of the boundary. There was a significant negative correlation between the values calculated by both systems and that measure of strength, with</Paragraph>
      <Paragraph position="4"> words, deep gaps in the similarity measure are associated with strong agreement between subjects that the spot marks a section boundary. Although r 2 is low both cases, the VecTile system yields more significant results.</Paragraph>
    </Section>
    <Section position="3" start_page="592" end_page="592" type="sub_section">
      <SectionTitle>
5.3 Discussion and Further Work
</SectionTitle>
      <Paragraph position="0"> The results discussed above need further support with a larger subject pool, as the level of agree: ment among the judges was at the low end of what can be considered significant. This is shown by the Kappa coefficients, measured against the expert opinion and listed in Table 2. The overall average was .594.</Paragraph>
      <Paragraph position="1"> Despite this caveat, the results clearly show that adding collocational information from the training  corpus improves the prediction of section breaks, hence, under common assumptions, the measurement of lexical cohesion. It is likely that these encouraging results can be further improved. Following are a few suggestions of ways to do so. Some factors work against the context vector method. For instance, the system currently has no mechanism to handle words that it has no context vectors for. Often it is precisely the co-occurrence of uncommon words not in the training corpus (personal names, rare terminology etc.) that ties text together. Such cases pose no challenge to the string-based system, but the VecTile system cannot utilize them. The best solution might be a hybrid system with a backup procedure for unknown words.</Paragraph>
      <Paragraph position="2"> Another point to note is how well the much simpler TextTile system compares. Indeed, a close look at the figures in Table 1 reveals that the better results of the VecTile system are due in large part to one of the texts, viz. #2. Considering the additional effort and resources involved in using context vectors, the modest boost in performance might often not be worth the effort in practice. This suggests that pure string repetition is a particularly strong indicator of similarity, and the vector-based system might benefit from a mechanism to give those vectors a higher weight than co-occurrences of merely similar words.</Paragraph>
      <Paragraph position="3"> Another potentially important parameter is the nature of the training corpus. In this case, it consisted mainly of news texts, while the texts in the experiment were scientific expository texts. A more homogeneous setting might have further improved the results.</Paragraph>
      <Paragraph position="4"> Finally, the evaluation of results in this task is complicated by the fact that &amp;quot;near-hits&amp;quot; (cases in which a section break is off by one paragraph) do not have any positive effect on the score.&amp;quot; This problem has been dealt with in the Topic Detection and Tracking (TDT) project by a more flexible score that becomes gradually worse as the distance between hypothesized and &amp;quot;real&amp;quot; boundaries increases (TDT, 1997a; TDT, 1997b).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML