File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/00/w00-1009_relat.xml

Size: 4,313 bytes

Last Modified: 2025-10-06 14:15:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1009">
  <Title>A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure</Title>
  <Section position="3" start_page="74" end_page="75" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
2.1 Document structure
</SectionTitle>
      <Paragraph position="0"> Rhetorical Structure Theory (RST) \[Mann &amp; Thompson 88, Mann 00\] is a comprehensive theory of text organization. It is based on &amp;quot;text coherence&amp;quot;, or the presence in &amp;quot;earefuUy written text&amp;quot; of unity that would not appear in random sequences of sentences. RST posits the existence of relations among sentences. Most relations consist of one or more nuclei (the central components of a rhetorical relation) and zero or more satellites (the supporting components of the relation). An example of an RST relation is evidence which is decomposed into a nuelens (a claim) and a satellite (text that supports the claim). RST is intentionally limited to single documents. With CST, we attempt to describe the rhetorical structure of sets of related documents. Unlike RST, CST cannot rely on the deliberateness of writing style. We can however make use of some observations of structure across documents which, while clearly not deliberate in the RST sense, can be quite predictable and useful. In a sense, CST associates a certain behavior to a &amp;quot;collective document author&amp;quot; (that is, the collectivity of all authors of the related documents).</Paragraph>
      <Paragraph position="1"> A pioneering study in the typology of links among documents is described in \[Trigg 83, Trigg &amp; Weiser 87\]. Trigg introduces a taxonomy of link types across scientific papers.</Paragraph>
      <Paragraph position="2"> The 80 suggested link types such as citation, refutation, revision, equivalence, and comparison are grouped in two categories: Normal (inter-document links) and Commentary (deliberate cross-document links). While the taxonomy is quite exhaustive, it is by no means appropriate or intended for general domain texts (that is, other than scientific articles).</Paragraph>
      <Paragraph position="3">  A large deal of research in the automatic induction of document and hyperdocument structure is due to Salton's group at Cornell \[Salton et al. 91\]. \[Allan 96\] presents a graph simplification technique for &amp;quot;hyperlink typing&amp;quot;, that is, assigning link types from Trigg's list to links between sentences or paragraphs of a pair of documents. Allan tested his techniques on sets of very distinct articles (e.g. &amp;quot;John F. Kennedy&amp;quot; and &amp;quot;United States of America&amp;quot; from the Funk and Wagnalls encyclopedia). As the author himself admits, the evaluation in \[Allan 96\] is very weak and doesn't indicate to any extent whether the techniques actually achieve anything useful.</Paragraph>
      <Paragraph position="4"> More recently, \[Salton et al. 97\] introduced a technique for document structuring based on semantic hyperlinks (among pairs of paragraphs which are related by a lexieal similarity significantly higher than random). The authors represent single documents from the Funk and Wagnalls encyclopedia on topics such as Abortion or Nuclear Weapons in the form of text relationship maps. These maps exploit the bushiness (or number of connecting edges) of a paragraph to decide whether to include it in a summary of the entire article. The assumption underlying their technique is that bushypaths (or paths connecting highly connected paragraphs) are more likely to contain information central to the topic of the article. The summarization techniques described in Salton et al.'s research are limited to single documents.</Paragraph>
      <Paragraph position="5"> One of the goals of CST is to extend the techniques set forth in Trigg, Salton, and Allan's work to cover sets of related documents in arbitrary domains.</Paragraph>
    </Section>
    <Section position="2" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
2.2 Multi-document summarization
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML