File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/p94-1002_concl.xml
Size: 2,694 bytes
Last Modified: 2025-10-06 13:57:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P94-1002"> <Title>MULTI-PARAGRAPH SEGMENTATION EXPOSITORY TEXT</Title> <Section position="7" start_page="21121" end_page="21121" type="concl"> <SectionTitle> SUMMARY AND FUTURE WORK </SectionTitle> <Paragraph position="0"> This paper has described algorithms for the segmentation of expository texts into discourse units that reflect the subtopic structure of expository text. I have introduced the notion of the recognition of multiple simultaneous themes, which bears some resemblance to .Chafe's Flow Model of discourse and Skorochod'ko's text structure types. The algorithms are fully implemented: term repetition alone, without use of thesaural relations, knowledge bases, or inference mechanisms, works well for many of the experimental texts. The structure it obtains is coarse-grained but generally reflects human judgment data.</Paragraph> <Paragraph position="1"> Earlier work (Hearst 1993) incorporated thesaural information into the algorithms; surprisingly the latest experiments find that this information degrades the performance. This could very well be due to problems with the algorithm used. A simple algorithm that just posits relations among terms that are a small distance apart according to WordNet (Miller et al. 1990) or Roget's 1911 thesaurus (from Project Gutenberg), modeled after Morris and Hirst's heuristics, might work better. Therefore I do not feel the issue is closed, and instead consider successful grouping of related words as future work. As another possible alternative Kozima (1993) has suggested using a (computationally expensive) semantic similarity metric to find similarity among terms within a small window of text (5 to 7 words).</Paragraph> <Paragraph position="2"> This work does not incorporate the notion of multiple simultaneous themes but instead just tries to find breaks in semantic similarity among a small number of terms. A good strategy may be to substitute this kind of similarity information for term repetition in algorithms like those described here. Another possibility would be to use semantic similarity information as computed in Schiitze (1993), Resnik (1993), or Dagan et ai. (1993).</Paragraph> <Paragraph position="3"> The use of discourse cues for detection of segment boundaries and other discourse purposes has been extensively researched, although predominantly on spoken text (see Hirschberg & Litman (1993) for a summary of six research groups' treatments of 64 cue words). It is possible that incorporation of such information may provide a relatively simple way improve the cases where the algorithm is off by one paragraph.</Paragraph> </Section> class="xml-element"></Paper>