File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/p94-1002_intro.xml
Size: 3,526 bytes
Last Modified: 2025-10-06 14:05:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P94-1002"> <Title>MULTI-PARAGRAPH SEGMENTATION EXPOSITORY TEXT</Title> <Section position="3" start_page="0" end_page="9" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> The structure of expository texts can be characterized as a sequence ofsubtopical discussions that occur in the context of a few main topic discussions. For example, a popular science text called Stargazers, whose main topic is the existence of life on earth and other planets, can be described as consisting of the following subdiscussions (numbers indicate paragraph numbers): 1-3 Intro - the search for life in space 4-5 The moon's chemical composition 6-8 How early proximity of the moon shaped it 9-12 How the moon helped life evolve on earth 13 Improbability of the earth-moon system 14-16 Binary/trinary star systems make life unlikely null 17-18 The low probability of non-binary/trinary systems 19-20 Properties of our sun that facilitate life 21 Summary Subtopic structure is sometimes marked in technical texts by headings and subheadings which divide the text into coherent segments; Brown & Yule (1983:140) state that this kind of division is one of the most basic in discourse. However, many expository texts consist of long sequences of paragraphs with very little structural demarcation. This paper presents fully-implemented algorithms that use lexical cohesion relations to partition expository texts into multi-paragraph segments that reflect their subtopic structure. Because the model of discourse structure is one in which text is partitioned into contiguous, nonoverlapping blocks, I call the general approach TextTiling. The ultimate goal is to not only identify the extents of the subtopical units, but to label their contents as well. This paper focusses only on the discovery of subtopic structure, leaving determination of subtopic content to future work.</Paragraph> <Paragraph position="1"> Most discourse segmentation work is done at a finer granularity than that suggested here. However, for lengthy written expository texts, multi-paragraph segmentation has many potential uses, including the improvement of computational tasks that make use of distributional information. For example, disambiguation algorithms that train on arbitrary-size text windows, e.g., Yarowsky (1992) and Gale et ai. (1992b), and algorithms that use lexical co-occurrence to determine semantic relatedness, e.g., Schfitze (1993), might benefit from using windows with motivated boundaries instead.</Paragraph> <Paragraph position="2"> Information retrieval algorithms can use subtopic structuring to return meaningful portions of a text if paragraphs are too short and sections are too long (or are not present). Motivated segments can also be used as a more meaningful unit for indexing long texts. Salton et al. (1993), working with encyclopedia text, find that comparing a query against sections and then paragraphs is more successful than comparing against full documents alone. I have used the results of TextTiling in a new paradigm for information access on full-text documents (Hearst 1994).</Paragraph> <Paragraph position="3"> The next section describes the discourse model that motivates the approach. This is followed by a description of two algorithms for subtopic structuring that make use only of lexical cohesion relations, the evaluation of these algorithms, and a summary and discussion of future work.</Paragraph> </Section> class="xml-element"></Paper>