File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1002_metho.xml
Size: 14,695 bytes
Last Modified: 2025-10-06 14:13:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P94-1002"> <Title>MULTI-PARAGRAPH SEGMENTATION EXPOSITORY TEXT</Title> <Section position="4" start_page="9" end_page="21121" type="metho"> <SectionTitle> THE DISCOURSE MODEL </SectionTitle> <Paragraph position="0"> Many discourse models assume a hierarchical segmentation model, e.g., attentional/intentional structure (Crosz & Sidner 1986) and Rhetorical Structure Theory (Mann ~ Thompson 1987). Although many aspects of discourse analysis require such a model, I choose to cast expository text into a linear sequence of segments, both for computational simplicity and because such a structure is sufficient for the coarse-grained tasks of interest here. 1</Paragraph> <Paragraph position="2"> correspond to sentences and edges between nodes indicate strong term overlap between the sentences.</Paragraph> <Paragraph position="3"> Skorochod'ko (1972) suggests discovering a text's structure by dividing it up into sentences and seeing how much word overlap appears among the sentences.</Paragraph> <Paragraph position="4"> The overlap forms a kind of intra-structure; fully connected graphs might indicate dense discussions of a topic, while long spindly chains of connectivity might indicate a sequential account (see Figure 1). The central idea is that of defining the structure of a text as a function of the connectivity patterns of the terms that comprise it. This is in contrast with segmenting guided primarily by fine-grained discourse cues such as register change, focus shift, and cue words. From a computational viewpoint, deducing textual topic structure from lexical connectivity alone is appealing, both because it is easy to compute, and also because discourse cues are sometimes misleading with respect to the topic structure (Brown & Yule 1983)(SS3).</Paragraph> <Paragraph position="5"> 1 Additionally, (Passonneau & Litman 1993) concede the difficulty of eliciting hierarchical intentional structure with any degree of consistency from their human judges.</Paragraph> <Paragraph position="6"> The topology most of interest to this work is the final one in the diagram, the Piecewise Monolithic Structure, since it represents sequences of densely interrelated discussions linked together, one after another. This topology maps nicely onto that of viewing documents as a sequence of densely interrelated subtopical discussions, one following another. This assumption, as will be seen, is not always valid, but is nevertheless quite useful.</Paragraph> <Paragraph position="7"> This theoretical stance bears a close resemblance to Chafe's notion of The Flow Model of discourse (Chafe 1979), in description of which he writes (pp 179-180): Our data.., suggest that as a speaker moves from focus to focus (or from thought to thought) there are certain points at which there may be a more or less radical change in space, time, character configuration, event structure, or, even, world .... At points where all of these change in a maximal way, an episode boundary is strongly present. But often one or another will change considerably while others will change less radically, and all kinds of varied interactions between these several factors are possible. 2 Although Chafe's work concerns narrative text, the same kind of observation applies to expository text.</Paragraph> <Paragraph position="8"> The TextTiling algorithms are designed to recognize episode boundaries by determining where thematic components like those listed by Chafe change in a maximal way.</Paragraph> <Paragraph position="9"> Many researchers have studied the patterns of occurrence of characters, setting, time, and the other thematic factors that Chafe mentions, usually in the context of narrative. In contrast, I attempt to determine where a relatively large set of active themes changes simultaneously, regardless of the type of thematic factor. This is especially important in expository text in which the subject matter tends to structure the discourse more so than characters, setting, etc. For example, in the Stargazers text, a discussion of continental movement, shoreline acreage, and habitability gives way to a discussion of binary and unary star systems.</Paragraph> <Paragraph position="10"> This is not so much a change in setting or character as a change in subject matter. Therefore, to recognize where the subtopic changes occur, I make use of lexical cohesion relations (Halliday & Hasan 1976) in a manner similar to that suggested by Skorochod'ko.</Paragraph> <Paragraph position="11"> Morris and Hirst's pioneering work on computing discourse structure from lexical relations (Morris & Hirst 1991), (Morris 1988) is a precursor to the work reported on here. Influenced by Halliday & Hasan's (1976) theory of lexical coherence, Morris developed an algorithm that finds chains of related terms via a comprehensive thesaurus (Roget's Fourth Edition). 3 For example, the 2Interestingly, Chafe arrived at the Flow Model after working extensively with, and then becoming dissatisfied with, a hierarchical model of paragraph structure like that of Longacre (1979).</Paragraph> <Paragraph position="12"> words residential and apartment both index the same thesaural category and can thus be considered to be in a coherence relation with one another. The chains are used to structure texts according to the attentional/intentional theory of discourse structure (Grosz & Sidner 1986), and the extent of the chains correspond to the extent of a segment. The algorithm also incorporates the notion of &quot;chain returns&quot; - repetition of terms after a long hiatus - to close off an intention that spans over a digression.</Paragraph> <Paragraph position="13"> Since the Morris & Hirst (1991) algorithm attempts to discover attentional/intentional structure, their goals are different than those of TextTiling. Specifically, the discourse structure they attempt to discover is hierarchical and more fine-grained than that discussed here. Thus their model is not set up to take advantage of the fact that multiple simultaneous chains might occur over the same intention. Furthermore, chains tend to overlap one another extensively in long texts. Figure 2 shows the distribution, by sentence number, of selected terms from the Stargazers text. The first two terms have fairly uniform distribution and so should not be expected to provide much information about the divisions of the discussion. The next two terms occur mainly at the beginning and the end of the text, while terms binary through planet have considerable overlap is not generally available online.</Paragraph> <Paragraph position="14"> from sentences 58 to 78. There is a somewhat welldemarked cluster of terms between sentences 35 and 50, corresponding to the grouping together of paragraphs 10, 11, and 12 by human judges who have read the text.</Paragraph> <Paragraph position="15"> From the diagram it is evident that simply looking for chains of repeated terms is not sufficient for determining subtopic breaks. Even combining terms that are closely related semantically into single chains is insufficient, since often several different themes are active in the same segment. For example, sentences 37 - 51 contain dense interaction among the terms move, continent, shoreline, time, species, and life, and all but the latter occur only in this region. However, it is the case that the interlinked terms of sentences 57 - 71 (space, star, binary, trinary, astronomer, orbit ) are closely related semantically, assuming the appropriate senses of the terms have been determined.</Paragraph> </Section> <Section position="5" start_page="21121" end_page="21121" type="metho"> <SectionTitle> ALGORITHMS FOR DISCOVERING SUBTOPIC STRUCTURE </SectionTitle> <Paragraph position="0"> Many researchers (e.g., Halliday ~z Hasan (1976), Tanhen (1989), Walker (1991)) have noted that term repetition is a strong cohesion indicator. I have found in this work that term repetition alone is a very useful indicator of subtopic structure, when analyzed in terms of multiple simultaneous information threads. This section describes two algorithms for discovering subtopic structure using term repetition as a lexical cohesion indicator. null The first method compares, for a given window size, each pair of adjacent blocks of text according to how similar they are lexically. This method assumes that the more similar two blocks of text are, the more likely it is that the current subtopic continues, and, conversely, if two adjacent blocks of text are dissimilar, this implies a change in subtopic flow. The second method, an extension of Morris & Hirst's (1991) approach, keeps track of active chains of repeated terms, where membership in a chain is determined by location in the text. The method determines subtopic flow by recording where in the discourse the bulk of one set of chains ends and a new set of chains begins.</Paragraph> <Paragraph position="1"> The core algorithm has three main parts: 1. Tokenization 2. Similarity Determination 3. Boundary Identification Tokenization refers to the division of the input text into individual lexical units. For both versions of the algorithm, the text is subdivided into psuedosentences of a pre-defined size w (a parameter of the algorithm) rather than actual syntactically-determined sentences, thus circumventing normalization problems. For the purposes of the rest of the discussion these groupings of tokens will be referred to as token-sequences. In practice, setting w to 20 tokens per token-sequence works best for many texts. The morphologically-analyzed token is stored in a table along with a record of the token-sequence number it occurred in, and how frequently it appeared in the token-sequence. A record is also kept of the locations of the paragraph breaks within the text. Closed-class and other very frequent words are eliminated from the analysis.</Paragraph> <Paragraph position="2"> After tokenization, the next step is the comparison of adjacent pairs of blocks of token-sequences for over-all lexical similarity. Another important parameter for the algorithm is the blocksize: the number of token-sequences that are grouped together into a block to be compared against an adjacent group of token-sequences. This value, labeled k, varies slightly from text to text; as a heuristic it is the average paragraph length (in token-sequences). In practice, a value of k = 6 works well for many texts. Actual paragraphs are not used because their lengths can be highly irregular, leading to unbalanced comparisons.</Paragraph> <Paragraph position="3"> Similarity values are computed for every token-sequence gap number; that is, a score is assigned to token-sequence gap i corresponding to how similar the token-sequences from token-sequence i- k through i are to the token-sequences from i + 1 to i + k + 1. Note that this moving window approach means that each token-sequence appears in k * 2 similarity computations. Similarity between blocks is calculated by a cosine measure: given two text blocks bl and bz, each with k token-sequences, /E t 2 n ~JJt,bx Et=l ~/)2 t,b~ where t ranges over all the terms that have been registered during the tokenization step, and wt,b~ is the weight assigned to term t in block /)I- In this version of the algorithm, the weights on the terms are simply their frequency within the block .4 Thus if the similarity score between two blocks is high, then the blocks have many terms in common. This formula yields a score between 0 and 1, inclusive.</Paragraph> <Paragraph position="4"> These scores can be plotted, token-sequence number against similarity score. However, since similarity is measured between blocks bl and b2, where bl spans token-sequences i - k through i and b2 spans i + 1 to i + k + 1, the measurement's z-axis coordinate falls between token-sequences i and i + 1. Rather than plotting a token-sequence number on the x-axis, we plot token-sequence gap number i. The plot is smoothed with average smoothing; in practice one round of average smoothing with a window size of three works best for most texts.</Paragraph> <Paragraph position="5"> Boundaries are determined by changes in the sequence of similarity scores. The token-sequence gap numbers are ordered according to how steeply the slopes of the plot are to either side of the token-sequence gap, rather than by their absolute similarity score. For a given token-sequence gap i, the algorithm looks at the scores of the token-sequence gaps to the left of i as long are their values are increasing. When the values to the left peak out, the difference between the score at the peak and the score at i is recorded. The same procedure takes place with the token-sequence gaps to the right of i; their scores are examined as long as they continue to rise. The relative height of the peak to the right of i is added to the relative height of the peak to the left. (A gap occurring at a peak will have a score of zero since neither of its neighbors is higher than it.) These new scores, called depth scores, corresponding to how sharp a change occurs on both sides of the token-sequence gap, are then sorted. Segment boundaries are assigned to the token-sequence gaps with the largest corresponding scores, adjusted as necessary to correspond to true paragraph breaks. A proviso check is done that prevents assignment of very close adjacent segment boundaries. Currently there must be at least three intervening token-sequences between boundaries.</Paragraph> <Paragraph position="6"> This helps control for the fact that many texts have spurious header information and single-sentence paragraphs. null The algorithm must determine how many segments to assigned to a document, since every paragraph is a 4Earlier work weighted the terms according to their frequency times their inverse document frequency. In these more recent experiments, simple term frequencies seem to work better.</Paragraph> <Paragraph position="8"> potential segment boundary. Any attempt to make an absolute cutoff is problematic since there would need to be some correspondence to the document style and length. A cutoff based on a particular valley depth is similarly problematic* I have devised a method for determining the number of boundaries to assign that scales with the size of the document and is sensitive to the patterns of similarity scores that it produces: the cutoff is a function of the average and standard deviations of the depth scores for the text under analysis* Currently a boundary is drawn only if the depth score exceeds g - C/r/2.</Paragraph> </Section> class="xml-element"></Paper>