File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1064_metho.xml
Size: 16,241 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1064"> <Title>A Statistical Model for Domain-Independent Text Segmentation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Algorithm for Finding the </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Maximum-Probability Segmentation </SectionTitle> <Paragraph position="0"> To find the maximum-probability segmentation a35a16 , we first define the cost of segmentation a16 as</Paragraph> <Paragraph position="2"> , the average length of a segment (in words), as prior probability. We leave it for future work to compare the suitability of various prior probabilities for text segmentation.</Paragraph> <Paragraph position="3"> and we then minimize a133 a24 a16 a27 to obtain a35a16 , because</Paragraph> <Paragraph position="5"> We further rewrite Equation (12) in the form of Equation (13) below by using Equation (5) and replacing a15 a49 with a146 a24a67a2</Paragraph> <Paragraph position="7"> a24a57a147a10a148a54a149a122a150a54a151a11a27 is the length of words, i.e.,the number of word tokens in words. Equation (13) is used to describe our algorithm in Section 3.1:</Paragraph> <Paragraph position="9"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Algorithm </SectionTitle> <Paragraph position="0"> This section describes an algorithm for finding the minimum-cost segmentation. First, we define the terms and symbols used to describe the algorithm.</Paragraph> <Paragraph position="1"> Given a text a0 a1a89a2a5a4a156a2a14a7a29a9a11a9a17a9a6a2a144a13 consisting of</Paragraph> <Paragraph position="3"> and a2 a49a60a158 a4 , so that a157a101a159 is just before a2 a4 and a157 a13 is just after a2a14a13 .</Paragraph> <Paragraph position="4"> Next, we define a graph a160 a1a162a161a67a163a164a46a166a165a168a167 , where a163 is a set of nodes and a165 is a set of edges. a163 is defined as</Paragraph> <Paragraph position="6"> where the edges are ordered; the initial vertex and the terminal vertex of a175 a49a50 are a157 a49 and a157 a50 , respectively. An example of a160 is shown in Figure 1.</Paragraph> <Paragraph position="7"> We say that a175 a49a50 covers a2 a49a60a158 a4a12a2 a49a60a158 a7a164a9a11a9a11a9a12a2 a50 . This means that a175 a49a50 represents a segment</Paragraph> <Paragraph position="9"> where a71 is the number of different words in a0 .</Paragraph> <Paragraph position="10"> Given these definitions, we describe the algorithm to find the minimum-cost segmentation or maximum-probability segmentation as follows: Step 1. Calculate the cost a76 a49a50 of edge a175 a49a50 for a98a168a171</Paragraph> <Paragraph position="12"> Algorithms for finding the minimum-cost path in a graph are well known. An algorithm that can provide a solution for Step 2 will be a simpler version of the algorithm used to find the maximum-probability solution in Japanese morphological analysis (Nagata, 1994). Therefore, a solution can be obtained by applying a dynamic programming (DP) algorithm.4 DP algorithms have also been used for text segmentation by other researchers (Ponte and Croft, 1997; Heinonen, 1998).</Paragraph> <Paragraph position="13"> The path thus obtained represents the minimum-cost segmentation in a160 when edges correspond with segments. In Figure 1, for example, if a175 a159 a4a38a175a166a4a63a179a180a175a31a179a34a181 is the minimum-cost path, then a182a2a168a4a63a183 a182a2a144a7a184a2a8a179a57a183 a182a2a144a185a184a2a8a181a57a183 is the minimum-cost segmentation.</Paragraph> <Paragraph position="14"> The algorithm automatically determines the number of segments. But the number of segments can also be specified explicitly by specifying the number of edges in the minimum-cost path.</Paragraph> <Paragraph position="15"> The algorithm allows the text to be segmented anywhere between words; i.e., all the positions /softwares.html.</Paragraph> <Paragraph position="16"> between words are candidates for segment boundaries. It is easy, however, to modify the algorithm so that the text can only be segmented at particular positions, such as the ends of sentences or paragraphs. This is done by using a subset of a165 in Equation (15). We use only the edges whose initial and terminal vertices are candidate boundaries that meet particular conditions, such as being the ends of sentences or paragraphs. We then obtain the minimum-cost path by doing Steps 1 and 2. The minimum-cost segmentation thus obtained meets the boundary conditions. In this paper, we assume that the segment boundaries are at the ends of sentences.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Properties of the segmentation </SectionTitle> <Paragraph position="0"> Generally speaking, the number of segments obtained by our algorithm is not sensitive to the length of a given text, which is counted in words.</Paragraph> <Paragraph position="1"> In other words, the number of segments is relatively stable with respect to variation in the text length. For example, the algorithm divides a newspaper editorial consisting of about 27 sentences into 4 to 6 segments, while on the other hand, it divides a long text consisting of over 1000 sentences into 10 to 20 segments. Thus, the number of segments is not proportional to text length.</Paragraph> <Paragraph position="2"> This is due to the term a20a77a114a116 a39 a15 in Equation (11). The value of this term increases as the number of words increases. The term thus suppresses the division of a text when the length of the text is long.</Paragraph> <Paragraph position="3"> This stability is desirable for summarization, because summarizing a given text requires selecting a relatively small number of topics from it.</Paragraph> <Paragraph position="4"> If a text segmentation system divides a given text into a relatively small number of segments, then a summary of the original text can be composed by combining summaries of the component segments (Kan et al., 1998; Nakao, 2000). A finer segmentation can be obtained by applying our algorithm recursively to each segment, if neces- null and found that our method is good at segmenting a text into a relatively small number of segments. On the other hand, the method is not good at segmenting a text into a large number of segments. For example, the method is good at segmenting a 1000-sentence text into 10 segments. In such a case, the segment boundaries seem to correspond well with topic boundaries. But, if the method is forced to segment the same text into 50 segments by specifying the number of</Paragraph> <Paragraph position="6"/> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Material </SectionTitle> <Paragraph position="0"> We used publicly available data to evaluate our system. This data was used by Choi (2000) to compare various domain-independent text segmentation systems.6 He evaluated a133a26a186a62a186 (Choi, 2000), TextTiling (Hearst, 1994), DotPlot (Reynar, 1998), and Segmenter (Kan et al., 1998) by using the data and reported that a133a26a186a180a186 achieved the best performance among these systems.</Paragraph> <Paragraph position="1"> The data description is as follows: &quot;An artificial test corpus of 700 samples is used to assess the accuracy and speed performance of segmentation algorithms. A sample is a concatenation of ten text segments. A segment is the first a15 sentences of a randomly selected document from the Brown corpus. A sample is characterised by the range a15 .&quot; (Choi, 2000) Table 1 gives the corpus statistics.</Paragraph> <Paragraph position="2"> Segmentation accuracy was measured by the probabilistic error metric a195 a92 proposed by Beeferman, et al. (1999).7 Low a195 a92 indicates high accuedges in the minimum-cost path, then the resulting segmentation often contains very small segments consisting of only one or two sentences. We found empirically that segments obtained by recursive segmentation were better than those obtained by minimum-cost segmentation when the specified number of segments was somewhat larger than that of the minimum-cost path, whose number of segments was automatically determined by the algorithm.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Experimental procedure and results </SectionTitle> <Paragraph position="0"> The sample texts were preprocessed - i.e., punctuation and stop words were removed and the remaining words were stemmed - by a program using the libraries available in Choi 's package. The texts were then segmented by the systems listed in Tables 2 and 3. The segmentation boundaries were placed at the ends of sentences. The segmentations were evaluated by applying an evaluation program in Choi 's package.</Paragraph> <Paragraph position="1"> The results are listed in Tables 2 and 3. a205 a98a62a98 is the result for our system when the numbers of segments were determined by the system. a205 a98a180a98</Paragraph> <Paragraph position="3"> the result for our system when the numbers of segments were given beforehand.8 a133a26a186a180a186 and a133a26a186a62a186</Paragraph> <Paragraph position="5"> are the corresponding results for the systems described in Choi 's paper (Choi, 2000).9 ments were determined by the systems. In these tables, the symbol &quot;a211a62a211 &quot; indicates that the difference in a195 a92 between the two systems is statistically significant at the 1% level, based on &quot;number a212a145a213a57a137a63a198a11a199a57a200a11a121a67a201a47a202a33a203a166a204a57a139 is the probability that a randomly chosen pair of words a distance of a214 words apart is inconsistently classified; that is, for one of the segmentations the pair lies in the same segment, while for the other the pair spans a segment boundary&quot; (Beeferman et al., 1999), where a214 is chosen to be half the average reference segment length (in words).</Paragraph> <Paragraph position="6"> a194a33a194a47a215a135a216a123a217 in Table 3 are slightly different from those listed in Table 6 of Choi 's paper (Choi, 2000). This is because the original results in that paper were based on 500 samples, while the results in our Table 3 were based on 700 samples (Choi, personal communication).</Paragraph> <Paragraph position="7"> ments were given beforehand.</Paragraph> <Paragraph position="8"> a one-sided a218 -test of the null hypothesis of equal means. The probability of the null hypothesis being true is displayed in the row indicated by &quot;prob&quot;. The column labels, such as &quot;a80a26a134a219a85 &quot;, indicate that the numbers in the column are the averages of a195 a92 over the corresponding sample texts. &quot;Total&quot; indicates the averages of a195 a92 over all the text samples.</Paragraph> <Paragraph position="9"> These tables show statistically that our system is more accurate than or at least as accurate as a133a26a186a180a186 . This means that our system is more accurate than or at least as accurate as previous domain-independent text segmentation systems, because a133a26a186a180a186 has been shown to be more accurate than previous domain-independent text segmentation systems.10 null</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Evaluation </SectionTitle> <Paragraph position="0"> Evaluation of the output of text segmentation systems is difficult because the required segmentations depend on the application. In this paper, we have used an artificial corpus to evaluate our system. We regard this as appropriate for comparing relative performance among systems.</Paragraph> <Paragraph position="1"> It is important, however, to assess the performance of systems by using real texts. These texts should be domain independent. They should also be multi-lingual if we want to test the mul10Speed performance is not our main concern in this paper. Our implementations of a207a32a208a33a208 and a207a32a208a33a208a38a220 are not optimum. However, a207a32a208a33a208 and a207a32a208a33a208a38a220 , which are implemented in C, run as fast as a210 a194a33a194 and a210 a194a33a194 a220 , which are implemented in Java (Choi, 2000), due to the difference in programming languages. The average run times for a sample text were tilinguality of systems. For English, Klavans, et al. describe a segmentation corpus in which the texts were segmented by humans (Klavans et al., 1998). But, there are no such corpora for other languages. We are planning to build a segmentation corpus for Japanese, based on a corpus of speech transcriptions (Maekawa and Koiso, 2000).</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5.2 Related work </SectionTitle> <Paragraph position="0"> Our proposed algorithm finds the maximum-probability segmentation of a given text. This is a new approach for domain-independent text segmentation. A probabilistic approach, however, has already been proposed by Yamron, et al. for domain-dependent text segmentation (broadcast news story segmentation) (Yamron et al., 1998).</Paragraph> <Paragraph position="1"> They trained a hidden Markov model (HMM), whose states correspond to topics. Given a word sequence, their system assigns each word a topic so that the maximum-probability topic sequence is obtained. Their model is basically the same as that used for HMM part-of-speech (POS) taggers (Manning and Sch&quot;utze, 1999), if we regard topics as POS tags.11 Finding topic boundaries is equivalent to finding topic transitions; i.e., a continuous topic or segment is a sequence of words with the same topic.</Paragraph> <Paragraph position="2"> Their approach is indirect compared with our approach, which directly finds the maximum-probability segmentation. As a result, their model can not straightforwardly incorporate features pertaining to a segment itself, such as the average length of segments. Our model, on the other hand, can incorporate this information quite naturally.</Paragraph> <Paragraph position="3"> Suppose that the length of a segment a225 follows a normal distribution a226 a24 a225a164a227a31a228 a46a47a229a30a27 , with a mean of a228 and standard deviation of a229 (Ponte and Croft, 1997). Then Equation (13) can be augmented to where a230a58a68a97a232a235a68a97a234a56a1a174a69 . Equation (17) favors segments whose lengths are similar to the average length (in words).</Paragraph> <Paragraph position="4"> Another major difference from their algorithm is that our algorithm does not require training data to estimate probabilities, while their algorithm does. Therefore, our algorithm can be applied to domain-independent texts, while their algorithm is restricted to domains for which training data are available. It would be interesting, however, to compare our algorithm with their algorithm for the case when training data are available. In such a case, our model should be extended to incorporate various features such as the average segment length, clue words, named entities, and so on (Reynar, 1999; Beeferman et al., 1999).</Paragraph> <Paragraph position="5"> Our proposed algorithm naturally estimates the probabilities of words in segments. These probabilities, which are called word densities, have been used to detect important descriptions of words in texts (Kurohashi et al., 1997). This method is based on the assumption that the density of a word is high in a segment in which the word is discussed (defined and/or explained) in some depth. It would be interesting to apply our method to this application.</Paragraph> </Section> class="xml-element"></Paper>