File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2243_metho.xml
Size: 3,354 bytes
Last Modified: 2025-10-06 14:15:10
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2243"> <Title>How to thematically segment texts by using lexical cohesion?</Title> <Section position="4" start_page="1481" end_page="1482" type="metho"> <SectionTitle> 3 Results </SectionTitle> <Paragraph position="0"> A first qualitative evaluation of the method has been done with about 20 texts but without a formal protocol as in (Hearst, 1997). The results of these tests are rather stable when parameters such as the size of the cohesion computing window or the size of the smoothing window are changed (from 9 to 21 words). Generally, the best results are obtained with a size of 19 words for the first window and 11 for the second one.</Paragraph> <Section position="1" start_page="1481" end_page="1482" type="sub_section"> <SectionTitle> 3.1 Discovering document breaks </SectionTitle> <Paragraph position="0"> In order to have a more objective evaluation, the method has been applied to the &quot;classical&quot; task of discovering boundaries between concatened texts. Results are shown in Table 1. As in (Hearst, 1997), boundaries found by the method are weighted and sorted in decreasing order.</Paragraph> <Paragraph position="1"> Document breaks are supposed to be the boundaries that have the highest weights. For the first Nb boundaries, Nt is the number of boundaries that match with document breaks. Precision is given by Nt/Nb and recall, by Nt/N, where N is the number of document breaks. Our evaluation has been performed with 39 texts coming from the Le Monde newspaper, but not taken from the corpus used for building the collocation network. Each text was 80 words long on average. Each boundary, which is a minimum of the cohesion graph, was weighted by the sum of the differences between its value and the values of the two maxima around it, as in (Hearst, 1997).</Paragraph> <Paragraph position="2"> The match between a boundary and a document break was accepted if the boundary was no further than 9 words (after pre-processing).</Paragraph> <Paragraph position="3"> Globally, our results are not as good as Hearst's (with 44 texts; Nb: 10, P: 0.8, R: 0.19; Nb: 70, P: 0.59, R: 0.95). The first explanation for such a difference is the fact that the two methods do not apply to the same kind of texts. Hearst does not consider texts smaller than 10 sentences long. All the texts of this evaluation are under this limit. In fact, our method, as Kozima's, is more convenient for closely tracking thematic evolutions than for detecting the major thematic shifts. The second explanation for this difference is related to the way the document breaks are found, as shown by the precision values. When Nb increases, precision decreases as it generally does, but very slowly.</Paragraph> <Paragraph position="4"> The decrease actually becomes significant only when Nb becomes larger than N. It means that the weights associated to the boundaries are not very significant. We have validated this hypothesis by changing the weighting policy of the boundaries without having significant changes in the results.</Paragraph> <Paragraph position="5"> One way for increasing the performance would be to take as text boundary not the position of a minimum in the cohesion graph but the nearest sentence boundary from this position.</Paragraph> </Section> </Section> class="xml-element"></Paper>