File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1033_intro.xml
Size: 3,319 bytes
Last Modified: 2025-10-06 14:01:18
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1033"> <Title>Using collocations for topic segmentation and link detection</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Topic analysis, which aims at identifying the topics of a text, delimiting their extend and finding the relations between the resulting segments, has recently raised an important interest. The largest part of it was dedicated to topic segmentation, also called linear text segmentation, and to the TDT (Topic Detection and Tracking) initiative (Fiscus et al., 1999), which addresses all the tasks we have mentioned but from a domain-dependent view-point and not necessarily in an integrated way.</Paragraph> <Paragraph position="1"> Systems that implement this work can be categorized according to what kind of knowledge they use. Most of those that achieve text segmentation only rely on the intrinsic characteristics of texts: word distribution, as in (Hearst, 1997), (Choi, 2000) and (Utiyama and Isahara, 2001), or linguistic cues as in (Passonneau and Litman, 1997).</Paragraph> <Paragraph position="2"> They can be applied without restriction about domains but have low results when a text doesn't characterize its topical structure by surface clues.</Paragraph> <Paragraph position="3"> Some systems exploit domain-independent knowledge about lexical cohesion: a network of words built from a dictionary in (Kozima, 1993); a large set of collocations collected from a corpus in (Ferret, 1998), (Kaufmann, 1999) and (Choi, 2001). To some extend, this knowledge permits these systems to discard some false topical shifts without losing their independence with regard to domains.</Paragraph> <Paragraph position="4"> The last main type of systems relies on knowledge about the topics they may encounter in the texts they process. This is typically the kind of approach developed in TDT where this knowledge is automatically built from a set of reference texts. The work of Bigi (Bigi et al., 1998) stands in the same perspective but focuses on much larger topics than TDT. These systems have a limited scope due to their topic representations but they are also more precise for the same reason.</Paragraph> <Paragraph position="5"> Hybrid systems that combine the approaches we have presented were also developed and illustrated the interest of such a combination: (Jobbins and Evett, 1998) combined word recurrence, collocations and a thesaurus; (Beeferman et al., 1999) relied on both collocations and linguistic cues.</Paragraph> <Paragraph position="6"> The topic analysis we propose implements such a hybrid approach: it relies on a general language resource, a collocation network, but exploits it together with word recurrence in texts. Moreover, it simultaneously achieves topic segmentation and link detection, i.e. determining whether two segments discuss the same topic.</Paragraph> <Paragraph position="7"> We detail in this paper the implementation of this analysis by the TOPICOLL system, we report evaluations of its capabilities concerning segmentation for two languages, French and English, and finally, we propose an evaluation measure that integrates both segmentation and link detection.</Paragraph> </Section> class="xml-element"></Paper>