File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1006_intro.xml
Size: 2,846 bytes
Last Modified: 2025-10-06 14:01:18
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1006"> <Title>NLP and IR Approaches to Monolingual and Multilingual Link Detection</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 3 Topic Segmentation </SectionTitle> <Paragraph position="0"> There is no presumption that each story discusses only one topic. Thus, we try to segment stories into small passages according to the discussing topics and compute passage similarity instead of document similarity. The basic idea is: the significance of some useful terms may be reduced in a long story because similarity measure on a large number of terms will decrease the effects of those important terms. Computing similarities between small passages could let some terms be more significant.</Paragraph> <Paragraph position="1"> The first method we adopted is text tiling approach (Hearst, 1993). TextTiling subdivides text into multi-paragraph units that represent passages or subtopics. The approach uses quantitative lexical analyses to segment the documents. After through TextTiling algorithm, a file will be broken into tiles. Suppose one story is broken into three tiles and the other one is broken into four tiles. There are twelve (i.e., 3*4) similarities of these two stories. We conducted three different strategies to investigate the effect of topic segmentation. Strategy (I) is computing the similarity using the most similar passage pair. Strategy (II) is computing the similarity using passage-averaged similarity. Strategy (III) is computing the similarity using a two-state decision (Chen, 2002). But the result is not so good as we expected. Up to now, the best performance is almost the same as the original method without text tiling.</Paragraph> <Paragraph position="2"> Next, we applied another topic segmentation algorithm developed by Utiyama et al. (2001).</Paragraph> <Paragraph position="3"> The results show that this segmentation algorithm is better than TextTiling. But the improvement is still not obvious. Table 4 shows the experimental results for topic segmentation. For strategy (III), the first threshold is 0.06, which is also the best threshold for the basic method, and the second threshold varies from 0.04 to 0.07 for segmentation. After applying topic segmentation, topic words would be centred on small passages. The amount of news stories discussing more than one topic is few in the test data and the overall performance depends on the segmentation algorithm. We make an index file similar to the original TDT index file. In this file, at least one story of each pair discusses multi-topics. We conducted different strategies to investigate the effect of topic segmentation. The experimental results demonstrate that topic segmentation is useful in this task (Chen, 2002).</Paragraph> </Section> class="xml-element"></Paper>