File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1040_intro.xml
Size: 4,823 bytes
Last Modified: 2025-10-06 14:02:16
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1040"> <Title>Multiple Similarity Measures and Source-Pair Information in Story Link Detection</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background and Related Work </SectionTitle> <Paragraph position="0"> The DARPA TDT story link detection task requires identifying pairs of linked stories. The original language of the stories are in English, Mandarin and Arabic. The sources include broadcast news and newswire. For the required story link detection task, the research groups tested their systems on a processed version of the data in which the story boundaries have been manually identified, the Arabic and Mandarin stories have been automatically translated to English, and the broadcast news stories have been converted to text by an automatic speech recognition (ASR) system.</Paragraph> <Paragraph position="1"> A number of research groups have developed story link detection systems. The best current technology for link detection relies on the use of cosine similarity between document terms vectors with TF-IDF term weighting. In a TF-IDF model, the frequency of a term in a document (TF) is weighted by the inverse document frequency (IDF), the inverse of the number of documents containing a term. UMass (Allan et al., 2000) has examined a number of similarity measures in the link detection task, including weighted sum, language modeling and Kullback-Leibler divergence, and found that the cosine similarity produced the best results. More recently, in Lavrenko et al. (2002), UMass found that the clarity similarity measure performed best for the link detection task. In this paper, we also examine a number of similarity measures, both separately, as in Allan et al. (2000), and in combination. In the machine learning field, classifier combination has been shown to provide accuracy gains (e.g., Belkin et al.(1995); Kittler et al. (1998); Brill and Wu (1998); Dietterich (2000)). Motivated by the performance improvement observed in these studies, we explored the combination of similarity measures for improving Story Link Detection. null CMU hypothesized that the similarity between a pair of stories is influenced by the source of each story. For example, sources in a language that is translated to English will consistently use the same terminology, resulting in greater similarity between linked documents with the same native language. In contrast, sources from radio broadcasts may be transcribed much less consistently than text sources due to recognition errors, so that the expected similarity of a radio broadcast and a text source is less than that of two text sources. They found that similarity thresholds that were dependent on the type of the story-pair sources (e.g., English/non-English language and broadcast news/newswire) improved story-link detection results by 15% (Carbonell et al., 2001). We also investigate how to make use of differences in similarity that are dependent on the types of sources composing a story pair.</Paragraph> <Paragraph position="2"> We refer to the statistics characterizing story pairs with the same source types as source-pair specific information. In contrast to the source-specific thresholds used by CMU, we normalize the similarity measures based on the source-pair specific information, simultaneously with combining different similarity measures.</Paragraph> <Paragraph position="3"> Other researchers have successfully used machine learning algorithms such as support vector machines (SVM) (Cristianini and Shawe-Taylor, 2000; Joachims, 1998) and boosted decision stumps (Schapire and Singer, 2000) for text categorization. SVM-based systems, such as that described in (Joachims, 1998), are typically among the best performers for the categorization task. However, attempts to directly apply SVMs to TDT tasks such as tracking and link detection have not been successful; this has been attributed in part to the lack of enough data for training the SVM1. In these systems, the input was the set of term vectors characterizing each document, similar to the input used for the categorization task. In this pa- null html, accessed Mar 11, 2004.</Paragraph> <Paragraph position="4"> per, we present a method for using SVMs to improve link detection performance by combining heterogeneous input features, composed of multiple similarity metrics and statistical characterization of the story sources. We additionally examine the utility of the statistical information by comparing against decision trees, where the statistical characterization is not utilized. We also examine the utility of the similarity values by comparing against voting, where the classification based on each similarity measure is combined.</Paragraph> </Section> class="xml-element"></Paper>