File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1040_metho.xml
Size: 15,897 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1040"> <Title>Multiple Similarity Measures and Source-Pair Information in Story Link Detection</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 System Description </SectionTitle> <Paragraph position="0"> To determine whether two documents are linked, state-of-the-art link detection systems perform three primary processing steps: 1. preprocessing to create a normalized set of terms for representing each document as a vector of term counts, or term vector 2. adapting model parameters (i.e., IDF) as new story sets are introduced and computing the similarity of the term vectors 3. determining whether a pair of stories are linked based on the similarity score.</Paragraph> <Paragraph position="1"> In this paper, we describe our investigations in improving the basic story link detection systems by using source specific information and combining a number of similarity measures. As in the basic story link detection system, a similarity score between two stories is computed. In contrast to the basic story link detection system, a variety of similarity measures is computed and the prediction models use source-pair-specific statistics (i.e., median, average, and variance of the story pair similarity scores). We do this in a post-processing step using machine learning classifiers (i.e., SVMs, decision trees, or voting) to produce a decision with an associated confidence score as to whether a pair of stories are linked. Source-pair-specific statistics and multiple similarity measures are used as input features to the machine learning based techniques in post-processing the similarity scores. In the next sections, we describe the components and processing performed by our system.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Preprocessing </SectionTitle> <Paragraph position="0"> For preprocessing, we tokenize the data, remove stopwords, replace spelled-out numbers by digits, replace the tokens by their stems using the Inxight LinguistX morphological analyzer, and then generate a term-frequency vector to represent each story. For text where the original source is Mandarin, some of the terms are untranslated.</Paragraph> <Paragraph position="1"> In our experiments, we retain these terms because many are content words. Both the training data and test data are preprocessed in the same way.</Paragraph> <Paragraph position="2"> Our base stoplist is composed of 577 terms. We extend the stoplist with terms that are represented differently by ASR systems and text documents. For example, in the broadcast news documents in the TDT collection &quot;30&quot; is spelled out as &quot;thirty&quot; and &quot;CNN&quot; is represented as three separate tokens &quot;C&quot;, &quot;N&quot;, and &quot;N&quot;. To handle these differences, an &quot;ASR stoplist&quot; was automatically created. Chen et al. (2003) found that the use of an enhanced stoplist, formed from the union of a base stoplist and ASR stoplist, was very effective in improving performance and empirically better than normalizing ASR abbreviations.</Paragraph> <Paragraph position="3"> The training data is used to compute the initial document frequency over the corpus for each term. The document frequency of terma0 ,a1a3a2a5a4a6a0a8a7 is defined to be: a4a6a0a8a7 , and document counts,a13 , are computed for each type of source.</Paragraph> <Paragraph position="4"> Our similarity calculations of documents are based on an incremental TF-IDF model. Term vectors are created for each story, and the vectors are weighted by the inverse document frequency, IDF, i.e.,a38a40a39a42a41a44a43a45a47a46a49a48a51a50 . In the incremental model,</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> a4a51a0a8a7 and </SectionTitle> <Paragraph position="0"> a13 are updated with each new set of stories in a source file. When the a52a48a6a53 set of test documents,a54a34a55 , is added to the model, the document term counts are updated as:</Paragraph> <Paragraph position="2"> a46a49a48a51a50 denotes the document count for term a0 in the newly added set of documentsa54a34a55 . The initial document counts a11a66a65 a4a51a0a8a7 were generated from a training set. In a static TF-IDF model, new words (i.e., those words, that did not occur in the training set) are ignored in further computations. An incremental TF-IDF model uses the new vocabulary in similarity calculations, which is an advantage for the TDT task because new events often contain new vocabulary. null Since very low frequency terms a0 tend to be uninformative, we set a threshold a67a42a68 such that only terms with</Paragraph> <Paragraph position="4"> a4a51a0a8a7a71a70a72a67a19a68 are used with sources up througha54a34a55 . For these experiments, we useda67a73a68a74a9a27a75 .</Paragraph> <Paragraph position="5"> The document frequencies,a1a69a2a5a4a51a0a8a7 , the number of documents containing term a0 , and document term frequencies,a2a5a4a76a1a78a77a79a0a8a7 , are used to calculate TF-IDF based weights where a13 is the total number of documents and a83 a4a6a1a47a7 is a normalization value. For the Hellinger, Tanimoto, and clarity measures, it is computed as:</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Similarity Measures </SectionTitle> <Paragraph position="0"> In addition to the cosine similarity measure used in base-line systems, we compute story pair similarity over a set of measures, motivated by the accuracy gains obtained by others when combining classifiers (see Section 2). A vector composed of the similarity values is created and is given to a trained classifier, which emits a score. The score can be used as a measure of confidence that the story pairs are linked.</Paragraph> <Paragraph position="1"> The similarity measures that we examined are cosine, Hellinger, Tanimoto, and clarity. Each of the measures captures a different aspect of the similarity of the terms in a document. Classifier combination has been observed to perform best when the classifiers produce independent judgments. The cosine distance between the word distribution for documentsa1a98a58 anda1 This measure has been found to perform well and was used by all the TDT 2002 link detection systems (unpublished presentations at the TDT2002 workshop).</Paragraph> <Paragraph position="2"> In contrast to the Euclidean distance based cosine measure, the Hellinger measure is a probabilistic measure. The Hellinger measure between the word distributions for Brants et al. (2002), the Hellinger measure was used in a text segmentation application and was found to be superior to the cosine similarity.</Paragraph> <Paragraph position="3"> The Tanimoto (Duda and Hart, 1973) measure is a measure of the ratio of the number of shared terms between two documents to the number possessed by one document only. We modified it to use frequency counts, instead of a binary indicator as to whether a term is present and computed it as: The clarity measure was introduced by Croft et al.</Paragraph> <Paragraph position="4"> (2001) and shown to improve link detection performance by Lavrenko et al. (2002). It gets its name from the distance to general English, which is called Clarity. We used a symmetric version that is computed as: where a10a12a11 is the probability distribution of words for &quot;general English&quot; as derived from the training corpus, and KL is the Kullback-Leibler divergence: a2a4a3 a4a49a35a13a5a9a5a14a122a7 a9 a50a27a26 In computing the clarity measure, the term frequencies were smoothed with the General English model using a weight of 0.01. This enables the KL divergence to be defined whena80 a4a51a0a29a77a36a1 a58a7 ora80 a4a51a0a29a77a36a1 a7 is 0. The idea behind the clarity measure is to give credit to similar pairs of documents with term distributions that are very different from general English, and to discount similar pairs of documents with term distributions that are close to general English, which can be interpreted as being nontopical. null We also defined the &quot;source-pair normalized cosine&quot; distance as the cosine distance normalized by dividing by the running median of the similarity values corresponding to the source-pair:</Paragraph> <Paragraph position="6"> where a30a45a31a6a33a35a34a37a36a27a38 a4a20a33a30a26 a68a39a41a68a43a21a7 is the running median of the similarity values of all processed story pairs where the source ofa1a47a46 is the same asa1 a58 and the source ofa1a49a48 is the same asa1 . This is a finer-grained use of source pair information than what was used by CMU, which used decision thresholds conditioned on whether or not the sources were cross-language or cross-ASR/newswire (Carbonell et al., 2001).</Paragraph> <Paragraph position="7"> In a base system employing a single similarity measure, the system computes the similarity measure for each story pair, which is given to the evaluation program (see Section 4.2).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Improving Story Link Detection Performance </SectionTitle> <Paragraph position="0"> We examined a number of methods for improving link detection, including: a50 compare the 5 similarity measures alone as alternative methods for combining scores In contrast to earlier attempts that applied the machine learning categorization paradigm of using the term vectors as input features (Joachims, 1998) to the link detection task, we believed that the use of document term vectors is too fine-grained for the SVMs to develop good generalization with a limited amount of labeled training data. Furthermore, the use of terms as input to a learner, as was done in the categorization task (see Section 2), would require frequent retraining of a link detection system since new stories often discuss new topics and introduce new terms. For our work, we used more general characteristics of a document pair, the similarity between a pair of documents, as input to the machine learning systems.</Paragraph> <Paragraph position="1"> Thus, in contrast to the term-based systems, the machine learning techniques are used in a post-processing step after the similarity scores are computed. Additionally, to normalize differences in expected similarity among pairs of source types, source-pair statistics are used as features in deciding whether two stories are linked and in estimating the confidence of the decision.</Paragraph> <Paragraph position="2"> In the next sections, we describe our methods for combining the similarity scores using machine learning techniques, and for combining the similarity scores with source-pair specific information.</Paragraph> <Paragraph position="3"> We used an SVM to combine sets of similarity measures for predicting whether two stories are linked because theoretically it has good generalization properties (Cristianini and Shawe-Taylor, 2000), it has been shown to be a competitive classifier for a variety of tasks (e.g., (Cristianini and Shawe-Taylor, 2000; Gestal et al., 2000), and it makes full use of the similarity scores and statistical characterizations. We also empirically show in Section 4.3.2 that it provides better performance than decision trees and voting for this task. The SVM is first trained on a set of labeled data where the input features are the sets of similarity measures and the class labels are the manually assigned decisions as to whether a pair of documents are linked. The trained model is then used to automatically decide whether a new pair of stories are linked. For the support vector machine, we used SVM-light (Joachims, 1999). A polynomial kernel was used in all the reported SVM experiments. In addition to making a decision as to whether two stories are linked, we use the value of the decision function produced by SVM-light as a measure of confidence, which serves as input to the evaluation program. null Training SVM-light on a 20,000 story-pair training corpus usually requires less than five minutes on a 1.8 GHz Linux machine, although the time is quite variable depending on the corpus characteristics. However, once the system is trained, testing new story pair similarities requires less than 1 min for over 20,000 story pairs.</Paragraph> <Paragraph position="4"> Source-pair-specific information that statistically characterizes each of the similarity measures is used in a post-processing step. In particular, we compute statistics from the training data similarity scores for different combinations of source modalities and languages. The modality pairs that we considered are: asr:asr, asr:text, and text:text, where asr represents &quot;automatic speech recognition&quot;. The combinations of languages that we used are: English:English, English:Arabic, English:Mandarin, Arabic:Arabic, Arabic:Mandarin, Mandarin:Mandarin.</Paragraph> <Paragraph position="5"> The rows of Table 1 represent possible combinations of source language for the story pairs; the columns represent different combinations of source modality. The alphabetic characters in the cells represent the pair similarity statistics of mean, median, and variance for that condition obtained from the training corpus. For conditions where training data was not available, we used the statistics of a coarser grouping. For example, if there is no data for the cell with languages Mandarin:Arabic and modality pair asr:asr, we would use statistics from the language pair non-English:non-English and modality pair asr:asr.</Paragraph> <Paragraph position="6"> Prior to use in link detection, an SVM is trained on a set of features computed for each story pair. These include the similarity measures described in Section 3.2 and corresponding source-pair specific statistics (average, median and variance) for the similarity measures. The motivation for using the statistical values is to inform the SVM about the type of source pairs that are being considered. Rather than using categorical labels, the source-pair statistics provide a natural ordering to the source-pair types and can be used for normalization. When a new pair of stories is post-processed, the computed similarity measures and the corresponding source-pair statistics are used as input to the trained SVM.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3.3 Other Methods for Combining Similarity Scores </SectionTitle> <Paragraph position="0"> In addition to SVMs, we investigated the utility of decision trees (Breiman et al., 1984) and majority voting (Kittler et al., 1998) as techniques to combine similarity measures and statistical information in a post-processing step. The simplest method that we examined for combining similarity scores is to create a separate classifier for each similarity measure and then classify based a combination of the votes of the different classifiers (Kittler et al., 1998). This method does not utilize statistical information. The single measure classifiers use an empirically determined threshold based on training data.</Paragraph> <Paragraph position="1"> Decision trees and SVMs are classifiers that use the similarity scores directly. Decision trees such as C4.5 easily handle categorical data. In our experiments, we noted that although source-pair specific statistics were used as an input feature to the decision tree, the decision trees treated the source-pair based statistical information as categorical features. For the decision trees we used the WEKA implementation of C4.5 (Witten and Frank, 1999).</Paragraph> </Section> </Section> class="xml-element"></Paper>