Multiple Similarity Measures and Source-Pair Information
in Story Link Detection
Francine Chen Ayman Farahat
Palo Alto Research Center
3333 Coyote Hill Rd.
Palo Alto, CA 94304
a0 fchen, farahat
a1 @parc.com, thorsten@brants.net
Thorsten Brants
Abstract
State-of-the-art story link detection systems,
that is, systems that determine whether two sto-
ries are about the same event or linked, are usu-
ally based on the cosine-similarity measured
between two stories. This paper presents a
method for improving the performance of a link
detection system by using a variety of simi-
larity measures and using source-pair specific
statistical information. The utility of a num-
ber of different similarity measures, including
cosine, Hellinger, Tanimoto, and clarity, both
alone and in combination, was investigated.
We also compared several machine learning
techniques for combining the different types
of information. The techniques investigated
were SVMs, voting, and decision trees, each
of which makes use of similarity and statisti-
cal information differently. Our experimental
results indicate that the combination of similar-
ity measures and source-pair specific statistical
information using an SVM provides the largest
improvement in estimating whether two stories
are linked; the resulting system was the best-
performing link detection system at TDT-2002.
1 Introduction
Story link detection, as defined in the Topic Detection and
Tracking (TDT) competition sponsored by the DARPA
TIDES program, is the task of determining whether two
stories, such as news articles and/or radio broadcasts, are
about the same event, or linked. In TDT an event is de-
fined as “something that happens at some specific time
and place” (TDT, 2002). For example, a story about a tor-
nado in Kansas in May and another story about a tornado
in Nebraska in June should not be classified as linked be-
cause they are about different events, although they both
fall under the same general “topic” of natural disasters.
But a story about damage due to a tornado in Kansas and
a story about the clean-up and repairs due to the same tor-
nado in Kansas are considered linked events.
In the TDT link detection task, a link detection sys-
tem is given a sequence of time-ordered sets of stories,
where each set is from one news source. The system can
“look ahead” N source files from the current source file
being processed when deciding whether the current pair
is linked. Because the TDT link detection task is focused
on streams of news stories, one of the primary differences
between link detection and the more traditional IR catego-
rization task is that new events occur relatively frequently
and comparisons of interest are focused on events that are
not known in advance. One consequence of this is that the
best-performing systems usually adapt to new input. Link
detection is thought of as the basis for other event-based
topic analysis tasks, such as topic tracking, topic detec-
tion, and first-story detection (TDT, 2002).
2 Background and Related Work
The DARPA TDT story link detection task requires iden-
tifying pairs of linked stories. The original language of
the stories are in English, Mandarin and Arabic. The
sources include broadcast news and newswire. For the
required story link detection task, the research groups
tested their systems on a processed version of the data in
which the story boundaries have been manually identified,
the Arabic and Mandarin stories have been automatically
translated to English, and the broadcast news stories have
been converted to text by an automatic speech recognition
(ASR) system.
A number of research groups have developed story
link detection systems. The best current technology for
link detection relies on the use of cosine similarity be-
tween document terms vectors with TF-IDF term weight-
ing. In a TF-IDF model, the frequency of a term in a docu-
ment (TF) is weighted by the inverse document frequency
(IDF), the inverse of the number of documents containing
a term. UMass (Allan et al., 2000) has examined a num-
ber of similarity measures in the link detection task, in-
cluding weighted sum, language modeling and Kullback-
Leibler divergence, and found that the cosine similarity
produced the best results. More recently, in Lavrenko et
al. (2002), UMass found that the clarity similarity mea-
sure performed best for the link detection task. In this
paper, we also examine a number of similarity measures,
both separately, as in Allan et al. (2000), and in combina-
tion. In the machine learning field, classifier combination
has been shown to provide accuracy gains (e.g., Belkin et
al.(1995); Kittler et al. (1998); Brill and Wu (1998); Di-
etterich (2000)). Motivated by the performance improve-
ment observed in these studies, we explored the combina-
tion of similarity measures for improving Story Link De-
tection.
CMU hypothesized that the similarity between a pair
of stories is influenced by the source of each story. For
example, sources in a language that is translated to En-
glish will consistently use the same terminology, result-
ing in greater similarity between linked documents with
the same native language. In contrast, sources from radio
broadcasts may be transcribed much less consistently than
text sources due to recognition errors, so that the expected
similarity of a radio broadcast and a text source is less than
that of two text sources. They found that similarity thresh-
olds that were dependent on the type of the story-pair
sources (e.g., English/non-English language and broad-
cast news/newswire) improved story-link detection re-
sults by 15% (Carbonell et al., 2001). We also investigate
how to make use of differences in similarity that are de-
pendent on the types of sources composing a story pair.
We refer to the statistics characterizing story pairs with the
same source types as source-pair specific information. In
contrast to the source-specific thresholds used by CMU,
we normalize the similarity measures based on the source-
pair specific information, simultaneously with combining
different similarity measures.
Other researchers have successfully used machine
learning algorithms such as support vector machines
(SVM) (Cristianini and Shawe-Taylor, 2000; Joachims,
1998) and boosted decision stumps (Schapire and Singer,
2000) for text categorization. SVM-based systems, such
as that described in (Joachims, 1998), are typically among
the best performers for the categorization task. However,
attempts to directly apply SVMs to TDT tasks such as
tracking and link detection have not been successful; this
has been attributed in part to the lack of enough data for
training the SVM1. In these systems, the input was the
set of term vectors characterizing each document, similar
to the input used for the categorization task. In this pa-
1http://www.ldc.upenn.edu/Projects/TDT3/email/email 402.
html, accessed Mar 11, 2004.
per, we present a method for using SVMs to improve link
detection performance by combining heterogeneous in-
put features, composed of multiple similarity metrics and
statistical characterization of the story sources. We addi-
tionally examine the utility of the statistical information
by comparing against decision trees, where the statistical
characterization is not utilized. We also examine the util-
ity of the similarity values by comparing against voting,
where the classification based on each similarity measure
is combined.
3 System Description
To determine whether two documents are linked, state-of-
the-art link detection systems perform three primary pro-
cessing steps:
1. preprocessing to create a normalized set of terms
for representing each document as a vector of term
counts, or term vector
2. adapting model parameters (i.e., IDF) as new story
sets are introduced and computing the similarity of
the term vectors
3. determining whether a pair of stories are linked
based on the similarity score.
In this paper, we describe our investigations in improv-
ing the basic story link detection systems by using source
specific information and combining a number of similar-
ity measures. As in the basic story link detection system, a
similarity score between two stories is computed. In con-
trast to the basic story link detection system, a variety of
similarity measures is computed and the prediction mod-
els use source-pair-specific statistics (i.e., median, aver-
age, and variance of the story pair similarity scores). We
do this in a post-processing step using machine learning
classifiers (i.e., SVMs, decision trees, or voting) to pro-
duce a decision with an associated confidence score as to
whether a pair of stories are linked. Source-pair-specific
statistics and multiple similarity measures are used as in-
put features to the machine learning based techniques in
post-processing the similarity scores. In the next sections,
we describe the components and processing performed by
our system.
3.1 Preprocessing
For preprocessing, we tokenize the data, remove stop-
words, replace spelled-out numbers by digits, replace the
tokens by their stems using the Inxight LinguistX mor-
phological analyzer, and then generate a term-frequency
vector to represent each story. For text where the original
source is Mandarin, some of the terms are untranslated.
In our experiments, we retain these terms because many
are content words. Both the training data and test data are
preprocessed in the same way.
3.1.1 Stop Words
Our base stoplist is composed of 577 terms. We extend
the stoplist with terms that are represented differently by
ASR systems and text documents. For example, in the
broadcast news documents in the TDT collection “30” is
spelled out as “thirty” and “CNN” is represented as three
separate tokens “C”, “N”, and “N”. To handle these differ-
ences, an “ASR stoplist” was automatically created. Chen
et al. (2003) found that the use of an enhanced stoplist,
formed from the union of a base stoplist and ASR stoplist,
was very effective in improving performance and empir-
ically better than normalizing ASR abbreviations.
3.1.2 Source-specific Incremental TF-IDF Model
The training data is used to compute the initial docu-
ment frequency over the corpus for each term. The docu-
ment frequency of terma0 ,a1a3a2a5a4a6a0a8a7 is defined to be:
a1a3a2a5a4a6a0a8a7a10a9a12a11
a4a6a0a8a7
a13 a9
a14a16a15
a2a17a1
a15a19a18a21a20
a0a23a22a25a24a25a26a27a0
a15a19a18a21a18a29a28
a24
a20a31a30
a11a14a32a15
a2a17a1
a15a19a18a33a20a34a30
a11
a18a21a15
a24a36a35
a28a37a20
Separate document term counts,
a11
a4a6a0a8a7 , and document
counts,a13 , are computed for each type of source.
Our similarity calculations of documents are based on
an incremental TF-IDF model. Term vectors are created
for each story, and the vectors are weighted by the in-
verse document frequency, IDF, i.e.,a38a40a39a42a41a44a43a45a47a46a49a48a51a50 . In the in-
cremental model,
a11
a4a51a0a8a7 and
a13 are updated with each new
set of stories in a source file. When the a52a48a6a53 set of test
documents,a54a34a55 , is added to the model, the document term
counts are updated as:
a11
a55a4a6a0a8a7a56a9
a11
a55a25a57a59a58a4a6a0a8a7a61a60
a11a59a62a64a63
a4a51a0a8a7
where
a11 a62 a63
a46a49a48a51a50 denotes the document count for term
a0 in the
newly added set of documentsa54a34a55 . The initial document
counts
a11a66a65
a4a51a0a8a7 were generated from a training set. In a static
TF-IDF model, new words (i.e., those words, that did not
occur in the training set) are ignored in further computa-
tions. An incremental TF-IDF model uses the new vocab-
ulary in similarity calculations, which is an advantage for
the TDT task because new events often contain new vo-
cabulary.
Since very low frequency terms a0 tend to be uninfor-
mative, we set a threshold a67a42a68 such that only terms with
a1a69a2
a62a37a63
a4a51a0a8a7a71a70a72a67a19a68 are used with sources up througha54a34a55 . For
these experiments, we useda67a73a68a74a9a27a75 .
3.1.3 Term Weighting
The document frequencies,a1a69a2a5a4a51a0a8a7 , the number of doc-
uments containing term a0 , and document term frequen-
cies,a2a5a4a76a1a78a77a79a0a8a7 , are used to calculate TF-IDF based weights
a80
a4a6a0a29a77a79a1a47a7 for the terms in a document (or story)a1 :
a80
a4a6a1a81a77a8a0a8a7a10a9
a82
a83
a4a76a1a84a7
a2a5a4a6a1a81a77a8a0a8a7a5a85a21a38a86a39a73a41
a13
a11
a4a51a0a8a7
(1)
where a13 is the total number of documents and a83 a4a6a1a47a7 is
a normalization value. For the Hellinger, Tanimoto, and
clarity measures, it is computed as:
a83
a4a76a1a84a7a56a9a88a87
a48
a2a5a4a6a1a81a77a8a0a8a7a89a85a21a38a40a39a42a41
a13
a11
a4a51a0a8a7
(2)
For cosine distance it is computed as:
a83
a4a6a1a47a7a56a9
a90a91
a91
a92
a87
a48
a93
a2a5a4a6a1a81a77a8a0a8a7a5a85a94a38a40a39a42a41
a13
a48
a11
a4a51a0a8a7a96a95a98a97
(3)
3.2 Similarity Measures
In addition to the cosine similarity measure used in base-
line systems, we compute story pair similarity over a set
of measures, motivated by the accuracy gains obtained
by others when combining classifiers (see Section 2). A
vector composed of the similarity values is created and
is given to a trained classifier, which emits a score. The
score can be used as a measure of confidence that the story
pairs are linked.
The similarity measures that we examined are cosine,
Hellinger, Tanimoto, and clarity. Each of the measures
captures a different aspect of the similarity of the terms
in a document. Classifier combination has been observed
to perform best when the classifiers produce independent
judgments. The cosine distance between the word distri-
bution for documentsa1a98a58 anda1
a97
is computed as:
a20a21a30
a26a99a4a6a1a58a77a36a1
a97
a7a10a9 a87
a48
a80
a4a76a1a58a77a8a0a8a7a5a85
a80
a4a6a1
a97
a77a8a0a8a7
This measure has been found to perform well and was
used by all the TDT 2002 link detection systems (unpub-
lished presentations at the TDT2002 workshop).
In contrast to the Euclidean distance based cosine mea-
sure, the Hellinger measure is a probabilistic measure.
The Hellinger measure between the word distributions for
documentsa1a98a58 anda1
a97
is computed as:
a20a33a30
a26a100a4a76a1a58a77a79a1
a97
a7a56a9 a87
a48a72a101
a80
a4a6a1a58a77a79a0a8a7a5a85
a80
a4a76a1
a97
a77a8a0a8a7
wherea0 ranges over the terms that occur ina1 a58 ora1
a97
. In
Brants et al. (2002), the Hellinger measure was used in a
text segmentation application and was found to be supe-
rior to the cosine similarity.
The Tanimoto (Duda and Hart, 1973) measure is a mea-
sure of the ratio of the number of shared terms between
two documents to the number possessed by one document
only. We modified it to use frequency counts, instead of a
binary indicator as to whether a term is present and com-
puted it as:
a102a104a103a51a105a107a106a109a108a47a110a104a111a69a108a42a112a36a113a115a114 a116a118a117a19a119
a106a109a108a110a111a3a120a23a113a37a121
a119
a106a109a108a112a111a3a120a69a113
a116 a117a122a119
a106a109a108a47a110a104a111a3a120a69a113
a112a115a123
a119
a106a109a108a42a112a21a111a3a120a23a113
a112a89a124
a119
a106a109a108a47a110a36a111a3a120a23a113a37a121
a119
a106a109a108a42a112a33a111a3a120a23a113a122a125
The clarity measure was introduced by Croft et al.
(2001) and shown to improve link detection performance
by Lavrenko et al. (2002). It gets its name from the dis-
tance to general English, which is called Clarity. We used
a symmetric version that is computed as:
a20a33a30
a26a99a4a6a1a58a96a77a79a1
a97
a7 a9
a1
a2a4a3
a4
a80
a4a51a0a29a77a79a1 a58a94a7a6a5a7a5
a80
a4a6a0a29a77a79a1
a97
a7a79a7
a60
a2a4a3
a4
a80
a4a51a0a29a77a36a1a47a58a33a7a8a5a9a5a10a12a11 a7
a1
a2a4a3
a4
a80
a4a51a0a29a77a36a1
a97
a7a8a5a9a5
a80
a4a51a0a29a77a79a1a58a7a8a7
a60
a2a4a3
a4
a80
a4a51a0a29a77a36a1
a97
a5a7a5a10a12a11 a7
where a10a12a11 is the probability distribution of words for
“general English” as derived from the training corpus,
and KL is the Kullback-Leibler divergence: a2a4a3 a4a49a35a13a5a9a5a14a122a7 a9
a116a16a15
a35a115a4a18a17a37a7a20a19
a15a22a21a24a23
a46
a15
a50
a25a29a46
a15
a50a27a26 In computing the clarity measure, the
term frequencies were smoothed with the General English
model using a weight of 0.01. This enables the KL diver-
gence to be defined whena80 a4a51a0a29a77a36a1 a58a7 ora80 a4a51a0a29a77a36a1
a97
a7 is 0. The
idea behind the clarity measure is to give credit to similar
pairs of documents with term distributions that are very
different from general English, and to discount similar
pairs of documents with term distributions that are close
to general English, which can be interpreted as being non-
topical.
We also defined the “source-pair normalized cosine”
distance as the cosine distance normalized by dividing by
the running median of the similarity values corresponding
to the source-pair:
a20a33a30
a26a100a4a76a1a58a77a79a1
a97
a7a56a9
a116
a48
a80
a4a6a1 a58a42a77a8a0a8a7a29a28
a80
a4a76a1
a97
a77a79a0a8a7
a30a32a31a22a33a35a34a37a36a27a38
a4
a20a21a30
a26 a68a40a39a42a41a68a44a43a21a7
where a30a45a31a6a33a35a34a37a36a27a38 a4a20a33a30a26 a68a39a41a68a43a21a7 is the running median of the
similarity values of all processed story pairs where the
source ofa1a47a46 is the same asa1 a58 and the source ofa1a49a48 is the
same asa1
a97
. This is a finer-grained use of source pair infor-
mation than what was used by CMU, which used decision
thresholds conditioned on whether or not the sources were
cross-language or cross-ASR/newswire (Carbonell et al.,
2001).
In a base system employing a single similarity measure,
the system computes the similarity measure for each story
pair, which is given to the evaluation program (see Sec-
tion 4.2).
3.3 Improving Story Link Detection Performance
We examined a number of methods for improving link de-
tection, including:
a50 compare the 5 similarity measures alone
a50 combine subsets of similarity scores using a support
vector machine (SVM)
a50 combine source-pair statistics with the correspond-
ing similarity score using an SVM, for each of the 5
similarity measures
a50 combine subsets of similarity scores with source pair
information using an SVM
a50 compare SVMs, decision trees, and majority voting
as alternative methods for combining scores
In contrast to earlier attempts that applied the machine
learning categorization paradigm of using the term vec-
tors as input features (Joachims, 1998) to the link detec-
tion task, we believed that the use of document term vec-
tors is too fine-grained for the SVMs to develop good gen-
eralization with a limited amount of labeled training data.
Furthermore, the use of terms as input to a learner, as was
done in the categorization task (see Section 2), would re-
quire frequent retraining of a link detection system since
new stories often discuss new topics and introduce new
terms. For our work, we used more general character-
istics of a document pair, the similarity between a pair
of documents, as input to the machine learning systems.
Thus, in contrast to the term-based systems, the machine
learning techniques are used in a post-processing step af-
ter the similarity scores are computed. Additionally, to
normalize differences in expected similarity among pairs
of source types, source-pair statistics are used as features
in deciding whether two stories are linked and in estimat-
ing the confidence of the decision.
In the next sections, we describe our methods for
combining the similarity scores using machine learning
techniques, and for combining the similarity scores with
source-pair specific information.
3.3.1 Combining Similarity Scores with SVMs
We used an SVM to combine sets of similarity mea-
sures for predicting whether two stories are linked be-
cause theoretically it has good generalization properties
(Cristianini and Shawe-Taylor, 2000), it has been shown
to be a competitive classifier for a variety of tasks (e.g.,
(Cristianini and Shawe-Taylor, 2000; Gestal et al., 2000),
and it makes full use of the similarity scores and statistical
characterizations. We also empirically show in Section
4.3.2 that it provides better performance than decision
trees and voting for this task. The SVM is first trained on
a set of labeled data where the input features are the sets of
similarity measures and the class labels are the manually
assigned decisions as to whether a pair of documents are
linked. The trained model is then used to automatically
decide whether a new pair of stories are linked. For the
support vector machine, we used SVM-light (Joachims,
1999). A polynomial kernel was used in all the reported
SVM experiments. In addition to making a decision as to
whether two stories are linked, we use the value of the de-
cision function produced by SVM-light as a measure of
Table 1: Source Pair Groups
asr:asr asr:text text:text
English:English a b c
English:Arabic d e f
English:Mandarin g h i
Arabic:Arabic j k l
Arabic:Mandarin m n o
Mandarin:Mandarin p q r
confidence, which serves as input to the evaluation pro-
gram.
Training SVM-light on a 20,000 story-pair training cor-
pus usually requires less than five minutes on a 1.8 GHz
Linux machine, although the time is quite variable de-
pending on the corpus characteristics. However, once the
system is trained, testing new story pair similarities re-
quires less than 1 min for over 20,000 story pairs.
3.3.2 Source-Pair Specific Information
Source-pair-specific information that statistically char-
acterizes each of the similarity measures is used in a post-
processing step. In particular, we compute statistics from
the training data similarity scores for different combina-
tions of source modalities and languages. The modal-
ity pairs that we considered are: asr:asr, asr:text, and
text:text, where asr represents “automatic speech recog-
nition”. The combinations of languages that we used
are: English:English, English:Arabic, English:Mandarin,
Arabic:Arabic, Arabic:Mandarin, Mandarin:Mandarin.
The rows of Table 1 represent possible combinations
of source language for the story pairs; the columns rep-
resent different combinations of source modality. The al-
phabetic characters in the cells represent the pair similar-
ity statistics of mean, median, and variance for that con-
dition obtained from the training corpus. For conditions
where training data was not available, we used the statis-
tics of a coarser grouping. For example, if there is no data
for the cell with languages Mandarin:Arabic and modal-
ity pair asr:asr, we would use statistics from the language
pair non-English:non-English and modality pair asr:asr.
Prior to use in link detection, an SVM is trained on a set
of features computed for each story pair. These include
the similarity measures described in Section 3.2 and cor-
responding source-pair specific statistics (average, me-
dian and variance) for the similarity measures. The mo-
tivation for using the statistical values is to inform the
SVM about the type of source pairs that are being con-
sidered. Rather than using categorical labels, the source-
pair statistics provide a natural ordering to the source-pair
types and can be used for normalization. When a new pair
of stories is post-processed, the computed similarity mea-
sures and the corresponding source-pair statistics are used
as input to the trained SVM.
3.3.3 Other Methods for Combining Similarity
Scores
In addition to SVMs, we investigated the utility of de-
cision trees (Breiman et al., 1984) and majority voting
(Kittler et al., 1998) as techniques to combine similarity
measures and statistical information in a post-processing
step. The simplest method that we examined for combin-
ing similarity scores is to create a separate classifier for
each similarity measure and then classify based a combi-
nation of the votes of the different classifiers (Kittler et
al., 1998). This method does not utilize statistical infor-
mation. The single measure classifiers use an empirically
determined threshold based on training data.
Decision trees and SVMs are classifiers that use the
similarity scores directly. Decision trees such as C4.5
easily handle categorical data. In our experiments, we
noted that although source-pair specific statistics were
used as an input feature to the decision tree, the decision
trees treated the source-pair based statistical information
as categorical features. For the decision trees we used
the WEKA implementation of C4.5 (Witten and Frank,
1999).
4 Experiments
We conducted a set of experiments to compare the utility
of combining similarity measures and the use of normal-
ization statistics. We also compared the utility of different
statistical learners.
4.1 Corpora
For our studies, we used corpora developed by the LDC
for the TDT tasks (Cieri et al, 2003). The TDT3 corpus
contains 40,000 news articles and broadcast news stories
with 120 labeled events in English, Mandarin and Ara-
bic from October through December 1998. For our com-
parative evaluations, we initialized the document term
counts and document counts using the TDT2 data (from
TDT 1998). Our “post-processing” system was trained
on the TDT3pub partition of the TDT2001 story pairs,
and tested on the TDT3unp partition of the TDT2001
test story pairs. The source-pair statistics were computed
from the linked story pairs in the TDT2002 dry run test
set. There are 20,966, 27,541, and 20,191 labeled story
pairs in TDT3pub, TDT2002 dry run, and TDT3unp, re-
spectively. The preprocessing and similarity computa-
tions do not require training, although adaptation is per-
formed by incrementally updating the document counts,
document frequencies, and the source-specific similari-
ties, and using the updated values in the computations.
The training data is used to compute similarity data for
training the post-processing systems.
Table 2: Topic-weighted Min Detection Cost for Different
Systems
system min DET
A 0.2368
B 0.3439
C 0.3175
D 0.2606
E 0.2342
4.2 Evaluation Measures
The goal of link detection is to minimize the cost, or
penalty, due to errors by the system. The TDT tasks are
evaluated by computing a “detection cost”:
a54a1a0a3a2
a48
a9a118a54a5a4 a46a7a6a8a6 a85
a9
a4 a46a10a6a11a6 a85
a9
a48a13a12a15a14a17a16
a2
a48
a60a71a54a1a18a20a19 a85
a9
a18a20a19a74a85
a9
a45a22a21a79a45
a57
a48a13a12a15a14a17a16
a2
a48
wherea54a5a4 a46a7a6a8a6 is the cost of a miss, a9 a4 a46a7a6a8a6 is the estimated
probability of a miss, a9 a48a13a12a15a14a17a16 a2a48 is the prior probability that
a pair of stories are linked, a54 a18a23a19 is the cost of a false
alarm, a9 a18a23a19 is the estimated probability of a false alarm,
and a9 a45a24a21a8a45a57 a48a13a12a15a14a17a16 a2a48 is the prior probability that a pair of sto-
ries are not linked. A miss occurs when a linked story pair
is not identified as linked by the system. A false alarm oc-
curs when a pair of stories that are not linked are identified
as linked by the system. A target is a pair of linked sto-
ries; conversely, a non-target is a pair of stories that are
not linked. For the link detection task these parameters
are set as follows: a54a25a4 a46a10a6a8a6 is 1.0, a9 a48a13a12a26a14a11a16 a2a48 is 0.02, anda54a5a18a20a19
is 0.1. The cost for each topic is equally weighted (i.e.,
the cost is topic-weighted, rather than story-weighted)and
normalized so that for a given system, “a4a54a5a0a27a2a48a7
a43
a21a17a14a17a28 can
be no less than one without extracting information from
the source data” (TDT, 2002):
a106a13a29a31a30a20a32
a117
a113a34a33a23a35a11a36a8a37 a114
a29 a30a20a32
a117
a38a40a39a42a41a78a106a13a29a31a43a45a44a42a46a47a46a5a121a15a48
a117a10a49
a36a34a50a51a32
a117
a111a11a29a31a52a54a53 a121a15a48a56a55a57a35a17a55a59a58
a117a10a49
a36a8a50a17a32
a117
a113
and a54a1a0a3a2a48a13a21a11a60 a2 a14a61a12a61a62a63a62 a9
a116
a46
a4a76a54
a46
a0a3a2
a48a7
a43
a21a11a14a51a28a65a64
a14
a0
a15
a35
a30a69a18a21a20 where
the sum is over topicsa30. A detection curve (DET curve)
is computed by sweeping a threshold over the range of
scores, and the minimum cost over the DET curve is iden-
tified as the minimum detection cost or min DET. The
topic-weighted DET cost, or score, is dependent on both
a good minimum cost over the DET curve, and a good
method for selecting an operating point, which is usually
implemented by selecting a threshold. A system with a
very low min DET score can have a much larger topic-
weighted DET score. Therefore, we focus on the mini-
mum DET score for our experiments.
4.3 Results
We present results comparing the utility of different
similarity measures and use of source-pair statistics, as
Table 3: Topic-weighted Min Detection Cost: Combined
Similarity Measures and Source-Pair Specific Informa-
tion (baseline system performance shown in bold)
min DET Cost
similarity measures used source-pair info used?
no yes
cos 0.2801 0.2532
normcos 0.2732 0.2533
Hel 0.3216 0.2657
Tan 0.3008 0.2748
cla 0.2706 0.2496
cos, Hel 0.2791 0.2467
normcos, cla 0.2631 0.2462
cos, normcos, cla 0.2626 0.2430
Hel, normcos, Tan 0.2714 0.2429
cos, normcos, Hel 0.2725 0.2421
cos, Hel, Tan, cla 0.2615 0.2452
cos, normcos, Hel, Tan 0.2736 0.2418
cos, normcos, Hel, cla 0.2614 0.2431
cos, normcos, Tan, cla 0.2623 0.2431
cos, normcos, Hel, Tan, cla 0.2608 0.2431
well as different learners for combining the similarity
measures and source-pair statistics. Our system was the
best performing Link Detection system at TDT2002.
We cannot compare our results with the other TDT2002
Link Detection systems because participants in TDT
agree not to publish results from another site. We did
not participate in TDT2001, but can compare our best
system on the TDT2001 test data (TDT3unp) against
the results of other TDT2001 systems (we extracted
the Primary Link Detection results from slides from the
TDT 2001 workshop, available (as of Mar 11, 2004) at:
http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt2001/Pape
rPres/Nist-pres/NIST-presentation-v6 files/v3 document
.htm. These results are shown in Table 2. The minimum
cost for our system, a11 , is less (better) than that of the
TDT2001 systems. For this comparison, we set the bias
parameter to 0.2, which reflects the expected number of
linked stories computed from the training data. For the
following comparative results, we set the bias parameter
to reflect the probability of a linked story specified in the
task definition (TDT2002), which resulted in a somewhat
higher cost.
4.3.1 Comparison of Similarity Measures and
Utility of Source-Pair Information
In this section, the effect of combining similarity met-
rics using an SVM and the effect of using source-pair
information is examined. The results in Table 3 are di-
vided into four sections. The upper sections show perfor-
mance as measured by min DET for single similarity met-
rics and the lower sections show performance for com-
bined similarity metrics using an SVM. The columns la-
beled “source-pair info used?” indicate whether source-
pair specific statistics were used as input features to the
SVM. The cosine, normalized cosine, Hellinger, Tani-
moto, and clarity similarity measures are represented as
“cos”, “normcos”, “Hel”, “Tan”, and “cla”, respectively.
The baseline model for comparison is the normalized co-
sine similarity without source-pair information (bolded),
which is very similar to the most successful story link de-
tection models (Carbonell et al., 2001; Allan et al., 2002).
To assess whether the observed differences were signifi-
cant, we compared models at the .005 significance level
using a paired two-sided t-test where the data was ran-
domly partitioned into 10 mutually exclusive sets.
In the upper sections, note that the clarity measure
with source-pair specific statistics exhibits the best per-
formance of the five measures, and is competitive with the
combination measures in the lower right of the table. Note
that the best-performing combination (italicized) did not
include clarity, which may be due in part to redundancy
with the other measures (Kittler et al., 1998). Compared
to the normalized cosine performance of 0.2732, the im-
proved performance of the cosine and normalized cosine
measures when source-pair specific information is used
(0.2532 and 0.2533, respectively; p a0 .005 for both com-
parisons) indicates that simple threshold normalization by
the running mean is not optimal.
Comparison of the upper and lower sections of the table
indicates that combination of similarity measures gener-
ally yields somewhat improved performance over single
similarity link detection systems; the difference between
the best upper model vs the best lower model, i.e., “cla”’
vs “cos, normcos, Hel, Tan, cla” without source-pair info,
and “cla” vs “cos, normcos, Hel, Tan” with source-pair
info, was significant at p a0 .005 for both comparisons.
And comparison of the left and right sections of the ta-
ble indicates that the use of source-pair specific statistics
noticeably improves performance (all models significant
at p a0 .005). The lower right section of Table 3 shows a
generally modest improvement over the best single metric
(i.e., clarity) when using a combination of features with
source-pair information.
These results indicate that although combination of
similarity measures improves performance, the use of
source-pair specific statistics are a larger factor in im-
proving performance. The SVM effectively uses the the
source-pair information to normalize and combine the
scores. Once the scores have been normalized, for some
measures there is little additional information to be gained
from adding additional features, although the combina-
tion of at least two measures removes the necessity of se-
lecting the “best” measure. For reference to the IR met-
rics of precision and recall, we present the results for a
Table 4: Precision and Recall: Combined Similarity Mea-
sures and Source-Pair Specific Information
similarity measures used source pair precision recall
info used?
cos no 87.45 85.33
cla yes 87.07 88.06
cos, Hel yes 88.83 86.75
cos, normcos, Hel, Tan yes 88.37 87.17
Table 5: Topic-weighted Min Detection Cost: Different
Learners for Combining Similarities
models
similarity measures used voting decision SVM
tree
cos, normcos, Hel 0.2802 0.2708 0.2421
cos, normcos, Hel, Tan 0.2810 0.2516 0.2418
cos, normcos, Hel, Tan, cla 0.2632 0.2574 0.2431
selected number of conditions in Table 4.
4.3.2 Comparison of Combination Models
We also investigated the use of other methods for com-
bining similarity measures and using source-pair specific
information. Table 5 compares the performance of voting,
a C4.5 decision tree, and an SVM. Three sets of similarity
measures were compared: 1) cosine, normalized cosine,
and Hellinger, 2) cosine, normalized cosine, Hellinger,
and Tanimoto (the best performing system in Table 3) and
3) the full set of similarity measures. All the SVM sys-
tems show significant improved performance at p a0 .005
over the baseline normalized cosine model, which had a
cost of 0.2732 (Table 3); only one of the decision trees
was significantly better at p a0 .05. The poorer perfor-
mance of voting compared to the baseline may be due in
part to dependencies among the different measures. None
of the decision tree systems were significantly better than
voting at p a0 .05; in comparison, the performance of all
SVMs were significantly better than the corresponding
voting and tree models at p a0 .005. The voting systems
did not use any source-pair information. The decision
trees used source-pair information categorically, but did
not make use of source-pair statistics. The SVMs used the
source-pair statistics, plus categorical source-pair infor-
mation as input features. Thus, the performance of these
systems tends to support the hypothesis that source-pair
information, and more specifically, source-pair similar-
ity statistics, contains useful information for the link de-
tection task. That is, the statistics not only differentiate
the source pairs, but provide additional information to the
classifier.
5 Conclusions
We have presented a set of enhancements for improving
story link detection over the best baseline systems. The
enhancements include the combination of different sim-
ilarity scores and statistical characterization of source-
pair information using machine learning techniques. We
observed that the use of statistical characterization of
source-pair information had a larger effect in improving
the performance of our system than the specific set of sim-
ilarity measures used. Comparing different methods for
combining similarity scores and source-pair information,
we observed that simple voting did not always provide
improvement over the best cosine similarity based sys-
tem, decision trees tended to provide better performance,
and SVMs provided the best performance of all combina-
tion methods evaluated. Our method can be used as post-
processing to the methods developed by other researchers,
such as topic-specific models, to create a system with
even better performance. Our investigations have fo-
cused on one collection drawn from broadcast news and
newswire stories in three languages; experiments on a va-
riety of collections would allow for assessment of our re-
sults more generally.
References
James Allan, Victor Lavrenko, Daniella Malin, and Rus-
sell Swan. 2000. Detections, Bounds, and Timelines:
UMass and TDT-3. In Proceedings of Topic Detection
and Tracking Workshop (TDT-3), Vienna, Virginia.
James Allan, Victor Lavernko, and Ramesh Nallapti.
2002. UMass at TDT 2002. Proceedings of the Topic
Detection & Tracking Workshop.
Nicholas J. Belkin, Paul B. Kantor, Edward A. Fox, and
J.A. Shaw. 1995. Combining the Evidence of Multiple
Query Representations for Information Retrieval. In-
formation Processing and Management, 33:3, pp. 431-
448.
Thorsten Brants, Francine Chen, and Ioannis Tsochan-
taridis. 2002. Topic-Based Document Segmentation
with Probabilistic Latent Semantic analysis. In Interna-
tional conference on Information and Knowledge Man-
agement (CIKM), McLean, VA, pp. 211-218.
Leo Breiman, Jerome H. Friedman, Richard A. Olshen,
and Charles J. Stone. 1984. Classification and Regres-
sion Trees, Wasworth International Group.
Eric Brill and Jun Wu. 1998. Classifier Combination for
Improved Lexical Disambiguation. In Proceedings of
COLING/ACL. pp. 191-195.
Jaime Carbonell, Yiming Yang, Ralf Brown,
Chun Jin and Jian Zhang. 2001. CMU TDT
Report. Slides at the TDT-2001 Meeting.
http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt2001/
paperpres.htm (and select the CMU presentation)
Francine Chen, Ayman Farahat and Thorsten Brants.
2003. Story Link Detection and New Even Detection
are Asymmetic. Proceedings of HLT-NAACL 2003,
Companion Volume, pp. 13-15.
Christopher Cieri, David Graff, Nii Martey,
and Stephanie Strassel. 2003. The TDT-3
Text and Speech Corpus. Proceedings Topic
Detection and Tracking Workshop, 2000.
http://www.itl.nist.gov/iaui/894.01/tests/tdt/research
links/index.htm
Nello Cristianini and John Shawe-Taylor. 2000. Support
Vector Machines, Cambridge University Press, Cam-
bridge, U.K.
W. Bruce Croft, Stephen Cronon-Townsend, and Victor
Lavrenko. 2001. Relevance Feedback and Personaliza-
tion: A Language Modeling Perspective. In DELOS
Workshop: Personalization and Recommender Sys-
tems in Digital Libraries, pp. 49-54.
Thomas G. Dietterich. 2000. Ensemble Methods in Ma-
chine Learning. In Multiple Classier Systems, Cagliari,
Italy.
Richard O. Duda and Peter E. Hart. 1973. Pattern Classi-
fication and Scene Analysis, John Wiley & Sons, Inc.
Tony van Gestel, Johan A.K. Suykens, Bart Baesens,
Stijn Viaene, Jan Vanthienen, Guido Dedene, Bart de
Moor and Joos Vandewalle. 2000. Benchmarking Least
Squares Support Vector Machine Classifiers. Internal
Report 00-37, ESAT-SISTA, K.U.Leuven.
Thorsten Joachims. 1999. Making large-Scale SVM
Learning Practical. In Advances in Kernel Methods -
Support Vector Learning, B. Schlkopf and C. Burges
and A. Smola (ed.), MIT-Press.
Thorsten Joachims. 1998. Text Categorization with Sup-
port Vector Machines: Learning with Many Relevant
Features. Proceedings of the European Conference on
Machine Learning (ECML), Springer, pp. 137-142.
Josef Kittler, Mohamad Hatef, Robert P.W. Duin, and Jiri
Matas. 1998. On Combining Classifiers. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence,
20(3), pp. 226-239.
Victor Lavrenko, James Allan, E. DeGuzman, D.
LaFlamme, V. Pollard, and S. Thomas. 2002. Rele-
vance Models for Topic Detection and Tracking. In
Proceedings of HLT-2002, San Diego, CA.
Robert E. Schapire and Yoram Singer. 2000. BoosTexter:
A Boosting-based System for Text Categorization. Ma-
chine Learning, 39(2/3), pp. 135-168.
(TDT2002) The 2002 Topic Detection and Track-
ing Task Definition and Evaluation Plan
http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt2002/
evalplan.htm
Ian H. Witten and Eibe Frank. 1999. Data Mining: Practi-
cal Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufman.
