The Smart/Empire TIPSTER IR System 
Chris Buckley, Janet Walz 
Sabir Research, Gaithersburg, MD 
chrisb,walz @ sabir.com 
Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff 
Department of Computer Science 
Cornell University, Ithaca, NY 14853 
cardie,mardis,mitra,pierce,wkiri @cs.cornell.edu 
1 INTRODUCTION 
The primary goal of the Cornell/Sabir TIPSTER Phase III 
project is to develop techniques to improve the end-user 
efficiency of information retrieval (IR) systems. We have 
focused our investigations in four related research areas: 
1. High Precision Information Retrieval. The goal 
of our research in this area is to increase the accu- 
racy of the set of documents given to the user. 
2. Near-Duplicate Detection. The goal of our work 
in near-duplicate detection is to develop methods 
for delineating or removing from the set of retrieved 
documents any information that the user has already 
seen. 
3. Context-Dependent Document Summarization. 
The goal of our research in this area is to provide 
for each document a short summary that includes 
only those portions of the document relevant to the 
query. 
4. Context-Dependent Multi-Document Summari- 
zation. The goal of our research in this area is 
to provide a short summary for an entire group of 
related documents that includes only query-related 
portions. 
Taken as a whole, our research aims to increase end-user 
efficiency in each of the above tasks by reducing the a- 
mount of text that the user must peruse in order to get the 
desired useful information. 
We attack each task through a combination of statis- 
tical and linguistic approaches. The proposed statistical 
approaches extend existing methods in IR by perform- 
ing statistical computations within the context of another 
query or document. The proposed linguistic approaches 
build on existing work in information extraction and rely 
on a new technique for trainable partial parsing. In short, 
our integrated approach uses both statistical and linguistic 
sources to identify selected relationships among important 
terms in a query or text. The relationships are encoded as 
TIPSTER annotations \[7\]. We then use the extracted re- 
lationships: (1) to discard or reorder retrieved texts (for 
high-precision text retrieval); (2) to locate redundant in- 
formation (for near-duplicate document detection); and 
(3) to generate coherent synopses (for context-dependent 
text summarization). 
An end-user scenario that takes advantage of the ef- 
ficiency opportunities offered by our research might pro- 
ceed as follows: 
1. The user submits a natural language query to the re- 
trieval system, asking for a high-precision search. This 
search will attempt to retrieve fewer documents than a 
normal search, but at a higher quality, so many fewer non- 
useful documents will need to be examined. 
2. The documents in the result set will be clustered so that 
closely related documents are grouped. 
Duplicate documents will be clearly marked so the 
user will not have to look at them at all. 
Near-duplicate documents will also be clearly mar- 
ked. When the user examines a document marked 
as a near-duplicate to a document previously ex- 
amined, the new material in this document is em- 
phasized in color so that it can be quickly perused, 
while the duplicate material can be ignored. 
3. Long documents can be automatically summarized, 
within the context of the query, so that perhaps only 20% 
of the document will be presented. This 20% summary 
107 
would include the material that made the system decide 
the document was useful, as well as other material de- 
signed to set the context for the query-related material. 
4. If the user wishes, an entire cluster of documents can 
be summarized. The user can then decide whether to look 
at any of the individual documents. This multi-document 
summary will once again be query-related. 
One key result of our TIPSTER efforts is the devel- 
opment of TRUESmart, a Toolbox for Research in User 
Efficiency. TRUESmart is a set of tools and data sup- 
porting researchers in the development of methods for im- 
proving user efficiency for state-of-the-art information re- 
trieval systems. TRUESmart allows the integration of sys- 
tem components for high-precision retrieval, duplicate de- 
tection, and context-dependent summarization; it includes 
a simple graphical user interface (GUI) that supports each 
of these tasks in the context of the end-user scenario de- 
scribed above. In addition, TRUESmart aids system eval- 
uation and analysis by highlighting important term rela- 
tionships identified by the underlying statistical and lin- 
guistic language processing algorithms. 
The rest of the paper presents TRUESmart and its un- 
derlying IR and NLP components. Section 2 first pro- 
vides an overview of the Smart IR system and the Empire 
Natural Language Processing (NLP) system. Section 3 
describes the TRUESmart toolbox. To date, we have used 
TRUESmart to support our work in high-precision retriev- 
al and context-dependent document summarization. We 
describe our results in these areas in Sections 4-5 using 
the TRUESmart interface to illustrate the algorithms de- 
veloped and their contribution to the end-user scenario de- 
scribed above. Section 6 summarizes our work in dupli- 
cate detection and describes how the TRUESmart inter- 
face will easily be extended to support this task and in- 
clude linguistic term relationships in addition to statistical 
term relationships. We conclude with a summary of the 
potential advantages of our overall approach. 
2 THE UNDERLYING SYSTEMS: 
SMART AND EMPIRE 
The two main foundations of our research are the Smart 
system for Information Retrieval and the Empire system 
for Natural Language Processing. Both are large systems 
running in the UNIX environment at Comell University. 
2.1 Smart 
Smart Version 13 is the latest in a long line of experi- 
mental information retrieval systems, dating back over 30 
years, developed under the guidance of G. Salton. The 
new version is approximately 50,000 lines of C code and 
documentation. 
Smart Version 13 offers a basic framework for investi- 
gations of the vector space and related models of informa- 
tion retrieval. Documents are fully automatically indexed, 
with each document representation being a weighted vec- 
tor of concepts, the weight indicating the importance of a 
concept to that particular document. The document rep- 
resentatives are stored on disk as an inverted file. Natural 
language queries undergo the same indexing process. The 
query representative vector is then compared with the in- 
dexed document representatives to arrive at a similarity 
and the documents are then fully ranked by similarity. 
Smart Version 13 is highly flexible (i.e., its algorithms 
can be easily adapted for a variety of IR tasks) and very 
fast, thus providing an ideal platform for information re- 
trieval experimentation. Documents are indexed at a rate 
of almost two gigabytes an hour, on systems currently 
costing under $5,000 (for example, a dual Pentium Pro 
200 Mhz with 512 megabytes memory and disk). Re- 
trieval speed is similarly fast, with basic simple searches 
taking much less than a second a query. 
2.2 The Empire System: A Trainable Par- 
tial Parser 
Stated simply, the goal of the natural language process- 
ing (NLP) component for the selected text retrieval tasks 
is to locate linguistic relationships between query terms. 
For this, we have developed Empire 1, a trainable partial 
parser. The remainder of this section describes the as- 
sumptions of our approach and the general architecture of 
the system. 
For the TIPSTER project, we are investigating the role 
of linguistic relationships in information retrieval tasks. A 
linguistic relationship between two terms is any relation- 
ship that can be determined through syntactic or semantic 
interpretation of the text that contains the terms. We are 
focusing on three classes of linguistic relationships that 
we believe will aid the information retrieval tasks: 
1. noun phrase relationships. E.g., determine wheth- 
er two query terms appear in the same (simple) noun 
phrase; find all places where a query term appears 
as the head of a noun phrase. 
1 The name refers to our focus on empirical methods for development 
and evaluation of the system. 
108 
2. Training 
Corpus 
. 
subject-verb-object relationships, including the 
identification of subjects and objects in gap con- 
structions. These relationships help to identify the 
functional structure of a sentence, i.e., who did what 
to whom. Once identified, Smart can assign higher 
weights to query terms that appear in these topic- 
indicating verb, object, and especially subject posi- 
tions. 
noun phrase coreference. Coreference resolution 
is the identification of all strings in a document that 
refer to the same entity. Noun phrase coreference 
will allow Smart to create more coherent summaries, 
e.g., by replacing pronouns with their referents as 
identified by Empire. In addition, Smart can use 
coreference relationships to modify its term weight- 
ing function to reflect the implied equality between 
all elements of a noun phrase equivalence class. 
Once identified, the linguistic relationships can be em- 
ployed in a number of ways to improve the efficiency of 
end-users: they can be used (1) to prefer the retrieval of 
documents that also exhibit the relationships; (2) to indi- 
cate the presence of redundant information; or (3) to es- 
tablish the necessary context in automatically generated 
summaries. Our approach to locating linguistic relation- 
ships is based on the following assumptions: 
• The NLP system need recognize only those relation- 
ships that are useful for the specific text retrieval 
application. There may be no need for full-blown 
syntactic and semantic analysis of queries and doc- 
uments. 
• The NLP system must recognize these relationships 
both quickly and accurately. The speed requirement 
argues for a shallow linguistic analysis; the accu- 
racy requirement argues for algorithms that focus 
on precision rather than recall. 
• The NLP component need only provide a compar- 
ative linguistic analysis between a document and a 
query. This should simplify the NLP task because 
individual documents do not have to be analyzed in 
isolation, but only relative to the query. 
Given thcse assumptions, we have developed Empire, a 
fast, trainable, precision-based partial parser. As a partial 
parser, Empire performs only shallow syntactic analysis 
of input texts. Like many partial parsers and NLP systems 
lk~r information extraction (e.g., Hobbs et al. \[9\]), Empire 
relies primarily on finite-state technology \[16\] to recog- 
nize all syntactic and semantic entities as well as their re- 
lationships to one another. Parsing proceeds in stages -- 
the initial stages identit~¢ relatively simple constituents: 
Pruning 
Corpus 
Improved 
Rule Set 
Final Rule Set 
Figure 1 : Error-Driven Pruning of Treebank Grammars 
simple noun phrases, some prepositional phrases, verb 
groups, and clauses. All linguistic relationships that re- 
quire higher-level attachment decisions are identified in 
subsequent stages and rely on output from earlier stages. 
Our use of finite-state transducers for partial parsing is 
most similar to the work of Abney \[1\], who employs a 
series of cascaded finite-state machines to build up an 
increasingly complex linguistic analysis of an incoming 
sentence. 
Unlike most work in this area, however, we do not use 
hand-crafted patterns to drive the linguistic analysis. In- 
stead, we rely on corpus-based learning algorithms to ac- 
quire the grammars necessary for driving each level of lin- 
guistic relationship identification. In particular, we have 
developed a very simple, yet effective technique for au- 
tomating the acquisition of grammars through error-driv- 
en pruning oftreebank grammars \[6\]. As shown in Fig- 
ure 1, the method first extracts an initial grammar from 
a "treebank" corpus, i.e., a corpus that has been anno- 
tated with respect to the linguistic relationship of interest. 
Consider the base noun phrase relationship -- the identi- 
fication of simple, non-recursive noun phrases. Accurate 
identification of base noun phrases is a critical component 
of any partial parser; in addition, Smart relies on base NPs 
as its primary source of linguistic phrase information. To 
extract a grammar for base noun phrase identification, we 
tag the training text with a part-of-speech tagger (we use 
Mitre's version of Brill's tagger \[3\]) and then extract as an 
NP rule every unique part-of-speech sequence that covers 
a base NP annotation. 
Next, the grammar is improved by discarding rules 
that obtain a low precision-based "benefit" score when ap- 
plied to a held out portion of the training corpus, the prun- 
ing corpus. The resulting "grammar" can then be used to 
identify base NPs in a novel text as follows: 
109 
1. Run all lower-level annotators. For base NPs, for 
example, run the part-of-speech annotator. 
2. Proceed through the tagged text from left to right, 
at each point matching the rules against the remain- 
ing input. For base NP recognition, match the NP 
rules against the remaining part-of-speech tags in 
the text. 
3. If there are multiple rules that match beginning at 
tag or token ti, use the longest matching rule R. 
Begin the matching process anew at the token that 
follows the last NP. 
2.2.1 Empire Evaluation 
Using this simple grammar extraction and pruning algo- 
rithm with the naive longest-match heuristic for applying 
rules to incoming text, the learned grammars are shown to 
perform very well for base noun phrase identification. A 
detailed description of the base noun phrase finder and its 
evaluation can be found in Cardie and Pierce \[6\]. In sum- 
mary, however, we have evaluated the approach on two 
base NP corpora derived from the Penn Treebank \[11\]. 
The algorithm achieves 91% precision and recall on base 
NPs that correspond directly to non-recursive noun phras- 
es in the treebank; it achieves 94% precision and recall on 
slightly less complicated noun phrases. 2 
We are currently investigating the use of error-driven 
grammar pruning to infer the grammars for all phases of 
partial parsing and the associated linguistic relationship 
identification. Initial results on verb-object recognition 
show 72% precision when tested on a corpus derived from 
the Penn Treebank. Analysis of the results indicates that 
our context-free approach, which worked very well for 
noun phrase recognition, does not yield sufficient accu- 
racy for verb-object recognition. As a result, we have 
used standard machine learning algorithms (i.e., k-nearest 
neighbor and memory-based learning using the value-dif- 
ference metric) to classify each proposed verb-object bra- 
cketing as either correct or incorrect given a 2-word win- 
dow surrounding the bracketing. In preliminary experi- 
ments, the machine learning algorithm obtains 84% gen- 
eralization accuracy. If we discard all bracketings it clas- 
sifies as incorrect, overall precision for verb-object recog- 
nition increases from 72% to over 80%. The next sec- 
tion outlines our general approach for using learning al- 
gorithms in conjunction with the Empire system. 
2This corpus further simplifies some of the the Treebank base NPs 
by removing ambiguities that we expect other components of our NLP 
system to handle, including: conjunctions, NPs with leading and trailing 
adverbs and verbs, and NPs that contain prepositions. 
2.2.2 The Role of Machine Learning Algorithms 
As noted above, Empire's finite-state partial parsing meth- 
ods may not be adequate for identifying some linguis- 
tic relationships. At a minimum, many linguistic rela- 
tionships are better identified by taking additional con- 
text into account. In these circumstances, we propose the 
use of corpus-based machine learning techniques -- both 
as a systematic means for correcting errors (as done for 
verb-object recognition above) and for learning to identify 
linguistic relationships that are more complex than those 
covered by the finite-state methods above. 
In particular, we have employed the Kenmorc knowl- 
edge acquisition framework for NLP systems \[4, 5\]. Ken- 
more relies on three major components. First, it requires 
an annotated training corpus, i.e., a collection of on- 
line documents, that has been annotated with the neces- 
sary bracketing information. Second, it requires a robust 
sentence analyzer, or parser. For this, we use the Empire 
partial parser. Finally, the framework requires an induc- 
tive learning algorithm. Although any inductive learning 
algorithm can be used, we have successfully used case- 
based learning (CBL) algorithms for a number of natural 
language learning problems. 
There are two phases to the framework: (1) a partially 
automated training phase, or acquisition phase, in which 
a particular linguistic relationship is learned, and (2) an 
application phase, in which the heuristics learned dur- 
ing training can be used to identify the linguistic relation- 
ship in novel texts. More specifically, the goal of Ken- 
more's training phase (see Figure 2) is to create a case 
base, or memory, of linguistic relationship decisions. To 
do this, the system randomly selects a set of training sen- 
tences from the annotated corpus. Next, the sentence an- 
alyzer processes the selected training sentences, creating 
one case for every instance of the linguistic relationship 
that occurs. As shown in Figure 2, each case has two 
parts. The context portion of the case encodes the con- 
text in which the linguistic relationship was encountered 
-- this is essentially a representation of some or all of the 
constituents in the neighborhood of the linguistic relation- 
ship as denoted in the flat syntactic analysis produced by 
the parser. The solution portion of the case describes how 
the linguistic relationship was resolved in the current ex- 
ample. In the training phase, this solution information is 
extracted directly from the annotated corpus. As the cases 
are created, they are stored in the case base. 
After training, the NLP system uses the case base with- 
out the annotated corpus to identify new occurrences of 
the linguistic relationship in novel sentences. Given a sen- 
tence as input, the sentence analyzer processes the sen- 
tence and creates a problem case, automatically filling in 
its context portion based on the constituents appearing the 
110 
Annotate~ linguistic relationship solution 
I t linguistic ~ relationships elected sentences I' to identify Sentence \[ ~ \[Training Case I conte oflinguistic i ,ela ions p l conte= Isolation I 
episode in linguistic 
relationship 
identification 
t 
i Case-Based Reasoning Component I 
Figure 2: Kenmore Training/Acquisition Phase. 
sentence. To determine whether the linguistic relationship 
holds, Kenmore next compares the problem case to each 
case in the case base, retrieves the most similar training 
case, and returns the decision as indicated in the solution 
part of the case. The solution information lets Empire de- 
cide whether the desired relationship exists in the current 
sentence. 
In previous work, we have used Kenmore for part- 
of-speech tagging, semantic feature tagging, information 
extraction concept acquisition, and relative pronoun res- 
olution \[5\]. We expect that this approach will be neces- 
sary for coreference resolution, for some types of subject- 
object identification, and for handling gap constructs (i.e., 
tbr determining that "boy" is the subject of "ate" as well 
as the object of "saw" in "Billy saw the boy that ate the 
candy"). It is also the approach used to learn the verb- 
object correction "heuristics" described in the last section. 
2.2.3 Coreference Resolution 
The final class of linguistic relationship is noun phrase 
eoreference -- for every entity in a text, the NLP system 
must locate all of the expressions or phrases that refer to it. 
As an example, consider the following: "Bill Clinton, cur- 
rent president of the United States, left Washington Mon- 
day morning for China. He will return in two weeks." In 
this excerpt, the phrases "Bill Clinton," "current president 
(of the United States)," and "he" refer to the same entity. 
Smart can use this coreference information to treat the as- 
sociated terms as equivalents. For example, it can assume 
that all items in the class are present whenever one ap- 
pears. In conjunction with coreference resolution, we are 
also investigating the usefulness of providing the IR sys- 
tem with canonicalized noun phrase forms that make use 
of term invariants identified during coreference. 
To date, we have implemented two simple algorithms 
for coreference resolution to use purely as baselines. Both 
operate only on base noun phrases as identified by Em- 
pire's base NP finder. The first heuristic assumes that 
two noun phrases are coreferent if they share any terms 
in common. The second assumes that two noun phrases 
are coreferent if they have the same head. Both obtained 
higher scores than expected when tested on the MUC6 
coreference data set. The head noun heuristic achieved 
42% recall and 51% precision; the overlapping terms heur- 
istic achieved 41% recall and precision. 
2.2.4 Empire Annotators 
All relationships identified by Empire are made available 
to Smart in the form of TIPSTER annotations. We cur- 
rently have the following annotators in operation: 
• tokenizer: identifies tokens, punctuation, etc. 
• sentence finder: based on Penn's maximum entropy 
algorithm \[ 15\]. 
• baseNPs: identifies non-recursive noun phrases. 
• verb-object: identifies verb-object pairs, either by 
bracketing the verb group and entire direct object 
phrase or by noting just the heads of each. 
• head noun coreference heuristic: identifies corefer- 
ent NPs. 
• overlapping terms coreference heuristic: identifies 
coreferent NPs. 
The tokenizer is written in C. The sentence finder is writ- 
ten in Java. All other annotators are implemented in Lu- 
cid/Liquid Common Lisp. 
111 
3 TRUESma  
To support our research in user-efficient information re- 
trieval, we have developed TRUESmart, a Toolbox for 
Research in User Efficiency. As noted above, TRUESmart 
allows the integration, evaluation, and analysis of IR and 
NLP algorithms for high-precision searches, context-de- 
pendent summarization, and duplicate detection. TRUE- 
Smart provides three classes of resources that are neces- 
sary for effective research in the above areas: 
l. Testbed Collections, including test queries and cor- 
rect answers 
2. Automatic Evaluation Tools, to measure overall 
how an approach does on a collection. 
3. Failure Analysis Tools, to help the researcher in- 
vestigate in depth what has happened. 
These tools are, to a large extent, independent of the actual 
research being done. However, they are just as vital for 
good research as the research algorithms themselves. 
3.1 TRUESmart Collections 
The testbed collections organized for TRUESmart are all 
based on TREC \[19\] and SUMMAC \[10\], the large eval- 
uation workshops run by NIST and DARPA respectively. 
TREC provides a number of document collections rang- 
ing up to 500,000 documents in size, along with queries 
and relevance judgements that tell whether a document is 
relevant to a particular query. 
Evaluation of our high-precision research can be done 
directly using the TREC collections. The TREC docu- 
ments, queries, and relevance judgements are sufficient to 
evaluate whether particular high-precision algorithms do 
better than others. 
For summarization research, however, a different test- 
bed is needed. The SUMMAC workshop evaluated sum- 
maries of documents. The major evaluation measured 
whether human judges were able to judge relevance of en- 
tire documents just from the summaries. While very valu- 
able in giving a one-time absolute measure of how well 
summarization algorithms are doing, human-dependent e- 
valuations are infeasible for a research group to perform 
on ongoing research since different human assessors are 
required whenever a given document or summary is judged. 
Our summarization testbed is based on the SUMMAC 
QandA evaluation. Given a set of questions about a docu- 
ment, and a key describing the locations in the document 
where those questions are answered, the goal is to evaluate 
how well an extraction-based summary of that document 
answers the questions. So the TRUESmart summarization 
testbed consists of 
• A small number of queries 
• A small number of relevant documents per query 
• A set of questions for each query 
• Locations in the relevant documents where each 
question is answered. 
Objective evaluation of near-duplicate information de- 
tection is difficult. As part of our efforts in this area, we 
have constructed a small set (50 pairs) of near-duplicate 
documents of newswire articles. These pairs were deliber- 
ately chosen to encompass a range of duplication amounts; 
we include 5 pairs at cosine similarity .95, 5 pairs at .90, 
and 10 pairs at each of .85, .80, .75, and .70. In addition, 
they have been categorized as to exactly what the rela- 
tionship between the pairs is. For example, some pairs 
are slight rewrites by the same author, some are followup 
articles, and some are two articles on the same subject 
by different authors. We also have queries that will re- 
trieve both of these pairs among the top documents. These 
articles are tagged: corresponding sections of text from 
each document pair are marked as identical, semantically 
equivalent, or different. 
Preparing a testbed for multi-document summariza- 
tion is even more difficult. We have not done this as yet, 
but our initial approach will take as a seed the QandA 
evaluation test collections described above. This gives us 
a query and a set of relevant documents with known an- 
swers to a set of common questions. Evaluation can be 
done by performing a multi-document summarization on 
a subgroup of this set of relevant documents. The final 
summary can be evaluated based upon how many ques- 
tions are answered (a question is answered by a text ex- 
cerpt in the summary if the excerpt in the corresponding 
original document was marked as answering the ques- 
tion), and how many questions are answered more than 
once. If too many questions are answered more than once, 
then the duplicate detection algorithms may not be work- 
ing optimally. If too few questions are answered at all, 
then the summarization algorithms may be at fault. The 
evaluation numbers produced by the final summary can be 
compared against the average evaluation numbers for the 
documents in the group. 
3.2 TRUESmart Evaluation 
Automatic evaluation of research algorithms is critical for 
rapid progress in all of these areas. Manual evaluation is 
112 
valuable, but impractical when trying to distinguish be- 
tween small variations of a research group's algorithms. 
3.2.1 Trec_eval 
Automatic evaluation of straight information retrieval 
tasks is not new. In particular, we have provided the 
"trec_eval" program to the TREC community to evalu- 
ate retrieval in the TREC environment. It will also be an 
evaluation component in the TRUESmart ToolBox. The 
trec_eval measures are described in the TREC-4 workshop 
proceedings \[8\]. 
3.2.2 Summ_eval 
The QandA evaluation of SUMMAC is very close to be- 
ing automatic once questions and keys are created. For 
SUMMAC, the human assessors still judge whether or 
not a given summary answers the questions. Indeed, for 
non-extraction-based summaries, this is required. But for 
evaluation of extraction-based summarization (where the 
summaries contain clauses, sentences, or paragraphs of 
the original document), an automatic approximation of the 
assessor task is possible. This enables a research group 
to fairly evaluate and compare multiple summaries of the 
same document, with no additional manual effort after 
the initial key is determined. Thus we have written the 
"summ_eval" evaluator. This algorithm for the automatic 
evaluation of summaries: 
1. Automatically finds the spans of the text of the orig- 
inal document that were given as answers in the 
keys. 
2. Automatically finds the spans of the text of the orig- 
inal document that appeared in a summarization of 
the document. 
3. Computes various measures of overlap between the 
summarization spans and the answer spans. 
The effectiveness of two summarization algorithms can 
be automatically compared by comparing these overlap 
measures. 
We ran summ_eval on the summaries produced by 
the systems of the SUMMAC workshop. The compar- 
ative ranking of systems using summ_eval is very close 
to the (presumably) optimal rankings using human asses- 
sors. This strongly suggests that automatic scoring of 
summ_eval can be useful for evaluation in circumstances 
where human scoring is not available 
3.2.3 Dup_eval 
"Dup_eval" uses the same algorithms as summ_eval to mea- 
sure how well an algorithm can detect whether one doc- 
ument contains information that is duplicated in another. 
The key (correct answer) for one document out of a pair 
will give the spans of text in that document that are dupli- 
cated in the other, at three different levels of duplication: 
exact, semantically equivalent, and contained in. The du- 
plicate detection algorithm being evaluated will come up 
with similar spans. Dup_eval measures the overlap be- 
tween the these sets of spans. 
3.3 TRUESmart GUI 
Automatic evaluation is only the beginning of the research 
process. Once evaluation pinpoints the failures and suc- 
cesses of a particular algorithm, analysis of these failures 
must be done in order to improve the algorithm. This anal- 
ysis is often time-consuming and painful. This motivates 
the implementation of the TRUESmart GUI. This GUI is 
not aimed at being a prototype of a user efficiency GUI. 
Instead, it offers a basic end-user interface while giving 
the researcher the ability to explore the underlying causes 
of particular algorithm behavior. 
Figure 3 shows the basic TRUESmart GUI as used 
to support high-precision retrieval and context-dependent 
summarization. The user begins by typing a query into 
the text input box in the middle, left frame. The sam- 
ple query is TREC query number 151: "The document 
will provide information on jail and prison overcrowding 
and how inmates are forced to cope with those conditions; 
or it will reveal plans to relieve the overcrowded condi- 
tion." Clicking the SubmitQ button initiates the search. 
Clicking the NewQ button allows the submission of a 
new query. 3 Once the query is submitted, Smart initi- 
ates a global search in order to quickly obtain an initial 
set of documents for the user. The document number, 
similarity ranking, similarity score, source, date, and ti- 
tle of the top 20 retrieved documents are displayed in the 
upper left frame of the GUI. Clicking on any document 
will cause its query-dependent summary to be displayed 
in the large frame on the right. In Figure 3, the sum- 
mary of the seventh document is displayed. In this run, 
we have set Smart's target summary length to 25% and 
asked for sentence- (rather than paragraph-) based sum- 
maries. Matching query terms are highlighted through- 
out the summary although they are not visible in the 
screen dump. The left, bottom-most frame of the inter- 
face lists the most important query terms (e.g., prison,jail, 
3The "ModQ" and "Mod vec" buttons allow the user to modify the 
query and modify the query vector, respectively. Neither will be dis- 
cussed further here. 
113 
inmat(e), overcrowd) and their associated weights (e.g., 
4.69,5.18,7.17, 12.54). 
Alter the initial display of the top-ranked documents, 
Smart begins a local search in the background: each in- 
dividual document is reparsed and matched once again 
against the query to see if it satisfies the particular high- 
precision restriction criteria being investigated. If it 
doesn't the document is removed from the retrieved set; 
otherwise, the document remains in the final retrieved set 
with a score that combines the global and local score. 
In addition, the user can supply relevance judgements on 
any document by clicking Rel (relevant), NRel (not rel- 
evant), or PRel (probably relevant). Smart uses these 
judgements as feedback, updating the ranking after ev- 
ery 5 judgements by adding new documents and removing 
those already judged from the list of retrieved texts. Fig- 
ure 4 shows the state of the session after a number of rel- 
evance judgements have been made and new documents 
have been added to the top 20. 
The interface, while basic, is valuable in its own right. 
It was successfully used for the Cornell/SablR experi- 
ments in the TREC 7 High-Precision track. In this task, 
users were asked to find 15 relevant documents within 5 
minutes for each of 50 queries. This was a true test of user 
efficiency; and Cornell/SablR did very well. 
The most important use of the GUI, though, is to ex- 
plore what is happening underneath the surface, in order 
to aid the researcher. Operating on either a single docu- 
ment or a cluster of documents, the researcher can request 
several different views. The two main paradigms are: 
(1) the document map view, which visually indicates the 
relationships between parts of the selected document(s); 
and (2) the document annotation view, which displays any 
subset of the available annotations for the selected docu- 
ment(s). Neither view is shown in Figures 3 and 4. 
The document annotation view, in particular, is ex- 
tremely flexible. The interface allows the user to run any 
of the available annotators on a document (or document 
set). Each annotator returns the text(s) and the set of an- 
notations computed for the text(s). The GUI, in turn, dis- 
plays the text with the spans of each annotation type high- 
lighted in a different color. Optionally, the values of each 
annotation can be displayed in a separate window. Thus, 
for instance, a document may be returned with one anno- 
tation type giving the spans of a document summary, and 
other annotation types giving the spans of an ideal sum- 
mary. The researcher can then immediately see what the 
problems are with the document summary. 
There is no limit to the number of possible annota- 
tors that can be displayed. Annotators implemented or 
planned include: 
• Query term matches (with values in separate win- 
dow). 
• Statistical and/or linguistic phrase matches. 
• Summary vs. model summary. 
• Summary vs. QandA answers. 
• Two documents concatenated with duplicate infor- 
mation of the second annotated in the first. 
• Coreferent noun phrases. 
• Subject, verb, or object term matches. 
• Verb-object, subject-verb, and subject-object term 
matches. 
• Subjects or objects of gap constructions annotated 
with the inferred filler if it matches an important 
term. 
Analyzing the role of linguistic relationships in the IR 
tasks amounts to requesting the display of some or all 
of the NLP annotators. For example, the user can re- 
quest to see linguistic phrase matches as well as statis- 
tical phrase matches. In the example from Figure 3, the 
resulting annotated summary would show "27 inmates" 
and "Latino inmates" as matches of the query term "in- 
mates" because all instances of "inmates" appear as head 
nouns. Similarly, it would show a linguistic phrase match 
between "jail overcrowding" (paragraph 5 of the sum- 
mary) and "jail and prison overcrowding" (in the query) 
for the same reason. When the output of the linguistic 
phrase annotator is requested, the lower left frame that 
lists query terms and weights is updated to include the 
linguistic phrases from the query and their corresponding 
weights. 
Alternatively, one might want to analyze the role of 
the "subject" annotator. In the running example, this would 
modify the summary window to show matches that in- 
volve terms appearing as the subject of a sentence or clause. 
For example, all of the following occurrences of "inmates" 
would be marked as subject matches with the "inmates" 
query term, which also appears in the subject position 
("inmates are forced"): "inmates were injured" (paragraph 
l ), "inmates broke out" (paragraph 2), "inmates refused" 
(paragraph 2), "inmates are confined" (paragraph 3), etc. 
Smart can give extra weight to these "subject" term match- 
es since entities that appear in this syntactic position are 
often central topic terms. The interface helps the devel- 
oper to quickly locate and determine the correctness of 
subject matches. As an aside, if the "subject gap con- 
struction" annotator were requested, "inmates" would be 
filled in as the implicit subject of "return" in paragraph 2 
and would be marked as a query term match. 
114 
..... 7. = .~,-. 
432478 1 56.80 
517694 2 56.48 
434848 3 54.80 
43~5~5 4 54.31 
405413 5 53.85 
51.1274 6 52.11 
399075 7 51.04 
<P> REPORT ~$SAILS CONDITIONS AT ~RTE 
@) ImitATES TO GET $150 ~ UP IF DEIIlE 
<?> JUDGES ORDE\]~E~ TO OPEN COURT IN N.% 
<P> "DEATH LOTTB~Y" III~TIZES O~OI~IH~ 
<P> 12 \]NJIJRE~ IN RIOTING AT 
<P> OI~NGE COUNTY NEMSI~TCH <JP> <JHE~ 
<P> ~ VISTA JAIL ~ I~TER RIOT FC 
186773 8 48.30 
I(S061 9 48.02 
407707 10 47.71 
180349 11 47.01 
455171 12 46.78 
4348~5 13 48.28 
117312 14 45.51 
444642 15 45,O0 
523364 16 44.74 
117307 17 44.57 
186014 18 43.77 
515147 19 43.2'6 
462422 2O 43.21 
FT 28 OCT 53 / Tumim ~ of Jail over 
FT 23 SEP 92 / Pria~n violence tops Fr~ 
<P> JAIL OOUm.E-~UHK ~ H~ HITCH: NE 
FT 25 NOV 93 1 Prison oxtonslon plan ar 
<P> POSTSCRIPT: JAIL PROG~It GIGS INF# 
<P> ~ OFFENOERS HAY BE JAILED IN TElL 
FT 30 OCT ~ / Prison condltions attack 
<P> STIFF TERHS I~E LITTLE IHPACT ON \[ 
<P> L.A, COJHTY'S (~TR~L JAIL: </P> <F 
FT 30 OCT 9~ / Prison coalitions are at 
FT I0 ~ 94 / Leadin9 ~'tlcle: Jail 
<P> CRONOING COHPOIJNI~ \]TIE PRESSURE 
4~> JAIL ~DING ~q~ IN COURT AGAIN; 
Rel ~1 PRel 
\]The d(x~.ment ulll provide inf~tim on Jail and ~'l~m ~\] 
Ne~Q S~itO HodQ Hod vec 
351 0 2¢~ 4.69668 l~'i~ 
351 0 ~72101 5.18446 Jail 
351 0 3940/9 7.17984 Irmat 
351 0 435502 12.54202 owN- ~- end 
351 0 442675 2.12707 provld 
351 0 467934 4.73140 cond 
351 0 46S0~9 5.59571 cope 
351 0 515220 2.19704 fore 
~51 0 644141 1.80153 plan 
351 0 6~9474 4.2~2 relief 
351 0 680680 2.26442 inform 
351 0 765274 2.13352 docu 
3~51 0 7~84 4.15008 reveal 
351 1 57596 6.39,327 inform l~ovid 
351 1 79370 10.46741 plan r~mal 
351 1 1085,.2 11.85068 o~mrcrm~d ~Ison 
<P> 
Calm ~ r~t~'ed Surda9 at the Sen Diego County Jail in 01ula Vista after 27 
irmat.~ uane injured in a Saturda9 ni9ht riot, authorities said. 
</P> 
<P> 
~llblock 3-R u~ locked doun ~t~r a .aJor dist.ur'bar~ betueen black and 
Latino irmato~ l~'oka out in • co~ an~ and the Inmatos r~u~d to return to 
their cells, said Sen Diego Count4j S}m'iff's Capt. 
<P> 
lock-an, in ~'dch lnmat~ are confined to their cells, ~ lifted 
after an undi~lo~d nuM~" of inmates ~ trar~Ferred to oU'mr count9 Jails. 
11'm Jail, d~i~d to confine 192 
inmates, held 782 on Sate'de . 
</P> 
<P> 
A national surv~ released l~t ~ reported that San Diego's Jails are 
nation's ~t ~ detention faniliti~. The stud , dnith examined 
27 Jall :~Jst~ms with 1,000 or ~ inpat~, fotmd that coont9 Jails operated at 
21,Tf of cap, it9 dtrin9 19~. Over its 10"9~" life, Pr~ltian A uas ~'~Jectod 
to rai~ $1.6 billion to 
relieve ov~'crowdln9. 
</P> 
<P> 
Count9 ~inlatrators have ~aid their anl9 realistic hope of alleviatin9 Jell 
overcrowdin9 li~ in overt urnin9 the court rulin9 that struck down Propceitian 
A. 
R court cettl~t naachad before the denim of 
PrOlX~ition fl called for in,to pqx~latian at the South ~a9 Jail to be reduced 
to 373 b9 Jul9 I. 
<P> 
Photo, Panaaedic~ tend to Irm~to injured in fracas at the Chula Vista Jail. 
Elapsed tire: 17,8 
Ouit 
Figure 3: TRUESmart GUI After Initial Query. Note that (other than the text input box) no frame 
borders, scrolling options, and or button borders are visible in this screen dump. 
115 
: t C e,:_ p.~,::' e r 
405413 JY 1 53.85 <P> 12 IHJLIRE~ IN RIOTING AT 
399075 JY 2 51.04 <P> ~ VISTA ~AIL ~ AFTER RIOT F\[ 
517694 3 60.31 <P> IHI~TES TO GET $150 ~ tiP IF D811E 
434848 4 59.01 <P> JLrDGES 0RDE;~ TO 0PE~I COURT IN N.'r 
43~8~ 5 58.53 <P> "\])ERTH LOTTERY" DRAHATIZES (~OM~l~ 
51.1.274 JN 6 58.51 <P> ORANGE COUNTY NENSZ~TCH </P> </H~ql 
407707 7 57.94 <P> JAIL DQLrB.E-~UNK ~ WAS HITCH: 1( 
43247'8 8 57.51 <P> REPORT I~IlLS CON\]HTIONS RT STATE 
515147 9 54.94 <P> CRONDING COHPOUNI~S THE PIE TH c 
434985 10 54.70 <P> DRUG OFF'EHI\]Ei~; HAY ~ JRILEI\] IN TEL 
186773 11 54.61 ~ 28 OCT 83 1 TLmlm ~ of Jail o~ 
106061 12 52.49 FT 23 SEP ~2 / Prison violence tops Fre 
523364 13 52.22 <P> L.A. COUNTY'S CENT I~L JAIL,* </P> 
405314 14 50.72 <P> I~tCIRL ~ AT El. CHJON JAIL LEAVE 
196014 15 50.48 FT 10 llqR 84 / Lasdln9 Article: Jail 
455171 16 50.44 ~> POSTSCRIPT: JAIL ~ GI ES Iblh~ 
513003 17 50.31 <P> O.C. JAIL PRCXS "Eli IN, BJT EACH 
117312 18 49.29 FT 30 OCT ~ I Prison conditions attad~ 
420075 19 49.28 <P> "YOUNG ftW TE/~ER" -- JAIU.10USE 
4G2422 20 49.00 ~> JAIL CROMDING CR~ 18 COURT AGAIN: 
Rel FtRel PRel 
\[The doo.m~t will provide infcr~etion on Jail and Prison ovorcf~ I 
NeuO ,~itO I'kxl Q Hod ve~ 
352 0 ~ 4.6S668 prison 
~2 0 372101 5.18446 Jail 
352 0 394019 7.17984 irmat 
~52 0 435502 12.542~2 o~o~d" 
352 0 442675 2.12707 IX'ovid 
~2 0 4~4 4.73140 cond 
352 0 469O69 5.59571 con 
0 515220 2.19704 forc 
352 0 644141 1.80!53 plan 
352 0 6~9474 4.2'8532 rel iaf 
352 0 680680 2.26442 inform 
352 0 7652.74 2.13352 ckx~u 
352 0 7~4 4.150~ reveal 
352 I 57596 6.393227 inform Ix'ovid 
352 1 793770 10.46741 plan r~wel 
352 1 10~522 11.95068 ovorcro~,d prison 
~P> 
JAIL I}Ot~.E-~JNK ~ HAS HITCH: NO 
<iF> 
~4mtI~) 
<P> 
Kow that Oron9e Cotmty finell9 h~ a~roval fro the stats to bunk two irletas 
ir~tsad of one in each cell of a ~ Jail facillt9 in Santa Aria, officials 
tJ~9 don't have the mona# for additional 9uards needed to satc~ the extra 
Pri~ne¢~. </F> 
<P> 
situation ~ cr~tsd a paradox in ~Ic~ sore than 200 Jall beds -- r~dy 
to be filled ~ will r~main ~mpt9 ~ t.~xx~n officials ~9 o~rcrowdin9 
continue~ to force the rel~ of ~"e than a hundred suspectsd or convicted 
criminals a da9 ~ ~Id otJ~'~iso be incarcerated. 
But last week the bond voted to help 
relieve Oran9e C~nt "a seric~ Jail o~-~-oddln9 ~'~bl~ b9 not ~nelizin9 the 
cant9 for so-called "double lxmkin " in 216 Of the 384 cells at t~ 
Intake-Rel~ C~mtsr in ,:~nta ~. <~> 
<P> 
there is an odd ~t to the state's r~.Jlations: The Boord of Corrections 
onl9 has authcrit9 to penalize a cant9 ~nile the Jail in question is under 
ccr~truction. ~ has 
previo~19 sued the count9 over Jail conditions. 
</P> 
<P> 
"lOuble bunkin9 "on the cheap" will nac~il9 lead to Prisonor-to-pris~r 
violence uhich sin Is cells wore intended to eliminate," Herman ~rots in a 
letter to the stats board. ~JP> 
<P> 
said the stats's requirement for ain le cells is intended to Protect 
wlrmeable irNt~ ~ such as 9oun9 or sontsll9 disturbed people -- frxm the 
l~'Ison population and to isolate irmat,~ kno.m to be exc~ivel9 violent. 
</P> 
~P> 
0ran9e County bee had • ~iousl9 owr~-~ded Jail syst.~m for ~ than 10 
9care. The Board of Ccrrections has rated its c~asit9 at about ~,000 irmet.~, 
but it nou ho,__~__ rare than 4,000 on are 9iwn da . 
<P> 
</P> 
<P> 
"Lesat nongorou~" Rel~ 
<~P> <F> 
In addition, Krona said that Isot yeor 43,675 suspectsd or convicted criminals 
~ere either turned ewe~ from the Jail or 9ivan an eorl9 release to ease the 
overoro~flng. O'P> <P> 
In addition to the 14 requsotsd delxJtias, sheriff's of Ficiala told the Board of 
C~,-r~tions that tJ~9 ~Jld need 16 more clerical staff ~rkcrs to ~ the 
additional inmates. 
Rel: 2 PReI: 0 NRel: 1 Elapsed tim: 29,2 
Figure 4: TRUESmart GUI After Relevance Judgements. 
116 
Finally, the role of coreference resolution might also 
be analyzed by requesting to see the output of the coref- 
erence annotator. In response to this request, the docu- 
ment text window would then be updated to highlight in 
the same color all of the entities considered in the same 
coreference equivalence class. As noted above (see Sec- 
tion 2.2), we currently have two simple coreference an- 
notators: one that uses the head noun heuristic and one 
that uses the overlapping terms heuristic. In our exam- 
ple, the head noun annotator would assume, among other 
things, that any noun phrase with "inmates" as its head 
refers to the same entity: "27 inmates", "black and Latino 
inmates", "the inmates", etc. (Note that many of these 
proposed coreferences are incorrect -- the heuristics are 
only meant to be used as baselines with which to compare 
other, better, coreference algorithms.) A quick scan of 
the text with all of these occurrences highlighted lets the 
user quickly determine how well the annotator is working 
for the current example. After limited pronoun resolu- 
tion is added to the coreference annotator, "their" in "their 
cells" (paragraph 2) would also be highlighted as part of 
the same equivalence class. 
4 HIGH-PRECISION INFORMA- 
TION RETRIEVAL 
In order to maintain general-purpose retrieval capabilities, 
for example, current IR systems attempt to balance their 
systems with respect to precision and recall measures. A 
number of information retrieval tasks, however, require 
retrieval mechanisms that emphasize precision: users want 
to see a small number of documents, most of which are 
deemed useful, rather than being given as many useful 
documents as possible where the useful documents are 
mixed in with numerous non-useful documents. As a re- 
sult, our research in high-precision IR concentrates on im- 
proving user time efficiency by showing the user only doc- 
uments that there is very good reason to believe are useful. 
Precision is increased by restricting an already re- 
trieved set of documents to those that meet some addi- 
tional criteria for relevance. An initial set of documents is 
retrieved (a global search), and each individual document 
is reparsed and matched against the query again to see if 
it satisfies the particular restriction criteria being investi- 
gated (local matching). If it does, the document is put into 
the final retrieved set with a score of some combination of 
the global and local score. We have investigated a num- 
ber of re-ranking algorithms. Three are briefly described 
below: Boolean filters, clusters, and phrases. 
4.1 Automatic Boolean Filters 
Smart expands user queries by adding terms occurring in 
the top documents. Maintaining the focus of the query is 
difficult while expanding; the query tends to drift away to- 
wards some one aspect of the query while ignoring other 
aspects. Therefore, it is useful to have a re-ranking algo- 
rithm that emphasizes those top documents which cover 
all aspects of the query. 
In recent work \[14\], we construct (soft) Boolean filters 
containing all query aspects and use these for re-ranking. 
A manually prepared filter can improve average precision 
by up to 22%. In practice, a user is not going to go to the 
difficulty of preparing such a filter, however, so an auto- 
matic approximation is needed. Aspects are automatically 
identified by looking at the term-term correlations among 
the query terms. Highly correlated terms are assumed to 
belong to the same aspect, and less correlated terms are 
assumed to be independent aspects. The automatic filter 
includes all of the independent aspects, and improves av- 
erage precision by 6 to 13%. 
4.2 Clusters 
Clustering the top documents can yield improvements 
from two sources, as we examine in \[12\]. First, outlier 
documents (those documents not strongly related to other 
documents) can be removed. This works reasonably for 
many queries. Unfortunately, it fails catastrophically for 
some hard queries where the outlier may be the only top 
relevant document! Absolute failures need to be avoided, 
so this approach is not currently recommended. The sec- 
ond improvement source is to ensure that query expansion 
terms come from all clusters. This is another method to 
maintain query focus and balance. A very modest im- 
provement of 2 to 3% is obtained; it appears the Boolean 
filter approach above is to be preferred, unless clustering 
is being done for other purposes in any case. 
4.3 Phrases 
Traditionally, phrases have been viewed as a precision en- 
hancing device. In \[13\] and \[12\], we examine the ben- 
efits of using high quality phrases from the Empire sys- 
tem. We discover that the linguistic phrases, when used 
by themselves without single terms, are better than tradi- 
tional Smart statistical phrases. However, neither group of 
phrases substantially improves overall performance over 
just using single terms, especially at the high precision 
end. Indeed, phrases tend to help at lower precisions where 
there are few clues to whether a document is relevant. At 
the high precision end, query balance is more important. 
117 
There are generally several clues to relevance for the high- 
est ranked documents, and maintaining balance between 
them is essential. A good phrase match often hurts this 
balance by over-emphasizing the aspect covered by the 
phrase. 
4.4 TREC 7 High Precision 
Cornell/SablR recently participated in the TREC 7 High 
Precision (HP) track. In this track, the goal of the user 
is to find 15 relevant documents to a query within 5 min- 
utes. This is obviously a nice evaluation testbed for user 
efficient retrieval. We used the TRUESmart GUI and in- 
corporated the automatic Boolean filters described above 
into some of our Smart retrievals. 
Only preliminary results are available now and once 
again Cornell/SablR did very well. All 3 of our users 
did substantially better than the median. One interesting 
point is that all 3 users are within 1% of each other: The 
same 3 users participated in the TREC 6 HP track last year 
with much more varied results. Last year, the hardware 
speed and choice of query length were different between 
the users. We attempted to equalize these factors this year. 
The basically identical results suggest (but the sample is 
much too small to prove) that our general approach is rea- 
sonably user-training independent. The major activity of 
the user is judging documents, a task for which all users 
are presumably qualified. The results are bounded by user 
agreement with the official relevance judgements, and the 
closeness of the results may indicate we are approaching 
that upper-bound. 
5 CONTEXT-DEPENDENT 
SUMMARIZATION 
Another application area considered to improve end- 
user efficiency is reduction of the text of the documents 
themselves. Longer documents contain a lot of text that 
may not be of interest to the end-user; techniques that 
reduce the amount of this text will improve the speed at 
which the end-user can lind the useful material. This type 
of summarization differs from our previous work in that 
the document summaries are produced within the context 
of a query. This is done by 
I. expanding the vocabulary of the query by related 
words using both a standard Smart cooccurrence 
based expansion process, and the output of the stan- 
dard Smart adhoc relevance feedback expansion pro- 
cess; 
2. weighting the expanded vocabulary by importance 
to the query; and 
3. performing the Smart summarization using only the 
weighted expanded vocabulary. 
We participated in both the TIPSTER dry run and the 
SUMMAC evaluations of summarization. Once again we 
did very well, finishing within the top 2 groups for the 
SUMMAC adhoc, categorization, and QandA tasks. In- 
terestingly, the top 3 groups for the QandA task all used 
Smart for their extraction-based summaries. 
Using the summ_eval evaluation tool on the SUM- 
MAC QandA task, we are continuing our investigations 
into length versus effectiveness, particularly when com- 
paring summaries based on extracting sentences as op- 
posed to paragraphs. As expected, the longer the sum- 
mary in comparison with the original document, the more 
effective the summary. For most evaluation measures, the 
relationship appears to be linear except at the extremes. 
For short summaries, sentences are more effective 
than paragraphs. This is expected; the granularity of para- 
graphs makes it tough to fit in entire good paragraphs. 
However, the reverse seems to be true for longer sum- 
maries, at least for us at our current level of summariza- 
tion expertise. The paragraphs tend to include related 
sentences that individually do not seem to use the par- 
ticular vocabulary our matching algorithms desire. This 
suggests that work on coreference becomes particularly 
crucial when working with sentence based summaries. 
Multi-Document Summarization. Our current work in- 
cludes extending context-dependent summarization tech- 
niques for use in multi-document, rather than single-doc- 
ument, summarization. Our work on duplicate informa- 
tion detection will also be critical for creating these more 
complicated summaries. We have no results to report for 
multi-document summarization at this time. 
6 DUPLICATE INFORMATION 
DETECTION 
Users easily become frustrated when information is du- 
plicated among the set of retrieved documents. This is 
especially a problem when users search text collections 
that have been created from several distinct sources: a 
newswire source may have several reports of the same 
incident, each of which may vary insignificantly. If we 
can ensure that a user does not see large quantities of du- 
plicate information then the user time efficiency will be 
improved. 
118 
(0208-173306) 
3608, 
3641 
Links helow 0.61~ ignored 
3~.92 
"~6/q, 0 
3610-- 
361)8- - Compare (:Mm 0.611 ( 360~¢ 3610 ) ( 361)8 3610 ) 
Figure 5: Document-Document Text Relationship Map for Articles 3608 and 3610. A line connects two paragraphs if 
their similarity is above a predefined threshold. 
Exact duplicate documents are very easy to detect by 
any number of techniques. Documents for which the basic 
content is exactly the same, but differ in document meta- 
data like Message ID or Time of Message, are also easy 
to detect by several techniques. We propose to compute 
a cosine similarity function between all retrieved docu- 
ments. Pairs of documents with a similarity of 1.0 will be 
identical as far as indexable content terms. 
The interesting research question is how to examine 
document pairs that are obviously highly related, but do 
not contain exactly the same terms or vocabulary as each 
other. For this, document-document maps are constructed 
between all retrieved documents which are of sufficient 
similarity to each other. These maps (see Figure 5) show a 
link between paragraphs of one document and paragraphs 
of the other if the similarity between the paragraphs is suf- 
ficiently strong. If all of the paragraphs of a document 
are strongly linked to paragraphs of a second document, 
then the content of the first document may be subsumed 
by the content of the second document. If there are un- 
linked paragraphs of a document, then those paragraphs 
contain new material that should be emphasized when the 
document is shown to the user. 
The structure of the document maps is an additional 
important feature to be used to indicate the type of rela- 
tionship between the documents: is one document an ex- 
pansion of another, or are they equivalent paraphrases of 
each other, or is one a summary document that includes 
the common topic as well as other topics. All of this infor- 
mation can be used to decide which document to initially 
show the user. 
Document-document maps can be created presently 
within the Smart system, though they have not been used 
in the past for detection of duplicate content \[2, 17, 18\]. 
Figure 5 gives such a document-document map between 
two newswire reports, one a fuller version of the other. 
7 SUMMARY 
In summary, we have developed supporting technology 
for improving end-user efficiency of information retrieval 
(IR) systems. We have made progress in three related ap- 
plication areas: high precision information retrieval, near- 
duplicate document detection, and context-dependent doc- 
ument summarization. Our research aims to increase end- 
user efficiency in each of the above tasks by reducing the 
amount of text that the user must p.eruse in order to get the 
119 
desired useful information. 
As the underlying technology for the above applica- 
tions, we use a novel combination of statistical and lin- 
guistic techniques. The proposed statistical approaches 
extend existing methods in IR by pertbrming statistical 
computations within the context of another query or doc- 
ument. The proposed linguistic approaches build on ex- 
isting work in information extraction and rely on a new 
technique for trainable partial parsing. The goal of the 
integrated approach is to identify selected relationships 
among important terms in a query or text and use the ex- 
tracted relationships: (1) to discard or reorder retrieved 
texts, (2) to locate redundant information, and (3) to 
generate coherent query-dependent summaries. We be- 
lieve that the integrated approach offers an innovative and 
promising solution to problems in end-user efficiency for 
a number of reasons: 
• Unlike previous attempts to combine natural lan- 
guage understanding and information retrieval, our 
approach always performs linguistic analysis rela- 
tive to another document or query. 
• End-user effectiveness will not be significantly com- 
promised in the face of errors by the Smart/Empire 
system. 
• The partial parser is a trainable system that can be 
tuned to recognize those linguistic relationships that 
are most important for the larger IR task. 
In addition, we have developed TRUESmart, a Tool- 
box for Research in User Efficiency. TRUESmart is a 
set of tools and data supporting researchers in the de- 
velopment of methods for improving user efficiency for 
state-of-the-art information retrieval systems. In addition, 
TRUESmart includes a simple graphical user interface that 
aids system evaluation and analysis by highlighting im- 
portant term relationships identified by the underlying sta- 
tistical and linguistic language processing algorithms. To 
date, we have used TRUESmart to integrate and evaluate 
system components in high-precision retrieval and context- 
dependent summarization. 
In conclusion, we believe that our statistical-linguistic 
approach to automated text retrieval has shown promising 
results and has simultaneously addressed four important 
goals for the TIPSTER program -- the need for increased 
accuracy in detection systems, increased portability and 
applicability of extraction systems, better summarization 
of free text, and increased communication across detec- 
tion and extraction systems. 

References 

\[1\] Steven. Abney. Partial Parsing via Finite-State Cas- 
cades. In Workshop on Robust Parsing, pages 8-15, 
1996. 

\[2\] James Allan. Automatic Hypertext Construction. 
Cornell University, Ph.D. Thesis, Ithaca, New York, 
1995. 

\[3\] Eric Brill. Transformation-Based Error-Driven 
Learning and Natural Language Processing: A Case 
Study in Part-of-Speech Tagging. Computational 
Linguistics, 21 (4):543-565, 1995. 

\[4\] C. Cardie. Domain-Specific Knowledge Acquisi- 
tion for Conceptual Sentence Analysis. PhD thesis, 
University of Massachusetts, Amherst, MA, 1994. 
Available as University of Massachusetts, CMPSCI 
Technical Report 94-74. 

\[5\] C. Cardie. Embedded machine learning systems 
for natural language processing: A general frame- 
work. In Stefan Wermter, Ellen Riloff, and Gabriele 
Scheler, editors, Symbolic, connectionist, and sta- 
tistical approaches to learning for natural language 
processing, Lecture Notes in Artificial Intelligence 
Series, pages 315-328. Springer, 1996. 

\[6\] C. Cardie and D. Pierce. Error-Driven Pruning of 
Treebank Grammars for Base Noun Phrase Identifi- 
cation. In Proceedings of the 36th Annual Meeting 
of the ACL and COLING-98, pages 218-224. Asso- 
ciation for Computational Linguistics, 1998. 

\[7\] R. Grishman. TIPSTER Architecture Design Docu- 
ment Version 2.2. Technical report, DARPA, 1996. 
Available at http : //www. tipster, org/ . 

\[8\] D. K. Harman. Appendix a, evaluation techniques 
and measures. In D. K. Harman, editor, Proceedings 
of the Fourth Text REtrieval Conference (TREC-4), 
pages A6-AI4. NIST Special Publication 500-236, 
1996. 

\[91 J. Hobbs, D. Appelt, J. Bear, D. Israel, 
M. Kameyama, M. Stickel, and M. Tyson. FASTUS: 
A Cascaded Finite-State Transducer for Extracting 
Information from Natural-Language Text. In E. 
Roche and Y. Schabes, editor, Finite-State Language 
Processing, pages 383--406. MIT Press, Cambridge, 
MA, 1997. 

\[lO\] I. Mani, D. House, G. Klein, L. Hirschman, L. Obrst, 
T. Firmin, M. Chrzanowski, and B. Sundheim. The 
tipster summac text summarization evaluation: Final 
report. Technical report, DARPA, 1998. 

[I 1] M. Marcus, M. Marcinkiewicz, and B. Santorini. 
Building a Large Annotated Corpus of English: 
The Penn Treebank. Computational Linguistics, 
19(2):313-330, 1993. 

[12] Mandar Mitra. High-Precision Information Re- 
trieval. PhD thesis, Department of Computer Sci- 
ence, Cornell University, 1998. 

[13] Mandar Mitra, Chris Buckley, Amit Singhal, and 
Claire Cardie. An analysis of statistical and syn- 
tactic phrases. In L. Devroye and C. Chrisment, 
editors, Conference Proceedings of RIAO-97, pages 
200-214, June 1997. 

[14] Mandar Mitra, Amit Singhal, and Chris Buckley. 
Improving automatic query expansion. In W. Bruce 
Croft, Alistair Moffat, C.J. van Rijsbergen, Ross 
Wilkinson, and Justin Zobel, editors, Proceedings of 
the 21st Annual International ACM SIGIR Confer- 
ence on Research and Development in Information 
Retrieval, pages 206-214. Association for Comput- 
ing Machinery, 1998. 

[ 15] J. Reynar and A. Ratnaparkhi. A Maximum Entropy 
Approach to Identifying Sentence Boundaries. In 
Proceedings of the Fifth Conference on Applied Nat- 
ural Language Processing, pages 1 6-19, San Fran- 
cisco, CA, 1997. Morgan Kaufmann. 

[16] Roche, E. and Schabes, Y., editor. Finite State De- 
vices for Natural Language Processing. MIT Press, 
Cambridge, MA, 1997. 

[17] Gerard Salton, James Allan, Chris Buckley, and 
Mandar Mitra. Automatic analysis, theme genera- 
tion and summarization of machine-readable texts. 
Science, 264:1421-1426, June 1994. 

[18] Gerard Salton, Amit Singhal, Chris Buckley, and 
Mandar Mitra. Automatic text decomposition us- 
ing text segments and text themes. Technical Report 
TR95-1555, Cornell University, 1995. 

[191 E. M. Voorhees and D. K. Harman. Overview of 
the sixth Text REtrieval Conference (TREC-6). In 
E. M. Voorhees and D. K. Harman, editors, The Sixth 
Text REtrieval Conference (TREC-6). NIST Special 
Publication 500-240, 1998. 
