File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1017_metho.xml

Size: 48,157 bytes

Last Modified: 2025-10-06 14:15:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1017">
  <Title>The Smart/Empire TIPSTER IR System</Title>
  <Section position="4" start_page="107" end_page="108" type="metho">
    <SectionTitle>
2 THE UNDERLYING SYSTEMS:
SMART AND EMPIRE
</SectionTitle>
    <Paragraph position="0"> The two main foundations of our research are the Smart system for Information Retrieval and the Empire system for Natural Language Processing. Both are large systems running in the UNIX environment at Comell University.</Paragraph>
    <Section position="1" start_page="107" end_page="107" type="sub_section">
      <SectionTitle>
2.1 Smart
</SectionTitle>
      <Paragraph position="0"> Smart Version 13 is the latest in a long line of experimental information retrieval systems, dating back over 30 years, developed under the guidance of G. Salton. The new version is approximately 50,000 lines of C code and documentation.</Paragraph>
      <Paragraph position="1"> Smart Version 13 offers a basic framework for investigations of the vector space and related models of information retrieval. Documents are fully automatically indexed, with each document representation being a weighted vector of concepts, the weight indicating the importance of a concept to that particular document. The document representatives are stored on disk as an inverted file. Natural language queries undergo the same indexing process. The query representative vector is then compared with the indexed document representatives to arrive at a similarity and the documents are then fully ranked by similarity.</Paragraph>
      <Paragraph position="2"> Smart Version 13 is highly flexible (i.e., its algorithms can be easily adapted for a variety of IR tasks) and very fast, thus providing an ideal platform for information retrieval experimentation. Documents are indexed at a rate of almost two gigabytes an hour, on systems currently costing under $5,000 (for example, a dual Pentium Pro 200 Mhz with 512 megabytes memory and disk). Retrieval speed is similarly fast, with basic simple searches taking much less than a second a query.</Paragraph>
    </Section>
    <Section position="2" start_page="107" end_page="108" type="sub_section">
      <SectionTitle>
2.2 The Empire System: A Trainable Par-
tial Parser
</SectionTitle>
      <Paragraph position="0"> Stated simply, the goal of the natural language processing (NLP) component for the selected text retrieval tasks is to locate linguistic relationships between query terms.</Paragraph>
      <Paragraph position="1"> For this, we have developed Empire 1, a trainable partial parser. The remainder of this section describes the assumptions of our approach and the general architecture of the system.</Paragraph>
      <Paragraph position="2"> For the TIPSTER project, we are investigating the role of linguistic relationships in information retrieval tasks. A linguistic relationship between two terms is any relationship that can be determined through syntactic or semantic interpretation of the text that contains the terms. We are focusing on three classes of linguistic relationships that we believe will aid the information retrieval tasks: 1. noun phrase relationships. E.g., determine whether two query terms appear in the same (simple) noun phrase; find all places where a query term appears as the head of a noun phrase.</Paragraph>
      <Paragraph position="3"> 1 The name refers to our focus on empirical methods for development and evaluation of the system.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="108" end_page="111" type="metho">
    <SectionTitle>
2. Training
Corpus
</SectionTitle>
    <Paragraph position="0"> .</Paragraph>
    <Paragraph position="1"> subject-verb-object relationships, including the identification of subjects and objects in gap constructions. These relationships help to identify the functional structure of a sentence, i.e., who did what to whom. Once identified, Smart can assign higher weights to query terms that appear in these topicindicating verb, object, and especially subject positions. null noun phrase coreference. Coreference resolution is the identification of all strings in a document that refer to the same entity. Noun phrase coreference will allow Smart to create more coherent summaries, e.g., by replacing pronouns with their referents as identified by Empire. In addition, Smart can use coreference relationships to modify its term weighting function to reflect the implied equality between all elements of a noun phrase equivalence class.</Paragraph>
    <Paragraph position="2"> Once identified, the linguistic relationships can be employed in a number of ways to improve the efficiency of end-users: they can be used (1) to prefer the retrieval of documents that also exhibit the relationships; (2) to indicate the presence of redundant information; or (3) to establish the necessary context in automatically generated summaries. Our approach to locating linguistic relationships is based on the following assumptions: * The NLP system need recognize only those relationships that are useful for the specific text retrieval application. There may be no need for full-blown syntactic and semantic analysis of queries and documents. null * The NLP system must recognize these relationships both quickly and accurately. The speed requirement argues for a shallow linguistic analysis; the accuracy requirement argues for algorithms that focus on precision rather than recall.</Paragraph>
    <Paragraph position="3"> * The NLP component need only provide a comparative linguistic analysis between a document and a query. This should simplify the NLP task because individual documents do not have to be analyzed in isolation, but only relative to the query.</Paragraph>
    <Paragraph position="4"> Given thcse assumptions, we have developed Empire, a fast, trainable, precision-based partial parser. As a partial parser, Empire performs only shallow syntactic analysis of input texts. Like many partial parsers and NLP systems lk~r information extraction (e.g., Hobbs et al. \[9\]), Empire relies primarily on finite-state technology \[16\] to recognize all syntactic and semantic entities as well as their relationships to one another. Parsing proceeds in stages -the initial stages identit~C/ relatively simple constituents:  simple noun phrases, some prepositional phrases, verb groups, and clauses. All linguistic relationships that require higher-level attachment decisions are identified in subsequent stages and rely on output from earlier stages. Our use of finite-state transducers for partial parsing is most similar to the work of Abney \[1\], who employs a series of cascaded finite-state machines to build up an increasingly complex linguistic analysis of an incoming sentence.</Paragraph>
    <Paragraph position="5"> Unlike most work in this area, however, we do not use hand-crafted patterns to drive the linguistic analysis. Instead, we rely on corpus-based learning algorithms to acquire the grammars necessary for driving each level of linguistic relationship identification. In particular, we have developed a very simple, yet effective technique for automating the acquisition of grammars through error-driven pruning oftreebank grammars \[6\]. As shown in Figure 1, the method first extracts an initial grammar from a &amp;quot;treebank&amp;quot; corpus, i.e., a corpus that has been annotated with respect to the linguistic relationship of interest. Consider the base noun phrase relationship -- the identification of simple, non-recursive noun phrases. Accurate identification of base noun phrases is a critical component of any partial parser; in addition, Smart relies on base NPs as its primary source of linguistic phrase information. To extract a grammar for base noun phrase identification, we tag the training text with a part-of-speech tagger (we use Mitre's version of Brill's tagger \[3\]) and then extract as an NP rule every unique part-of-speech sequence that covers a base NP annotation.</Paragraph>
    <Paragraph position="6"> Next, the grammar is improved by discarding rules that obtain a low precision-based &amp;quot;benefit&amp;quot; score when applied to a held out portion of the training corpus, the pruning corpus. The resulting &amp;quot;grammar&amp;quot; can then be used to identify base NPs in a novel text as follows:  1. Run all lower-level annotators. For base NPs, for example, run the part-of-speech annotator.</Paragraph>
    <Paragraph position="7"> 2. Proceed through the tagged text from left to right,  at each point matching the rules against the remaining input. For base NP recognition, match the NP rules against the remaining part-of-speech tags in the text.</Paragraph>
    <Paragraph position="8"> 3. If there are multiple rules that match beginning at tag or token ti, use the longest matching rule R.</Paragraph>
    <Paragraph position="9"> Begin the matching process anew at the token that follows the last NP.</Paragraph>
    <Paragraph position="10">  Using this simple grammar extraction and pruning algorithm with the naive longest-match heuristic for applying rules to incoming text, the learned grammars are shown to perform very well for base noun phrase identification. A detailed description of the base noun phrase finder and its evaluation can be found in Cardie and Pierce \[6\]. In summary, however, we have evaluated the approach on two base NP corpora derived from the Penn Treebank \[11\].</Paragraph>
    <Paragraph position="11"> The algorithm achieves 91% precision and recall on base NPs that correspond directly to non-recursive noun phrases in the treebank; it achieves 94% precision and recall on slightly less complicated noun phrases. 2 We are currently investigating the use of error-driven grammar pruning to infer the grammars for all phases of partial parsing and the associated linguistic relationship identification. Initial results on verb-object recognition show 72% precision when tested on a corpus derived from the Penn Treebank. Analysis of the results indicates that our context-free approach, which worked very well for noun phrase recognition, does not yield sufficient accuracy for verb-object recognition. As a result, we have used standard machine learning algorithms (i.e., k-nearest neighbor and memory-based learning using the value-difference metric) to classify each proposed verb-object bracketing as either correct or incorrect given a 2-word window surrounding the bracketing. In preliminary experiments, the machine learning algorithm obtains 84% generalization accuracy. If we discard all bracketings it classifies as incorrect, overall precision for verb-object recognition increases from 72% to over 80%. The next section outlines our general approach for using learning algorithms in conjunction with the Empire system.</Paragraph>
    <Paragraph position="12"> 2This corpus further simplifies some of the the Treebank base NPs by removing ambiguities that we expect other components of our NLP system to handle, including: conjunctions, NPs with leading and trailing adverbs and verbs, and NPs that contain prepositions.</Paragraph>
    <Paragraph position="13">  As noted above, Empire's finite-state partial parsing methods may not be adequate for identifying some linguistic relationships. At a minimum, many linguistic relationships are better identified by taking additional context into account. In these circumstances, we propose the use of corpus-based machine learning techniques -- both as a systematic means for correcting errors (as done for verb-object recognition above) and for learning to identify linguistic relationships that are more complex than those covered by the finite-state methods above.</Paragraph>
    <Paragraph position="14"> In particular, we have employed the Kenmorc knowledge acquisition framework for NLP systems \[4, 5\]. Kenmore relies on three major components. First, it requires an annotated training corpus, i.e., a collection of on-line documents, that has been annotated with the necessary bracketing information. Second, it requires a robust sentence analyzer, or parser. For this, we use the Empire partial parser. Finally, the framework requires an inductive learning algorithm. Although any inductive learning algorithm can be used, we have successfully used case-based learning (CBL) algorithms for a number of natural language learning problems.</Paragraph>
    <Paragraph position="15"> There are two phases to the framework: (1) a partially automated training phase, or acquisition phase, in which a particular linguistic relationship is learned, and (2) an application phase, in which the heuristics learned during training can be used to identify the linguistic relationship in novel texts. More specifically, the goal of Kenmore's training phase (see Figure 2) is to create a case base, or memory, of linguistic relationship decisions. To do this, the system randomly selects a set of training sentences from the annotated corpus. Next, the sentence analyzer processes the selected training sentences, creating one case for every instance of the linguistic relationship that occurs. As shown in Figure 2, each case has two parts. The context portion of the case encodes the context in which the linguistic relationship was encountered -- this is essentially a representation of some or all of the constituents in the neighborhood of the linguistic relationship as denoted in the flat syntactic analysis produced by the parser. The solution portion of the case describes how the linguistic relationship was resolved in the current example. In the training phase, this solution information is extracted directly from the annotated corpus. As the cases are created, they are stored in the case base.</Paragraph>
    <Paragraph position="16"> After training, the NLP system uses the case base without the annotated corpus to identify new occurrences of the linguistic relationship in novel sentences. Given a sentence as input, the sentence analyzer processes the sentence and creates a problem case, automatically filling in its context portion based on the constituents appearing the  sentence. To determine whether the linguistic relationship holds, Kenmore next compares the problem case to each case in the case base, retrieves the most similar training case, and returns the decision as indicated in the solution part of the case. The solution information lets Empire decide whether the desired relationship exists in the current sentence.</Paragraph>
    <Paragraph position="17"> In previous work, we have used Kenmore for part-of-speech tagging, semantic feature tagging, information extraction concept acquisition, and relative pronoun resolution \[5\]. We expect that this approach will be necessary for coreference resolution, for some types of subject-object identification, and for handling gap constructs (i.e., tbr determining that &amp;quot;boy&amp;quot; is the subject of &amp;quot;ate&amp;quot; as well as the object of &amp;quot;saw&amp;quot; in &amp;quot;Billy saw the boy that ate the candy&amp;quot;). It is also the approach used to learn the verb-object correction &amp;quot;heuristics&amp;quot; described in the last section.  The final class of linguistic relationship is noun phrase eoreference -- for every entity in a text, the NLP system must locate all of the expressions or phrases that refer to it. As an example, consider the following: &amp;quot;Bill Clinton, current president of the United States, left Washington Monday morning for China. He will return in two weeks.&amp;quot; In this excerpt, the phrases &amp;quot;Bill Clinton,&amp;quot; &amp;quot;current president (of the United States),&amp;quot; and &amp;quot;he&amp;quot; refer to the same entity. Smart can use this coreference information to treat the associated terms as equivalents. For example, it can assume that all items in the class are present whenever one appears. In conjunction with coreference resolution, we are also investigating the usefulness of providing the IR system with canonicalized noun phrase forms that make use of term invariants identified during coreference.</Paragraph>
    <Paragraph position="18"> To date, we have implemented two simple algorithms for coreference resolution to use purely as baselines. Both operate only on base noun phrases as identified by Empire's base NP finder. The first heuristic assumes that two noun phrases are coreferent if they share any terms in common. The second assumes that two noun phrases are coreferent if they have the same head. Both obtained higher scores than expected when tested on the MUC6 coreference data set. The head noun heuristic achieved 42% recall and 51% precision; the overlapping terms heuristic achieved 41% recall and precision.</Paragraph>
    <Paragraph position="19">  All relationships identified by Empire are made available to Smart in the form of TIPSTER annotations. We currently have the following annotators in operation: * tokenizer: identifies tokens, punctuation, etc.</Paragraph>
    <Paragraph position="20"> * sentence finder: based on Penn's maximum entropy algorithm \[ 15\].</Paragraph>
    <Paragraph position="21"> * baseNPs: identifies non-recursive noun phrases.</Paragraph>
    <Paragraph position="22"> * verb-object: identifies verb-object pairs, either by bracketing the verb group and entire direct object phrase or by noting just the heads of each.</Paragraph>
    <Paragraph position="23"> * head noun coreference heuristic: identifies coreferent NPs.</Paragraph>
    <Paragraph position="24"> * overlapping terms coreference heuristic: identifies coreferent NPs.</Paragraph>
    <Paragraph position="25"> The tokenizer is written in C. The sentence finder is written in Java. All other annotators are implemented in Lucid/Liquid Common Lisp.</Paragraph>
  </Section>
  <Section position="6" start_page="111" end_page="116" type="metho">
    <SectionTitle>
3 TRUESma
</SectionTitle>
    <Paragraph position="0"> To support our research in user-efficient information retrieval, we have developed TRUESmart, a Toolbox for Research in User Efficiency. As noted above, TRUESmart allows the integration, evaluation, and analysis of IR and NLP algorithms for high-precision searches, context-dependent summarization, and duplicate detection. TRUESmart provides three classes of resources that are necessary for effective research in the above areas:  l. Testbed Collections, including test queries and correct answers 2. Automatic Evaluation Tools, to measure overall how an approach does on a collection.</Paragraph>
    <Paragraph position="1"> 3. Failure Analysis Tools, to help the researcher in- null vestigate in depth what has happened.</Paragraph>
    <Paragraph position="2"> These tools are, to a large extent, independent of the actual research being done. However, they are just as vital for good research as the research algorithms themselves.</Paragraph>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.1 TRUESmart Collections
</SectionTitle>
      <Paragraph position="0"> The testbed collections organized for TRUESmart are all based on TREC \[19\] and SUMMAC \[10\], the large evaluation workshops run by NIST and DARPA respectively.</Paragraph>
      <Paragraph position="1"> TREC provides a number of document collections ranging up to 500,000 documents in size, along with queries and relevance judgements that tell whether a document is relevant to a particular query.</Paragraph>
      <Paragraph position="2"> Evaluation of our high-precision research can be done directly using the TREC collections. The TREC documents, queries, and relevance judgements are sufficient to evaluate whether particular high-precision algorithms do better than others.</Paragraph>
      <Paragraph position="3"> For summarization research, however, a different test-bed is needed. The SUMMAC workshop evaluated summaries of documents. The major evaluation measured whether human judges were able to judge relevance of entire documents just from the summaries. While very valuable in giving a one-time absolute measure of how well summarization algorithms are doing, human-dependent evaluations are infeasible for a research group to perform on ongoing research since different human assessors are required whenever a given document or summary is judged.</Paragraph>
      <Paragraph position="4"> Our summarization testbed is based on the SUMMAC QandA evaluation. Given a set of questions about a document, and a key describing the locations in the document where those questions are answered, the goal is to evaluate how well an extraction-based summary of that document answers the questions. So the TRUESmart summarization testbed consists of  question is answered.</Paragraph>
      <Paragraph position="5"> Objective evaluation of near-duplicate information detection is difficult. As part of our efforts in this area, we have constructed a small set (50 pairs) of near-duplicate documents of newswire articles. These pairs were deliberately chosen to encompass a range of duplication amounts; we include 5 pairs at cosine similarity .95, 5 pairs at .90, and 10 pairs at each of .85, .80, .75, and .70. In addition, they have been categorized as to exactly what the relationship between the pairs is. For example, some pairs are slight rewrites by the same author, some are followup articles, and some are two articles on the same subject by different authors. We also have queries that will retrieve both of these pairs among the top documents. These articles are tagged: corresponding sections of text from each document pair are marked as identical, semantically equivalent, or different.</Paragraph>
      <Paragraph position="6"> Preparing a testbed for multi-document summarization is even more difficult. We have not done this as yet, but our initial approach will take as a seed the QandA evaluation test collections described above. This gives us a query and a set of relevant documents with known answers to a set of common questions. Evaluation can be done by performing a multi-document summarization on a subgroup of this set of relevant documents. The final summary can be evaluated based upon how many questions are answered (a question is answered by a text excerpt in the summary if the excerpt in the corresponding original document was marked as answering the question), and how many questions are answered more than once. If too many questions are answered more than once, then the duplicate detection algorithms may not be working optimally. If too few questions are answered at all, then the summarization algorithms may be at fault. The evaluation numbers produced by the final summary can be compared against the average evaluation numbers for the documents in the group.</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="112" type="sub_section">
      <SectionTitle>
3.2 TRUESmart Evaluation
</SectionTitle>
      <Paragraph position="0"> Automatic evaluation of research algorithms is critical for rapid progress in all of these areas. Manual evaluation is  valuable, but impractical when trying to distinguish between small variations of a research group's algorithms.  Automatic evaluation of straight information retrieval tasks is not new. In particular, we have provided the &amp;quot;trec_eval&amp;quot; program to the TREC community to evaluate retrieval in the TREC environment. It will also be an evaluation component in the TRUESmart ToolBox. The trec_eval measures are described in the TREC-4 workshop proceedings \[8\].</Paragraph>
      <Paragraph position="1">  The QandA evaluation of SUMMAC is very close to being automatic once questions and keys are created. For SUMMAC, the human assessors still judge whether or not a given summary answers the questions. Indeed, for non-extraction-based summaries, this is required. But for evaluation of extraction-based summarization (where the summaries contain clauses, sentences, or paragraphs of the original document), an automatic approximation of the assessor task is possible. This enables a research group to fairly evaluate and compare multiple summaries of the same document, with no additional manual effort after the initial key is determined. Thus we have written the &amp;quot;summ_eval&amp;quot; evaluator. This algorithm for the automatic evaluation of summaries:  1. Automatically finds the spans of the text of the original document that were given as answers in the keys.</Paragraph>
      <Paragraph position="2"> 2. Automatically finds the spans of the text of the original document that appeared in a summarization of the document.</Paragraph>
      <Paragraph position="3"> 3. Computes various measures of overlap between the  summarization spans and the answer spans.</Paragraph>
      <Paragraph position="4"> The effectiveness of two summarization algorithms can be automatically compared by comparing these overlap measures.</Paragraph>
      <Paragraph position="5"> We ran summ_eval on the summaries produced by the systems of the SUMMAC workshop. The comparative ranking of systems using summ_eval is very close to the (presumably) optimal rankings using human assessors. This strongly suggests that automatic scoring of summ_eval can be useful for evaluation in circumstances where human scoring is not available  &amp;quot;Dup_eval&amp;quot; uses the same algorithms as summ_eval to measure how well an algorithm can detect whether one document contains information that is duplicated in another. The key (correct answer) for one document out of a pair will give the spans of text in that document that are duplicated in the other, at three different levels of duplication: exact, semantically equivalent, and contained in. The duplicate detection algorithm being evaluated will come up with similar spans. Dup_eval measures the overlap between the these sets of spans.</Paragraph>
    </Section>
    <Section position="3" start_page="112" end_page="116" type="sub_section">
      <SectionTitle>
3.3 TRUESmart GUI
</SectionTitle>
      <Paragraph position="0"> Automatic evaluation is only the beginning of the research process. Once evaluation pinpoints the failures and successes of a particular algorithm, analysis of these failures must be done in order to improve the algorithm. This analysis is often time-consuming and painful. This motivates the implementation of the TRUESmart GUI. This GUI is not aimed at being a prototype of a user efficiency GUI.</Paragraph>
      <Paragraph position="1"> Instead, it offers a basic end-user interface while giving the researcher the ability to explore the underlying causes of particular algorithm behavior.</Paragraph>
      <Paragraph position="2"> Figure 3 shows the basic TRUESmart GUI as used to support high-precision retrieval and context-dependent summarization. The user begins by typing a query into the text input box in the middle, left frame. The sample query is TREC query number 151: &amp;quot;The document will provide information on jail and prison overcrowding and how inmates are forced to cope with those conditions; or it will reveal plans to relieve the overcrowded condition.&amp;quot; Clicking the SubmitQ button initiates the search. Clicking the NewQ button allows the submission of a new query. 3 Once the query is submitted, Smart initiates a global search in order to quickly obtain an initial set of documents for the user. The document number, similarity ranking, similarity score, source, date, and title of the top 20 retrieved documents are displayed in the upper left frame of the GUI. Clicking on any document will cause its query-dependent summary to be displayed in the large frame on the right. In Figure 3, the summary of the seventh document is displayed. In this run, we have set Smart's target summary length to 25% and asked for sentence- (rather than paragraph-) based summaries. Matching query terms are highlighted throughout the summary although they are not visible in the screen dump. The left, bottom-most frame of the interface lists the most important query terms (e.g., prison,jail,  inmat(e), overcrowd) and their associated weights (e.g., 4.69,5.18,7.17, 12.54).</Paragraph>
      <Paragraph position="3"> Alter the initial display of the top-ranked documents, Smart begins a local search in the background: each individual document is reparsed and matched once again against the query to see if it satisfies the particular high-precision restriction criteria being investigated. If it doesn't the document is removed from the retrieved set; otherwise, the document remains in the final retrieved set with a score that combines the global and local score.</Paragraph>
      <Paragraph position="4"> In addition, the user can supply relevance judgements on any document by clicking Rel (relevant), NRel (not relevant), or PRel (probably relevant). Smart uses these judgements as feedback, updating the ranking after every 5 judgements by adding new documents and removing those already judged from the list of retrieved texts. Figure 4 shows the state of the session after a number of relevance judgements have been made and new documents have been added to the top 20.</Paragraph>
      <Paragraph position="5"> The interface, while basic, is valuable in its own right.</Paragraph>
      <Paragraph position="6"> It was successfully used for the Cornell/SablR experiments in the TREC 7 High-Precision track. In this task, users were asked to find 15 relevant documents within 5 minutes for each of 50 queries. This was a true test of user efficiency; and Cornell/SablR did very well.</Paragraph>
      <Paragraph position="7"> The most important use of the GUI, though, is to explore what is happening underneath the surface, in order to aid the researcher. Operating on either a single document or a cluster of documents, the researcher can request several different views. The two main paradigms are: (1) the document map view, which visually indicates the relationships between parts of the selected document(s); and (2) the document annotation view, which displays any subset of the available annotations for the selected document(s). Neither view is shown in Figures 3 and 4.</Paragraph>
      <Paragraph position="8"> The document annotation view, in particular, is extremely flexible. The interface allows the user to run any of the available annotators on a document (or document set). Each annotator returns the text(s) and the set of annotations computed for the text(s). The GUI, in turn, displays the text with the spans of each annotation type highlighted in a different color. Optionally, the values of each annotation can be displayed in a separate window. Thus, for instance, a document may be returned with one annotation type giving the spans of a document summary, and other annotation types giving the spans of an ideal summary. The researcher can then immediately see what the problems are with the document summary.</Paragraph>
      <Paragraph position="9"> There is no limit to the number of possible annotators that can be displayed. Annotators implemented or planned include:  with the inferred filler if it matches an important term.</Paragraph>
      <Paragraph position="10"> Analyzing the role of linguistic relationships in the IR tasks amounts to requesting the display of some or all of the NLP annotators. For example, the user can request to see linguistic phrase matches as well as statistical phrase matches. In the example from Figure 3, the resulting annotated summary would show &amp;quot;27 inmates&amp;quot; and &amp;quot;Latino inmates&amp;quot; as matches of the query term &amp;quot;inmates&amp;quot; because all instances of &amp;quot;inmates&amp;quot; appear as head nouns. Similarly, it would show a linguistic phrase match between &amp;quot;jail overcrowding&amp;quot; (paragraph 5 of the summary) and &amp;quot;jail and prison overcrowding&amp;quot; (in the query) for the same reason. When the output of the linguistic phrase annotator is requested, the lower left frame that lists query terms and weights is updated to include the linguistic phrases from the query and their corresponding weights.</Paragraph>
      <Paragraph position="11"> Alternatively, one might want to analyze the role of the &amp;quot;subject&amp;quot; annotator. In the running example, this would modify the summary window to show matches that involve terms appearing as the subject of a sentence or clause. For example, all of the following occurrences of &amp;quot;inmates&amp;quot; would be marked as subject matches with the &amp;quot;inmates&amp;quot; query term, which also appears in the subject position (&amp;quot;inmates are forced&amp;quot;): &amp;quot;inmates were injured&amp;quot; (paragraph l ), &amp;quot;inmates broke out&amp;quot; (paragraph 2), &amp;quot;inmates refused&amp;quot; (paragraph 2), &amp;quot;inmates are confined&amp;quot; (paragraph 3), etc. Smart can give extra weight to these &amp;quot;subject&amp;quot; term matches since entities that appear in this syntactic position are often central topic terms. The interface helps the developer to quickly locate and determine the correctness of subject matches. As an aside, if the &amp;quot;subject gap construction&amp;quot; annotator were requested, &amp;quot;inmates&amp;quot; would be filled in as the implicit subject of &amp;quot;return&amp;quot; in paragraph 2 and would be marked as a query term match.</Paragraph>
      <Paragraph position="12">  ..... 7. = .~,-.</Paragraph>
      <Paragraph position="13">  situation ~ cr~tsd a paradox in ~Ic~ sore than 200 Jall beds -- r~dy to be filled ~ will r~main ~mpt9 ~ t.~xx~n officials ~9 o~rcrowdin9 continue~ to force the rel~ of ~&amp;quot;e than a hundred suspectsd or convicted criminals a da9 ~ ~Id otJ~'~iso be incarcerated.</Paragraph>
      <Paragraph position="14"> But last week the bond voted to help relieve Oran9e C~nt &amp;quot;a seric~ Jail o~-~-oddln9 ~'~bl~ b9 not ~nelizin9 the cant9 for so-called &amp;quot;double lxmkin &amp;quot; in 216 Of the 384 cells at t~ Intake-Rel~ C~mtsr in ,:~nta ~. &lt;~&gt; &lt;P&gt; there is an odd ~t to the state's r~.Jlations: The Boord of Corrections onl9 has authcrit9 to penalize a cant9 ~nile the Jail in question is under ccr~truction. ~ has previo~19 sued the count9 over Jail conditions.</Paragraph>
      <Paragraph position="15"> &lt;/P&gt; &lt;P&gt; &amp;quot;lOuble bunkin9 &amp;quot;on the cheap&amp;quot; will nac~il9 lead to Prisonor-to-pris~r violence uhich sin Is cells wore intended to eliminate,&amp;quot; Herman ~rots in a letter to the stats board. ~JP&gt; &lt;P&gt; said the stats's requirement for ain le cells is intended to Protect wlrmeable irNt~ ~ such as 9oun9 or sontsll9 disturbed people -- frxm the l~'Ison population and to isolate irmat,~ kno.m to be exc~ivel9 violent.  In addition, Krona said that Isot yeor 43,675 suspectsd or convicted criminals ~ere either turned ewe~ from the Jail or 9ivan an eorl9 release to ease the overoro~flng. O'P&gt; &lt;P&gt; In addition to the 14 requsotsd delxJtias, sheriff's of Ficiala told the Board of C~,-r~tions that tJ~9 ~Jld need 16 more clerical staff ~rkcrs to ~ the additional inmates.</Paragraph>
      <Paragraph position="16"> Rel: 2 PReI: 0 NRel: 1 Elapsed tim: 29,2  Finally, the role of coreference resolution might also be analyzed by requesting to see the output of the coreference annotator. In response to this request, the document text window would then be updated to highlight in the same color all of the entities considered in the same coreference equivalence class. As noted above (see Section 2.2), we currently have two simple coreference annotators: one that uses the head noun heuristic and one that uses the overlapping terms heuristic. In our example, the head noun annotator would assume, among other things, that any noun phrase with &amp;quot;inmates&amp;quot; as its head refers to the same entity: &amp;quot;27 inmates&amp;quot;, &amp;quot;black and Latino inmates&amp;quot;, &amp;quot;the inmates&amp;quot;, etc. (Note that many of these proposed coreferences are incorrect -- the heuristics are only meant to be used as baselines with which to compare other, better, coreference algorithms.) A quick scan of the text with all of these occurrences highlighted lets the user quickly determine how well the annotator is working for the current example. After limited pronoun resolution is added to the coreference annotator, &amp;quot;their&amp;quot; in &amp;quot;their cells&amp;quot; (paragraph 2) would also be highlighted as part of the same equivalence class.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="116" end_page="117" type="metho">
    <SectionTitle>
4 HIGH-PRECISION INFORMA-
TION RETRIEVAL
</SectionTitle>
    <Paragraph position="0"> In order to maintain general-purpose retrieval capabilities, for example, current IR systems attempt to balance their systems with respect to precision and recall measures. A number of information retrieval tasks, however, require retrieval mechanisms that emphasize precision: users want to see a small number of documents, most of which are deemed useful, rather than being given as many useful documents as possible where the useful documents are mixed in with numerous non-useful documents. As a result, our research in high-precision IR concentrates on improving user time efficiency by showing the user only documents that there is very good reason to believe are useful. Precision is increased by restricting an already retrieved set of documents to those that meet some additional criteria for relevance. An initial set of documents is retrieved (a global search), and each individual document is reparsed and matched against the query again to see if it satisfies the particular restriction criteria being investigated (local matching). If it does, the document is put into the final retrieved set with a score of some combination of the global and local score. We have investigated a number of re-ranking algorithms. Three are briefly described below: Boolean filters, clusters, and phrases.</Paragraph>
    <Section position="1" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
4.1 Automatic Boolean Filters
</SectionTitle>
      <Paragraph position="0"> Smart expands user queries by adding terms occurring in the top documents. Maintaining the focus of the query is difficult while expanding; the query tends to drift away towards some one aspect of the query while ignoring other aspects. Therefore, it is useful to have a re-ranking algorithm that emphasizes those top documents which cover all aspects of the query.</Paragraph>
      <Paragraph position="1"> In recent work \[14\], we construct (soft) Boolean filters containing all query aspects and use these for re-ranking.</Paragraph>
      <Paragraph position="2"> A manually prepared filter can improve average precision by up to 22%. In practice, a user is not going to go to the difficulty of preparing such a filter, however, so an automatic approximation is needed. Aspects are automatically identified by looking at the term-term correlations among the query terms. Highly correlated terms are assumed to belong to the same aspect, and less correlated terms are assumed to be independent aspects. The automatic filter includes all of the independent aspects, and improves average precision by 6 to 13%.</Paragraph>
    </Section>
    <Section position="2" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
4.2 Clusters
</SectionTitle>
      <Paragraph position="0"> Clustering the top documents can yield improvements from two sources, as we examine in \[12\]. First, outlier documents (those documents not strongly related to other documents) can be removed. This works reasonably for many queries. Unfortunately, it fails catastrophically for some hard queries where the outlier may be the only top relevant document! Absolute failures need to be avoided, so this approach is not currently recommended. The second improvement source is to ensure that query expansion terms come from all clusters. This is another method to maintain query focus and balance. A very modest improvement of 2 to 3% is obtained; it appears the Boolean filter approach above is to be preferred, unless clustering is being done for other purposes in any case.</Paragraph>
    </Section>
    <Section position="3" start_page="116" end_page="117" type="sub_section">
      <SectionTitle>
4.3 Phrases
</SectionTitle>
      <Paragraph position="0"> Traditionally, phrases have been viewed as a precision enhancing device. In \[13\] and \[12\], we examine the benefits of using high quality phrases from the Empire system. We discover that the linguistic phrases, when used by themselves without single terms, are better than traditional Smart statistical phrases. However, neither group of phrases substantially improves overall performance over just using single terms, especially at the high precision end. Indeed, phrases tend to help at lower precisions where there are few clues to whether a document is relevant. At the high precision end, query balance is more important.</Paragraph>
      <Paragraph position="1">  There are generally several clues to relevance for the highest ranked documents, and maintaining balance between them is essential. A good phrase match often hurts this balance by over-emphasizing the aspect covered by the phrase.</Paragraph>
    </Section>
    <Section position="4" start_page="117" end_page="117" type="sub_section">
      <SectionTitle>
4.4 TREC 7 High Precision
</SectionTitle>
      <Paragraph position="0"> Cornell/SablR recently participated in the TREC 7 High Precision (HP) track. In this track, the goal of the user is to find 15 relevant documents to a query within 5 minutes. This is obviously a nice evaluation testbed for user efficient retrieval. We used the TRUESmart GUI and incorporated the automatic Boolean filters described above into some of our Smart retrievals.</Paragraph>
      <Paragraph position="1"> Only preliminary results are available now and once again Cornell/SablR did very well. All 3 of our users did substantially better than the median. One interesting point is that all 3 users are within 1% of each other: The same 3 users participated in the TREC 6 HP track last year with much more varied results. Last year, the hardware speed and choice of query length were different between the users. We attempted to equalize these factors this year. The basically identical results suggest (but the sample is much too small to prove) that our general approach is reasonably user-training independent. The major activity of the user is judging documents, a task for which all users are presumably qualified. The results are bounded by user agreement with the official relevance judgements, and the closeness of the results may indicate we are approaching that upper-bound.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="117" end_page="117" type="metho">
    <SectionTitle>
5 CONTEXT-DEPENDENT
SUMMARIZATION
</SectionTitle>
    <Paragraph position="0"> Another application area considered to improve end-user efficiency is reduction of the text of the documents themselves. Longer documents contain a lot of text that may not be of interest to the end-user; techniques that reduce the amount of this text will improve the speed at which the end-user can lind the useful material. This type of summarization differs from our previous work in that the document summaries are produced within the context of a query. This is done by I. expanding the vocabulary of the query by related words using both a standard Smart cooccurrence based expansion process, and the output of the standard Smart adhoc relevance feedback expansion process; null 2. weighting the expanded vocabulary by importance to the query; and 3. performing the Smart summarization using only the weighted expanded vocabulary.</Paragraph>
    <Paragraph position="1"> We participated in both the TIPSTER dry run and the SUMMAC evaluations of summarization. Once again we did very well, finishing within the top 2 groups for the SUMMAC adhoc, categorization, and QandA tasks. Interestingly, the top 3 groups for the QandA task all used Smart for their extraction-based summaries.</Paragraph>
    <Paragraph position="2"> Using the summ_eval evaluation tool on the SUMMAC QandA task, we are continuing our investigations into length versus effectiveness, particularly when comparing summaries based on extracting sentences as opposed to paragraphs. As expected, the longer the summary in comparison with the original document, the more effective the summary. For most evaluation measures, the relationship appears to be linear except at the extremes. For short summaries, sentences are more effective than paragraphs. This is expected; the granularity of paragraphs makes it tough to fit in entire good paragraphs. However, the reverse seems to be true for longer summaries, at least for us at our current level of summarization expertise. The paragraphs tend to include related sentences that individually do not seem to use the particular vocabulary our matching algorithms desire. This suggests that work on coreference becomes particularly crucial when working with sentence based summaries.</Paragraph>
    <Paragraph position="3"> Multi-Document Summarization. Our current work includes extending context-dependent summarization techniques for use in multi-document, rather than single-document, summarization. Our work on duplicate information detection will also be critical for creating these more complicated summaries. We have no results to report for multi-document summarization at this time.</Paragraph>
  </Section>
  <Section position="9" start_page="117" end_page="118" type="metho">
    <SectionTitle>
6 DUPLICATE INFORMATION
DETECTION
</SectionTitle>
    <Paragraph position="0"> Users easily become frustrated when information is duplicated among the set of retrieved documents. This is especially a problem when users search text collections that have been created from several distinct sources: a newswire source may have several reports of the same incident, each of which may vary insignificantly. If we can ensure that a user does not see large quantities of duplicate information then the user time efficiency will be improved.</Paragraph>
    <Paragraph position="1">  their similarity is above a predefined threshold.</Paragraph>
    <Paragraph position="2"> Exact duplicate documents are very easy to detect by any number of techniques. Documents for which the basic content is exactly the same, but differ in document meta-data like Message ID or Time of Message, are also easy to detect by several techniques. We propose to compute a cosine similarity function between all retrieved documents. Pairs of documents with a similarity of 1.0 will be identical as far as indexable content terms.</Paragraph>
    <Paragraph position="3"> The interesting research question is how to examine document pairs that are obviously highly related, but do not contain exactly the same terms or vocabulary as each other. For this, document-document maps are constructed between all retrieved documents which are of sufficient similarity to each other. These maps (see Figure 5) show a link between paragraphs of one document and paragraphs of the other if the similarity between the paragraphs is sufficiently strong. If all of the paragraphs of a document are strongly linked to paragraphs of a second document, then the content of the first document may be subsumed by the content of the second document. If there are unlinked paragraphs of a document, then those paragraphs contain new material that should be emphasized when the document is shown to the user.</Paragraph>
    <Paragraph position="4"> The structure of the document maps is an additional important feature to be used to indicate the type of relationship between the documents: is one document an expansion of another, or are they equivalent paraphrases of each other, or is one a summary document that includes the common topic as well as other topics. All of this information can be used to decide which document to initially show the user.</Paragraph>
    <Paragraph position="5"> Document-document maps can be created presently within the Smart system, though they have not been used in the past for detection of duplicate content \[2, 17, 18\]. Figure 5 gives such a document-document map between two newswire reports, one a fuller version of the other.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML