File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/x96-1003_intro.xml
Size: 19,049 bytes
Last Modified: 2025-10-06 14:06:08
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1003"> <Title>EVALUATION DRIVEN RESEARCH: The Foundation of the TIPSTER Text Program</Title> <Section position="3" start_page="16" end_page="20" type="intro"> <SectionTitle> EVALUATION DRIVEN RESEARCH: TIPSTER Style </SectionTitle> <Paragraph position="0"> So how well has TIPSTER adhered to the Evaluation Driven Research Paradigm as described in this preceding section? My assessment, in a phrase, is very well. Unfortunately a detailed response to this question is beyond the scope of this paper since the full answers to this question lies in the collective papers contained in the Proceedings of the TIPSTER Text Program (Phase 1), the Proceedings for each of the recent Message Understanding Conferences (MUC) and for each of the Text Retrieval Conferences (TREC) and the rest of this Proceedings for Phase II. So my objective for the remainder of this paper is to give a high level summary response to each paradigm component and maybe in the process to give a perspective with which you can read and interpret these individual papers.</Paragraph> <Paragraph position="1"> Components of the Evaluation Driven Research meetings a large number of important, yet diverse text handling, processing, and exploitation requirements surfaced. To make matters worse, each of these requirements took on many different forms when we took into account specific applications. Early on we opted to focus the TIPSTER Program on two core problems which seemed to be central to a large number of different operational problems.</Paragraph> <Paragraph position="2"> These two enabling technology areas are now well known and closely associated with the TIPSTER Program: Document Detection and Information Extraction. In Phase I the research goal was to significantly push the state-of-the-art in both fields using multiple, different technical approaches. In Phase II the research goals shifted. The main focus was now placed on investigating ways in which the two separate technology areas of document detection and information extraction could synergistically interact within a single, modular TIPSTER system architecture, on developing and deploying operational prototypes based upon the most promising TIPSTER algorithms, and on the continuing advancement of the overall performance of the best TIPSTER algorithms. In Phase III, we will add a third enabling technology area; text summarization while continuing to pursue natural extensions of these Phase II goals.</Paragraph> <Paragraph position="3"> * A series of specific tasks which when successfully accomplished would move the R&D community significantly closer to the program's final objective.</Paragraph> <Paragraph position="4"> The manner in which the TIPSTER Program has incorporated this component is most easily seen in the design of the multiple tasks that underwrote Phase I, Our evaluation of the pre-TIPSTER state-of-the-art in document detection systems concluded that there was: specific applications. As a result system designs were highly &quot;stove-piped&quot;. System portability was virtually non-existent.</Paragraph> <Paragraph position="5"> 0 Systems failed &quot;hard&quot; when they encountered previously unseen vocabulary, linguistic structures, formats, etc.</Paragraph> <Paragraph position="6"> 0 Practical applications were limited to highly constrained domains with high enough priority to warrant the development expense associated with a highly tailored system solution In response to these conclusions, Phase I of TIPSTER established multiple, inter-related tasks. All participants were required to demonstrate language portability by performing the same basic tasks in both English and in Japanese and system robustness by successfully handling and processing text documents which contained ungrammatical usage, garbles, new words, and structures.</Paragraph> <Paragraph position="7"> In addition the document detection participants were required to perform both routing and ad hoc retrieval tasks, to automatically convert detailed, lengthy natural language information need statements covering a wide range of topics into system specific queries without human intervention, to return relevant documents in priority order based upon the document's perceived degree of relevance, to highlight the most relevant passages within these retrieved documents, and to perform all of their tasks on large (now over 1 million documents and multiple gigabytes), heterogeneous, complex document collections.</Paragraph> <Paragraph position="8"> Similarly the information extraction participants were additionally required to automatically locate, identify and standardize information contained in newspaper style documents within two distinct subject domains; the formation of business joint ventures and microelectronic chip fabrication. This entire extraction task was significantly more difficult than previous extraction tasks when measured along several dimensions (i.e. text corpus complexity, text corpus size, template fall complexity, and the overall nature of the task).</Paragraph> <Paragraph position="9"> One of the most challenging information extraction tasks which was first articulated during Phase I (namely, system extensibifity by analyst endusers) has still not been completely satisfied. Extraction systems are still best extended and modified by the system developers themselves or by individuals who have received significant training. An agreed upon and specifically tailored metric and evaluation methodology for periodically measuring progress towards accomplishing each of the chosen tasks.</Paragraph> <Paragraph position="10"> Frequent formal metric-based evaluations have been a hallmark of the TIPSTER Text Program. The relevant evaluations are only highlighted in the following paragraphs. Each of these evaluations has been reported on in detail in either the Proceedings of the TIPSTER Text Program (Phase I), in this Proceedings for Phase II or in the separately pubfished Proceedings for the Message Understanding Conferences (MUC-3 to MUC-6) and Text Retrieval Conferences (TREC- 1 to TREC-4). A reader wanting additional details is directed to one or more of these references.</Paragraph> <Paragraph position="11"> During Phase I, all TIPSTER participants were formally evaluated shortly before the 12, 18, and 24 month Workshops.</Paragraph> <Paragraph position="12"> In addition, the TIPSTER Text Program established close ties with the Message Understanding Conference (MUC) beginning with MUC-3. All of the TIPSTER Information Extraction contractors were required to participate in MUC-4 where the subject domain consisted of news reports on terrorism events. MUC-5 coincided with the TIPSTER Phase 124-month evaluation and consisted of the same information extraction tasks that had been assigned to the Phase I participants (Formation of business joint ventures and microelectronic chip fabrication; each domain in two languages, English and Japanese). The non-TIPSTER MUC-5 participants could choose which of the 4 domainlanguage pairs they wished to be evaluated against. In November 1995, a redesigned MUC-6 has held in which each participant could choose to be evaluated in one or more of the following tasks; a named entity task, a template element task, a scenario template task, and a co-reference task. All four of these tasks were done using English source texts. In May 1996, TIPSTER sponsored a new information extraction evaluation program; the Multilingual Evaluation Task (MET). In MET, the participants performed the MUC-6 named entity task in one or more of the following foreign languages; Spanish, Chinese, and Japanese.</Paragraph> <Paragraph position="13"> Early into Phase I of the TIPSTER Text Program, the decision was made to establish a companion evaluation program based initially on the TIPSTER Phase I document detection tasks. This companion evaluation program became known as the Text Retrieval Conference (TREC). To date, four TREC's have been held and the fifth is currently in progress. During TREC-1 to TREC-3, each participant was evaluated against both a routing task and an ad hoc retrieval task, each consisting of 50 test cases. Beginning with TREC-4, several additional specialty subtasks (referred to within TREC as Tracks) were added. These included a multiple database merging track, a confusion track to examine the effect of corrupted data, a multilingual track to examine retrieval of Spanish language documents, an interactive track, and a filtering track. These TREC Tracks are being continued in TREC-5.</Paragraph> <Paragraph position="14"> The major addition here is that the retrieval of Chinese language documents has been added to the multilingual track.</Paragraph> <Paragraph position="15"> As part of TIPSTER Phase III, the TIPSTER R&D investigations will be expanded into the field of text summarization. Planning is already underway to determine an appropriate metric-based evaluation strategy for text summarization.</Paragraph> <Paragraph position="16"> The impact of the TIPSTER Text Program metric-based evaluations can be readily seen from the single statistic that over 100 institutions have already participated in either a TIPSTER Text Program internal evaluation, or one or more of the MUC, MET, and TREC evaluation programs. In fact a significant majority of these institutions have participated at least twice and many have participated with even greater frequency.</Paragraph> <Paragraph position="17"> * Sufficient quantities of training and testing data. Each data collection should be carefully selected, formatted, annotated, and otherwise prepared to directly support a specific task.</Paragraph> <Paragraph position="18"> The thirteen different formal metric-based evaluations conducted variously under the banners of the TIPSTER Text Program Phase I (3), MUC (4), MET (1), and TREC (5) could not have been executed without sufficient quantities of training and testing data. The collection, annotation, tagging, and formatting of the base document collections along with the creation of the appropriate answer keys to support each separate evaluation program has beeen a costly, time consuming, human analyst intensive process. The bulk of these data preparation tasks were concentrated into Phase I, but additional data preparation efforts to support MUC, MET and TREC have continued, as needed, since the completion of Phase I in 1993. The performance of human analysts in completing their tasks has been routinely measured and have subsequently been used as a benchmark against which the performance of the information extraction and document detection algorithms can be compared.</Paragraph> <Paragraph position="19"> As indicated earlier in this paper, the optimal situation is one in which the data collection effort is 100% completed prior to the start of the associated research task. This did not happen during TIPSTER Phase I. The collection, formatting and preparation of appropriate document databases and the creation of topic statements and pooled relevance judgments to support the document detection research tasks and of complex scenario templates, detailed fill rule descriptions, and appropriate answer keys to support the information extraction research task turned out to be a monumental undertaking. These data preparation tasks in both areas were several orders of magnitude greater than previous efforts. The TIPSTER government sponsors did not fully appreciate this fact until the data collection efforts were underway. We soon found ourselves in the situation where TIPSTER Phase I Program execution and data preparation were occurring simultaneously. It quickly proved very difficult, particularly on the information extraction side, to maintain sufficient training and testing data throughput and at the same time, maintain high data consistency. While the job was eventually completed, it was only through the tireless and sometimes even heroic efforts of a small number of highly motivated and dedicated government researchers that this data preparation effort was brought to a successful conclusion in Phase I. To say the least, this is not a recommended mode of operation.</Paragraph> <Paragraph position="20"> Again all of these TIPSTER data development activities have been previously reported on in the Proceedings associated with each of the evaluation programs identified earlier. The interested reader is directed to these sources for additional information and details.</Paragraph> <Paragraph position="21"> A group of several (in fact, the more the merrier) leading-edge research institutions who are willing to participate in a cooperative, corporate program.</Paragraph> <Paragraph position="22"> The cooperativeness and corporateness of the TIPSTER Text Program participants has been repeatedly demonstrated in a wide variety of ways. A few examples are listed below to demonstrate the degree to which this statement has been played out. 0 One participant in the Document Detection component of TIPSTER has participated in all three TIPSTER Phase I evaluations, in TREC-1 to TREC-4, and is currently participating in TREC-5. Likewise one participant in the Information Extraction component has participated in all three TIPSTER Phase I evaluations, MUC-3 to MUC-6, and MET. A number of other participants come close to matching these participation levels.</Paragraph> <Paragraph position="23"> 0 Throughout the entire TIPSTER Text Program all of the contractors have willing shared data files and software modules with the other participants. This clearly allowed the collective program to cover more ground and to move forward faster.</Paragraph> <Paragraph position="24"> 0 Since its beginning the TIPSTER Text Program has held technical workshops at 6 month intervals. The Phase II 24-month Workshop was the 10th such workshop. A portion of each workshop has been devoted to each contractor describing the technical details of their underlying algorithms and approaches, the results of their internally conducted evaluations and experiments, as well as their successes and failures on the TIPSTER sponsored formal evaluations. The openness of these presentations has always been highly commendable. To the degree that time permits, the same openness has been evident during each MUC, MET, and TREC.</Paragraph> <Paragraph position="25"> The importance of these forums and open discussions has been repeatedly demonstrated.</Paragraph> <Paragraph position="26"> A report outlining the details of successfully implemented techniques and approaches is made at one workshop by a single participant.</Paragraph> <Paragraph position="27"> Inevitably at the next workshop, reports are given by several other participants concerning how they were able to successfully and beneficially incorporate these new ideas into their own systems. In this way, a single success has been quickly multiplied.</Paragraph> <Paragraph position="28"> 0 Establishing and maintaining a cooperative, corporate viewpoint among the program's external participants is made considerably easier if it is evident that there is a similar cooperative and corporate viewpoint being regularly demonstrated by the Government sponsors. Over the past seven years a unique bonding chemistry has developed among the large number of Government personnel who have had an active hand in the TIPSTER Program. Since October 1993 the introductory briefing of the TIPSTER Text Program has regularly been given as a joint briefing by Dr. Sarah Taylor of the Office of Research and Development and myself. This briefing has been frequently opened with the observation that &quot;Multiple agencies have been working closely together on this Program since 1989. Why, in the process, we've even become friends.&quot; The line usually sparks a snicker or two, because those in the audience seem to know that previous joint programs between these Agencies have not always been so amicable. Almost from day one, there has been an underlying current of give and take, of teamwork, of consensus building. This atmosphere has proven to be quite contagious as new Government participants have joined the TIPSTER Program team and it has clearly rubbed off onto the other TIPSTER participants.</Paragraph> <Paragraph position="29"> 0 In the Spring of 1994 the TIPSTER Text Program was nominated by the Community Management Staff as a &quot;Reinvention Laboratory&quot; in recognition of &quot;its teamwork, its customer focus, and the fact that it has broken down exiting bureaucratic barriers.&quot; Then in March 1996 Vice President Gore presented the National Performance Review Hammer Award to the TIPSTER Text Program in the reinvention of government. In his remarks, the Vice President lauded the TIPSTER Program's teamwork for spanning the Intelligence Community and partnering with the private sector and leading universities.</Paragraph> <Paragraph position="30"> * Sufficient government funding to cover the cost of all aspects of the Evaluation Driven Research Paradigm.</Paragraph> <Paragraph position="31"> From its inception the TIPSTER Text Program has been a jointly planned, funded, and managed program. It is unlikely that any of the individual participating Agencies could have started and sustained a program of this magnitude by itself. In addition to the three principal funding agencies, additional funds were obtained from a variety of other sources at critical junctures in the program. The most notable example of this came from the Congressionally funded Dual Use Technology Program which provided over $5 million in supplement funds in early 1992, about a quarter of the way through Phase I. This infusion of funds helped raised the TIPSTER Program to a higher level, insured that its extensive program to collect and prepare sufficient quantities of training and testing data could be completed as planned and at the desired level of quality and provided the impetus for the TIPSTER Text Program to undertake the development of its first operational prototype system based upon TIPSTER technology (i.e., the HOOKAH Project at the Drug Enforcement Administration). 0 The implementation of the TIPSTER Phase II Architecture Demonstration System, required extensive, detailed coordination between all seven of the TIPSTER Phase II contractors.</Paragraph> <Paragraph position="32"> The timetable which was established for completion of this effort was extremely tight.</Paragraph> <Paragraph position="33"> Any single contractor who chose to drag his or her feet or not fully and openly participate would have put the completion of the whole effort in serious jeopardy. This did not happen and as a result, the TIPSTER Text Program Phase II 12-month Workshop was treated to several demonstrations of this working prototype system built in compliance with the specifications of the TIPSTER Architecture.</Paragraph> </Section> class="xml-element"></Paper>