File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1002_metho.xml
Size: 12,979 bytes
Last Modified: 2025-10-06 14:14:26
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1002"> <Title>FOCUS OF TIPSTER PHASES I and II</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> FOCUS OF TIPSTER PHASES I and II </SectionTitle> <Paragraph position="0"> The TIPSTER Program began in June 1989 just after the conclusion of the second Message Understanding Conference (MUC-2). The concept which the Advanced Research Projects Agency (ARPA) put forward at that time was based on the promising results of that conference and on the belief that the technologies being demonstrated at MUC for the automated handling of large volumes of text would be of great benefit to a variety of Government agencies.</Paragraph> <Paragraph position="1"> Following a series of meetings, agreement was reached for sharing the planning, funding, and execution of the program. Several major concepts were also outlined and accepted as central to TIPSTER's progress. Initially two phases were planned: two years of research and development into advanced algorithms, followed by two years of development of prototype/demonstration systems.</Paragraph> <Paragraph position="2"> Within these two phases, there would be separate focus on detection (retrieval and routing) and on extraction (understanding). Portability with regard to domain and language would be emphasized; evaluation of complete systems would be encouraged and scheduled periodically; and, as part of the baseline for these evaluations, the Government would develop a large corpora for the training and testing of corpus-based techniques as well as for system development and evaluation. Proposals were solicited in June of 1990, and eventually three contractors were selected to investigate different approaches to detection and another three were selected for extraction research.</Paragraph> <Paragraph position="3"> The Pre-Existing State Of The Art The two language technologies which ARPA initially hoped to advance through the TIPSTER Program were Document Detection and Information Extraction.</Paragraph> <Paragraph position="4"> Document Detection includes both routing, which involves running static queries against a stream of new data, and retrieval, which involves running ad hoc queries against archival data. Prior to TIPSTER, most analysts faced with fmding essential information from large volumes of data used search systems based on Boolean keywords. These systems had been developed more than a decade earlier and typically had the following characteristics: * the user loses an unknown quantity of useful information because the system is unable to retrieve many of the relevant documents (= low recall); * the user must read a very large number of irrelevant documents which the system mistakenly retrieved (= low precision); * the user must scan the entire list of retrieved documents because a good document is just as likely to be at the end of the list as at the beginning (= no ranking or prioritization); * the user must generate by hand all variant spellings or alternate word choices because there are no built-in rules for adding variants; * the user has to understand how the system works and the syntax of queries in order to use the system (= hand built queries).</Paragraph> <Paragraph position="5"> Information Extraction is a technology in which pre-specified types of information are located within free text, extracted, and placed within a database. Notwithstanding prior ARPA and commercial support for the development of information extraction technology and MUC's positive impact, information extraction had been applied to the database update task as largely a manual procedure. This procedure was costly in terms of labor commitment and the training required and there was wide variance in the accuracy and consistency of the database content. The techniques typically only worked for English, and were difficult to port to new domains or even to extend within the current domains. Although a few automated information extraction systems had been deployed, they tended to be expensive both in terms of development and ongoing maintenance and were task specific with little reusability.</Paragraph> <Paragraph position="6"> Review of Phase I The TIPSTER Program officially began at a kickoffworkshop held in September of 1991. The government reviewed the framework, objectives and plans for the following two years of work. The contractors described their specific approaches to detection and extraction and laid the groundwork for the future sharing of ideas and of software and data resources. The workshop included parallel working sessions for discussion of specific issues in the different areas of research, including details for addressing the different domains and different languages and the government's preparation of the data.</Paragraph> <Paragraph position="7"> The workshops were repeated at 6-month intervals for the duration of Phase I. Selected researchers from other ARPA Human Language Technology (HLT) programs were also invited. In connection with each of the workshops, uniform evaluations of system performance were conducted and were reported during the meetings. There were frequent exchanges of information among the government and contractors (with heavy use of email) and, by the end of the two years, a sizable catalog of shareable resources had been developed. Both the Message Understanding Conferences (MUC's), which preceded TIPSTER, and the Text REtrieval Conferences (TREC's) evaluated the state of the art and provided a major additional benefit in promoting text-processing research and development outside of the TIPSTER Text contracts, since they were organized by NRaD and NIST and advertised to a wider community. All of the TIPSTER Text contractors were required to participate in MUC or TREC, and MUC-5 and TREC-2, using TIPSTER evaluation techniques, were intentionally timed to coincide with the end of Phase I (the 24-month workshop) so as to provide a measure of the state of the art and identify good performers.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> TIPSTER Phase II </SectionTitle> <Paragraph position="0"> During the last year of Phase I, the Government began planning Phase II. Scenarios were developed to indicate the variety of actual applications of the systems to be developed. A two-tiered program of (1) continued algorithm development and (2) transfer of technology into demonstration projects was defined. The successful concepts of Phase I were continued, with close cooperation among the government agencies and the contractors; regular workshops; and corpora for development and testing. Based on a variety of lessons learned and encouraged by the results of TIPSTER Phase I, a four part program began to informally take shape. While continuing its traditional focus on advanced research and metrics-based evaluation, the need for a supporting architecture was recognized, along with the realization that many of the techniques developed were now sufficiently stable to be applied in an operational environment as demonstration projects.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. Research </SectionTitle> <Paragraph position="0"> Document Detection. The TIPSTER Phase II prototype systems gave the user Document Detection tools which feature the algorithms and technology developed in Phase I. There has been improved recall (higher recall of relevant documents) and improved precision (the user reads fewer useless documents in fmding the ones he wants). Moreover, the system, not the user, can now automatically expand the queries to draw in more relevant documents (using concept based tools such as thesauri, or using a natural language description of the subject supplied by the user), and the documents are statistically rard~ed according to how well they match the query, thus improving the chances that the most useful documents will be near the top of the queue.</Paragraph> <Paragraph position="1"> Information Extraction. As a result of algorithm development in Phase I, TIPSTER Phase II prototype systems were built with the following characteristics:</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Architecture Concept </SectionTitle> <Paragraph position="0"> The design of the Phase I systems and analysis of the scenarios indicated the complementary nature of detection and extraction operations and the desirability of supporting both capabilities within a single system. There also appeared to be many similar modules in the diverse systems. From this, it was determined that an initial activity of Phase II would be the development of a common, open software architecture for the implementation of text-processing systems. This architecture was to facilitate sharing of the development tasks, transferring technology to actual applications, future R&D into improved algorithms, and continuous upgrading of systems which use the architecture. The architecture stresses functional and knowledge-based modularity and uses an SGML-Iike language for tagging text transferred between the modules.</Paragraph> <Paragraph position="1"> This architecture was developed as part of Phase II R&D through the cooperative efforts of multiple contractors, coordinated by an independent Systems Engineering / Configuration Management contractor. The R&D included improvement of algorithms and research into combining the results of the application of diverse extraction and detection techniques.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Continuing Evaluation </SectionTitle> <Paragraph position="0"> During Phase II, TIPSTER became primary sponsor for both the Message Understanding Conferences and the Text Retrieval Conferences, based on the belief that these forums for evaluation of text-processing technologies are essential to continued success in TIPSTER research and development.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Demonstration Projects </SectionTitle> <Paragraph position="0"> A Broad Agency Announcement soliciting proposals for participation in Phase II R&D was issued in August 1993. Bidders were judged based on their ideas for research as well as their being potential sources for the demonstration projects.</Paragraph> <Paragraph position="1"> Individual agencies issued separate Requests for Proposals for each such project. For each project, a demonstration system based on the architecture and modules developed in the R&D tier was developed, installed and evaluated in the processing of actual &quot;operational&quot; data. Needs for architecture and algorithm improvements or additional research were fed back to the R&D projects.</Paragraph> <Paragraph position="2"> A Successful Partnership The continued close cooperation of multiple government organizations in formulating and implementing the TIPSTER program has been a major ingredient in its success. By regularly providing a forum for discussion, the Program has also fostered cooperation among an ever-expanding group of academic institutions and industry vendors, who have shared ideas and resources while pursuing different approaches to the problems of text processing. The use of these technologies will undoubtedly expand well beyond the prototypes and operational systems built during TIPSTER Phase II for a small number of Government agencies, as the world at large recognizes the need for document detection and information extraction.</Paragraph> <Paragraph position="3"> Conclusion of Phase H The 24-Month Workshop which concluded TIPSTER Phase II brought together a large number of researchers and developers to discuss their results and describe their progress since 1994 and to present their findings to a variety of potential customers. The papers summarizing the efforts sponsored by the TIPSTER Program are included in this volume.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> Acknowledgments </SectionTitle> <Paragraph position="0"> Portions of the information above were originally in papers prepared by Roberta H. Merchant and Thomas H. Crystal and published in the</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> TIPSTER Phase I Proceedings. </SectionTitle> <Paragraph position="0"> We also wish to thank Wiley Harris and Sharon Kaufmann for their suggestions and assistance in assembling and organizing all of the materials presented in this summary book for</Paragraph> </Section> <Section position="9" start_page="0" end_page="11" type="metho"> <SectionTitle> TIPSTER Phase II. </SectionTitle> <Paragraph position="0"/> </Section> class="xml-element"></Paper>