File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/x93-1014_intro.xml

Size: 3,990 bytes

Last Modified: 2025-10-06 14:05:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="X93-1014">
  <Title>ABCD + E A A andB A andB A,B,C,D A,B,C,D B C</Title>
  <Section position="3" start_page="0" end_page="135" type="intro">
    <SectionTitle>
2. DOCUMENT CORPORA
</SectionTitle>
    <Paragraph position="0"> Four language-domain pairs were used in the TIPSTER exercise, abbreviated as EJV, JJV, EME, JME to reflect the language (English or Japanese) and the domain (Joint Ventures or MicroElectronics). Each of the four language-domain pairs has an associated set of 1200 to 1600 documents (a corpus), divided into a development set and the multiple test sets. During the course of the TIPSTER program, up to three test sets were prepared for each language-domain pair, in addition to approximately 1000 development set documents for each corpus. These test sets, which were used for the TIPSTER 12-, 18-, and 24month evaluations, ranged from 50 to 300 documents each.</Paragraph>
    <Paragraph position="1"> For MUC-5, the first test set was added tothe development corpus, the second test set was used for the MUC-5 dry run, and the third test set was used for the MUC-5 evaluation.</Paragraph>
    <Paragraph position="2"> Randomly selected from the overall pool of documents, the test sets reflect a similar distribution of sources, relevancy, and other document attributes as the development sets.</Paragraph>
    <Paragraph position="3"> There are a few exceptions, e.g., the first EJV test set does not contain documents from one of the sources added to the development and subsequent test sets.</Paragraph>
    <Paragraph position="4"> These corpora consist of documents from a variety of newswire or newspaper sources, selected by a combination of automatic retrieval and manual filtering techniques. For example, the EJV corpus was retrieved from three text data sources (LEXUS/NEXUS, PROMT, and Wall Street Journal from ACL/DCI or TIPSTER Detection database CDROMs) by using traditional keyword-based document retrieval systems. These keywords for EJV included such sterns as joint venture, joint, venture, tie-up, collaborate, cooperate. Though the majority of the documents were pulled by the keyword method, additional candidates were retrieved by random browsing through the corpora sources and identifying documents which appeared to be relevant.</Paragraph>
    <Paragraph position="5"> After a large pool of candidate documents was retrieved, these documents were manually scanned and separated into two groups: relevant or irrelevant. In order to test whether the Information Extraction systems were able to discriminate between relevant and irrelevant documents, the four corpora were then seeded with a certain number of irrelevant documents. The percentage of irrelevant documents functioning as &amp;quot;distractors&amp;quot; ranges from about 5% (for English Joint Ventures) to 30% (for Japanese Microelectronics). By comparison, the corpora used for previous MUCs used up to 50% irrelevant documents, stressing the document detection aspect of the task more strenuously than in TIPSTER/MUC-5.</Paragraph>
    <Paragraph position="6"> The 200+ different sources used to build the English-language corpora include the Wall Street Journal, Jiji Press, New York Iimes, Financial limes, Kyodo News Service, and a variety of technical publications in fields such as communications, airline transportation, rubber &amp; plastics, and food marketing. The Japanese-language sources  used for the Japanese corpora include Asahi, Nikkei, and Yomiuri.</Paragraph>
    <Paragraph position="7"> Each document in the four development corpora has an associated filled-in template (see appendices to &amp;quot;Tasks, Domains and Languages for Information Extraction&amp;quot; in this volume), representing the correct template or &amp;quot;answer key&amp;quot; for that document. The development corpora along with their associated templates were provided to the program participants during the course of the program.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML