File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1043_intro.xml
Size: 2,861 bytes
Last Modified: 2025-10-06 14:05:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1043"> <Title>Test Sets</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> TEXT CLASSIFICATION </SectionTitle> <Paragraph position="0"> Text classification is central to many information retrieval applications, as well as being relevant to message understanding applications in text analysis. To appreciate the importance and difficulty of this problem, consider the role that it played in the MUC-3 (The Third Message Understanding Conference) performance evaluation. Last year 15 text analysis systems attempted to extract information from news articles about terrorism (Lehnert & Sundheim 1991; Sundheim 1991). According to an extensive set of domain guidelines, roughly 50% of the texts in the MUC-3 development corpus did not contain legitimate information about terrorist activities. Articles that described rumours or lacked specific details were This research supported by the Office of Naval Research, under a University Research Initiative Grant, Contract #N00014-86-K-0764, NSF Presidential Young Investigators Award NSFIST-8351863, and the Advanced</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Research Projects Agency of the Department of Defense </SectionTitle> <Paragraph position="0"> monitored by the Air Force Office of Scientific Research under Contract No. F49620-88-C-0058.</Paragraph> <Paragraph position="1"> designated as irrelevant, as well as descriptions of specific events that targetted military personnel and installations (a terrorist event was defined to be one in which civilians or civilian locations were the apparent or accidental targets in an intentional act of violence). In order to achieve high-precision information extraction, the MUC-3 text analyzers had to differentiate relevant and irrelevant texts without human assistance. A system with a high rate of false positives would tend to generate output for irrelevant texts, and this behavior would show up in both the scores for overgeneration and spurious event counts. An analysis of the MUC-3 evaluation suggests that all of the MUC-3 systems experienced significant difficulty with relevant text classification (Krupka et al. 1991).</Paragraph> <Paragraph position="2"> Although some texts will inevitably require in-depth natural language understanding capabilities in order to be correctly classified, we will demonstrate that skimming techniques can be used to identify subsets of a corpus that can be classified with very high levels of precision. Our algorithm automatically derives relevancy signatures from a training corpus using selective concept extraction techniques. These signatures are then used to recognize relevant texts with a high degree of accuracy.</Paragraph> </Section> </Section> class="xml-element"></Paper>