File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/92/h92-1043_concl.xml
Size: 2,895 bytes
Last Modified: 2025-10-06 13:56:51
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1043"> <Title>Test Sets</Title> <Section position="10" start_page="227" end_page="227" type="concl"> <SectionTitle> CONCLUSIONS </SectionTitle> <Paragraph position="0"> The Relevancy Signatures Algorithm was inspired by the fact that human readers are capable of scanning a collection of texts, and reliably identifying a subset of those texts that are relevant to a given domain. More importantly, this 5 During the course of this research, we found that about 4% of the irrelevant texts in the MUC-3 development corpus were miscategorized. These errors were uncovered by spot checks: no systematic effort was made to review all the irrelevant texts. We therefore suspect that the actual error rate is probably much higher.</Paragraph> <Paragraph position="1"> classification can be accomplished by fast text skimming: the reader hits on a key sentence and a determination of relevancy is made. This method is not adequate if one's goal is to identify all possible relevant texts, but text skimming can be very reliable when a proper subset of relevant texts is sufficient. We designed the Relevancy Signatures Algorithm in an effort to simulate this process.</Paragraph> <Paragraph position="2"> In fact, the Relevancy Signatures Algorithm has an advantage over humans insofar as it can automatically derive domain specifications from a set of training texts.</Paragraph> <Paragraph position="3"> While humans rely on domain knowledge, explicit domain guidelines, and general world knowledge to identify relevant texts, the Relevancy Signatures Algorithm requires no explicit domain specification. Given a corpus of texts tagged for domain relevancy, an appropriate dictionary, and suitable natural language processing capabilities, reliable relevancy indicators are extracted from the corpus as a simple side effect of natural language analysis. Once this training base has been obtained, no additional capabilities are needed to classify a new text.</Paragraph> <Paragraph position="4"> It follows that the Relevancy Signatures Algorithm avoids the knowledge-engineering bottleneck associated with many text analysis systems. As a result, this algorithm can be easily ported to new domains and is trivial to scale-up.</Paragraph> <Paragraph position="5"> With large online text corpora becoming increasingly available to natural language researchers, we have an opportunity to explore operational alternatives to hand-coded knowledge bases and rule bases. As we have demonstrated, natural language processing capabilities can produce domain signatures for representative text corpora that support high-precision text classification. We have obtained high degrees of precision for limited levels of recall, in an effort to simulate human capabilities with a domain-independent discrimination technique.</Paragraph> </Section> class="xml-element"></Paper>