File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/w93-0303_concl.xml

Size: 3,396 bytes

Last Modified: 2025-10-06 13:57:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0303">
  <Title>Document Filtering Using Semantic Information from s Machine Readable Dictionary1</Title>
  <Section position="7" start_page="26" end_page="27" type="concl">
    <SectionTitle>
7. Conclusions
</SectionTitle>
    <Paragraph position="0"> As a preliminary full implementation and testing of the SFCoder as * means for semantically representing the content of texts for the purpose of delimiting * document set with a high likelihood of containing all those relevant to an individual query, we find these results promising. In a large operational system, the ability to filter out 61% of the incoming flux of millions of documents if the SFCoder alone is used, or 72% of the documents if the SFCoder + Proper Noun Interpreter is used, will have a significant impact on the system's performance.</Paragraph>
    <Paragraph position="1"> In addition, we have also been experimenting with the SFC vectors as * means for automatically clustering documents in an established database (Liddy, Paik &amp; Woelfel, 1992). To do this, the document vectors are clustered using Ward's agglomerative clustering algorithm (Ward, 1963) to form classes in a document database. For ad hoc retrieval, query SFC vectors are matched to the SFC vector of each cluster-centroid in the database. Clusters whose cantroid vectors exhibit a predetermined similarity to the query SFC vector are either presented to the user as a semantically cohesive cluster on which to begin preliminary browsing or, passed on to other system components for further processing. A qualitative analysis of the clusters produced in this manner revealed that the use of SFCs combined with Ward's clustering algorithm resulted in meaningful groupings of documents that were similar across concepts not directly encoded in SFCs. Browsers find that documents seem to fit naturally int the cluster to which they are assigned by the system.</Paragraph>
    <Paragraph position="2"> Beyond its uses within the DR-LINK System, the Subject Field Coder has general applicability as a pre-filter  for a wide range of other systems. The only adjustment required would be a recomputation of the correlation matrix based on each new corpus. The recomputation is necessary due to the fact that different corpora represent different domains and the tendencies of SFCs to correlate with other SFCs will vary somewhat from domain to domain. We have used the SFC filter on various corpora and have quickly recomputed a matrix for each.</Paragraph>
    <Paragraph position="3"> Reiterating the opening argument of this paper, we believe that the current situation in information retrieval could be effectively dealt with by considering document retrieval as a multi-stage process in which the first modules of a system filter out those texts with no real likelihood of matching a user's need. The filtering approach offers promise particularly to those systems which perform a more conceptual style of representation which is very computationally expensive if applied to all documents regardless of the likelihood that they might be relevant.</Paragraph>
    <Paragraph position="4"> Acknowledqm~rlts We wish to thank Longman Group, Ltd. for making the machine readable version of LDOCE, 2nd Edition available to us and BBN for making POST available for our use on the DR-LINK Project.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML