File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/w00-0102_abstr.xml

Size: 2,958 bytes

Last Modified: 2025-10-06 13:41:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0102">
  <Title>Using Long Runs as Predictors of Semantic Coherence in a Partial Document Retrieval System</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We propose a method for dealing with semantic complexities occurring in information retrieval systems on the basis of linguistic observations. Our method follows from an analysis indicating that long runs of content words appear in a stopped document cluster, and our observation that these long runs predominately originate from the prepositional phrase and subject complement positions and as such, may be useful predictors of semantic coherence.</Paragraph>
    <Paragraph position="1"> From this linguistic basis, we test three statistical hypotheses over a small collection of documents from different genre. By coordinating thesaurus semantic categories (SEMCATs) of the long run words to the semantic categories of paragraphs, we conclude that for paragraphs containing both long runs and short runs, the SEMCAT weight of long runs of content words is a strong predictor of the semantic coherence of the paragraph.</Paragraph>
    <Paragraph position="2"> Introduction One of the fundamental deficiencies of current information retrieval methods is that the words searchers use to construct terms often are not the same as those by which the searched information has been indexed. There are two components to this problem, synonymy and polysemy (Deerwester et. al., 1990). By definition of polysemy, a document containing the search terms or indexed with the search terms is not necessarily relevant. Polysemy contributes heavily to poor precision. Attempts to deal with the synonymy problem have relied on intellectual or automatic term expansion, or the construction of a thesaurus.</Paragraph>
    <Paragraph position="3"> Also the ambiguity of natural language causes semantic complexities that result in poor precision. Since queries are mostly formulated as words or phrases in a language, and the expressions of a language are ambiguous in many cases, the system must have ways to disambiguate the query.</Paragraph>
    <Paragraph position="4"> In order to resolve semantic complexities in information retrieval systems, we designed a method to incorporate semantic information into current IR systems. Our method (1) adopts widely used Semantic Information or Categories, (2) calculates Semantic Weight based on probability, and (3) (for the purpose of verifying the method) performs partial text retrieval based upon Semantic Weight or Coherence to overcome cognitive overload of the human agent. We make two basic assumptions: 1. Matching search terms to semantic categories should improve retrieval precision. 2. Long runs of content words have a linguistic basis for Semantic Weight and can also be verified statistically.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML