File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1045_intro.xml
Size: 2,656 bytes
Last Modified: 2025-10-06 14:06:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1045"> <Title>Less is more: Eliminating index terms from subordinate clauses</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Efforts to exploit natural language processing (NLP) to aid information retrieval (IR) have generally involved augmenting a standard index of lexical terms with more complex terms that reflect aspects of the linguistic structure of the indexed text (Fagan 1988, Katz 1997, Arampatzis et al. 1998, Strzalkowski et al. 1998, inter alia). This paper shows that NLP can benefit information retrieval in a very different way: rather than increasing the size and complexity of an IR index, linguistic information can make it possible to store less information in the index. In particular, we demonstrate that robust NLP technology makes it possible to omit substantial portions of a text from the index without dramatically affecting precision or recall.</Paragraph> <Paragraph position="1"> This research is motivated by insights from Rhetorical Structure Theory (RST) (Mann & Thompson 1986, 1988). An RST analysis is a dependency analysis of the structure of a text, whose leaf nodes are the propositions encoded in clauses. In this structural analysis, some propositions in the text, called &quot;nuclei,&quot; are more centrally important in realizing the writer's communicative goals, while other propositions, called &quot;satellites,&quot; are less central in realizing those goals, and instead provide additional information about the nuclei in a manner consistent with the discourse relation between the nucleus and the satellite. This asymmetry has an analogue in sentence structure: main clauses tend to represent nuclei, while subordinate clauses tend to represent satellites (Matthiessen and Thompson 1988, Corston-Oliver 1998).</Paragraph> <Paragraph position="2"> From the perspective of discourse analysis, the task of information retrieval can be viewed as attempting to identify the &quot;aboutness,&quot; or global topicality, of a document in order to determine the relevance of the document as a response to a user's query. Given an RST analysis of a document, we would expect that for the purposes of predicting document relevance, information that occurs in nucleic propositions ought to be more useful than information that occurs in satellite propositions. To test this expectation, we experimented with eliminating from an IR index those terms that occurred in certain kinds of subordinate clauses.</Paragraph> </Section> class="xml-element"></Paper>