File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0109_intro.xml
Size: 4,139 bytes
Last Modified: 2025-10-06 14:06:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0109"> <Title>EXPLOITING TEXT STRUCTURE FOR TOPIC IDENTIFICATION</Title> <Section position="2" start_page="0" end_page="101" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Topic identification concerns a problem of predicting terms in text which indicate its subject or theme.</Paragraph> <Paragraph position="1"> In the past, the problem has been addressed mostly by computational linguists in relation to issues like coreference (Hobbs, 1978), anaphora resolution (Grosz and Sidner, 1986; Lappin and Leass, 1994), or discourse center (Joshi and Weinstein, 1981; Walker et al., 1994). In information retrieval, predicting important terms in document is crucial for an effective retrieval of relevant documents(Salton et al., 1993), though they do not necessarily correspond to the subject or the theme. Predicting important terms involves numerical weighting of terms in document. Terms with top weights are judged important and representative of document.</Paragraph> <Paragraph position="2"> A spin-off of information retrieval, known as text categorization, shares a similar research interest.</Paragraph> <Paragraph position="3"> Text categorization concerns associating documents with their classification terms or categories (Lewis, 1992). Since in text categorization, categories are determined beforehand in such a way as to meet the user's specific tastes or needs, they may not serve as a topic or a theme in that they need not have a semantic relevance to the contents of documents.</Paragraph> <Paragraph position="4"> Technically, however, it is straightforward to move from text categorization to topic identification, provided that we are able to somehow isolate themes in texts and use them as categories to be assigned to texts. But the problem with using text categorization for topic identification, is that categories are arbitrarily given by humans, with no regard for documents that are to be classified. There is thus always a danger of misrepresenting documents. One possible way out is to choose categories not from outside of the documents but from within. The feasibility of the idea is explored in the paper.</Paragraph> <Paragraph position="5"> The use of text structure in information retrieval was motivated by the need for dealing with large documents, whose breadth of vocabulary may easily mislead the retrieval system into making a wrong judgement about their relevancy to the query. Indeed, a new area of research known as passage retrieval has emerged to explore methods for using information from various levels of a document's structure, e.g. sentences, sections, paragraphs, and other semantically or rhetorically motivated textual units.</Paragraph> <Paragraph position="6"> Wilkinson (1994) describes weighting methods that combine the similarity measure with various textual categories like abstract, purpose and supplementary, etc. Salton (1993) compares the full-text retrieval with the passage retrieval based on sections and paragraphs, and reports that the latter form of retrieval led to an increased effectiveness. Allan (1995) examines the usefulness of passage for relevance feedback, which concerns deriving or learning useful query terms from retrieved documents. Hearst (1993) is an interesting attempt to enhance the retrieval performance by using what they call a text tile, a discourse unit determined on the basis of the subject or content of the text. Callan (1994) proposes a hybrid approach of using both passage and document.</Paragraph> <Paragraph position="7"> Section 2 introduces the idea of bringing an IR technique to the topic identification task. Section 3 discusses a problem that the proposed method shows poor performance on large documents. Section 4 is a response to the problem: we propose the use of information on text structure to reduce irrelevancy in the document and increase effectiveness. In Section 5, we conduct a set of experiments to determine whether the use of text structure has a positive effect on the performance.</Paragraph> </Section> class="xml-element"></Paper>