File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1072_intro.xml
Size: 5,361 bytes
Last Modified: 2025-10-06 14:05:46
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1072"> <Title>DOCUMENT REPRESENTATION IN NATURAL LANGUAGE TEXT RETRIEVAL</Title> <Section position="3" start_page="0" end_page="364" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> The task of information retrieval is to extract relevant documents from a large collection of documents in response to user queries.</Paragraph> <Paragraph position="1"> When the documents contain primarily unrestricted text (e.g., newspaper articles, legal documents, etc.) the relevance of a document is established through 'full-text' retrieval. This has been usually accomplished by identifying key terms in the documents (the process known as 'indexing') which could then be matched against terms in queries \[2\]. The effectiveness of any such term-based approach is directly related to the accuracy with which a set of terms represents the content of a document, as well as how well it contrasts a given document with respect to other documents. In other words, we are looking for a representation R such that for any text items D1 and D2, R(D1) = R(D2) iff meaning(D1) = meaning(D2), at an appropriate level of abstraction (which may depend on the types and character of anticipated queries).</Paragraph> <Paragraph position="2"> The simplest word-based representations of content are usually inadequate since single words are rarely specific enough for accurate discrimination, and their grouping is often accidental. A better method is to identify groups of words that create meaningful phrases, especially if these phrases denote important concepts in the database domain. For example, joint venture is an important term in the Wall Street Journal (WSJ henceforth) database, while neither joint nor venture are important by themselves. In fact, in a 800+ MBytes database, both joint and venture would often be dropped from the list of terms by the system because their inverted document frequency (idJ) weights were too low. In large databases l See \[1\] for a detailed introduction to TREC.</Paragraph> <Paragraph position="3"> comprising hundreds of thousands of documents the use of phrasal terms is not just desirable, it becomes necessary.</Paragraph> <Paragraph position="4"> An accurate syntactic analysis is an essential prerequisite for selection of phrasal terms. Various statistical methods, e.g., based on word co-occurrences and mutual information, as well as partial parsing techniques, are prone to high error rates (sometimes as high as 50%), turning out many unwanted associations. Therefore a good, fast parser is necessary, but it is by no means sufficient.</Paragraph> <Paragraph position="5"> While syntactic phrases are often better indicators of content than 'statistical phrases' -- where words are grouped solely on the basis of physical proximity, e.g., &quot;college junior&quot; is not the same as &quot;junior college&quot; -- the creation of compound terms makes the term matching process more complex since in addition to the usual problems of synonymy and subsumption, one must deal with their structure (e.g., &quot;college junior&quot; is the same as &quot;junior in college&quot;). For all kinds of terms that can be assigned to the representation of a document, e.g., words, syntactic phrases, fixed phrases, and proper names, various levels of &quot;regularization&quot; are needed to assure that syntactic or lexical variations of input do not obscure underlying semantic uniformity. Without actually doing semantic analysis, this kind of normalization can be achieved through the following processes: 2 (1) morphological stemming: e.g., retrieving is reduced to retriev; (2) lexicon-based word normalization: e.g., retrieval is reduced to retrieve; (3) operator-argument representation of phrases: e.g., information retrieval, retrieving of information, and retrieve relevant information are all assigned the same representation, retrieve+information; (4) context-based term clustering into synonymy classes and subsumption hierarchies: e.g., takeover is a kind of acquisition (in business), and Fortran is a programming language.</Paragraph> <Paragraph position="6"> In traditional full-text indexing, terms are selected from among words and stems and weighted according to their frequencies and distribution among documents. The introduction of terms which are derived primarily by linguistic means into the representation of documents changes the balance of frequency-based weighting and therefore calls for more complex term weighting schemes than 2 An alternative, but less efficient method is to generate all variants (lexical, syntactic, etc.) of words/phrases in the queries \[31.</Paragraph> <Paragraph position="7"> those devised and tested on single-word representations. The standard ff.idf scheme (term frequency times inverted document frequency), for example, weights terms proportionately to their global scores (idf) and their in-document frequencies (tO, usually normalized by document length. It is appropriate when most uses a term are explicit, that is, appropriate words actually occur in text. This, however, is frequently not the case with proper names or phrases as various anaphorrs can be used to create implicit term occurrences.</Paragraph> </Section> class="xml-element"></Paper>