File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/a94-1028_intro.xml
Size: 8,498 bytes
Last Modified: 2025-10-06 14:05:34
<?xml version="1.0" standalone="yes"?> <Paper uid="A94-1028"> <Title>Robust Text Processing in Automated Information Retrieval</Title> <Section position="2" start_page="0" end_page="169" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The task of information retrieval is to extract relevant documents from a large collection of documents in response to user queries. When the documents contain primarily unrestricted text (e.g., newspaper articles, legal documents, etc.) the relevance of a document is established through 'full-text' retrieval.</Paragraph> <Paragraph position="1"> This has been usually accomplished by identifying key terms in the documents (the process known as 'indexing') which could then be matched against terms in queries (Salton, 1989). The effectiveness of any such term-based approach is directly related to the accuracy with which a set of terms represents the content of a document, as well as how well it contrasts a given document with respect to other documents. In other words, we are looking for a representation R such that for any text items D1 and D2, R(D1) = R(D2) iff meaning(D1) = meaning(D2), at an appropriate level of abstraction (which may depend on the types and character of anticipated queries), ' See (Harman, 1993) for a detailed description of TREC.</Paragraph> <Paragraph position="2"> The simplest word-based representations of content are usually inadequate since single words are rarely specific enough for accurate discrimination, and their grouping is often accidental. A better method is to identify groups of words that create meaningful phrases, especially if these phrases denote important concepts in the database domain. For example, joint venture is an important term in the Wall Street Journal (WSJ henceforth) database, while neither joint nor venture are important by themselves. In fact, in a 800+ MBytes database, both joint and venture would often be dropped from the list of terms by the system because their inverted document frequency (idj) weights were too low. In large databases comprising hundreds of thousands of documents the use of phrasal terms is not just desirable, it becomes necessary.</Paragraph> <Paragraph position="3"> To illustrate this point let us consider TREC Topic 104, an information request from which a data-base search query is to be built. The reader may note various sections of this Topic, with <desc> corresponding to the user's original request, further elaborated in <narr>, and <con> consisting of expert-assigned phrases denoting key concepts to be considered.</Paragraph> <Paragraph position="4"> <top> <num> Number: 104 <dora> Domain: Law and Government <title> Topic: Catastrophic Health Insurance <desc> Description: Document will enumerate provisions of the U.S. Catastrophic Health Insurance Act of 1988, or the political/legal fallout from that legislation.</Paragraph> <Paragraph position="5"> <aaarr> Narrative: A relevant document will detail the content of the U.S. medicare act of 1988 which extended catastrophic illness benefits to the elderly, with particular attention to the financing scheme which led to a firestorm of protest and a Congressional retreat, or a relevant document will detail the political/legal consequences of the catastrophic health insurance imbroglio and subsequent efforts by Congress to provide similar coverages through a less-controversial mechanism.</Paragraph> <Paragraph position="6"> 2. catastrophic-health program, catastrophic illness, catastrophic care, acute care, long-term nursing home care 3. American Association of Retired Persons, AARP, senior citizen, National Committee to Preserve Social Security and Medicare </top> If the phrases are ignored altogether, 2 this query will produce an output where the relevant documents are scattered as shown in the first table below which lists the ranks and scores of relevant documents within the top 100 retrieved documents. On the other hand, if we include even simple phrases, such as catastrophichealth program, acute care, home care, and senior citizen, we can considerably sharpen the outcome of the search as seen in the second table) A query obtained from the fields <rifle>, <desc> and <narr> will be, as may be expected, much weaker than the one using <con> field, especially without the phrasal terms, because the narrative contains far fewer specific terms while containing some that may prove distracting, e.g., firestorm. In fact, Broglio and Croft (1993), and Broglio (personal communication, 1993) showed that the exclusion of the <con> field makes the queries quite ineffective, while adding the <narr> field makes them even worse as they lose precision by as much as 30%. However, adding phrasal terms can improve things considerably. We return to this issue later in the paper.</Paragraph> <Paragraph position="7"> An accurate syntactic analysis is an essential prerequisite for selection of phrasal terms. Various statistical methods, e.g., based on word co-occurrences 2 All single words (except the stopwords such as articles or prepositions) am included in the query, including those making up the phrases.</Paragraph> <Paragraph position="8"> s Including extra terms in documents changes the way other terms are weighted. This issue is discussed later in this paper. and mutual information, as well as partial parsing techniques, are prone to high error rates (sometimes as high as 50%), turning out many unwanted associations.</Paragraph> <Paragraph position="9"> Therefore a good, fast parser is necessary, but it is by no means sufficienL While syntactic phrases are often better indicators of content than 'statistical phrases' where words are grouped solely on the basis of physical proximity, e.g., &quot;college junior&quot; is not the same as &quot;junior college&quot; -- the creation of compound terms makes the term matching process more complex since in addition to the usual problems of synonymy and subsumption, one must deal with their structure (e.g., &quot;college junior&quot; is the same as &quot;junior in college&quot;). For all kinds of terms that can be assigned to the representation of a document, e.g., words, syntactic phrases, fixed phrases, and proper names, various levels of &quot;regularization&quot; are needed to assure that syntactic or lexical variations of input do not obscure underlying semantic uniformity. Without actually doing semantic analysis, this kind of normalization can be achieved through the following processes: 4 (1) morphological stemming: e.g., retrieving is reduced to retriev; (2) lexicon-based word normalization: e.g., retrieval is reduced to retrieve; (3) operator-argument representation of phrases: e.g., information retrieval, retrieving of information, and retrieve relevant information are all assigned the same representation, retrieve+information; (4) context-based term clustering into synonymy classes and subsumption hierarchies: e.g., takeover is a kind of acquisition (in business), and Fortran is a programming language.</Paragraph> <Paragraph position="10"> Introduction of compound terms complicates the task of discovery of various semantic relationships among them. For example, the term natural language can often be considered to subsume any term denoting a specific human language, such as English. Therefore, a query containing the former may be expected to retrieve documents containing the latter. The same can be said about language and English, unless language is in fact a part of the compound term programming language in which case the association language Fortran is appropriate. This is a problem because (a) it is a standard practice to include both simple and compound terms in document representation, and (b) term associations have thus far been computed primarily at word level (including fixed phrases) and therefore care 4 An alternative, but less efficient method is to generate all variants (lexical, syntactic, etc.) of words~hrases in the queries (Sparck-Jones & Tait, 1984).</Paragraph> <Paragraph position="11"> must be taken when such associations are used in term matching. This may prove particularly troublesome for systems that attempt term clustering in order to create &quot;meta-terms&quot; to be used in document representation.</Paragraph> </Section> class="xml-element"></Paper>