File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0302_intro.xml

Size: 7,965 bytes

Last Modified: 2025-10-06 14:05:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0302">
  <Title>ROBUST TEXT PROCESSING IN AUTOMATED INFORMATION RETRIEVAL</Title>
  <Section position="3" start_page="0" end_page="9" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> A typical information retrieval OR) task is to select documents from a d~!ahase in response to a user's query, and rank these documents according to relevance. This has been usually accomplished using statistical methods (often coupled with manual encoding) that (a) select terms (words, phrases, and other units) from documents that are deemed to best represent their contents, and (b) create an inverted index file (or files) that provide and easy access to documents containing these terms. An important issue here is that of finding an appropriate combination of term weights which would reflect each term's relative contribution to the information contents of the document. Among many possible weighting schemes the inverted document frequency OdD has come to be recognized as universally applicable across variety of different text collections. Once the index is created, the search process will attempt to match a preprocessed user query (or queries) against representations of documents in each case determining a degree of relevance between the two which depends upon the number and types of matching terms. Although many sophisticated search and matching methods are available, the crucial problem remains to be that of an adequate representation of contents for both the documents and the queries.</Paragraph>
    <Paragraph position="1"> The simplest word-based representations of contents are usually inadequate since single words are rarely specific enough for accurate discrimination, and their grouping is often accidental. A better method is to identify groups of words that create meaningful phrases, especially if these phrases denote important concepts in database domain. For example, joint venture is an important term in Wall Street Journal (WSJ henceforth) database, while neither joint nor venture are important by themselves. In the retrieval experiments with the WSJ database, we noticed that both joint and venture were dropped from the list of terms by the system because their idf weights were too low. In large databases, such as TIPSTEK/TREC, the use of phrasal terms is not just desirable, it becomes necessary.</Paragraph>
    <Paragraph position="2"> The question thus becomes, how to identify the correct phrases in the text? Both statistical and syntactic methods were used before with only limited success. Statistical methods based on word co-occurrences and mutual information are prone to high error rates, turning out many unwanted associations.</Paragraph>
    <Paragraph position="3"> Syntactic methods suffered from low quality of generated parse structures that could be attributed to limited coverage grammars and the lack of adequate lexicons. In fact. the difficulties encountered in applying computational linguistics technologies to text processing have contributed to a wide-spread belief that  automated natural language processing may not be suitable in IR. These difficulties included inefficiency, lack of robustness, and prohibitive cost of manual effort required to build lexicons and knowledge bases for each new text domain. On the other hand, while numerous experiments did not establish the usefulness of linguistic methods in IR, they cannot be considered conclusive because of their limited scale. \] The rapid progress in Computational Linguistics over the last few years has changed this equation in various ways. First of all, large-scale resources became available: on-line lexicons, including Oxford Advanced Learner's Dictionary (OALD), Longman Dictionary of Contemporary English (LDOCE), Webster's Dictionary, Oxford English Dictionary, Collins Dictionary, and others, as well as large text corpora, many of which can now be obtained for research purposes. Robust text-oriented software tools have been built, including part of speech taggers (stochastic and otherwise), and fast parsers capable of processing text at speeds of 4200 words per minute or more (e.g., TIP parser developed by the author). While many of the fast parsers are not very accurate (they are usually partial analyzers by design), 2 some, like TIP, perform in fact no worse than standard full-analysis parsers which are many times slower and far less robust. 3 An accurate syntactic analysis is an essential prerequisite for term selection, but it is by no means sufficient. Syntactic parsing of the database contents is usually attempted in order to extract linguistically motivated phrases, which presumably are better indicators of contents than &amp;quot;statistical phrases&amp;quot; where words are grouped solely on the basis of physical proximity (e.g., &amp;quot;college junior&amp;quot; is not the same as &amp;quot;junior college'). However, creation of such compound terms makes term matching process more complex since in addition to the usual problems of synonymy and subsumption, one must deal with their structure (e.g., &amp;quot;college junior&amp;quot; is the same as &amp;quot;junior in college&amp;quot;). In order to deal with structure, parser's t Standard IR benchmark collectiot~s are statistically too small and the experiments can easily produce cotm~rinmitive results. For example, Cnmfield collection is only approx. 180,000 English words, while CACM-3204 collection is approx. 200.000 words.</Paragraph>
    <Paragraph position="4">  2 Partial parsing is usually fast enough, but it also generates noisy data: as numy as 50% of all generated phrases cotild be incorrect (Lewis and Croft, 1990).</Paragraph>
    <Paragraph position="5"> 3 &amp;quot;I'rP has been shown to produce parse structures which sum  no worse m recall, precision and crossing rate than those generated by flill-setle lmguisuc parsers when compared to hand-coded Treebank parse tree,.</Paragraph>
    <Paragraph position="6"> output needs to be &amp;quot;normalized&amp;quot; or &amp;quot;regularized&amp;quot; so that complex terms with the same or closely related meanings would indeed receive matching representations. This goal has been achieved to a certain extent in the present work. As it will be discussed in more detail below, indexing terms were selected from among head-modifier pairs extracted from predicate-argument representations of sentences.</Paragraph>
    <Paragraph position="7"> The next important task is to achieve normalization across diferent terms with close or related meaning. This can be accomplished by discovering various semantic relationships among words and phrases, such as synonymy and subsumption. For example, the term natural language can be considered, in certain domains at least2 to subsume any term denoting a specific human language, such as English. Therefore, a query containing the former may be expected to retrieve documents containing the latter. The system presented here computes term associations from text on word and fixed phrase level and then uses these associations in query expansion.</Paragraph>
    <Paragraph position="8"> A fairly primitive filter is employed to separate synonymy and subsumption relationships from others including antonymy and complementation, some of which are strongly domain-dependent. This process has led to an increased retrieval precision in experiments with smaller and more cohesive collections (CACM-3204).</Paragraph>
    <Paragraph position="9"> In the following sections we present an overview of our system, with the emphasis on its text-processing components. We would like to point out here that the system is completely automated, i.e., all the processing steps, those performed by the statistical core. and these performed by the natural language processing components, are done automatically, and no human intervention or manual encoding is required.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML