File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1103_intro.xml

Size: 5,222 bytes

Last Modified: 2025-10-06 14:00:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1103">
  <Title>Use of Dependency Tree Structures for the Microcontext Extraction</Title>
  <Section position="2" start_page="0" end_page="23" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Empirical methods in natural language processing (NLP) employ learning techniques to automatically extract linguistic knowledge from nat~al language corpora; for an overview of this field see (Bfill and Mooney 1997). This paper wants to show their usefulness in the field of information retrieval (IR). As the effects and the contribution of this discipline to IR has not been well examined and evaluated yet, various uses of NLP techniques in IR are only marginally mentioned in well known monographs published in last ten years, e.g. (Salton 1989), (Frakes and Baeza-Yates 1992), (Korfhage 1997).</Paragraph>
    <Paragraph position="1"> A textual IR system stores a collection of documents and special data structures for effective searching. A textual document is a sequence of terms. When analysing the content of a document, terms are the basic processed units -- usually they are words of natural language. When retrieving, the IR system returns documents presumed to be of interest to the user in response to a query. The user's query is a formal statement of user's information need. The documents that are interesting for the user (relative to the put query) are relevant; the others are non-relevant. The effectiveness of IR systems is usually measured in terms of precision, the percentage of retrieved documents that are relevant, and recall, the percentage of relevant documents that are retrieved.</Paragraph>
    <Paragraph position="2"> The starting point of our consideration on IR was a critique of word-based retrieval techniques. Traditional IR systems treat the query as a pattern of words to be matched by documents. Unfortunately, the effectiveness of these word-matching systems is mostly poor  because the system retrieves only the documents that contain words that occttr also in the query. However, in fact, the user &amp;Des not look for the words used in the query. The user desires the sense of the words and wants to retrieve the documents containing word,,; having the same sense. In contrast to the word-based approach, a sense-based IR system treats the query as a pattern of the required sense. In order to match this sense by the sense of words in documents, the senses of ambiguous words must be determined. Therefore a good word sense disambiguation is necessary ha a sense-based IR system.</Paragraph>
    <Paragraph position="3"> Ambiguity and synonymity of words is a property of natural language causing a very serious problem in IR. Both ambiguous words and synonyms depress the effectiveness of word-matching systems. The direct effect of polysemy on word-matching systems is to decrease precision (e.g., queries about financial banks retrieve documents about rivers). Synonymity decreases recall. If one sense is expressed by different synonyms in different documents, the word-matching system will retrieve all the documents only if all the synonyms are given in the query. Unfortunately, polysemy has another negative effect: polysemy also prevents the effective use of thesauri. Consequently, thesauri cannot be directly used to eliminate the problem with synonyms.</Paragraph>
    <Paragraph position="4"> In our opinion, if a retrieval system is not able to identify homonyms and synonyms and to discriminate their senses, ambiguity and synonymity will remain one of the main factors causing 1) low recall, 2) low precision, and 3) the known and inevitable fact that recall and precision are inversely related. There are some evidences that lexical context analysis could be a good way how to eliminate or at least decrease these difficulties m see below.</Paragraph>
    <Paragraph position="5"> How to take the step from words towards senses? Since an application of word contexts is the only possibility to estimate the sense of words, the way of dealing with word contexts is a central problem in sense-based retrieval.</Paragraph>
    <Paragraph position="6"> Knowing word contexts we can determine the measure of collocating, i.e. the extent to which a pair of words collocates. The knowledge of collocations can be used in IR for several purposes: making up contextual representations of words, resolving word ambiguity, estimating semantic word similarity, tuning the user's query in interaction with the user and quantifying the significance of words for retrieval according to entropy of their contexts.</Paragraph>
    <Paragraph position="7"> Section 2 expresses our motivation: the investigation of word contexts helps us to develop an efficient IR system. Next section is devoted to analysing Czech texts and suggests a construction of dependency microcontext structures making use of the tree structure automatically created in the process of Prague Dependency Treebank annotation. Further part focuses on applications of contextual knowledge in IR and refers to the project working on an experimental IR textual database. Finally we summarise the results of this study.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML