File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1042_metho.xml

Size: 11,570 bytes

Last Modified: 2025-10-06 14:13:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1042">
  <Title>INFERENCING IN INFORMATION RETRIEVAL</Title>
  <Section position="4" start_page="218" end_page="218" type="metho">
    <SectionTitle>
2. THE INFORMATION RETRIEVAL
PROBLEM
</SectionTitle>
    <Paragraph position="0"> One of the essential characteristics of document storage and retrieval is the parallelism between the indexing and searching processes. Text is subjected to either manual or automatic indexing. If it is manual, there will generally be indexing rules. For example, in the case of NLM's MEDLINE database, one rule says that articles should be indexed with the most specific terms available in NLM's MeSH vocabulary\[9\].</Paragraph>
    <Paragraph position="1"> Thus, if an article is about aplastic anemia it should be indexed under that term (which is a bona fide MeSH term) and not under either the more general term &amp;quot;anemia&amp;quot; or the even more general term &amp;quot;hematologic diseases&amp;quot;. At search time, the user (or program acting on the user's behalf) needs to take this into account when formulating a search strategy and statement.</Paragraph>
    <Paragraph position="2"> Even if the indexing is automatic, and it may be as simple as creating an inverted index for all the words in the document, the user (or program) needs to be aware of the conventions for creating that index. This includes recognition of the fact that text words are generally run against a stopword list of function words and other highly frequent words before they are entered in the database. For example, if the user wants to query the MEDLINE database on &amp;quot;the effects of acidosis on ATP&amp;quot; and uses text words only, the two words &amp;quot;acidosis&amp;quot; and &amp;quot;ATP&amp;quot; will individually yield many postings and their coordination will yield another, smaller, set. However, adding in the word &amp;quot;effects&amp;quot; will not make the search results any more precise, since, as a highly frequent word in biomedical documents, this word has been placed on the stopword list.</Paragraph>
    <Paragraph position="3"> Without some knowledge of these conventions, the results can be confusing to end users.</Paragraph>
    <Paragraph position="4"> Feedback of various kinds allows the user to negotiate with the retrieval system. This may involve refining a search statement based on viewing the set of rifles or documents initially retrieved, or finding that because the number of postings for a search statement is unacceptably large or small that the search strategy has been too broad, or too narrow, or misformulated in some other way. It may also involve accessing information about the indexing rules or controlled vocabulary used in the system. The effect of this feedback is that it makes the user more aware of both the potential of the retrieval system as well as its limitations.</Paragraph>
    <Paragraph position="5"> Most researchers in intelligent interface design assume to one degree or another that the user will be &amp;quot;left in the loop&amp;quot; to negotiate with the system, resolving ambiguities, making relevancy judgements, and revising searches based on (userindependen0 information supplied by the system. (See\[10\] for a strong statement about the desirabilty of giving the user maximum control over the entire search interaction.) Many of the attempts that have been made to apply NLP to information retrieval have involved the search interface; others have involved the indexing process. See\[11\] for a review of some of the more recent research efforts. The results of applying NLP to the information retrieval problem have not always been encouraging\[12\]. It is important to recognize why this might be so. First, retrieval experiments have been carried out that use partially developed parsing systems and then compare these results with other non-NLP methods. The results of these comparisons should, therefore, be viewed with caution. In some cases, so-called stemming procedures have been used which embody some linguistic sophistication, but, again, are not fully motivated or developed. The results of these experiments again underscore the limitations of the incomplete methods used. Second, given that the indexing and retrieval processes are so closely related, a successful application of NLP will need to be fully integrated with both processes. Some of the inconclusive results in\[13\], for example, may derive from a decision to ignore this point.</Paragraph>
  </Section>
  <Section position="5" start_page="218" end_page="220" type="metho">
    <SectionTitle>
3. THE SPECIALIST SYSTEM
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="218" end_page="219" type="sub_section">
      <SectionTitle>
3.1 Lexicon and Parser
</SectionTitle>
      <Paragraph position="0"> Lexical information is central to our parsing system. The lexicon currently contains some 51,000 lexical items, with over 88,000 lexical forms. It includes both general English lexical items as well as items specific to the domain of  biomedicine. Lexical entries are created using our lexicon building tool called Lextool. Lextool is a menu-based system which accepts as input either a file of lexical items or lexical items typed in from the keyboard. With the interactive aid of the user, it generates fully specified lexical frames. Lextool incorporates rules that dictate which slots are permissible for the syntactic category in question. The coding system is closely tied to the codes given in the first edition of the Longman Dictionary of Contemporary English\[14\], although we have modified this scheme somewhat, and we have added additional codes, for example, those for logical interpretation, such as subject control, object raising, etc. We do not have the Longman dictionary in machine readable form, but other online information sources are available to lexical coders in the Lextool environment. These include the Dorland's Illustrated Medical Dictionary\[5\]; Meshtool, our MeSH vocabulary browser; Meta, the Metathesaurus retrieval system; and access to sample sentences from MEDLINE citations which contain the lexical items in question. The two sample records shown below illustrate the type of information that is encoded for lexical items 2.</Paragraph>
      <Paragraph position="1">  entries. We are, however, currently considering a variety of approaches to semantics and our future work in this area may wen have an impact on the structure of the lexical entries.</Paragraph>
      <Paragraph position="2"> The record for &amp;quot;sad&amp;quot; illustrates the sort of information we encode for adjectives. Included is variant information (i.e., whether the adjective forms regular comparative and superlafives); positional information, e.g., whether the adjective is predicative, attributive, or both; adjective type (e.g., the &amp;quot;1&amp;quot; in &amp;quot;attrib(1)&amp;quot; indicates that this is an adjective of quality); information about possible complements (e.g, finite, infinitival complements); and information about any nominalizations.</Paragraph>
      <Paragraph position="3"> The record for &amp;quot;aim&amp;quot; illustrates some of the information we encode for nouns and verbs. Noun frames include variant information and information about possible complements and nominalizations, if relevant. Verbs are most extensively coded. While any particular complement slot of a verb is optional, at least one from the set &amp;quot;intran, &amp;quot;man&amp;quot;, &amp;quot;ditran&amp;quot;, or &amp;quot;cplxtran&amp;quot;, must be chosen. In addition, the particular type of object is encoded. For example, aim as a verb may be transitive, and if so, it can take a single np as an object or one of a variety of prepositional phrase complements (e.g., &amp;quot;aim at the target&amp;quot;, &amp;quot;aim at winning&amp;quot;, &amp;quot;aim for the best&amp;quot;, etc.). The grammar includes context-free BNF rules together with context-sensitive restrictions. It is based heavily on the Pundit grammar, but we continue to refine and modify it so that it can handle new constructions and additional lexical attributes. A slightly simplified sample parse is shown below.</Paragraph>
      <Paragraph position="4">  We have investigated the possiblity of using the UMLS semantic types for expressing selectional restrictions. Our initial assessment is that they may be profitably used, but since we are currently developing a general approach to semantics, we have not yet implemented any restrictions of this sort. In the meantime, we report semantic types in the output parse.</Paragraph>
      <Paragraph position="5"> The semantic types are not directly encoded in lexical entries, but are looked up at parse time in our Metathesaurus retrieval application.</Paragraph>
    </Section>
    <Section position="2" start_page="219" end_page="220" type="sub_section">
      <SectionTitle>
3.2 Access to Knowledge Sources
</SectionTitle>
      <Paragraph position="0"> The Metathesaurus application allows users (or programs) to search for Metathesaurus terminology, reporting the term and its source vocabulary; its definition, synonyms, related or associated terms; its semantic types; its lexical tags and variants; or its contexts, e.g., its ancestors or descendants.</Paragraph>
      <Paragraph position="1">  Simplified sample output for some queries for &amp;quot;Gierke's disease&amp;quot; are shown below. Note that &amp;quot;Gierke's disease&amp;quot; is a synonym of &amp;quot;Glycogen Storage Disease Type r', and is, therefore, mapped to this term throughout.</Paragraph>
      <Paragraph position="2"> \[CN = concept name, DEF = definition, VOC = source vocabulary (MSH = MeSH, SNOMED = Systematized Nomenclature of Medicine), STY = semantic type, SY = synonym\]. Concept Definition \[return to quit\]: Gierke's disease CN: Glycogen Storage Disease Type I DEF: An autosomal recessive disease in which gene expression of glucose-6-phosphatase is absent, resulting in hypoglycemia due to lack of glucose production. Accumulation of glycogen in liver and kidney leads to organomegaly, particularly massive hepatomegaly. Increased concentrations of lactic acid and hyperlipidemia appear in the plasma. Clinical gout often appears in early childhood.</Paragraph>
    </Section>
    <Section position="3" start_page="220" end_page="220" type="sub_section">
      <SectionTitle>
3.3 Retrieval Module
</SectionTitle>
      <Paragraph position="0"> As noted above, we have developed a retrieval module in order to test the extent to which NLP techniques may improve information retrieval. The current implementation of the module processes files such as MEDLINE citation records, creates an index for the items in all relevant fields, including MeSH terminology and text words, and provides for boolean retrieval of these items. In addition to retrieval based on the MeSH vocabulary and text words, the retrieval module also provides for noun phrase extraction, indexing, and retrieval.</Paragraph>
      <Paragraph position="1"> A noun phrase index is created by parsing the textual fields of input records, generating several variants of each noun phrase and computing synonyms of each variant. During retrieval, noun phrases are similarly extracted from a parse of the user's query and processed against the noun phrase index.</Paragraph>
      <Paragraph position="2"> The retrieval module gives us direct access to the test collection of queries and citation records and was heavily used in the experiment reported below.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML