File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1004_intro.xml

Size: 4,571 bytes

Last Modified: 2025-10-06 14:06:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1004">
  <Title>Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax*</Title>
  <Section position="3" start_page="0" end_page="24" type="intro">
    <SectionTitle>
2 Background and Introduction
</SectionTitle>
    <Paragraph position="0"> NLP techniques have been applied to extraction of information from corpora for tasks such as free indexing (extraction of descriptors from corpora), (Metzler and Haas, 1989; Schwarz, 1990; Sheridan and Smeaton, 1992; Strzalkowski, 1996), term acquisition (Smadja and McKeown, 1991; Bourigault, 1993; Justeson and Katz, 1995; Dallle, 1996), or extraction of lin9uistic information e.g. support verbs (Grefenstette and Teufel, 1995), and event structure of verbs (Klavans and Chodorow, 1992).</Paragraph>
    <Paragraph position="1"> Although useful, these approaches suffer from two weaknesses which we address. First is the issue of filtering term lists; this has been dealt with by constraints on processing and by post-processing over-generated lists. Second is the problem of difficulties in identifying related terms across parts of speech.</Paragraph>
    <Paragraph position="2"> We address these limitations through the use of controlled indexing, that is, indexing with reference to previously available authoritative terms lists, such as (NLM, 1995). Our approach is fully automatic, but permits effective combination of available resources (such as thesauri) with language processing technology, i.e., morphology, part-of-speech tagging, and syntactic analysis.</Paragraph>
    <Paragraph position="3">  Automatic controlled indexing is a more difficult task than it may seem at first glance: * controlled indexing on single-words must account for polysemy and word disambiguation (Krovetz and Croft, 1992; Klavans, 1995).</Paragraph>
    <Paragraph position="4"> * controlled indexing on multi-word terms must consider the numerous forms of term variations (Dunham, Pacak, and Pratt, 1978; Sparck Jones and Tait, 1984; Jacquemin, 1996).</Paragraph>
    <Paragraph position="5"> We focus here on the multi-word task. Our system exploits a morphological processor and a transformation-based parser for the extraction of multi-word controlled indexes.</Paragraph>
    <Paragraph position="6"> The action of the system is twofold. First, a corpus is enriched by tagging each word unambiguously, and then expanded by linking each word with all its possible derivatives. For example, for English, the word genes is tagged as a plural noun and morphologically connected to genic, genetic, genome, genotoxic, genetically, etc. Second, the term list is dynamically expanded through syntactic transformations which allow the retrieval of term variants. For example, genic expressions, genes were expressed, expression of this gene, etc. are extracted as variants of gene expression.</Paragraph>
    <Paragraph position="7"> This system relies on a full-fledged unification formalism and thus is well adapted to a fine-grained identification of terms related in syntactically and morphologically complex ways. The same system has been effectively applied both to English and French, although this paper focuses on French (see (Jacquemin, 1994) for the case of syntactic variants in English). All evaluation experiments were performed on two corpora: a training corpus \[ECI\] (ECI, 1989 and 1990) used for the tuning of the metagrammar and a test corpus \[AGR\] (AGR, 1995) used for evaluation. \[ECI\] is a subset of the European Corpus Initiative data composed of 1.3 million words of the French newspaper &amp;quot;Le Monde&amp;quot;; \[AGR\] is a set of abstracts of scientific papers in the agricultural domain from INIST/CNRS (1.1 million words). A list of terms is associated with each corpus: the terms corresponding to \[ECI\] were automatically extracted by LEXTER (Bourigault, 1993) and the terms corresponding to \[AGR\] were extracted from the AGROVOC term list owned by INIST/CNRS.</Paragraph>
    <Paragraph position="8"> The following section describes methods for grouping multi-word term variants; Section 4 presents a linguistically-motivated method for lexical analysis (inflectional analysis, part of speech tagging, and derivational analysis); Section 5 explains term expansion methods: constructions with a local parse through syntactic transformations preserving dependency relations; Section 6 illustrates the empirical tuning of linguistic rules; Section 7 presents an evaluation of the results in terms of precision and recall.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML