File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1004_metho.xml

Size: 16,234 bytes

Last Modified: 2025-10-06 14:14:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1004">
  <Title>Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax*</Title>
  <Section position="4" start_page="24" end_page="25" type="metho">
    <SectionTitle>
3 Variation in Multi-Word Terms: A
</SectionTitle>
    <Paragraph position="0"> Description of the Problem Linguistic variation is a major concern in the studies on automatic indexing. Variations can be classified into three major categories: * Syntactic (Type 1): the content words of the original term are found in the variant but the syntactic structure of the term is modified, e.g. technique for performing volumetric measurements is a Type 1 variant of measurement technique.</Paragraph>
    <Paragraph position="1"> * Morpho-syntaetic (Type 2): the content words of the original term or one of their derivatives are found in the variant. The syntactic structure of the term is also modified, e.g. electrophoresed on a neutral polyaerylamide gel is a Type 2 variant of gel electrophoresis.</Paragraph>
    <Paragraph position="2"> * Semantic (Type 3): synonyms are found in the variant; the structure may be modified, e.g. kidney function is a Type 3 variant of renal function. null This paper deals with Type 1 and Type 2 variations. The two main approaches to multi-word term conflation in IR are text simplification and structural similarity. Text simplification refers to traditional IR algorithms such as (1) deletion of stop words, (2) normalization of single words through stemming, and (3) phrase construction through dictionary matching. (See (Lewis, Croft, and Bhandaru, 1989; Smeaton, 1992) on the exploitation of NLP techniques in IR.) These methods are generally limited. The morphological complexity of the language seems to be a decisive argument for performing rich stemming (Popovi~ and Willett, 1992). Since we focus on French, a language with a rich declensional inflectional and derivational morphology--we have chosen the richest and most precise morphological analysis. This is a key component in the recognition of Type 2 variants. For structural similarity, coarse dependency-based NLP methods do not account for fine structural relations involved in Type 1 variants. For instance, properties of flour should be linked to flour properties, properties of wheat flour but not to properties of flour starch (examples are from (Schwarz, 1990)). The last occurrence must be rejected because starch is the argument of the head  noun properties, whereas flour is the argument of the head noun properties in the original term. Without careful structural disambiguation over internal phrase structure, these important syntactic distinctions would be incorrectly overlooked.</Paragraph>
  </Section>
  <Section position="5" start_page="25" end_page="25" type="metho">
    <SectionTitle>
4 Part of Speech Disambiguation
</SectionTitle>
    <Paragraph position="0"> and Morphology First, inflectional morphology is performed in order to get the different analyses of word forms. Inflectional morphology is implemented with finite-state transducers on the model used for Spanish (Tzoukermann and Liberman, 1990). The theoretical principles underlying this approach are based on generative morphology (Aronoff, 1976; Selkirk, 1982).</Paragraph>
    <Paragraph position="1"> The system consists of precomputing stems, extracted from a large dictionary of French (Boyer, 1993) enhanced with newspaper corpora, a total of over 85,000 entries.</Paragraph>
    <Paragraph position="2"> Second, a finite-state part of speech tagger (Tzoukermann, Radev, and Gale, 1995; Tzoukermann and Radev, 1996) performs the morpho-syntactic disambiguation of words. The tagger takes the output of inflectional morphological analysis and through a combination of linguistic and statistical techniques, outputs a unique part of speech for each word in context. Reducing the ambiguity of part of speech tags eliminates ambiguity in local parsing.</Paragraph>
    <Paragraph position="3"> Furthermore, part of speech ambiguity resolution permits construction of correct derivational links.</Paragraph>
    <Paragraph position="4"> Third, derivational morphology (Tzoukermann and Jacquemin, 1997) is achieved to generate morphological variants of the disambiguated words. Derivational generation is performed on the lemmas produced by the inflectional analysis and the part of speech information. Productive stripping and concatenation rules are applied on lemmas.</Paragraph>
    <Paragraph position="5"> The derived forms are expressed as tokens with feature structures 1. For instance, the following set of constraints express that the noun modernisateur is morphologically related to the word modernisation 2 .</Paragraph>
    <Paragraph position="6"> The &lt;ON&gt; metarule removes the -ion suffix, and the &lt;EUR&gt; rule adds the nominal suffix -eur.</Paragraph>
    <Paragraph position="7">  &lt;reference&gt;.</Paragraph>
    <Paragraph position="8"> &lt;cat&gt; =- N &lt;lemma&gt; =- 'modernisation' &lt;reference&gt; = 52663 &lt;derivation cat&gt; -- N &lt;derivation lemma&gt; = 'modernisateur' &lt;derivation reference&gt; = 52662 &lt;derivation history&gt; -- '&lt;ON&lt;&gt;EUR&gt;'.</Paragraph>
    <Paragraph position="9">  The morphological analysis performed in this study is detailed in (Tzoukermann, Klavans, and Jacquemin, 1997). It is more complete and linguistically more accurate than simple stemming for the following reasons: * Allomorphy is accounted for by listing the set of its possible allomorphs for each word. A1lomorphies are obtained through multiple verb stems, e.g. \]abriqu-, \]abric- (fabricate) or additional allomorphic rules.</Paragraph>
    <Paragraph position="10"> * Concatenation of several suffixes is accounted for by rule ordering mechanisms. Furthermore, we have devised a method for guessing possible suffix combinations from a lexicon and a corpus. This empirical method reported in (Jacquemin, 1997) ensures that suffixes which are related within specific domains are considered.</Paragraph>
    <Paragraph position="11"> * Derivational morphology is built with the perspective of overgeneration. The nature of the semantic links between a word and its derivational forms is not checked and all allomorphic alternants are generated. Selection of the correct links occurs during subsequent term expansion process with collocational filtering. Although dtable (cowshed) is incorrectly related to dtablir (to establish), it is very improbable to find a context where dtablir co-occurs with one of the three words found in the three multi-word terms containing dtable: nettoyeur (cleaner), alimenration (feeding), and liti~re (litter): Since we focus on multi-word term variants, overgeneration does not present a problem in our system.</Paragraph>
  </Section>
  <Section position="6" start_page="25" end_page="26" type="metho">
    <SectionTitle>
5 Transformation-Based Term
Expansion
</SectionTitle>
    <Paragraph position="0"> The extraction of terms and their variants from corpora is performed by a unification-based parser. The controlled terms are transformed into grammar rules whose syntax is similar to PATR-II.</Paragraph>
    <Section position="1" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
5.1 A Corpus-Based Method for
Discovering Syntactic Transformations
</SectionTitle>
      <Paragraph position="0"> We present a method for inferring transformations from a corpus in the purpose of developing a gram- null mar of syntactic transformations for term variants. To discover the families of term variants, we first consider a notion of collocation which is less restrictive than variation. Then, we refine this notion in order to filter out genuine variants and to reject spurious ones. A Type 1 collocation of a binary term is a text window containing its content words wl and w2, without consideration of the syntactic structure. With such a definition, any Type 1 variant is a Type 1 collocation. Similarly, a notion of Type 2 collocation is defined based on the co-occurence of wl and w2 including their derivational relatives.</Paragraph>
      <Paragraph position="1"> A d=5-word window is considered as sufficient for detecting collocations in English (Martin, A1, and Van Sterkenburg, 1983). We chose a window-size twice as large because French is a Romance language with longer syntactic structures due to the absence of compounding, and because we want to be sure to observe structures spanning over large textual sequences. For example, the term perte au stockage (storage loss) is encountered in the \[AGR\] corpus as: pertes occasionndes par les insectes au sorgho stockd (literally: loss of stored sorghum due to the insects). A linguistic classification of the collocations which are correct variants brings up the following families  of variations a.</Paragraph>
      <Paragraph position="2"> * Type 1 variations are classified according to their syntactic stucture.</Paragraph>
      <Paragraph position="3"> 1. Coordination: a coordination the combination of two terms with a common head word or a common argument. Thus, fruits et agrumes tropicaux (literally: tropical citrus fruits or fruits) is a coordination variant of the term fruits tropicaux (tropical fruits).</Paragraph>
      <Paragraph position="4"> 2. Substitution/Modification: a substitution is the replacement of a content word by a term; a modification is the insertion of a modifier without reference to another term. For example, activitd thermodynamique de l'eau (thermodynamic activity of water) is a substitution variant of activitg de l'eau (activity of water) if activitd thermodynamique (thermodynamic activity) is a term; otherwise, it is a modification.</Paragraph>
      <Paragraph position="5"> 3. Compounding/Decompounding: in  French, most terms have a compound noun structure, i.e. a noun phrase structure where determiners are omitted such as consommation d'oxyg~ne (oxygen consumption). The decompounding variation is the 3 Variations are generic linguistic functions and variants are transformations of terms by these functions. transformation of a term with a compound structure into a noun phrase structure such as consommation de l'oxyg~ne (consumption of the oxygen). Compounding is the reciprocal transformation.</Paragraph>
      <Paragraph position="6"> * Type 2 variations are classified according to the nature of the morphological derivation. Often semantic shifts are involved as well (Viegas, Gonzalez, and Longwell, 1996).</Paragraph>
      <Paragraph position="7"> 1. Noun-Noun variations: relations such as result/agent (fixation de l'azote (nitrogen fixation) / fixateurs d ' azote (nitrogen fixater)) or container/content (rdservoir d ' eau (water reservoir) / rdserve en eau (water reserve)) are found in this family.</Paragraph>
      <Paragraph position="8">  2. Noun-Verb variations: these variations often involve semantic shifts such as process/result fixation de l'azote/fixer l'azote (to fix nitrogen).</Paragraph>
      <Paragraph position="9"> 3. Noun-Adjective variations: the two  ways to modify a noun, a prepositional phrase or an adjectival phrase, are generally semantically equivalent, e.g. variation du climat (climate variation) is a synonym of variation climatique (climatic variation).</Paragraph>
      <Paragraph position="10"> A method for term variant extraction based on morphology and simple co-occurrences would be very imprecise. A manual observation of collocations shows that only 55% of the Type 1 collocations are correct Type 1 variants and that only 52% of the Type 2 collocations are correct Type 2 variants. It is therefore necessary to conceive a filtering method for rejecting fortuitous co-occurrences. The following section proposes a filtering system based on syntactic patterns.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="26" end_page="27" type="metho">
    <SectionTitle>
6 Empirical Rule Tuning
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
6.1 Syntactic Transformations for Type 1
</SectionTitle>
      <Paragraph position="0"> and Type 2 variants The concept of a grammar of syntactic transformations is motivated by well-known observations on the behavior of collocations in context (e.g. (Harris et al., 1989).) Initial rules based on surface syntax are refined through incremental experimental tuning.</Paragraph>
      <Paragraph position="1"> We have devised a grammar of French to serve as a basis for the creation of metarules for term variants. For example, the noun phrase expansion rule is4:</Paragraph>
      <Paragraph position="3"> awe use UNIX regular expression symbols for rules and transformations.</Paragraph>
      <Paragraph position="4">  From this rule a set of expansions can be generated:</Paragraph>
      <Paragraph position="6"> In order to balance completeness and accuracy, expansions are limited. After the initial expansion is created for a range of structures, empirical tuning is applied to create a set of maximum coverage metarules. null We briefly illustrate this process for coordination. For this example, we restrict transformations to terms with N P N structures which represent a full 33% of the binary terms. Examples of metarules of Type 1 and Type 2 variations are given in Table 1.</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
6.2 Development of a Coordination
</SectionTitle>
      <Paragraph position="0"> Transformation for N P N Terms The coordination types are first calculated by combining the pattern N1 P2 Ns with possible expansions of a noun phrase with a simple paradigmatic struc-</Paragraph>
      <Paragraph position="2"> The first parenthesis (C A T N A ? P) represents a coordinated head noun, the second (A C P) and third (P D ? A T N A T C P?) represent respectively an adjective phrase and a prepositional phrase coordinated with the prepositional phrase of the original term.</Paragraph>
      <Paragraph position="3"> Variants were extracted on the \[ECI\] corpus through this transformation; the following observations and changes have been made.</Paragraph>
      <Paragraph position="4"> First, coordination accepts a substitution which replaces the noun N3 with a noun phrase D ? A T Ns. For example, the variant tempdrature et humiditd initiale de Pair (temperature and initial humidity of the air) is a coordination where a determiner precedes the last noun (air).</Paragraph>
      <Paragraph position="5"> Secondly, the observations of coordination variants also suggest that the coordinating conjunction can be preceded by an optional comma and followed by an optional adverb, e.g. la production, et surtout la diffusion des semences (the production, and particularly the distribution of the seeds).</Paragraph>
      <Paragraph position="6"> Thirdly, variants such as de l'humiditd et de la vitesse de l'air (literally: of humidity and of the speed of the air) indicate that the conjunction can be followed by an optional preposition and an optional determiner.</Paragraph>
      <Paragraph position="7"> 5Subscripts represent indexing.</Paragraph>
      <Paragraph position="8"> The three preceding changes are made on the expression of (3) and the resulting transformation is given in the first line of Table 1 (changes are underlined). null Our empirical selection of valid metarules is guided by linguistic considerations and corpus observations. This mode of grammar conception has led us to the following decisions: * reject linguistic phenomena which could not be accounted for by regular expressions such as sentential complements of nouns; * reject noisy and inaccurate variations such as long distance dependencies (specifically within a verb phrase); * focus on productive and safe variations which are felicitously represented in our framework.</Paragraph>
      <Paragraph position="9"> Accounting for variants which are not considered in our framework would require the conception of a novel framework, probably in cooperation with a deeper analyzer. It is unlikely that our transformational approach with regular expressions could do much better than the results presented here. Table 2 shows some variants of AGROVOC terms extracted from the \[AGR\] corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML