File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1097_metho.xml

Size: 8,870 bytes

Last Modified: 2025-10-06 14:14:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1097">
  <Title>Improving Automatic Indexing through Concept Combination and Term Enrichment</Title>
  <Section position="4" start_page="595" end_page="597" type="metho">
    <SectionTitle>
3 Conceptual Phrase Building
</SectionTitle>
    <Paragraph position="0"> The indexes extracted at the preceding step are text chunks which generally build up a correct syntactic structure: verb phrases for verbalizations and, otherwise, noun phrases. When overlapping, these indexes can be combined and replaced by their head words so as to condense and structure the documents. This process is the reverse operation of the noun phrase decomposition described in (Habert et al., 1996).</Paragraph>
    <Paragraph position="1"> The purpose of automatic indexing entails the following characteristics of indexes: * frequently, indexes overlap or are embedded one in another (with \[AGR-CAND\], 35% of the indexes overlap with another one and 37% of the indexes are embedded in another one; with \[AGROVOC\], the rates are respectively 13% and 5%), * generally, indexes cover only a small fraction of the parsed sentence (with \[AGR-CAND\], the indexes cover, on average, 15% of the surface; with \[AGROVOC\], the average coverage is 3%), * generally, indexes do not correspond to maximal structures and only include part of the arguments of their head word.</Paragraph>
    <Paragraph position="2"> Because of these characteristics, the construction of a syntactic structure from indexes is like solving a puzzle with only part of the clues, and with a certain overlap between these clues.</Paragraph>
    <Section position="1" start_page="595" end_page="597" type="sub_section">
      <SectionTitle>
Text Structuring
</SectionTitle>
      <Paragraph position="0"> The construction of the structure consists of the following 3 steps: Step 1. The syntactic head of terms is determined by a simple noun phrase grammar of the language under study. For French, the following regular expression covers 98% of the term structures in the database \[AGROVOC\] (Mod is any adjectival modifier and the syntactic head is the noun in bold face): Mod* N N ? (Mod I (Prep Art ? Mod* N N ? Mod*))* The second source of knowledge about syntactic heads is embodied in transformations. For  instance, the syntactic head of the verbalization in (1) is the verb in bold typeface.</Paragraph>
      <Paragraph position="1"> Step 2. A partial relation between the indexes of a sentence is now defined in order to rank in priority the indexes that should be grouped first into structures (the most deeply embedded ones). This definition relies on the relative spatial positions of two indexes i and j and their syntactic heads H(i) and H(j): Definition 3.1 (Index priority) Let i and j be two indexes in the same sentence. The relative priority ranking of i and j is:</Paragraph>
      <Paragraph position="3"> This relation is obviously reflexive. It is neither transitive nor antisymmetric. It can, however, be shown that this relation is not cyclic for 3 elements: i~j A jT~k =C/ -~(kT~i). (This property is not demonstrated here, due to the lack of space.) The linguistic motivations of Definition 3.1 are linked to the composite structure built at Step 3 according to the relative priorities stated  by T~. We now examine, in turn, the 4 cases of term overlap: 1. Head embedding: 2 indexes i and j, with a common head word and such that i is embedded into j, build a 2-level structure:</Paragraph>
      <Paragraph position="5"> This structuring is illustrated by nappe d'eau (sheet of water) which combines with nappe d'eau souterraine (underground sheet of water) and produces the 2-level structure \[\[nappe d'eau\] souterraine\] (\[underground ~ of water\]\]). (Head words are underlined.) In this case, i has a higher priority than j; it corresponds to (H(i) =  H(j) A i C_ j) in Definition 3.1.</Paragraph>
      <Paragraph position="6"> 2. Argument embedding: 2 indexes i and j, with different head words and such that the head word of i belongs to j and the head word of j does not belong to i, combine as follows:</Paragraph>
      <Paragraph position="8"> This structuring is illustrated by nappe d'eau which combines with eau souterraine (underground water) and produces the structure \[nappe d~.eau souterraine\]\] (\[sheet of \[underground water.\]\]). Here, i has a higher priority than j; it corresponds to (H(i) ~ H(j) A H(i) * j A g(j) ~ i) in Definition 3.1.</Paragraph>
      <Paragraph position="9"> 3. Head overlap: 2 indexes i and j, with a common head word and such that i and j partially overlap, are also combined at Step 3 by making j a substructure of i. This combination is, however, non-deterministic since no priority ordering is defined between these 2 indexes. Therefore, it does not correspond to a condition in Definition 3.1.</Paragraph>
      <Paragraph position="10"> H(i) In our experiments, this structure corresponds to only one situation: a head word with pre- and post-modifiers such as importante activitd (intense activity) and activivtg de ddgradation mdtabolique (activity of metabolic degradation).</Paragraph>
      <Paragraph position="11"> With \[-AGR-CAND\], this configuration is encountered only 27 times (.1% of the index overlaps) because premodifiers rarely build correct term occurrences in French. Premodifiers generally correspond to occasional characteristics such as size, height, rank, etc.</Paragraph>
      <Paragraph position="12"> 4. The remaining case of overlapping indexes with different head words and reciprocal inclusions of head words is never encountered. Its presence would undeniably denote a flaw in the calculus of head words.</Paragraph>
      <Paragraph position="13"> Step 3. A bottom-up structure of the sentences is incrementally built by replacing indexes by trees. The indexes which are highest ranked by  the Step 2 are processed first according to the following bottom-up algorithm: 1. build a depth-1 tree whose daughter nodes are all the words in the current sentence and whose head node is S, 2. for all the indexes i in the current sentence, selected by decreasing order of priority, (a) mark all the the depth-1 nodes which are a lexical leaf of i or which are the head node of a tree with at least one leaf in i, (b) replace all the marked nodes by a  unique tree whose head features are the features of H(i), and whose depth1 leaves are all the marked nodes.</Paragraph>
      <Paragraph position="14"> When considering the sentence given in Table 1, the ordering of the indexes after Step 2 is the following: i2 &gt; i5, i6 &gt; i2, and i4 &gt; i3. (They all result from the argument embedding relation.) The algorithm yields the following structure of the sample sentence: f ...la respiration et ses rapports avec l'humidit~ ont dt~ analvs~es respiration du sol humidit~ et la temperature analys~es dans le sol temperature du sol sol superficiel d'une for~t for~t tropicale</Paragraph>
    </Section>
    <Section position="2" start_page="597" end_page="597" type="sub_section">
      <SectionTitle>
Text Condensation
</SectionTitle>
      <Paragraph position="0"> The text structure resulting from this algorithm condenses the text and brings closer words that would otherwise remain separated by a large number of arguments or modifiers. Because of this condensation, a reindexing of the structured text yields new indexes which are not extracted at the first step.</Paragraph>
      <Paragraph position="1"> Let us illustrate the gains from reindexing on a sample utterance: l'dvolution au cours du temps du sol et des rendements (temporal evolution of soils and productivity). At the first step of indexing, ~volution au cours du temps (lit. evolution over time) is recognized as a variant of dvolution dans le temps (lit. evolution with time). At the second step of indexing, the daughter nodes of the top-most tree build the condensed text: l'dvolution du sol et des rendements (evolution of soils and productivity):  l'~volution du sol et des rendements l'~volution au cours du temps This condensed text allows for another index extraction: dvolution du sol et des rendements, a Coordination variant of dvolution du rendement (evolution of productivity). This index was not visible at the first step because of the additional modifier au cours du temps (temporal). (Reiterated indexing is preferable to too unconstrained transformations which burden the system with spurious indexes.) Both processes--text structuring, presented here, and term acquisition, described in (Jacquemin, 1996)--reinforce each other. On the one hand, acquisition of new terms increases the volume of indexes and thereby improves text structuring by decreasing the non-conceptual surface of the text. On the other hand, text condensation triggers the extraction of new indexes, and thereby furnishes new possibilities for the acquisition of terms.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML