File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1003_metho.xml

Size: 20,931 bytes

Last Modified: 2025-10-06 14:15:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1003">
  <Title>Parsing rule Noun1 Prep Noun2 Adj -~ Parse (1) Head: Noum Exp.: Nouns Adj Head: Nouns Exp.: Adj Parse (2) Head: Noun1 Prep Nouns Head: Noun1 Exp.: Nouns Exp.: Adj</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Computational Terminology
</SectionTitle>
    <Paragraph position="0"> In the domain of corpus-based terminology two types of tools are currently developed: tools for automatic term extraction (Bourigault, 1993; Justeson and Katz, 1995; Daille, 1996; Brun, 1998) and tools for automatic thesaurus construction (Grefenstette, 1994). These tools are expected to be complementary in the sense that the links and clusters proposed in automatic thesaurus construction can be exploited for structuring the term candidates produced by the automatic term extractors. In fact, complementarity is difficult because term extractors provide mainly multi-word terms, while tools for automatic thesaurus construction yield clusters of single-word terms.</Paragraph>
    <Paragraph position="1"> On the one hand, term extractors focus on multi-word terms for ontological motivations: single-word terms are too polysemous and too generic and it is therefore necessary to provide the user with multi-word terms that represent finer concepts in a domain. The counterpart of this focus is that automatic term extractors yield important volumes of data that require structuring through a postprocessor. On the other hand, tools for automatic thesaurus construction focus on single-word terms for practical reasons. Since they cluster terms through statistical measures of context similarities, these tools exploit recurring situations. Since single-word terms denote broader concepts than multi-word terms, they appear more frequently in corpora and are therefore more appropriate for statistical clustering.</Paragraph>
    <Paragraph position="2"> The contribution of this paper is to propose an integrated platform for computer-aided term extraction and structuring that results from the combination of LEXTER, a Term Extraction tool (Bouriganlt et al., 1996), and FASTR 1, a Term Normalization tool (Jacquemin et al., 1997).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Components of the Platform for
Computer-Aided Terminology
</SectionTitle>
    <Paragraph position="0"> The platform for computer-aided terminology is organized as a chain of four modules and the corresponding flowchart is given by Figure 1. The modules are: POS tagging First the corpus is processed by Sylex, a Part-of-Speech tagger. Each word is unambiguously tagged and receives a single lemma.</Paragraph>
    <Paragraph position="1"> Term Extraction LEXTER, the term extraction tool acquires term candidates from the tagged corpus. In a first step, LEXTER exploits the part-of-speech categories for extracting maximal-length noun phrases. It relies on makers of frontiers together with a shallow grammar of noun phrases. In a second step, LEXTER recursively decomposes these maximal-length noun phrases into two syntactic constituents (Head and Expansion).</Paragraph>
    <Paragraph position="2"> Term Clustering The term clustering tool groups the term candidates produced at the * FA STR can be downloaded www. limsi, fr/Individu/j acquemi/FASTK.</Paragraph>
    <Paragraph position="3">  aided terminology preceding step through a self-indexing procedure followed by a graph-based classification. This task is basically performed by FASTR, a term normalizer, that has been adapted to the task at hand.</Paragraph>
    <Paragraph position="4"> ~F-:!etion The last step of thesaurus construction is the validation of automatically extracted clusters of term candidates by a terminologist and a domain expert. The validation is performed through a data-base interface. The links are automatically updated through the entire base and a structured thesaurus is progressively constructed.</Paragraph>
    <Paragraph position="5"> The following sections provide more details about the components and evaluate the quality of the terms thus extracted.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="16" type="metho">
    <SectionTitle>
3 Term Extraction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Term Extraction for the French
Language
</SectionTitle>
      <Paragraph position="0"> Term extraction tools perform statistical or/and syntactical analysis of text corpora in specialized technical or scientific domains. Term candidates correspond to sequences of words (most of the time noun phrases) that are likely to be terminological units. These candidates are ultimately validated as entries by a terminologist in charge of building a thesaurus. LEXTER, the term extractor, is applied to the French language.</Paragraph>
      <Paragraph position="1"> Since French is a Romance language, the syntactic structure of terms and compounds is very similar to the structure of non-compound and nonterminological noun phrases. For instance, in French, terms can contain prepositional phrases with determiners such as: paroiNoun deprep /'Det uret~reNoun (ureteral wall). Because of this similarity, the detection of terms and their variants in French is more difficult than in the English language. null The input of our term extraction tool is an unambiguously tagged corpus. The extraction process is composed of two main steps: Splitting and Parsing.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="16" type="sub_section">
      <SectionTitle>
3.2 Splitting
</SectionTitle>
      <Paragraph position="0"> The techniques of shallow parsing implemented in the Splitting module detect morpho-syntactical patterns that cannot be parts of terminological noun phrases and that are therefore likely to indicate noun phrases boundaries. Splitting techniques are used in other shallow parsers such as (Grefenstette, 1992). In the case of LEXTER, the noun phrases which are isolated by splitting are not intermediary data; they are not used by any other automatic module in order to index or classify documents. The extracted noun phrases are term candidates which are proposed to the user.</Paragraph>
      <Paragraph position="1"> In such a situation, splitting must be performed with high precision.</Paragraph>
      <Paragraph position="2"> In order to process correctly some problematic splittings, such as coordinations, attributive past participles and sequences preposition + determiner, the system acquires and uses corpus-based selection restrictions of adjectives and nouns (Bourigault et al., 1996).</Paragraph>
      <Paragraph position="3"> For example, in order to disambiguate PPattachments, the system possesses a corpus-based list of adjectives which accept a prepositional argument built with the preposition h (at). These selectional restrictions are acquired through Corpus-Based Endogenous Learning (CBEL) as follows: During a first pass, all the adjectives in a predicative position followed by the preposition h are collected. During a second pass, each time a splitting rule has eliminated a sequence beginning with the preposition el, the preceding adjective is discarded from the list. Empirical analyses confirm the validity of this procedure. More complex procedures of CBEL are implemented into LEXTER in order to acquire nouns sub-categorizing the preposition h or the preposition sur (on), adjectives sub-categorizing the preposition de (of), past participles sub-categorizing the preposition de (of), etc.</Paragraph>
      <Paragraph position="4"> Ultimately, the Splitting module produces a set of text sequences, mostly noun phrases, which we  refer to as Maximal-Length Noun Phrases (henceforth MLNP).</Paragraph>
    </Section>
    <Section position="3" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
3.3 Parsing
</SectionTitle>
      <Paragraph position="0"> The Parsing module recursively decomposes the maximal-length noun phrases into two syntactic constituents: a constituent in head-position (e.g. bronchial cell in the noun phrase cylindrical bronchial cell, and cell in the noun phrase bronchial cell), and a constituent in expansion position (e.g. cylindrical in the noun phrase cylindrical bronchial cell, and bronchial in the noun phrase bronchial cell). The Parsing module exploits rules in order to extract two subgroups from each MLNP, one in head-position and the other one in expansion position. Most of MLNP sequences are ambiguous. Two (or more) binary decompositions compete, corresponding to several possibilities of prepositional phrase or adjective attachment. The disambiguation is performed by a corpus-based method which relies on endogenous learning procedures (Bouriganlt, 1993; Ratnaparkhi, 1998). An example of such a procedure is given in Figure 2.</Paragraph>
    </Section>
    <Section position="4" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
3.4 Network of term candidates
</SectionTitle>
      <Paragraph position="0"> The sub-groups generated by the Parsing module, together with the maximal-length noun phrases extracted by the Splitting module, are the term candidates produced by the Term extraction tool.</Paragraph>
      <Paragraph position="1"> This set of term candidates is represented as a network: each multi-word term candidate is connected to its head constituent and to its expansion constituent by syntactic decomposition links. An excerpt of a network of term candidates is given in Figure 3. Vertical and horizontal links are syntactic decomposition links produced by the Term Extraction tool. The oblique link is a syntactic variation link added by the Term Clustering tool.</Paragraph>
      <Paragraph position="2"> The building of the network is especially important for the purpose of term acquisition. The average number of multi-word term candidates is 8,000 for a 100,000 word corpus. The feedback of several experiments in which our Term Extraction tool was used shows that the more structured the set of term candidates is, the more efficiently the validation task is performed. For example, the structuring through syntactic decomposition allows the system to underscore lists of terms that share the same term either in head position or in expansion position. Such paradigmatic series are frequent in term banks, and initiating the validation task by analyzing such lists appears to be a very efficient validation strategy.</Paragraph>
      <Paragraph position="3"> This paper proposes a novel technique for enriching the network of term candidates through  the addition of syntactic variation links to syntactic decomposition links.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="16" end_page="18" type="metho">
    <SectionTitle>
4 Term Clustering
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="16" end_page="17" type="sub_section">
      <SectionTitle>
4.1 Adapting a Normalization Tool
</SectionTitle>
      <Paragraph position="0"> Term normalization is a procedure used in automatic indexing for conflating various term occurrences into unique canonical forms. More or less linguistically-oriented techniques are used in the literature for this task. Basic procedures such as (Dillon and Gray, 1983) rely on function word deletion, stemming, and alphabetical word reordering. For example, the index library catalogs is transformed into catalog librar through such simplification techniques.</Paragraph>
      <Paragraph position="1"> In the platform presented in this paper, term normalization is performed by FASTR, a shallow transformational parser which uses linguistic knowledge about the possible morpho-syntactic transformations of canonical terms (Jacquemin et al., 1997). Through this technique syntactically and morphologically-related occurrences, such as stabilisation de prix (price stabilization) and stabiliser leurs prix (stabilize their prices), are continted. null Term variant extraction in FASTR differs from preceding works such as (Evans et al., 1991) because it relies on a shallow syntactic analysis of term variations instead of window-based measures of term overlaps. In (Sparck Jones and Tait, 1984) a knowledge-intensive technique is proposed for extracting term variations. This approach has however never been applied to large scale term extraction because it is based on a full semantic analysis of sentences. Our approach is more realistic because it does not involve large-scale knowledge-intensive interpretation of texts that is known to be unrealistic.</Paragraph>
      <Paragraph position="2"> Our approach to the clustering of term can- null didates is to group the output of LEXTER, by conflating term candidates with other term candidates instead of confiating corpus occurrences with controlled terms. Our technique can be seen as a kind of self-indexing in which term candidates are indexed by themselves through FASTR, for the purpose of conflating candidates that are variants of each other. Thus, the term candidate cellule bronchique cylindrique (cylindrical bronchial cell) is a variant of the other candidate cellule cylindrique (cylindrical cell) because an adjectival modifier is inserted in the first term. Through the self-indexing procedure these two candidates belong to the same cluster.</Paragraph>
    </Section>
    <Section position="2" start_page="17" end_page="17" type="sub_section">
      <SectionTitle>
4.2 Types of Syntactic Variation Rules
</SectionTitle>
      <Paragraph position="0"> Because of this original framework, specific variations patterns were designed in order to capture inter-term variations. In this study, we restrict ourselves to syntactic variations and ignore morphological modifications. The variations patterns can be classified into the following two families: Internal insertion of modifiers The insertion of one or more modifiers inside a noun phrase structure. For instance the following transformation NAInsAj:</Paragraph>
      <Paragraph position="2"> describes the insertion of one to three adjectival modifiers inside a Noun-Adjective structure in French. Through this transformation, the term candidate cellule bronchique cylindrique (cylindrical bronchial cell) is recognized as a variant of the term candidate cellule cylindrique (cylindrical cell). Other internal modifications account for adverbial and prepositional modifiers.</Paragraph>
      <Paragraph position="3"> Preposition switch 8C/ determiner insertion In French, terms, compounds, and noun phrases have comparable structures: generally a head noun followed by adjectival or prepositional modifiers. Such terms may vary through lexical changes without significant structural modifications. For example NPNSynt:  accounts for preposition suppressions such as fibre de collaggne/fibre collaggne (collagen fiber), additions of determiners, and/or preposition switches such as rev~tement de surface / rev~tement en surface (surface coating). null The complete rule set is shown in Table 1. Each transformation given in the first column conflates the term structure given in the second column and the term structure given in the third column.</Paragraph>
    </Section>
    <Section position="3" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
4.3 Clustering
</SectionTitle>
      <Paragraph position="0"> The output of FASTR is a set of links between pairs of term candidates in which the target candidate is a variant of the source candidate. In order to facilitate the validation of links by the expert, this output is converted into clusters of term candidates. The syntactic variation links can be considered as the edges of an undirected graph whose nodes are the term candidates. A node nl representing a term tl is connected to a node n2 representing t2 if and only if there is a transformation T such that T(tl) = t2 or T(t2) = tl * Each connected subgraph Gi of G is considered as a cluster of term candidates likely to correspond to similar concepts. (A connected subgraph Gi is  such that for every pair of nodes (nl,n2) in Gi, there exists a path from nl to n2.) For example, tl =nucldole prodminent (prominent nucleolus), t2 =nucldole central prodminent (prominent central nucleolus), t3 =nucldole souvent prodminent (frequently prominent nucleolus), and t4 =nucl~ole parfois prodminent (sometimes prominent nucleolus) are four term candidates that build a star-shaped 4-word cluster illustrated by Figure 4. Each edge is labelled with the syntactic transformation T that maps one of the nodes to the other.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="18" end_page="18" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> Experiments were made on three different corpora described in Table 2. The first two lines of Table 2 report the size of the corpora and the number of term candidates extracted by LEXTER from these corpora. The third and fourth lines show the number of links between term candidates extracted by FASTR and the number of connected subgraphs corresponding to these links. Finally, the last two lines report statistics on the size of the clusters and the ratio of term candidates that be- null long to one of the subgraphs produced by the clustering algorithm. Although the variation rules implemented in the Term Structuring tool are rather restrictive (only syntactic insertion has been taken into account), the number of links added to the network of term candidates is noticeably high. An average rate of 10% of multi-word term candidates produced by LEXTER belong to one of the clusters resulting from the recognition of term variants by FASTR.</Paragraph>
    <Paragraph position="1"> Frequencies of syntactic variations are reported in Table 3. A screen-shot showing the type of validation that is proposed to the expert is given by Figure 5.</Paragraph>
  </Section>
  <Section position="8" start_page="18" end_page="20" type="metho">
    <SectionTitle>
6 Expert Evaluation
</SectionTitle>
    <Paragraph position="0"> Evaluation was performed by three experts, one in each domain represented by each corpus. These experts had already been involved in the con- null Proceedings of EACL '99 struction of terminological products through the analysis of the three corpora used in our experiments: an ontology for a case-memory system dedicated to the diagnosis support ~n pathology (\[Broussais\]), a semantic dictionary for the Menelas Natural Language Understanding system (\[Menelas\]), and a structured thesaurus for a computer-assisted technical writing tool (\[DER\]). The precision rates are very satisfactory (from 93% to 98% corresponding to error rates of 7% and 2% given in the last line of Table 4), and show that the proposed method must be considered as an important progress in corpus-based terminology.</Paragraph>
    <Paragraph position="1"> Only few links are judged as conceptually irrelevant by the experts. For example, image d'embole tumorale (image of a tumorous embolus) is not considered as a correct variant of image tumorale (image of a tumor) because the first occurrence refers to an embolus while the second one refers to a tumor.</Paragraph>
    <Paragraph position="2"> The experts were required to assess the proposed links and, in case of positive reply, they were required to provide a judgment about the actual conceptual relation between the connected terms. Although they performed the validation independently, the three experts have proposed very similar types of conceptual relations between term candidates connected by syntactic variation links. At a coarse-grained level, they proposed the same three types of conceptual relations: Synonymy Both connected terms are considered as equivalent by the expert: embole tumorale (tumorous embolus) / embole vasculaire tumorale (vascular tumorous embolus).</Paragraph>
    <Paragraph position="3"> The preceding example corresponds to a frequent situation of elliptic synonymy: the notion of integrated metonymy (Kleiber, 1989).</Paragraph>
    <Paragraph position="4"> In the medical domain, it is a common knowledge that an embole tumorale is an embole vasculaire tumorale, as everyone knows that sunflower oil is a synonym of sunflower seed oil.</Paragraph>
    <Paragraph position="5"> Generic/specific relation One of the two terms denotes a concept that is finer than the other one: cellule dpithdliale cylindrique (cylindrical epithelial cell) is a specific type of cellule cylindrique (cylindrical cell).</Paragraph>
    <Paragraph position="6"> Attributive relation As in the preceding case, there is a non-synonymous semantic relation between the two terms. One of them denotes a concept richer than the other one because it carries an additional attributes: a noyau volumineux irrdgulier (large irregular nucleus) is a noyau irrdgulier (irregular nucleus) that is additionally volumineux (large).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML