File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1403_metho.xml
Size: 15,977 bytes
Last Modified: 2025-10-06 14:08:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1403"> <Title>Lexically-Based Terminology Structuring: Some Inherent Limits</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Acquiring links through lexical </SectionTitle> <Paragraph position="0"> inclusion of terms The present work induces hierarchical relations between terms when the constituent words of one term lexically include those of the second term (section 3.1). When comparing these relations with those that preexist in the MeSH, precision can reach 29.3% and recall 13.7% (section 3.2). We focus here on the analysis of the relations that are not found in the MeSH (section 3.3), which we develop in the next section (section 4).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Lexical inclusion </SectionTitle> <Paragraph position="0"> The method we use here for inducing hierarchical relations between terms is basically a test of lexical inclusion: we check whether a term a0 (parent) is 'included' in another term a1 (child), i.e., whether all words in a0 occur in a1 . We assume that this type of inclusion is a clue of a hierarchical relation between terms, as in acides gras / acides gras indispensables (fatty acids / fatty acids, essential).</Paragraph> <Paragraph position="1"> To detect this type of relation, we test whether all the content words of a0 occur in a1 . We do this on segmented terms with a gradually increasing normalization on word forms. Basic normalizations are performed first: conversion to lower case, removal of punctuation, of numbers and of 'stop words'.</Paragraph> <Paragraph position="2"> Subsequent normalizations rely on morphological ressources: lemmatization (with the two alternate inflectional lexicons) and stemming with a derivational lexicon. Terms are indexed by their words to speed up the computation of term inclusion over all term pairs of the whole MeSH thesaurus.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Application to MeSH and quantification </SectionTitle> <Paragraph position="0"> This structuring method has been applied to the flat list of 19,638 terms of the MeSH thesaurus. As expected, the number of links induced between terms increases when applying inflectional normalization and again with derivational normalization.</Paragraph> <Paragraph position="1"> We evaluated the quality of the links obtained with this approach by comparing them automatically with the original structure of the MeSH and computing recall and precision metrics. We summarize here the main results; a detailed evaluation can be found in (Grabar and Zweigenbaum, 2002).</Paragraph> <Paragraph position="2"> Depending on the normalization, up to 29.3% of the links found are correct (precision), and up to 13.7% of the direct MeSH links are found by lexical inclusion (recall). We also examined whether each term was correctly placed under one of its ancestors: this was true for up to 26% of the terms (recall); and the placement advices were correct in up to 58% of the cases (precision). The recall of links increases when applying more complete morphological knowledge (inflection then derivation).</Paragraph> <Paragraph position="3"> The evolution of precision is opposite: injection of more extensive morphological knowledge (derivation vs inflection) leads to taking more 'chances' for generating links between terms: the precision with no normalization (raw results) is 29.3% vs 22.5% when using all normalizations (lem-stem-med). Depending on the type of normalization, the best precision obtained for links is 43%.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Human analysis of 'new' relations </SectionTitle> <Paragraph position="0"> The evaluations presented in the previous section quantify the match between the induced relations and existing MeSH relations. However, they give no explanation for the fact that 70% of the induced relations are not considered relevant by the MeSH.</Paragraph> <Paragraph position="1"> This is what we study in the remainder of this paper: why these terms are not hierarchically related in the MeSH, and what kinds of relations exist between them.</Paragraph> <Paragraph position="2"> According to the position of the words of the 'parent' term in the 'child' term, we divide the extra-MeSH relations into three sets: a2a4a3a6a5 the parent concept is at the head position in the child concept: absorption/absorption intestinale; a2a8a7a9a5 the parent concept is at the tail (expansion) position in the child concept: abdomen/tumeur abdomen; a2a8a10a9a5 other types of positions. Each set of relations is sampled by randomly selecting a 20% subset, both without normalization (raw) and with inflectional and derivational normalizations (med-lem-stem). Table 1 presents the number of analyzed relations (total = 194).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 An analysis of new, lexically-induced </SectionTitle> <Paragraph position="0"> relations We first examine the issues encountered when trying to identify the head of each term (section 4.1), then review in turn each analyzed subset: head (section 4.2), expansion (section 4.3) and other relations (section 4.4).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Finding the head </SectionTitle> <Paragraph position="0"> In French, the semantic head of a noun phrase is usually located at the beginning of this phrase (this contrasts with English, where the semantic head is generally at the end of NPs). Moreover, as is often the case with terms, MeSH terms do not include determiners, so that the semantic head is usually the first word here. We therefore rely on a heuristic for determining 'head' and 'expansion' subsets: the head is the first word of the term, and the expansion is the last word. This is correct most of the time, but in some cases, the semantic head is positioned at the end of the term, generally separated with a comma, a tradition sometimes followed in thesauri: filoviridae/filoviridae, infections, leishmania/leishmania tropica, infection, quinones/quinone reductases, neurone/neurone moteur, maladie, syndrome/bouche main pied, syndrome.</Paragraph> <Paragraph position="1"> These cases must be hand-corrected and distributed into the following classes.</Paragraph> <Paragraph position="2"> We also encountered another kind of error, due to overzealous derivational knowledge: contracture/contraction musculaire, biologie/testament biologique, where contracture (a muscle disease) and contraction (normal muscle function) have both been stemmed to the same base word; the expansion adjective biologique is derived from the noun biologie, but its sense is generally more specific than biologie. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 'Head' subset </SectionTitle> <Paragraph position="0"> Let us first discard a case where it seems that we encountered a translation error. An examination of the structure of the English MeSH and a search on Web pages show that in the French MeSH, acide linoleique alpha should read acide linolenique alpha, which is a kind of acide linolenique (and not a kind of acide linoleique). The induced relation: acide linoleique/acide linoleique alpha is therefore incorrect; with the correct spelling, the lexical inclusion: acide linolenique/acide linolenique alpha would reveal a correct hierarchical relation.</Paragraph> <Paragraph position="1"> 4.2.1 The head is not the 'genus' of the term We encountered cases where the whole term did not have an is-a relation with the head as defined above. This happens in two types of situations.</Paragraph> <Paragraph position="2"> The first situation is due to syntactic reasons. In the following induced relation, acides amines / acides amines, peptides et proteines, null the larger term is an enumeration, with the sense of a logical OR. It is therefore the genus term, of which each of its components (e.g., acides amines) is a sub-type.</Paragraph> <Paragraph position="3"> The second situation is due to semantic reasons. Lexical induction of hierarchical relations assumes inheritance of the defining features of the genus term (e.g., a fatty acid, essential is a kind of fatty acid). However, it is well known that this is not always true: a plaster cat is not a cat (i.e., a mammal, etc.). This is sometimes modeled as a type coercion phenomenon. We found quite a few 'plaster cats' in our terms: personnalite/personnalite compulsive, voix/voix oesophagienne.</Paragraph> <Paragraph position="4"> For instance, personnalite here describes 'behaviorresponse patterns that characterize the individual', whereas personnalite compulsive (compulsive personality disorder) describes a mental disorder. Disorders (or diseases) are different objects than behaviors in the MeSH.</Paragraph> <Paragraph position="5"> This depends on the choice of term names in the terminology (here, the MeSH). Terms like absorption, investissement, etc., have specific senses that make them polysemous. To determine a precise sense, these terms have to be specialized by their contexts: investissement/investissement (psychanalyse), absorption/absorption cutanee, goitre/goitre ovarien Here, investissement alone (investment) has the financial sense, whereas in investissement (psychanalyse), it has its more generic sense. In a similar way, absorption has a specific meaning in chemistry, and goitre alone is a disorder of the thyroid gland. These cases are often non-ambiguous in the original English version of the same terms: for instance, investissement (psychanalyse) (fr) is a translation of cathexis (en).</Paragraph> <Paragraph position="6"> A related case occurs when the name of a parent term is underspecified: acides/acides pentanoiques, acne/acne rosacee.</Paragraph> <Paragraph position="7"> In these examples, acides means inorganic acids1 and acne means acne vulgaris, but the convention adopted is to use these single words to name the corresponding concepts.</Paragraph> <Paragraph position="8"> Finally, some induced links, although absent from the MeSH, are potentially correct is-a links, but the designers of the MeSH have made a different modeling choice: amyotrophies/amyotrophies spinales enfance, hyperplasie/hyperplasie epitheliale focale, centre public sante/centre public sante mentale, null rectocolite/rectocolite hemorragique, penicillines/penicilline g.</Paragraph> <Paragraph position="9"> A general representational choice in the MeSH, as in some other medical terminologies (e.g., SNOMED), is to differentiate on the one hand signs or symptoms and on the other hand diseases (a more fully characterized pathological state). This is the case for amyotrophies and hyperplasie (signs or symptoms) vs amyotrophies spinales enfance and hyperplasie epitheliale focale (disease of the nervous system, of the mouth).</Paragraph> <Paragraph position="10"> For some reason, a centre public sante mentale (public mental health center) is considered not to share all the attributes of a general centre public sante (public health center), which prevents them from being in a parent-child relationship: they are only siblings in the MeSH thesaurus.</Paragraph> <Paragraph position="11"> Penicillines, in the MeSH, have been chosen to refer to a therapeutic class of drugs (under antibiotics, under chemical actions), whereas penicilline g is considered as a chemical substance.</Paragraph> <Paragraph position="12"> The structuring involved in these instances reflects the ontological commitments of the terminol1Note, though, that if inorganic acids was named this way, it would be impossible to link it by lexical induction to other, more specific types of inorganic acids.</Paragraph> <Paragraph position="13"> ogy designers, and cannot be recovered by lexical inclusion.2</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 'Expansion' subset </SectionTitle> <Paragraph position="0"> When a 'parent' term is in 'expansion' position (end position) in a 'child' term, we assume that the semantic head of the child term is modified; the induced relation is indeed expected not to be is-a.</Paragraph> <Paragraph position="1"> Some of the main cases found are close to those for the 'head' subset. Among others, we find again enumerations (see subsection 4.2.1): immunodepresseurs / antineoplasiques et immunodepresseurs null and syntactic ambiguity (subsection 4.2.2): oncogene/antigene viral oncogene, where the word oncogene is a noun in the first term and an adjective in the second one.</Paragraph> <Paragraph position="2"> Many of the relations found in the 'expansion' subset are partitive: abdomen/muscle droit abdomen, amerique centrale/indien amerique centrale, argent/nitrate argent.</Paragraph> <Paragraph position="3"> (human body parts, a continent and its peoples, and chemical substances).</Paragraph> <Paragraph position="4"> In some instances, a general type of link between terms can be detected: caused-by: myxome/virus myxome, but in most other cases, we have what looks like a specific thematic relation between a predicate and its argument: comportement alimentaire/troubles comportement alimentaire, bovin/pneumonie interstitielle atypique bovin, hopital/capacite lits hopital, services sante/fermeture service sante, macrophage/activation macrophage.</Paragraph> <Paragraph position="5"> Note that some of these expansion relations involve adjectival derivations of nouns: cubitus/nerf cubital, genes/epreuve complementation genetique.</Paragraph> <Paragraph position="6"> 2They might be amenable to distributional methods if their contexts of occurrence are different enough.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 'Other' subset </SectionTitle> <Paragraph position="0"> In this last subset, the 'parent' term can be at any position in the 'child' term other than head or expansion. It can also be non-contiguous, accepting modifiers or some other intervening elements. All these cases are actually similar to those of the 'expansion' subset except those of the form: bacterie aerobie/bacterie gram-negatif aerobie where bacterie remains the head of the term.</Paragraph> <Paragraph position="1"> The following examples reproduce the general cases of the 'expansion' subset with additional modifiers: null arteres/anevrysme artere iliaque, hepatite b/virus hepatite b canard, encephalite/virus encephalite equine ouest, sommeil/troubles sommeil extrinseques, irrigation/liquide irrigation endocanalaire, maladie/assurance maladie personne agee.</Paragraph> <Paragraph position="2"> In some of them, adjectival derivation is involved: cellules/molecule-1 adhesion cellulaire vasculaire, null chimie/produits chimiques inorganiques, dent/implantation dentaire sous-periostee.</Paragraph> <Paragraph position="3"> Some relations are characteristic of the language of chemical compounds: cytochrome c/ubiquinol-cytochrome c reductase, null diphosphate/uridine diphosphate acide glucuronique, null lysine/histone-lysine n-methyltransferase.</Paragraph> <Paragraph position="4"> The 'other' subset also hosted the following morphosyntactic ambiguity: cilie/cellule ciliee externe where the words cilie (noun, an invertebrate organism) and ciliee (inflected form of adjective cilie, which characterizes a type of cell) are conflated by lemmatization. This error is mainly due to the fact that the MeSH is written with unaccented uppercase letters: the adjective is actually spelled cilie, which would be unambiguous here.</Paragraph> </Section> </Section> class="xml-element"></Paper>