File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1087_metho.xml
Size: 16,112 bytes
Last Modified: 2025-10-06 14:08:41
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1087"> <Title>Enhancing automatic term recognition through recognition of variation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> Terms are linguistic units that are assigned to concepts and used by domain specialists to describe and refer to specific concepts in a domain.</Paragraph> <Paragraph position="1"> In this sense, terms are preferred designators of concepts. In text, however, concepts are frequently denoted by different surface realisations of preferred terms, which we denote as their term variants. Consequently, a concept can be linguistically represented using any of the surface forms that are variants of the corresponding preferred term. We consider the following types of term variation: (i) orthographic: e.g. usage of hyphens and slashes (amino acid and amino-acid), lower and upper cases (NF-KB and NF-kb), spelling variations (tumour and tumor), different Latin/Greek transcriptions (oestrogen and estrogen), etc.</Paragraph> <Paragraph position="2"> (ii) morphological: the simplest variations are related to inflectional phenomena (e.g. singular, plural). Derivational transformations can lead to variants in some cases (cellular gene and cell gene), but not always (activated factor vs.</Paragraph> <Paragraph position="3"> activating factor); (iii) lexical: genuine lexical synonyms, which may be interchangeably used (carcinoma and cancer, haemorrhage and blood loss); (iv) structural: e.g. possessive usage of nouns using prepositions (clones of human and human clones), prepositional variants (cell in blood, cell from blood), term coordinations (adrenal glands and gonads); (v) acronyms and abbreviations: very frequent term variation phenomena in technical sublanguages, especially in biomedicine; sometimes they may be even preferred terms (DNA for deoxyribonucleic acid).</Paragraph> <Paragraph position="4"> Note that variation types (i) - (iii) affect individual constituents, while (iv) and (v) involve variation in structure of the preferred term. In any case, they do not &quot;change&quot; the meaning as they refer to the same concept. Daille et al. (1996) and Jacquemin (1999, 2001) further identified types of variation that modified the meaning of terms.</Paragraph> <Paragraph position="5"> Although many authors mention the problems related to term variation, few have dealt with linking the corresponding term variants. Also, the recognition of variants is typically performed as a separate operation, and not as part of ATR.</Paragraph> <Paragraph position="6"> The simplest technique to handle some types of term variation (e.g. morphological) is based on stemming: if two term forms share a stemmed representation, they are considered as mutual variants (Jacquemin and Tzoukermann, 1999; Ananiadou et al., 2000). However, stemming may result in ambiguous denotations related to &quot;overstemming&quot; (i.e. resulting in the conflation of terms which are not real variants) and &quot;under-stemming&quot; (i.e. resulting in the failure to link real term variants).</Paragraph> <Paragraph position="7"> Other approaches to the recognition of term variants use preferred terms and known synonyms from existing term dictionaries and approximate string matching techniques to link or generate different term variants (Krauthammer et al., 2001; Tsuruoka and Tsujii, 2003).</Paragraph> <Paragraph position="8"> Jacquemin (2001) presents a rule-based system, FASTR, which supports several hundred meta-rules dealing with morphological, syntactic (i.e. structural) and semantic term variation. Term variation recognition is based on the transformation of basic term structures into variant structures. However, the variants recognised by FASTR are more conceptual variants than terminological ones, as non-terminological units (such as verb phrases, extended insertions, etc.) are also linked to terms in order to improve indexing and retrieval.</Paragraph> </Section> <Section position="5" start_page="0" end_page="4" type="metho"> <SectionTitle> 3 Incorporating term variation into ATR </SectionTitle> <Paragraph position="0"> Our approach to ATR combines the C-value method (Frantzi et al., 2000) with the recognition of term variation, which is incorporated as an integral part of the term extraction process.</Paragraph> <Paragraph position="1"> C-value is a hybrid approach combining term formation patterns with corpus-based statistical measures. Term formation patterns act as linguistic filters to a POS tagged corpus: filtered sequences are considered as potential realisations of domain concepts (term candidates). They are subsequently assigned termhoods (i.e. likelihood to represent terms) according to a statistical measure. The measure amalgamates four corpus-based characteristics of a term candidate, namely its frequency of occurrence, its frequency of occurrence as a form nested within other candidate terms, the number of candidate terms inside which it is nested, and the number of words it contains.</Paragraph> <Paragraph position="2"> The original C-value method treats term variants that correspond to the same concept as separate term candidates. Consequently, by providing separate frequencies of occurrence for individual variants instead of a single frequency of occurrence calculated for a term candidate unifying all variants, the corpus-based measures and termhoods are distributed across different variants. Therefore, we aim at enhancing the statistical evaluation of termhoods through conflation of different surface representations of a given term, and through joint frequencies of occurrence of all equivalent surface forms that correspond to a single concept.</Paragraph> <Paragraph position="3"> In order to conflate equivalent surface expressions, we carry out linguistic normalisation of individual term candidates (see examples in Table 1). Firstly, each term candidate is mapped to a specific canonical representative (CR) by semantically isomorphic transformations. Then, we establish an equivalence relation, where two term candidates are related iff they share the same CR.</Paragraph> <Paragraph position="4"> The partitions of this relation are denoted as synterms: a synterm contains surface term representations sharing the same CR.</Paragraph> <Paragraph position="5"> Our aim is to form synterms prior to the syntactic estimation of termhoods for term candidates.</Paragraph> <Paragraph position="6"> Therefore, after the extraction of individual term candidates, we subsequently normalise them in order to generate synterms, where the normalisation is performed according to the typology of variations described in Section 2. More precisely, we consider separately the normalisation of variations that affect term candidate constituents and variations that involve structural changes. The general architecture of our ATR approach is presented in Figure 1.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Normalising term constituent variation </SectionTitle> <Paragraph position="0"> In the case of variations that do not affect the structure of terms, the formation of CRs is based on a POS tagger (for inflectional variation) and simple heuristics (for orthographic normalisation).</Paragraph> <Paragraph position="1"> For example, different transcriptions of neoclassical combining forms are treated by replacements of specific character combinations (ae - e, ph - f) in such forms (and only in such forms). Inflectional normalisation is based on POS tagging: a canonical term candidate form is a singular form containing no possessives (Down's syndrome - down syndrome).</Paragraph> <Paragraph position="2"> In order to address lexical variants, one can use dictionaries of synonyms where the preferred terms are used for normalisation purposes ({hepatic microsomes, liver microsomes} - liver microsomes). In experiments reported here, we did not attempt to normalise lexical variation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 3.2 Normalising term structure variation </SectionTitle> <Paragraph position="0"> Variations affecting term structure are less frequent but more complex. Here we consider two types of term variation: prepositional term candidates and coordinated term candidates (for a detailed analysis of these variations see (Nenadic et al., 2004)).</Paragraph> <Paragraph position="1"> Prepositional term candidates are normalised by transformation into corresponding expressions without prepositions. Using prepositions of, in, for and by as anchors, we generate semantically isomorphic CRs by inversion. For example, the candidate nuclear factor of activated T cell is transformed into activated T cell nuclear factor.</Paragraph> <Paragraph position="2"> Here is a simplified example of a rule describing the transformation of a term candidate that contains the preposition of: if structure of term candidate is In order to address the problems of determining the boundaries of term constituents in text (to the right and left of prepositions), for each prepositional term candidate we generate all possible nested candidates + and their corresponding CRs. For example, for the candidate regulation of gene expression, we generate both gene regulation and gene expression regulation. Since this approach also generates a number of false candidates, additional heuristics are used to enhance precision, such as removing adverbials and determiners, using a stop list of terminologically irrelevant prepositional expressions (e.g. number of ..., list of ..., case of ..., in presence of ...), etc.</Paragraph> <Paragraph position="3"> A similar approach is used for the recognition of coordinated term candidates: coordinating conjunctions (and, or, but not, as well as, etc.) are used as anchors, and when a coordinating structure is recognised in text, the corresponding CRs of the candidate terms involved are generated.</Paragraph> <Paragraph position="4"> We differentiate between head coordination (where term heads are coordinated, e.g. adrenal glands and gonads) and argument coordination (where term arguments/modifiers are coordinated, e.g. SMRT and Trip-1 mRNAs).</Paragraph> <Paragraph position="5"> The recognition and extraction of coordinated terms is highly ambiguous even for human specialists, since coordinated terms and term conjunctions share the same structures (see Table 2). Also, similar patterns cover both argument and head coordinations, which makes it difficult to extract coordinated constituents (i.e. terms). Not only is the recognition of term coordinations and their subtypes ambiguous, but also internal boundaries of coordinated terms are blurred. In a separate study, we have shown that morphosyntactic features are insufficient both for the successful recognition of coordinations and for the extraction of coordinated terms: in many cases, the correct interpretation and decoding of term coordinations is only possible with sufficient background knowledge (Nenadic et al., 2004).</Paragraph> <Paragraph position="6"> + Each constituent extracted from a nested prepositional term candidate has to follow a pattern used for the extraction of individual candidate terms. example adrenal glands and gonads In order to address the problems of structural ambiguities and boundaries of coordinated terms, we also generate all possible nested coordination expressions and corresponding term candidates.</Paragraph> <Paragraph position="7"> For example, from a candidate coordination viral gene expression and replication we generate two pairs of coordinated term candidates: viral gene expression and viral gene replication viral gene expression and viral replication Patterns for the extraction of term candidates from coordinations have been acquired semimanually for a subset of term coordinations. For each pattern, we define a procedure for the extraction of coordinated term candidates and generation of the corresponding CRs (see Table 3 for examples). The generated candidates from coordinated structures are subsequently treated as individual term candidates.</Paragraph> </Section> <Section position="3" start_page="1" end_page="4" type="sub_section"> <SectionTitle> 3.3 Normalising acronym variation </SectionTitle> <Paragraph position="0"> We treat acronym extraction as part of the ATR process (see Figure 1). In (Nenadic et al., 2002) we suggested a simple procedure for acquiring acronyms and their expanded forms (EFs), which was mainly based on using orthographic and syntactic features of contexts where acronyms were introduced. The model is based on three types of patterns: acronym patterns (defining common internal acronym structures and forms), definition patterns (based on syntactic patterns which describe typical contexts where acronyms are introduced in text) and matching patterns (the set of matching rules between acronyms and their corresponding EFs).</Paragraph> <Paragraph position="1"> Acronyms also exhibit variation (e.g. RAR alpha, RAR-alpha, RARA, RARa, RA receptor alpha etc.</Paragraph> <Paragraph position="2"> are all acronyms for retinoic acid receptor alpha). Therefore, in addition to extracting acronyms, we further gather all acronym variants and their EFs, and we map them into a single CR. Since in this paper acronyms are taken as term variants, we &quot;replace&quot; acronym occurrences by the CR of their EFs. In order to bypass the problem of acronym ambiguity, we replace/normalise only acronyms that are introduced in a given document.</Paragraph> <Paragraph position="3"> e.g. function or surface antigenic profile surface antigenic profile function antigenic profile extraction of term candidates from coordinations (nested denotes the generation of all possible linearly nested substrings)</Paragraph> </Section> <Section position="4" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.4 Calculating termhoods with variants </SectionTitle> <Paragraph position="0"> Term variants sharing the same CR are grouped together into synterms, and the calculation of C-values (i.e. termhoods) is performed for the whole synterm rather than for individual term candidates.</Paragraph> <Paragraph position="1"> The main reason for doing this is to avoid the distribution of frequencies of occurrence of term candidates across different variants, as these frequencies have a significant impact on estimating termhoods. Instead of providing separate frequencies of occurrence and obtaining termhoods for individual term candidates, we provide a single frequency of occurrence and joint termhood calculated for a synterm, which unifies all variants.</Paragraph> <Paragraph position="2"> Similarly to the estimation of C-values for individual term candidates (Frantzi et al., 2000), the formula for calculating the termhoods for synterms is as follows: where c denotes a synterm whose elements share a canonical representative (denoted as CR in the formula), f(CR) corresponds to the cumulative frequency with which all term candidates from the synterm c occur in a given corpus, |CR |denotes the average length of the term candidates (the number of constituents), and T</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="4" type="metho"> <SectionTitle> CR </SectionTitle> <Paragraph position="0"> is a set of all synterms whose CRs contain the given CR as a nested substring.</Paragraph> <Paragraph position="1"> This approach ensures that all term variants are naturally dealt with jointly, thus supporting the fact that they denote the same concept. As a consequence, we expect that precision would be enhanced by considering joint frequencies of occurrence and termhoods for all variants of candidate terms, while recall would benefit by the introduction of new candidates through consideration of different variation types.</Paragraph> </Section> class="xml-element"></Paper>