File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-2167_intro.xml

Size: 15,627 bytes

Last Modified: 2025-10-06 14:05:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2167">
  <Title>A METHODOLOGY FOR AUTOMATIC TERM RECOGNITION</Title>
  <Section position="3" start_page="0" end_page="1036" type="intro">
    <SectionTitle>
3 A METHODOLOGY FOR AUTO-
MATIC TERM RECOGNITION
</SectionTitle>
    <Paragraph position="0"> We investigated the relevance of other disciplines to automatic term recognition, such as Information Science - especially techniques of automatic indexing. We concluded that non-linguistic based techniques (statistical and probability based ones), while  providing gross means of characterizing texts and measuring the beilaviour and content-bearing potential of words, are not refined enough for our purposes. In Terminology we are interested as much in word-forms occurring with high frequency as in rare ones or those of the middle frequencies. We are interested in all units that may be acting as terms in a collection of texts. IIowever, we do not deny the useflll role of such techniques. They have their place in that they may usefully complement other techniques.</Paragraph>
    <Paragraph position="1"> We chose to concentrate on potential contributions of Linguistics, especially from lexical morphology, and were interested in developing methodologies for term recognition that apply theoretically motivated ideas about term formation. Theoretical Linguistics deals exclusively with general language word structure. We designed an integrated model of word and term structure based on the results of an analysis of Immunology terms in the sublanguage of Medicine (for English) and on models to he found in the literature on general language (Se\]kirk, 1982; Mohanan, 1986).</Paragraph>
    <Paragraph position="2"> Medical terminology relies heavily on Greek (mainly) and Latin neoclassical elements for the creation of terms such as 'erythroeyte' and 'angioneurotie'. In the literature of theoretical Linguistics there are no satisfactory accounts of the neoclassical vocabulary and no formal motivated classification of neoclassical wordforms exists. In Terminology, most accounts of term structure remain at an unformalised descriptive level and this is particularly true for discussions of neoclassical vocabularies.</Paragraph>
    <Paragraph position="3"> The reason for this overall lack of formal description of neoclassical elements appears to be due to their occupying a peripheral or ambiguous place in most analyses of word and term formation in English. We found this to be unsatisfactory for the following reason: it is anomalous to conceive of English word formation as being somehow separated from term formation, especially as terms constitute the majority of English words. Therefore, we strove to set up an integrated model of word and term structure which would, importantly, account adequately for the neoclassical component.</Paragraph>
    <Paragraph position="4"> The word structure of English can be said to comprise 3 category types, i.e. Word, Root and Affix (Selkirk, 1982) 1 However, there is great confusion in the literature as to the morphological status of Greek and Latin neoclassical forms, i.e. whether they are roots, affixes or even both. Models which describe them as affixes allow the generation of forms such as *afffx+a{fix. Many models, including the unformalised ones of conventional dictionaries, charaeterise neoclassical elements vaguely as 'combining elements', 1Selkirk is cited here only as a reference point: we shall develop our own model as shall be seen.</Paragraph>
    <Paragraph position="5"> which suggests some kind of extra-morphologicM status (or wastebasket status). Such forms thus apparently defy attempts to provide an integrated account in terms of the accepted morphological categories.</Paragraph>
    <Paragraph position="6"> In our approach, we introduced a fourth category type comb, to help handle the neoclassical wordstock of English. This does not, in itself, resolve the problem of how to (sub)classify neoclassical elements: we will address this aspect below in detail.</Paragraph>
    <Paragraph position="7"> Firstly, though, we discuss our concomitant adoption of a level ordered approach to the morphological analysis of English words and terms.</Paragraph>
    <Paragraph position="8"> Level ordering places strong constraints on the cooccurrence or order of classes of affix and hence is a powerful mechanism in helping to identify whether a wordform is well-formed or not, whether a wordform may be segmented in a particular way or not, etc.</Paragraph>
    <Paragraph position="9"> Numerous models incorporating level ordering have been proposed in morphology and morphophonology.</Paragraph>
    <Paragraph position="10"> There is debate on how many levels should be identiffed and what the relationships between levels are. Level ordering has its critics also. We do not enter into these debates here, however we have found, in experiments over the years, that level ordering is of great use in a computational morphology environment, as has been recently also suggested by (Sproat, 1992) who, like us, has also found that there is a gain in grouping rules according to Level.</Paragraph>
    <Paragraph position="11"> There is nevertheless broad agreement that, in English, Level 1 and Level 2 are affixational levels dealing with latinate morphology (Class I affixation) and native morphology (Class II affixation), respectively. Level 1 feeds Level 2, therefore native affixation must he attached outside latinate morphology. There is less agreement about the relationship between Class II affixation and native compounding and whether one needs to identify a separate native compounding level. For various reasons we do not have space to go into here, we choose to recognize a distinct native compounding Level 3. Moreover, we importantly recognize a Level 0, which is reserved for non-native (i.e. neoclassical) compounding. In other words, compounding purely involving neoclassical elements must be completed before affixation takes place.</Paragraph>
    <Paragraph position="12"> Thus, the four distinct levels of our model are:  1. Non-native compounding (neoclassical compounding) null 2. Class I affixation 3. Class II affixation 4. Native compounding  Each level has two characteristics: it is cyclic and optional. Cyclicity accounts for recursive structures, i.e. we might find forms such as the following:  prefix-II + word + suffix-I + sufflx-I ~' where Level 1 rules apply twice before Level 2 ones.</Paragraph>
    <Paragraph position="13"> To apply our model, we used the Edinburgh Cambridge Morphological Analyser and Dictionary System (Ritchie, et al. 1992), a component of the lgatural Language Toolkit developed for the UK Alvey IT Programme. This offers a Koskeniemmi-type analyser (here restricted to handling morphographemic pt,enomena) and a general purpose unification based analyser which allows the morphologist to express her knowledge via feature bundles of attribute-value pairs in a context-flee grammar framework.</Paragraph>
    <Paragraph position="14"> Our model is instantiated in our computational wordform grammar as follows. The analysis strategy used by the wordform grammar parser is that of a bottom up chart parser. Each rule in our grammar is marked for level or levels. Lexical entries are also marked for level. Thus, a Class I suffix like 'ous' as in 'glorious' is marked for Level 1. Monomorphemic non-affix native lexical entries are also marked by default for Level 1. Thus, if we have the wordform 'glorious', then, in a computational environment, string segmentation, morphographemic rule application and dictionary look-up will yield: glory((cat noun)(level l.)) and ous ((cat suffix)(leve\] 1)) These two representations are added to the data structure (a chart). Rules with Level 1 as their domain may now apply, as the basic condition for their activation is present in the chart. They will match with these representations and yield: glory + ous ((cat adjective)(level 1)) which is still a Level 1 object. This is added to the chart and no further rules apply. This representation may now be generated as a word of English. As Levels are optional, in this case the rules associated with higher Levels do not apply. If we take an nnderived monomorphemie native wordform, this can be seen conceptually as passing through Levels 1-3, with vacuous rule application. All such wordforms are marked as Level 1 in the dictionary, thus will not be considered, as is correct, by Level 0 rules. The fact that an object is marked for some Level does not block it at that Level: it merely indicates that this is the first Level at which rules may apply recall that we do not know, in bottom-up analysis, wbetber e.g. we are dealing with an underived form, until we have finished the analysis, thus we must allow for underived forms to potentially combine with affixes or participate in compounding.</Paragraph>
    <Paragraph position="15"> Besides the use of four levels in our morphological analysis, we additionally introduced a diacritic feature which explicitly marks degrees of boundness for neoclassical roots. Analysis of a corpus of Immunology texts, by various (semi-)automatic methods, produced classifications of neoclassical elements 2I~ II correspond to Class I afflxation and Class II affixation, respectively.</Paragraph>
    <Paragraph position="16"> into roots and affixes. Neoclassical roots make up our new category comb and display three degrees of boundness: totally free (e.g. cyst), partially bound (e.g. myel- or -myel) and totally bound (e.g. ten) a, Totally bound forms cannot appear on their own and cannot appear in compound final root position without being suffixed. Partially bound forms cannot appear on their own, but can stand in compound final root position without suffixation. Totally free forms can appear in any position, suffixed or not, and can stand on their own. All neoclassical roots are marked in the dictionary with level information level 0 and a value for the boundness feature 4. Those neoclassical elements that we have classed as affixes are dealt with largely at Level 1~ In addition to level ordering and boundness information, other characteristics of our implementation are the use of morphosyntactic head, feature value percolation and rela~ivised head (Di Sciullo and Williams, 1987). The important issue for us was to determine whether a wordform is a general language word or a potential tenn. In our system, we demonstrafed how this could be achieved for affixed forms, neoclassical compounds and certain types of native compound. We labelled certain suffixes as typically term forming suffixes on the basis of a sublanguage corpus analysis, attaching the feature value (wordtype term) to their dictionary entry (each affix has its own lexical entry). We can then ensure that a suffix with this feature percolates its value to the mother node. We used only two wordtype values in our system: term and word. Besides employing the notions of head and percolation from Lexicalist Morphology, we also used the notion of relativized head. This refinement of the notion of head helped us percolate the relevant information in cases where the morpheme bearing tbe label (wordtype term) was not in syntactic head position according to the Rightband Itead Rule.</Paragraph>
    <Paragraph position="17"> Our wordfonn grammar rules generate the following word and term forms involving sufllxation (note: prefixation is similar to suffixation thus is not shown): term -4 word + term_suffix  term --~ term-{- term_suffix word -4 word -{- word suffix term .-* term -{- word_suffix, Compounding operates in a similar fashion: term -~ term + word term -+ term -I- term term --~ word -F term word -~ word -F word.</Paragraph>
    <Paragraph position="18">  Our use of a unification based word grammar 3 We could have worked with three types of comb, however we prefer our current solution as it appears more flexible and expressive to us.</Paragraph>
    <Paragraph position="19"> 4We only use two wlues for bound, however bonndness is interpreted by a combination of bound and level values to give us our 3-way distinction.</Paragraph>
    <Paragraph position="20">  then allowed features associated with known terminological elements to be attached to overall wordforms, thus characterising them as potential terms for later assessment by the terminologist. The notion of ~erminoIogical head of a wordform is important in this respect: this refers to the element of a complex wordform which confers term-hood on the whole wordform, which may not be the same as the morphosyntactic head.</Paragraph>
    <Paragraph position="21"> As yet, we are only capable of determining terminological status for an mlknown word, or wordform containing an unknown morpheme, if it contains a known terminological element (revealed by prior col pus analysis and coded appropriately in tile dictionary). For known morphemes there is no problem. By using notions of Level Ordering, we can fltrthermore impose strong constraints on the form a word (or term) may take. Thus, we can filter and reject as nonwords or nonterms wordforms where all analysis without Level Ordering might postulate a valid wordfonn of English.</Paragraph>
    <Paragraph position="22"> We provide an anMysis of a potential term in the following.</Paragraph>
    <Paragraph position="23"> Pmal represen-tation leukaemia analysis: 1 (((bound-) compound-) (level 1) (wordtype term) (category noun)) \]&amp;quot; This is the final representation which postulates that the word 'leukaemia' is a noun, term, Level 1, non compound lexical unit.</Paragraph>
    <Paragraph position="24"> L0-to-M-by-n-.or-a(b-suffixing T rule name (((compound-) (wordtype term) (bound -) (level 0) (category comb)) representation of lexical entry a data  We have simplified tl, is example somewhat t\~r exposition: in tile EdCam system, a dictionary end.</Paragraph>
    <Paragraph position="25"> try contains fields other than the two shown here (the orthographic form followed by associated morphosyntactic information). Underline denotes a feature variahle, whose name indicates the set of possible values taken by the feature. All Level 0 object, s 1,ave terminological status in our corpus thus we may safely mark wordtype directly on the mother. The feature suffixes is used as a subcategorisation frame whose value must unify with that of the affixed object. The feature makes indicates what category the affix turns the object it attaches to into. Tlm value of this feature is the one that is percolated, via uniG cation of variable values, to the mother node, to give it its category specification. Subcategorisation and makes information is stored in the morphosyntax tleld of an afflx's lexical entry. Our suffixing rules are basicMly all of this form with variants to take care of suffixation at different Levels. There are several rules that take care of mapping between Levels 0 and I as in the above example. With prefixes, which are typically not category changing, we have a three-way unification. The use of the eompomM feature is used at two levels, the neoclassical level and the native level. Compounds are assigned one syntactic parse only, a left branching one, to avoid problems with overgeneration ~ A top-level filter takes care of allowing only word~ forms that are potential terms to be passed out as resuits: ((category _any-cat)(bound-) (level _1-2-o&gt;3) (wordtype term)). Note that no Level 0 objects can he so output a.nd that each object must be unbound, have a major lexical category (not suffix, prefix or comb) and t)e of wordtype term.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML