File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1019_metho.xml
Size: 18,502 bytes
Last Modified: 2025-10-06 14:12:55
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1019"> <Title>DILEMMA-2: A LEMMATIZER-TAGGER FOR MEDICAL ABSTRACTS</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> DILEMMA-2: A LEMMATIZER-TAGGER FOR MEDICAL ABSTRACTS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Hans Paulussen Facult6s Universitaires Notre-Dame de la Paix, </SectionTitle> <Paragraph position="0"> rue de Bruxelles 61, B-5000 Namur, Belgium</Paragraph> </Section> </Section> <Section position="2" start_page="0" end_page="142" type="metho"> <SectionTitle> Abstract 0 Introduction </SectionTitle> <Paragraph position="0"> This paper reports on the development of DILEMMA-2*, a lemmatizer-tagger for the sublanguage of medical abstracts. The program is an extension of DILEMMA-I, a lemmatizer-tagger for general English texts.</Paragraph> <Paragraph position="1"> In the first section a brief outline is given of DILEMMA-1. Particular attention _is paid to the original concept of a default category which is linked with a categorial graph by means of a pointer system. In the second section we show why DILEMMA-1 was not able to get a suitable score when lemmatizing medical abstracts, the main reason being the inability to recognize sublanguage specific vocabulary. In the next section a description is given of the most important errors along with their solutions; these errors are then categorized as gaps or wrong assignments. The former could be dealt with in either a suffix list or a gaps filler default. The latter mainly concerned wrongly assigned past participles and errors on noun, verb or adjective assignment.</Paragraph> <Paragraph position="2"> After implementation of the proposed solutions, a comparison is made between the results of DILEMMA-1 and DILEMMA-2, showing that the results of DILEMMA-1 have been improved substantially within a sublanguage context, and this by using linguistic, i.e. sublanguage, knowledge, thus avoiding ad hoc remedies.</Paragraph> <Paragraph position="3"> DILEMMA-2 was developed as part of a research contract for Elsevier Science Publishers (ESP), Amsterdam, The Netherlands. The development of DILEMMA-lwas carried out as part of contract research for Van Dale Lexicografie Publishers, Utrecht, The Netherlands. In this paper we describe DILEMMA-2, a lemmatizer-tagger for medical abstracts, which is an updated version of DILEMMA-1, a lemmatizer-tagger for general texts. After a brief outline of DILEMMA-1 we give a description of the types of errors we found when running the general lemmatizer on medical abstracts. This is followed by some examples of the solutions we proposed and implemented into DILEMMA-2. Finally, the results of DILEMMA-I and DILEMMA-2 are compared, showing that a sublanguage approach can lead to workable results in the general English texts, developed at the University of Antwerp during the academic year 1985-1986 (see \[MARTIN 88b\]). For each word of the text it tries to find its lemma (or dictionary entry form) and its grammatical category, and subcategories (or specifiers) where necessary. Being a lemmatizer, not a parser, DILEMMA-1 is as such limited to a relatively basic level of syntactic analysis, which however can be used as input to a more powerful syntactic analyzer. In this way, a lemmatizer can be considered an invaluable tool for corpus linguistics. To carry out the task of assigning grammar categories and possible specifiers DILEMMA-1 looks at word forms from four different points-of-view. First of all word forms are looked at out-of-context (dictionary lookup, morphological procedures). In a second step the immediate context is taken into account: word forms are analyzed and checked by looking at the words immediately preceding and following them. In a third step, the protosyntactic module, a larger context (such as verb patterns) is taken into account. Finally, in the temporary memory, word forms are checked by looking at the whole text.</Paragraph> <Paragraph position="4"> Like most modern lemmatizers, DILEMMA-1 uses as much linguistic knowledge as possible, by translating any regularity on the lexico-morphological level into rules, thus keeping the dictionary small. But in the case of DILEMMA-1 the size of the dictionary is exlxemely small when compared to other lemmatizer-taggers: a little over 3600 words. For a comparison between DILEMMA-1 and CLAWS, another well-known, lemmatizer-tagger for English, see \[MARTIN 88b\]. In passing it can be noted that DILEMMA-1 uses a dictionary only half the size of that used in CLAWS. This smallness is due to criteria adopted on the macro- and micro-level of the dictionary. The vertical macro-level is concerned with the words to select as entry, whereas the horizontal micro-level deals with the information to store next to the dictionary entry.</Paragraph> <Paragraph position="6"> On the macro-level, the dictionary entries are selected according to the following three principles: frequency, closure (of classes, e.g. prepositions) and irregularity (e.g.</Paragraph> <Paragraph position="7"> irregular verbs, irregular plurals). In other words, the dictionary only contains words which are either frequent or which belong to closed classes or which cannot be deduced from the grammar of the English language. To construct the list of frequent words, we used Van Dale's English-Dutch Dictionary (\[MARTIN 89\]), where very frequent words are labeled F4 or F3. These frequency codes were the result of an earlier research project (\[MARTIN 83\], \[MARTIN 88a\]).</Paragraph> <Paragraph position="8"> On the micro-level, categorial information is stored preferentially. Each word is given a preferential default category which can shift to other categories along a categorial graph manipulated by the program (see Fig. 1).</Paragraph> <Paragraph position="9"> This way of storing categorial information is based on the fact that DILEMMA-1 also tries to look for regularity in the categories words can have, and that is what makes it so different from other lemmatizers.</Paragraph> <Paragraph position="10"> Table h part of the DILEMMA dictionary wordform lemma DC ptr specifiers king king noun 1 kiss kiss verb r kit kit noun 1 kitchen kitchen noun n knee knee noun 1 kneel kneel verb n knelt kneel verb n pastpapa knew know verb n past Being a morphologically poor language, English has a large number of grammatical homonyms. DILEMMA-1 starts from the assumption (i) that English words can have different categories, (ii) that each word has a default category (DC), and (iii) that the necessary categorial shifts can be systematized. The DC, which is the main category of a word, is established on the basis of frequency, analogy and/or meaning. Next to a DC, each word in the dictionary (see Table 1) has a pointer (left, fight or neither) indicating the direction in which a category can shift through a categorial graph which was established after calculating the combination and frequency of categories. The word &quot;kiss&quot;, for example, has the category 'verb' as DC, and can shift 'right' to the 'noun' category. All categorial shifts are guided by condition-action rules in the rule component of the DILEMMA-1-program. Note also that the categories 'numeral' and 'interjection' are not integrated in the graph. The numeral only has a predecessor, viz. noun; the interjection has neither a predecessor nor a successor. partici- verb ing pating in prep the det art Geneva noun prop conference noun sg The use of categorial information and pointers changes the dictionary into an economical and dynamic set of lexemes. This is maybe the most striking feature of the modular architecture of DILEMMA-I, and it explains also why the program can run even within a PC-environment. For a fuller account of the DILEMMA-l-program we refer to \[MARTIN 88b\].</Paragraph> <Paragraph position="12"> An output sample of DILEMMA-1 is shown in Table 2, which is a sentence from the BROWN-corpus (see \[KUCERA 67\]). The first column is the text, the second is the lemmatized form, and the following columns give the category and possible specifiers. When a word is not recognized, or when recognition is doubtful, it is flagged by a double asterisk.</Paragraph> <Paragraph position="13"> DILEMMA-I was tested on a number of general language text samples and proved to be a very powerful tool. A sample of error analysis on 6 texts taken from a standard British English corpus (the LOB corpus \[JOHANSSON 78\]) shows e.g. that for general language texts, DILEMMA-I's success rate does not drop below 90%, nor does it exceed 97%, on the average leading to a Nevertheless, when DILEMMA-1 was tested on a number of medical abstracts, its scoring reference point of 93.50% was not reached at all. 'Best results' were more likely to lie within the 90% area, the average being about 86% (see Table 6). The object of this research project was how to bring back the success rate for lemmatizing medical abstracts, without changing the philosophy behind the DILEMMA-I-program, which is developed as a robust, preferential, dynamic system in which items can take different values governed by constraints. Moreover, in a language such as English, categories are often functional instead of lexical (which explains, in part, the small size of the lexicon).</Paragraph> </Section> <Section position="3" start_page="142" end_page="142" type="metho"> <SectionTitle> 2 A Sublanguage Approach </SectionTitle> <Paragraph position="0"> When running the DILEMMA-1 program on medical abstracts, we found that most errors are related to the sublanguage of medical abstracts. For example, most gaps in the output are due to a lack of sublanguage specific vocabulary in the DILEMMA-1 dictionary: e.g. astrocyte, fibrillary, acidic, GFAP. Another point which supports the idea of sublanguage influence is that the more abstracts resemble general language texts, the more their results lie within the general language area. In an extreme case there was only one error in a text of 42 words (success rate = 97.62%). Very unlike the average medical abstract, this text showed no symbols or abbreviations, and it had short, non-complex sentence structures. For a fuller account of lexical differences between sublanguage and general language lexicons, see \[MCNAUGHT 91\].</Paragraph> <Paragraph position="1"> An example showing that the sublanguage features are not solely confined to the lexical level is the following sentence, where 'counts' --which can be either 'verb' or 'noun' -- must shift from 'verb' to 'noun' when found at the beginning of a sentence: e.g. Counts of neocortical cells did not reveal differences in cell numbers.</Paragraph> <Paragraph position="2"> This categorial shift is a sublanguage shift, as categorial and syntactic ambiguity does not exist here, i.e. sentence initial verbal constructions such as imperatives and questions do not occur in the sublanguage of medical abstracts.</Paragraph> <Paragraph position="3"> To improve the DILEMMA-1 program, we not only tried to tackle the problems from a sublanguage approach, but we also decided to implement all program adaptations in a separate module which can be called up by the user whenever he wants. Such a modular architecture makes it easier to adapt the program to another sublanguage.</Paragraph> <Paragraph position="4"> Although not new, the sublanguage approach is being adopted more and more in the implementation of real world applications, where computational linguists are constantly confronted with how to organize the vast amount of world knowledge. The domains can be very diverse as can be seen in the examples of \[CHEVALIER 78\] (automatic translation of weather forecasts), \[DEVILLE 89\] (automatic man-machine dialogue system handling requests for administrative information) and \[PALMER 90\] (physics world problems for college students involving pulley systems). Only by strictly defining the limits of the application domain can one write programs without having to resort to brute force techniques.</Paragraph> <Paragraph position="5"> Even if we stress the sublinguistic character of the errors in the DILEMMA-1 output, there were of course also a number of general errors, most of which could not be solved within the context of the DILEMMA-1 framework which presupposes no (clause) syntactic knowledge. In the rest of this paper we will focus our attention on the modifications added at the sublanguage level.</Paragraph> </Section> <Section position="4" start_page="142" end_page="144" type="metho"> <SectionTitle> 3 DILEMMA-2 </SectionTitle> <Paragraph position="0"> DILEMMA-2 is the result of the corrections we made to DILEMMA-1 within the context of medical abstracts. The type of errors encountered were either gaps or wrong assignments (see Table 6 at the end of this section).</Paragraph> <Section position="1" start_page="143" end_page="143" type="sub_section"> <SectionTitle> 3.1 Gaps </SectionTitle> <Paragraph position="0"> As explained in section 2, most gaps were sublanguage specific words. Putting the missing scientific terms in the dictionary was not considered a good solution: this would have been against the basic principle of the DILEMMA concept, which was to keep the dictionary as small as possible; in any case, it would have been a practically impossible task (ESP has a database of more than 100.000 scientific terms). In as far as the missing terms showed some regularity at the lexico-morphological and syntactic level, they could be dealt with outside the dictionary, by using a sublanguage specific suffix list and a sublanguage specific gaps filler default. The former has been arrived at by considering medical sublanguage from a broad point-ofview, situating it within the functional domain of scientific writing. Table 4 gives a sample of scientific suffixes which have categorial power and which are typical for the formation of medical terms in a broad sense of the term.</Paragraph> <Paragraph position="1"> As to the default for the remaining gaps, it became apparent from the sample analysis that medical texts, like most scientific texts, heavily nominalize. As a result, we expect remaining gaps to be (part of) NPs. Given the contents of the existing DILEMMA dictionary this leads us to a choice between (predominantly) nouns and adjectives. Therefore, only in a last module is DILEMMA-2 allowed to fill out all remaining gaps as nouns unless these gaps: (a) occur in prototypical patterns for adjectives such as (pron is mentioned here, because it has not been shifted to unless followed by Noun, then they are shifted to Adjective. These are typical endings which can also yield adjectives and/or verbs.</Paragraph> <Paragraph position="2"> In these cases the gaps remain unfilled mad are flagged for further processing by a higher module, such as a clause module or a syntactic parser.</Paragraph> </Section> <Section position="2" start_page="143" end_page="144" type="sub_section"> <SectionTitle> 3.2 Wrong Assignments </SectionTitle> <Paragraph position="0"> Although wrong assignments are far less frequent than gaps, they can be important in so far as they can give rise to wrong results in further processing (e.g. in establishing NPs), and in so far as they are no longer easily recoverable. The most important of these errors are related to either those cases where specification of simple past o~ past participle is difficult to distinguish, or to erron concerning noun, verb or adjective assignment. Anothel problem, we will not deal with in this paper, is the ample use of differently structured abbreviations, such as: MH, VAHR, b.i.d., mRna. These were handled by ar abbreviations procedure.</Paragraph> <Paragraph position="1"> A major problem, well known in English taggin~ and so not only restricted to medical texts, is the wron~ specification of verbs which can be either simple past ol past participle. As long as the specifier is no~ disambiguated, it is referred to as PAST_PAPA. In the following example both 'revealed' and 'increased' were assigned PAST, whereas 'increased' should have beer assigned PAP/~ e.g. Western blot analysis revealed increased levels of GFAP in Mo(br/y) forebrain and cerebellum.</Paragraph> <Paragraph position="2"> In the context of another ESP project on automatic indexing of medical abstracts, it is important to correctl3 delineate NPs and therefore to recognize PAPA's However, within the framework of a lemmatizer-tagge this is not an easy task. Again, a sublanguage approach of great help here. Attributive PAPA's in ou sublanguage occur much more often than in genera English texts. Consequently, in some very strictly defmea contexts we could partly disambiguate the PAST_PAP/ problem, as in the following three examples: (a) When a PAST PAPA is preceded by an ING-form and f-ollowed by a noun, select PAPA: e.g. physicians are expressing increased willingness.</Paragraph> <Paragraph position="3"> (b) When a PAST PAPA is preceded by a noun and followed by a preposition or a particle, select PAPA: e.g. cells isolated from ...</Paragraph> <Paragraph position="4"> (c) When a PAST PAPA is found at the beginning of a ~tence, select PAPA: e.g.</Paragraph> <Paragraph position="5"> Affected males suffer profound deficits in oxidative metabolism.</Paragraph> <Paragraph position="6"> Each case was implemented in a condition-actioJ rule, so that example (a), when written as a C-function looks as follows:</Paragraph> <Paragraph position="8"> This function states that if and and and then and the specifier of the selected word is C past it has an alternative specifier C papa the specifier of the preceding word is C ing the category of the next word is C noun change the specifier of the selected word into Cpapa eliminate the alternative specifier As stated above, our proposal of PAST_PAPA rules was based on observations of the Elsevier corpus of medical abstracts at our disposal. We found that the contexts in which wrongly coded attributive PAPA's occur, can -- as a rule -- be characterized as: (e.g. Affected males suffer profound deficits ...).</Paragraph> <Paragraph position="9"> In the case of errors concerning noun, verb or adj e ct ive assignment, we encountered similar context problems, and again, only in very strict contexts could a rule be applied.</Paragraph> </Section> </Section> class="xml-element"></Paper>