File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0309_metho.xml

Size: 12,091 bytes

Last Modified: 2025-10-06 14:07:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0309">
  <Title>Biomedical Text Retrieval in Languages with a Complex Morphology</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Morphological Analysis for Medical IR
</SectionTitle>
    <Paragraph position="0"> Morphological analysis for IR has requirements which differ from those for NLP proper. Accordingly, the decomposition units vary, too. Within a canonical NLP framework, linguistically significant morphemes are chosen as nondecomposable entities and defined as the smallest content-bearing (stem) or grammatically relevant units (affixes such as prefixes, infixes and suffixes). As an IR alternative, we here propose subwords (and grammatical affixes) as the smallest units of morphological analysis. Subwords differ from morphemes only, if the meaning of a combination of linguistically significant morphemes is (almost) equal to that of another nondecomposable medical synonym. In this way, subwords preserve a sublanguage-specific composite meaning that would get lost, if they were split up into their constituent morpheme parts.</Paragraph>
    <Paragraph position="1"> Hence, we trade linguistic atomicity against medical plausibility and claim that the latter is beneficial for boosting the system's retrieval performance. For instance, a medically justified minimal segmentation of 'diaphysis' into 'diaphysa0 is' will be preferred over a linguistically motivated one ('diaa0 physa0 is'), because the first can be mapped to the quasi-synonym stem 'shaft'. Such a mapping would not be possible with the overly unspecific morphemes 'dia' and 'phys', which occur in numerous other contexts as well (e.g.'diaa0 gnosa0 is', 'physa0 ioa0 logy'). Hence, a decrease of the precision of the retrieval system would be highly likely due to over-segmentation of semantically opaque compounds. Accordingly, we distinguish the following decomposition classes: Subwords like a2 'gastr', 'hepat', 'nier', 'leuk', 'diaphys', a3a4a3a4a3a6a5 are the primary content carriers in a word. They can be prefixed, linked by infixes, and suffixed. As a particularity of the German medical language, proper names may appear as part of complex nouns (e.g., 'Parkinsona0 verdacht' ['suspicion of Parkinson's disease']) and are therefore included in this category.</Paragraph>
    <Paragraph position="2"> Short words, with four characters or less, like a2 'ion', 'gene', 'ovum'a5 , are classified separately applying stricter grammatical rules (e.g., they cannot be composed at all). Their stems (e.g., 'gen' or 'ov') are not included in the dictionary in order to prevent false ambiguities. The price one has to pay for this decision is the inclusion of derived and composed forms in the subword dictionary (e.g., 'anion','genet','ovul'). null Acronyms such as a2 'AIDS', 'ECG', a3a4a3a4a3a6a5 and abbreviations (e.g., 'chron.' [for 'chronical'], 'diabet.' [for 'diabetical']) are nondecomposable entities in morphological terms and do not undergo any further morphological variation, e.g., by suffixing.</Paragraph>
    <Paragraph position="3"> Prefixes like a2 'a-', 'de-', 'in-', 'ent-', 'ver-', 'anti-', a3a4a3a4a3a6a5 precede a subword.</Paragraph>
    <Paragraph position="4"> Infixes (e.g., '-o-' in &amp;quot;gastra0 oa0 intestinal&amp;quot;, or '-s-' in 'Sektiona0 sa0 bericht' ['autopsy report']) are used as a (phonologically motivated) 'glue' between morphemes, typically as a link between subwords.</Paragraph>
    <Paragraph position="5"> Derivational suffixes such as a2 '-io-', '-ion-', '-ie-', '-ung-', '-itis-', '-tomie-', a3a4a3a4a3a6a5 usually follow a subword.</Paragraph>
    <Paragraph position="6"> Inflectional suffixes like a2 '-e', '-en', '-s', '-idis', '-ae', '-oris', a3a4a3a4a3a6a5 appear at the very end of a composite word form following the subwords or derivational suffixes.</Paragraph>
    <Paragraph position="7"> Prior to segmentation a language-specific orthographic normalization step is performed. It maps German umlauts '&amp;quot;a', '&amp;quot;o', and '&amp;quot;u' to 'ae', 'oe', and 'ue', respectively, translates 'ca' to 'ka', etc. The morphological segmentation procedure for German in January 2002 incorporates a subword dictionary composed of 4,648 subwords, 344 proper names, and an affix list composed of 117 prefixes, 8 infixes and 120 (derivational and inflectional) suffixes, making up 5,237 entries in total. 186 stop words are not used for segmentation. In terms of domain coverage the subword dictionary is adapted to the terminology of clinical medicine, including scientific terms, clinicians' jargon and popular expressions.</Paragraph>
    <Paragraph position="8"> The subword dictionary is still in an experimental stage and needs on-going maintenance. Subword entries that are considered strict synonyms are assigned a shared identifier. This thesaurus-style extension is particularly directed at foreign-language (mostly Greek or Latin) translates of source language terms, e.g., German 'nier' EQ Latin 'ren' (EQ English 'kidney'), as well as at stem variants.</Paragraph>
    <Paragraph position="9"> The morphological analyzer implements a simple word model using regular expressions and processes input strings following the principle of 'longest match' (both from the left and from the right). It performs backtracking whenever recognition remains incomplete. If a complete recognition cannot be achieved, the incomplete segmentation results, nevertheless, are considered for indexing. In case the recognition procedure yields alternative complete segmentations for an input word, they are ranked according to preference criteria, such as the minimal number of stems per word, minimal number of consecutive affixes, and relative semantic weight.2</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experimental Setting
</SectionTitle>
    <Paragraph position="0"> As document collection for our experiments we chose the CD-ROM edition of MSD, a Germanlanguage handbook of clinical medicine (MSD, 1993). It contains 5,517 handbook-style articles (about 2.4 million text tokens) on a broad range of clinical topics using biomedical terminology.</Paragraph>
    <Paragraph position="1"> In our retrieval experiments we tried to cover a wide range of topics from clinical medicine. Due to the importance of searching health-related contents both for medical professionals and the general public we collected two sets of user queries, viz. expert queries and layman queries.</Paragraph>
    <Paragraph position="3"> some semantically important suffixes, such as '-tomie' ['-tomy'] or '-itis'; a7 =1 is assigned to prefixes and derivational suffixes; a7 =0 holds for inflectional suffixes and infixes.</Paragraph>
    <Paragraph position="4"> Expert Queries. A large collection of multiple choice questions from the nationally standardized year 5 examination questionnaire for medical students in Germany constituted the basis of this query set. Out of a total of 580 questions we selected 210 ones explicitly addressing clinical issues (in conformance with the range of topics covered by MSD). We then asked 63 students (between the 3rd and 5th study year) from our university's Medical School during regular classroom hours to formulate free-form natural language queries in order to retrieve documents that would help in answering these questions, assuming an ideal search engine.</Paragraph>
    <Paragraph position="5"> Acronyms and abbreviations were allowed, but the length of each query was restricted to a maximum of ten terms. Each student was assigned ten topics at random, so we ended up with 630 queries from which 25 were randomly chosen for further consideration (the set contained no duplicate queries).</Paragraph>
    <Paragraph position="6"> Layman Queries. The operators of a Germanlanguage medical search engine (http://www.</Paragraph>
    <Paragraph position="7"> dr-antonius.de/) provided us with a set of 38,600 logged queries. A random sample (a8 =400) was classified by a medical expert whether they contained medical jargon or the wording of laymen.</Paragraph>
    <Paragraph position="8"> Only those queries which were univocally classified as layman queries (through the use of non-technical terminology) ended up in a subset of 125 queries from which 27 were randomly chosen for our study.</Paragraph>
    <Paragraph position="9"> The judgments for identifying relevant documents in the whole test collection (5,517 documents) for each of the 25 expert and 27 layman queries were carried out by three medical experts (none of them was involved in the system development). Given such a time-consuming task, we investigated only a small number of user queries in our experiments.</Paragraph>
    <Paragraph position="10"> This also elucidates why we did not address inter-rater reliability. The queries and the relevance judgments were hidden from the developers of the sub-word dictionary.</Paragraph>
    <Paragraph position="11"> For unbiased evaluation of our approach, we used a home-grown search engine (implemented in the PYTHON script language). It crawls text/HTML files, produces an inverted file index, and assigns salience weights to terms and documents based on a simple tf-idf metric. The retrieval process relies on the vector space model (Salton, 1989), with the cosine measure expressing the similarity between a query and a document. The search engine produces a ranked output of documents.</Paragraph>
    <Paragraph position="12"> We also incorporate proximity data, since this information becomes particularly important in the segmentation of complex word forms. So a distinction must be made between a document containing 'appenda0 ectomy' and 'thyroida0 itis' and another one containing 'appenda0 ica0 itis' and 'thyroida0 ectomy'. Our proximity criterion assigns a higher ranking to adjacent and a lower one to distant search terms. This is achieved by an adjacency offset, a9a11a10 , which is added to the cosine measure of each document. For a query a12 consisting of a8 terms, a12a14a13a16a15a18a17a4a19a20a15a22a21a11a19a4a3a6a3a6a3a6a19a20a15a24a23 , the minimal distance between a pair of terms in a document, (a15a26a25a24a19a20a15a28a27 ), is referred to by</Paragraph>
    <Paragraph position="14"> We distinguished four different conditions for the retrieval experiments, viz. plain token match, trigram match, plain subword match, and subword match incorporating synonym expansion: Plain Token Match (WS). A direct match between text tokens in a document and those in a query is tried. No normalizing term processing (stemming, etc.) is done prior to indexing or evaluating the query. The search was run on an index covering the entire document collection (182,306 index terms).</Paragraph>
    <Paragraph position="15"> This scenario serves as the baseline for determining the benefits of our approach.3 Trigram Match (TG). As an alternative lexicon-free indexing approach (which is more robust relative to misspellings and suffix variations) we considered each document and each query indexed by all of their substrings with character length '3'.</Paragraph>
    <Paragraph position="16"> Subword Match (SU). We created an index building upon the principles of the subword approach as described in Section 2. Morphological segmentation yielded a shrunk index, with 39,315 index terms remaining. This equals a reduction rate of 78% compared with the number of text types in the collection.4  stead of subwords, synonym class identifiers which stand for several subwords are used as index terms. The following add-ons were supplied for further parametrizing the retrieval process: Orthographic Normalization (O). In a preprocessing step, orthographic normalization rules (cf. Section 2) were applied to queries and documents.</Paragraph>
    <Paragraph position="17"> Adjacency Boost (A). Information about the position of each index term in the document (see above) is made available for the search process.</Paragraph>
    <Paragraph position="18"> Table 1 summarizes the different test scenarios.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML