File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1140_metho.xml
Size: 25,605 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1140"> <Title>High-Performance Tagging on Medical Texts</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Medical Tagging with Off-the-shelf </SectionTitle> <Paragraph position="0"> Technology For the first series of experiments, we chose two representatives of the currently prevailing data-driven tagging approaches, Brill's rule-based tagger (Brill, 1995) and TNT, a statistical tagger (Brants, 2000). As we are primarily concerned with German language input, for Brill's tagger, originally developed on English data, its German rule extension package was used. TNT, on the other hand, is based on a statistical model and therefore is basically language-independent. It implements the Viterbi algorithm for second-order Markov models (Brants, 2000), in which states of the model represent tags and the output represents words. The best POS tag for a given word is determined by the highest probability that it occurs with a0 previous tags. Tags for unknown words are assigned by a probabilistic suffix analysis; smoothing is done by linear interpolation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Experiment 1: Medical Tagging with </SectionTitle> <Paragraph position="0"> Standard Tagset Trained on NEGRA The German default version of TNT was trained on NEGRA, the largest publicly available manually annotated German newspaper corpus (composed of 355,095 tokens and POS-tagged with the general-purpose STTS tagset; cf. (Skut et al., 1997)). The Brill tagger comes with an English default version also trained on general-purpose language corpora like the PENN TREEBANK (Marcus et al., 1993).</Paragraph> <Paragraph position="1"> In order to compare the performance of both taggers on German data, the Brill tagger was retrained on the German NEGRA newspaper corpus, with parameters recommended in the training manual.</Paragraph> <Paragraph position="2"> In a second round, we set aside a subset of a newly developed German-language medical corpus (21,000 tokens, with 1800 sentences). We here refer to this text corpus as FRAMEDa1a3a2a4a2a6a5 and describe its superset, FRAMED (Wermter and Hahn, 2004), in more depth in Section 4.1. Three human taggers, trained on the STTS tagset and on guidelines used for tagging the NEGRA corpus, annotated FRAMEDa1a3a2a4a2a6a5 according to NEGRA standards. The interrater reliability for this part of the manual annotation was 96.7% (standard deviation: 0.6%), based on a random sample of 2000 tokens (10% of the evaluation corpus).</Paragraph> <Paragraph position="3"> The performance of both taggers, TNT and Brill, with their NEGRA newspaper-trained parameterization was then measured on the FRAMEDa1a7a2a4a2a6a5 corpus. In addition, since both TNT and Brill allow the inclusion of an external backup lexicon, their performance was also measured by plugging in two such medical backups.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Results from Medical Tagging with </SectionTitle> <Paragraph position="0"> Standard Tagset Trained on NEGRA We measured tagging accuracy by the ratio of the number of correct POS assignments to text tokens (as defined by the gold standard, viz. the manually annotated corpus) and the number of all POS assignments to text tokens from the test set.</Paragraph> <Paragraph position="1"> Table 1 reveals that the n-gram-based TNT tagger outperforms the rule-based Brill tagger on the FRAMEDa1a3a2a4a2a6a5 medical corpus, both being trained on the NEGRA newspaper corpus. The inclusion of a small medical backup lexicon (composed of 171 entries which account for the most frequently falsely tagged tokens such as measure units, Latinate medical terms, abbreviations etc.) boosted TNT's performance to 96.7%, which is on a par with the state-of-the-art performance of taggers on newspaper texts. A much more comprehensive medical backup lexicon, which contained the first one plus the German Specialist Lexicon, a very large repository of domain-specific medical terms (totalling 95,969 entries), much to our surprise had almost no effect on improving the tagging results.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> TNT BRILL </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> The results for the German version of Brill's tagger, both its default version (91.9%) and the lexicon add-on (93.4%), are still considerably better than those of its default version reported by Campbell et al. (Campbell and Johnson, 2001) for English medical input (89.0%).</Paragraph> <Paragraph position="3"> 3 An Inquiry into Corpus Similarity The fact that an n-gram-based statistical POS tagger like TNT, trained on newspaper and tested on medical language data, falls 1.5% short of state-of-the-art performance figures may at first come as a surprise.</Paragraph> <Paragraph position="4"> It has been observed by (Campbell and Johnson, 2001) and (Friedman and Hripcsak, 1999), however, that medical language shows less variation and complexity than general, newspaper-style language. Our second series of experiments, quantifying the grammatical differences/similarities between newspaper and medical language on the TNT-relevant POS n-gram level, may shed some explanatory light on the tagger's performance.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Experiment 2: Measuring Corpus Similarity </SectionTitle> <Paragraph position="0"> For this purpose, we collected a large medical document collection of mostly clinical texts (i.e., pathology, histology and surgery reports, discharge summaries). We refer to this collection (composed of 2480K tokens) as BIGMED. Next, we randomly split BIGMED into six subsamples of NEGRA size (355K tokens). This was meant to ensure a statistically sound comparability and to break up the medical subgenres. The same procedure was repeated for a collection of German newspaper and newswire texts collected from the Web. All twelve samples (six medical ones, henceforth called MED, and six newspaper ones, henceforth called NEWS, also composed of 2480K tokens to ease partitioning) were then automatically tagged by TNT based on its newspaper-trained parameterization.</Paragraph> <Paragraph position="1"> Since NEGRA is the newspaper corpus on which the default version of TNT was trained, its statistical comparison with MED should elucidate the tagger's performance on medical texts without changing the training environment. Moreover, a parallel comparison with other newspaper texts (NEWS) may help in further balancing these results. Because TNT is a Markovian tagger based on tri-, bi- and unigram POS sequences, the statistics were based on the POS n-gram sequences in the different corpora. For this purpose, we extracted all POS trigram, bigram and unigram type sequences from NEGRA, MED, and NEWS. Their numbers are reported in Table 2 (see rows 1, 4 and 7). We then generated a distribution of these types based on three ranges of occurrence frequencies. The results are reported in Table 3.</Paragraph> <Paragraph position="2"> We then determined how many POS n-gram types were common between NEGRA and MED and common between NEGRA and NEWS (see Table 2, rows 2, 5 and 8). Each of these common POS n-gram types was subjected to a a0 a1 test in order to measure whether their common occurrence in both corpora was just random (null hypothesis) or whether that particular n-gram was indicative of the similarity between the two corpora (i.e., between NEGRA and MED, on the one hand, and between NEGRA and NEWS, on the other hand). This interpretation of a0 a1 statistics has already been evaluated against other corpus similarity measures and was shown to perform best (Kilgarriff, 2001), assuming a non-normal distribution (cf. also Table 3). The a0 a1 metric sums the differences between observed and expected values in all squares of the table and scales them by the magnitude of the expected values. The number of all common significant POS n-grams (i.e., those whose critical values are greater than 3.841 for a six MED and six NEWS samples in parentheses) probability level of a6 = 0.05) is indicative of the magnitude of corpus similarity. These results are reported in Table 2 (see rows 3, 6 and 9).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Results from Measuring Corpus Similarity </SectionTitle> <Paragraph position="0"> As shown in Table 2 (rows 1, 4 and 7), the number of unique POS n-gram types was considerably lower in MED. Compared with NEGRA, MED had 29% less trigram types, 19% less bigram types and 4% less unigram types (i.e., POS tags), whereas NEWS even had slightly more types at all n-gram levels.</Paragraph> <Paragraph position="1"> NEWS (standard deviation of means of occurrence frequencies in parentheses) bigram types is also reflected in the three-part distribution in Table 3: The number of POS trigrams occurring less than ten times is almost one third less in MED than in NEGRA or in NEWS; similarly, but less pronounced, this can be observed for POS bigrams. On the other hand, the number of trigram types occurring more than 1000 times is even higher for MED, and the number of bigram and unigram types is about the same when scaled against the total number of types. This indicates a rather high POS trigram and bigram type dispersion in newspaper corpora, whereas medical narratives appear to be more homogeneous.</Paragraph> <Paragraph position="2"> Table 2 (rows 2, 5 and 7) indicates that the number of POS trigram and bigram types common to both corpora was much smaller for the NEGRA-MED comparison than it was for NEGRA-NEWS. In other words, more of the NEGRA POS n-gram types appeared in the NEWS corpus as well, whereas far less showed up in the MED corpus. At this level of comparison, sublanguage differences clearly show up. If, however, compared with the total number of POS n-gram types in each corpus, the common ones cover much more of the MED corpus than of the NEGRA corpus. The coverage for NEGRA and NEWS is about the same.</Paragraph> <Paragraph position="3"> The number of common POS n-gram types that are a0 a1 significant (Table 2: rows 3, 6, and 9) shows the magnitude of corpus similarity. For the common trigram types, it was almost four times higher in the NEGRA-MED comparison than for NEGRA-NEWS; for the common bigram types it was more than twice as high, and for the unigram types 20% higher.</Paragraph> <Paragraph position="4"> Finally, table 4 shows that the top-ranked POS trigrams, bigrams and unigrams common to NEGRA and MED (columns 2 to 4) exhibit a strikingly different a0 a1 magnitude compared to those common to NEGRA and NEWS (columns 5 to 7). This means that, in regard to their top POS n-grams, NEGRA and MED are highly similar, whereas NEGRA and NEWS are less so. Interestingly, for each n-gram the top 5 ranks remain unchanged across all six NEGRA-MED comparisons, whereas they have a different ranking in almost each of the six NEGRA-NEWS comparisons. It seems as though the most characteristic similarities between medical sublanguage and newspaper language are highly consistent and predictable, whereas the intra-newspaper comparison shows weak and inconsistent similarities.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Tagging with Medical Resources </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 FRAMED, an Annotated Medical Corpus </SectionTitle> <Paragraph position="0"> FRAMED, the FReiburg Annotated MEDical corpus (Wermter and Hahn, 2004), combines a variety of relevant medical text genres focusing on clinical reports. The clinical text genres cover discharge summaries, pathology, histology and surgery reports. The non-clinical ones consist of medical expert texts (from a medical textbook) and health care consumer texts taken from the Web. It has already been mentioned that medical language, as used in these clinical documents, has some unique properties not found in newspaper genres. Among these features are the use of Latin and Greek terminology (sometimes also mixed with the host language, here German), various ad hoc forms for abbreviations and acronyms, a variety of (sometimes idiosyncratically used) measure units, enumerations, and some others. These may not be marginal sub-language properties and thus may have an impact on the quality of tagging procedures. In order to test this assumption, we enhanced the NEGRA-rooted STTS tagset with three dedicated tags which capture ubiquitous lexical properties of medical texts not covered by this general-purpose tagset, thus yielding the STTS-MED tagset.1 Our three student annotators then annotated the FRAMED medical corpus with the extended STTS-MED tagset. The mean of the inter-annotator consistency of this annotation effort was 98.4% (with a standard deviation of 0.6).</Paragraph> <Paragraph position="1"> A look at the frequency ranking of the dedicated medical tags shows that they bear some relevance in annotating medical corpora. Out of the 54 tag types occurring in the FRAMED corpus, ENUM is ranked 14, LATIN is ranked 19, and FDSREF is ranked 33.</Paragraph> <Paragraph position="2"> In terms of absolute frequencies, all three additional tags account for 1613 (out of 100,141) tag tokens (ENUM: 866, LATIN: 560, FDSREF: 187). To test the overall impact of these three additional tags, we ran the default NEGRA-newspaper-based TNT on our FRAMED medical corpus and compared the resulting STTS tag assignments with those from the extended STTS-MED tagset. The additional tags accounted for only 24% of the differences between the two assignments (1613/6685). Hence, their introduction, by no means, fully explains any improved tagging results (compared with the reduced newspaper tagset). The other sublanguage properties mentioned above (e.g., abbreviations, acronyms, measure units etc.) are already covered by the original tagset.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Experiment 3: Re-Training TNT on </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> FRAMED </SectionTitle> <Paragraph position="0"> In a third series of experiments, we compared TNT's performance with respect to the general newspaper language and the medical sublanguage.</Paragraph> <Paragraph position="1"> For this purpose, the tagger was newly trained and tested on a random sample (100,198 tokens) of the NEGRA newspaper corpus with the standard STTS tagset, and, in parallel, re-trained and tested on the erence patterns related to formal document structure). For this evaluation, we used learning curve values (see Table 5) that indicate the tagging performance when using training corpora of different sizes. Our experiments started with 5,000 tokens and ranged to the size of the entire corpus (minus the test set). At each size increment point, the overall accuracy, as well as the accuracies for known and unknown words were measured, while also considering the percentage of unknown words.</Paragraph> <Paragraph position="2"> The tests were performed on random partitions of the corpora that use up to 90% as training set (depending on the training size) and 10% as test set. In this way, the test data was guaranteed to be unseen during training. This process was repeated ten times, each time using a different 10% as the test set, and the single outcomes were then averaged.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Results from Medical Tagging with Medical Resources </SectionTitle> <Paragraph position="0"> Table 5 (columns 4-9) reveals that the FRAMED-trained TNT tagger outperforms the NEGRA-trained one at all training points and across all types of accuracies we measured. Trained with the largest possible training size (viz. 90,000 tokens), the tagger's overall accuracy for its FRAMED parametrization scores 98.0%, compared to 95.7% for its NEGRA parametrization. The performance differences between FRAMED and NEGRA range between 2.3 (at training points 90,000 and 70,000) and 3.3 percentage points (at training point 5,000). The tagging accuracy for known tokens is higher for both FRAMED and NEGRA (with 98.7% and 97.6%, respectively, at training point 90,000). The differences here are less pronounced, ranging from 1.0 to 1.3 percentage points.</Paragraph> <Paragraph position="1"> By far the largest performance difference can be observed with respect to the tagging accuracy for unknown words (cf. Table 5 (columns 4 and 5)), ranging from 5.8 (at training point 30,000) to 6.6 percentage points (at training points 10,000 and 40,000). The FRAMED-trained tagger scores above 90% in seven out of ten points and never falls below 80%. The NEGRA-based tagger, on the other hand, remains considerably below 90% at all points, and even falls below 80% at the first two training points. This performance difference is clearly one factor which contributes to the FRAMED tagger's superior results. The difference in the average percentage of unknown words is the other dimension where both environments diverge (cf. Table 5, columns 2 and 3). Whereas the percentage of unknown words starts out to be equally high for lowest training sizes (5,000 and 10,000), this rate drops much faster for the FRAMED-trained tagger. At the highest possible training point, only 12.5% of the words are unknown, compared to still almost 18% unknown to the NEGRA-trained tagger, resulting in a 5.4 percentage point difference. Thus, both the high tagging accuracy for unknown words and their lower rate, in the first place, seem to be key for the superior performance of the FRAMED-trained TNT tagger.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> Campbell and Johnson (2001) have argued that general-purpose off-the-shelf NLP tools are not readily portable and extensible to the analysis of medical texts. By evaluating the English version of Brill's rule-based tagger (Brill, 1995), they conclude that taggers trained on general-purpose language resources, such as newspaper corpora, are not suited to medical narratives but rather need timely and costly retraining on manually tagged medical corpora. Interestingly though, it has also been observed (Friedman and Hripcsak, 1999; Campbell and Johnson, 2001) that medical language shows less variation and complexity than general, newspaper-style language, thus exhibiting typical properties of a sublanguage. Setting aside the difference in vocabulary between medical and nonmedical domains, the degradation in performance of general-language off-the-shelf NLP tools for MLP applications then seems counter-intuitive. Our first and second series of experiments were meant to explain this puzzling state of affairs.</Paragraph> <Paragraph position="1"> The results of these experiments shed a different light on the portability and extensibility of off-the-shelf NLP tools for the analysis of medical narratives as was hypothesized by Campbell and Johnson (2001). A statistical POS tagger like TNT, which is trained on general-purpose language by default, only falls 1.5% short of the state-of-the-art performance in a medical environment. An easy-to-set-up medical backup lexicon eliminates this difference entirely. It appears that it is the underlying language model which determines whether a POS tagger is more or less suited to be portable to the medical domain, not the surface characteristics of medical sublanguage. Moreover, lexical backup facilities show up as a significant asset to MLP. Much to our surprise, a full-scale, carefully maintained lexicon did not substantially improve the tagger's performance in comparison with a heuristically assembled brief list of the most common tagging mistakes.</Paragraph> <Paragraph position="2"> A reason for the statistical tagger's outperformance may be derived from our comparative corpus statistics, which was the focus of our second series of experiments. Concerning POS n-grams, the data points to a less varied and less complex grammar of medical sublanguage(s). Not only is the number of POS n-gram types much lower for medical narratives than for general-language newspaper texts, but the distribution also favors high-occurring (more than 1000 times) types in MED. Another indicator of a simpler POS n-gram grammar in medical narratives is the fact that the absolute number of POS n-gram types common to NEGRA and MED is much lower than for NEGRA and NEWS. Scaled against the total number of types in MED, however, the common ones cover a bigger part of the medical narratives, whereas they cover less of NEGRA. For POS trigrams, half of NEGRA is congruent with three quarters of MED; for POS bigrams three quarters of NEGRA is congruent with nine tenths of MED.</Paragraph> <Paragraph position="3"> Common POS n-grams that are a0 a1 significant indicate that two corpora are similar with respect to them. Their number was significantly higher for the NEGRA-MED comparison than for NEGRA-NEWS.</Paragraph> <Paragraph position="4"> Hence, the congruency of a high proportion of POS n-gram types between NEGRA and MED is not accidental. At the POS n-gram type level, this shows a higher degree of similarity between NEGRA and medical narratives than between NEGRA and other newspaper texts. Furthermore, the high a0 a1 numbers for the top ranked POS n-grams indicate that they are especially characteristic of the NEGRA-MED similarity. Eight of the top-ranked trigrams and bigrams can be identified as parts of a noun phrase. All of them contain a prenominal adjective (ADJA in Table 4), six a common noun (NN in Table 4). The prenominal adjective is by far the most characteristic POS unigram for medical-newspaper inter-language similarity. None of these observations hold for newspaper intra-language similarity. Our third series of experiments showed that Markovian taggers like TNT improve their performance substantially when trained on medical data. Indeed, we were able to achieve a performance boost which goes beyond current state-of-the-art numbers. This seems to be even more notable inasmuch as the tagger's retraining was done on a comparatively small-sized corpus (90,000 tokens).</Paragraph> <Paragraph position="5"> These experiments suggest two explanations.</Paragraph> <Paragraph position="6"> First, annotating medical texts with a medically enhanced tagset took care of medical sublanguage properties not covered by general-purpose tagsets.</Paragraph> <Paragraph position="7"> Second, several tagging experiments on newspaper language, whether statistical (Ratnaparkhi, 1996; Brants, 2000) or rule-based (Brill, 1995), report that the tagging accuracy for unknown words is much lower than the overall accuracy.2 Thus, the lower percentage of unknown words in medical texts seems to be a sublanguage feature beneficial to POS taggers, whereas the higher proportion of unknown words in newspaper language seems to be a prominent source of tagging errors. This is witnessed by the tagging accuracy for unknown words, which is much higher for the FRAMED-trained tagger than for the newspaper-trained one. For the medical tagger, there is only a 5 percentage point difference between overall and unknown word accuracy at training point 90,000, whereas, for the newspaper tagger, this difference amounts to 8.8 percentage points. This may be interrelated with another property of sublanguages, viz. their lower number of word types: At each training point, the lexicon of the FRAMED tagger is 20 percentage points smaller than that of the newspaper tagger. TNT's handling of unknown words relies on the probability distribution for a particular (formal) suffix of some fixed length (cf. Brants (2000)). Thus, guessing an unknown word's category is easier on a small-sized tagger lexicon, because there are less choices for the POS category of a word with a paricular suffix.</Paragraph> <Paragraph position="8"> Only recently has the accuracy of data-driven POS taggers moved beyond the the '97% barrier' (derived from newspaper corpora). This was partly achieved by computationally more expensive models than TNT's efficienct unidirectional Markovian one. For example, Gim'enez and M`arquez (2003) report an accuracy of 97.13% for their SVM-based power tagger. The best automatically learned POS-tagging result reported so far (97.24%) is Toutanova et al. (2003)'s feature-based cyclic dependency network tagger. Although reaching the 98% accuracy level constitutes a breakthrough, it is of course conditioned by the medical sublanguage we are working with. Still, the application of language technologies in certain sublanguage domains like medicine, and more recently, genomics and biology, is gaining rapid importance, and thus, our results also have to be considered from this perspective.</Paragraph> <Paragraph position="9"> 2These authors report on differences between 7.7 and 11.5 percentage points.</Paragraph> </Section> class="xml-element"></Paper>