A tagger/lemmatiser for Dutch medical language 
Peter Spyns 
University of Gent, Division of Medical Informatics 
De Pintelaan 185 (5K3), B-9000 Gent, Belgium 
Peter. Spynscrug. ac. be 
Abstract 
In this paper, we want to describe a tag- 
ger/lemmatiser for Dutch medical voca- 
bulary, which consists of a full-form dic- 
tionary and a morphological recogniser 
for unknown vocabulary coupled to an 
expert system-like disambiguation mo- 
dule. Attention is also paid to the main 
datastructures: a lexical database and 
feature bundles implemented as direc- 
ted acyclic graphs. Some evaluation re- 
sults are presented as well. The tag- 
ger/lemmatiser currently functions as a 
lexical front-end for a syntactic parser. 
For pure tagging/lemmatising purposes, 
a reduced tagset (not suited for sentence 
analysis) can be used as well. 
1 Introduction 
Medical patient reports consist mainly of 
free text, combined with results of va- 
rious laboratories. While nmnerical data 
can easily be stored and processed for ar- 
chiving and research purposes, free text 
is rather difficult to be processed by a 
computer, although it contains the most 
relevant information. IIowever, only a 
few NLP-driven systems have actually 
been implemented (lfi'iedman and John- 
son, 1992). 
For Dutch, a prototype covering a lar- 
ger part of the Dutch grammar and me- 
dical vocabulary is under development. 
This paper focuses on a spin-off-- c.q. a 
contextual tagger/lemmatiser (T/L) -- 
of the lexical component of the Dutch 
Medical Language Processor (DMLP) 
(Spyns and De Moor, 1996). A T/L is 
quite valuable for several kinds of corpus 
studies concerning the medical vocabu- 
lary (co-occurrence patterns, statistical 
data, . .. ). For efficient sentence analysis 
in particular, it is necessary to disambi- 
guate the results of morphological ana- 
lysis before they can be passed oil the 
parser. 
In the following sections, we will describe 
in detail the different knowledge bases 
(cf. section 2) and the implementation of 
tile major data structures (cf. section 3). 
Each section is illustrated by an cxaIn- 
ple or some implementation details. The 
subsequent section (4) is devoted to the 
evaluation. The paper ends with a dis- 
cussion (section 5). 
2 Linguistic Knowledge 
In essence, the T/L is a generate-and- 
test engine. All possible morphologi- 
cal analyses of a word are provided (by 
the database or tile word recogniser cf. 
section 2.1), (generator), and the con- 
textual disambiguator (cf. section 2.2), 
(test engine), must reduce as much as 
possible tile potentially valid analyses to 
the one(s) effectively applicable in the 
context of the given input sentence 1 
2.1 Lexlcal Front-end 
The dictionary is conceived as a full 
form dictionary in order to speed up 
the tagging process. Experiments (De- 
haspe, 1993b) have shown that full form 
retrieval is in most of the cases signifi- 
cantly faster than canonical form com- 
putation and retrieval. (cf. also (Ritehie 
et al., 1992, p.201)). The lexical data- 
base for Dutch was built using several 
resources: an existing electronic valency 
dictionary 2 and a list of words extrac- 
ted from a medical corpus (cardiology 
patient discharge summaries). The al- 
ready existing electronic dictionary and 
1Before the actual linguistic analysis takes place, 
some preprocessing (marking of sentence boundaries, 
etc.) is done. 
2This resulted from the K.U. Leuvcn PROTON- 
project (Dehaspe and Van Langendonck, 1991) 
1147 
the newly coded entries were converted 
and merged into a common representa- 
tion in a relational database (Dehaspe, 
1993a). A Relational DataBase Manage- 
ment System (RDBMS) can handle very 
large amounts of data while guarante- 
eing flexibility and speed of execution. 
Currently, there are some 100.000 full 
forms in the lexical database (which is 
some 8000 non inflected forms). For the 
moment, the database contains for the 
major part simple wordforms. Complex 
wordforms nor idiomatic expressions are 
yet handled in a conclusive manner. 
Itowevcr, since an exhaustive dictionary 
is an unrealistic assumption, an intel- 
ligent word recognlser tries to cope 
with all the unknown word forms (Spyns, 
1994). The morphological recogniser 
tries to identify the unknown form by 
computing its potential linguistic charac- 
teristics (including its canonical form). 
For this purpose, a set of heuristics that 
combine morphological (inflection, deri- 
vation and compounding) as well as non 
morphological (lists of endstrings cou- 
pled to their syntactic category) know- 
ledge. When these knowledge sources 
do not permit to identify the unknown 
forms, they are marked as guesses and 
receive the noun category. 
Actually, a difference is made between 
the regular full form database dictionary 
and a much smaller canonical form 
dictionary. The latter consist of auto- 
matically generated entries. Those ent- 
ries are asserted as temporary canoni- 
cal form lexicon entries and do not need 
to be calculated again by the recogniser 
part of the T/L when encountered a se- 
cond time in the submitted text. A sub- 
stantial speedup can be gained that way. 
2.2 The Disambiguator 
The contextual 3 disambiguator of the 
DMLP is implemented as an "expert- 
like system" (Spyns, 1995), which does 
not only take the immediate left and/or 
right neighbour of a word in the sentence 
into account, but also the entire left or 
right part of the sentence, depending on 
the rule. E.g. if a simple form of the 
verb 'hebben' \[have\] appears, the auxilL 
ary reading is kept only if a past particL 
ple is present in the context 4 
aWe only consider the syntactic context. 
4Unlike in English, the past participle in Dutch 
does not need to occupy a position adjacent to the 
auxiliary. 
The rule base can be subdivided into 21 
i.ndependent rule sets. A specific mecha- 
nism selects the appropriate ruleset to be 
triggered. Some rulesets are internally 
ordered. Iit that case, if the most speci- 
fic rule is fired, the triggering of the more 
general rules is prevented. In other cases, 
all the rules of a ruleset are triggered se- 
quentially. Some rules are mutually ex- 
clusive. The rules are implemented as 
Prolog clauses, which guarantees a de- 
clarative style of the rules (at !east to a 
large extent). 
The control mechanism works with an 
agenda that contains the position of the 
words ill the input sentence. The posi- 
tion in the sentence uniquely identifies a 
word (and thus its corresponding (group 
of different) morphological reading(s)). 
Every position in the agenda is sequenti- 
ally checked whether it can be disambi~ 
guated or not. If an ambiguous word is 
encountered, its position is kept on the 
agenda. For every clement of the agenda, 
all possible binary combinations of the 
syntactic categories are tried (failure dri- 
ven loop). 1'o avoid infinite loops (repea- 
tedly firing the same rule that is not able 
to alter the current set of morphological 
readings), the same ruleset can only be 
fired once for the word on the same posi- 
tion during the same pass. As long as the 
disambiguator can reducc the number of 
readings and the agenda is not empty, a 
new pass is performed. 
3 Software Engineering 
In order to preserve the reusability of the 
dictionary, an extra software layer hides 
the database. This layer transforms the 
information from the database into a fea- 
ture bundle containing the application 
specific features. The software layer re- 
stricts and adapts the "view" (just like 
the SQL-views) the programs have on 
the information of a lexical entry . This 
methods allows that all sorts of informa- 
tion can be coupled to a lexical entry in 
the database while only the information 
relevant for a specific NLP-application 
passes "the software filter". Besides the 
qualitative aspect, the filter can also af- 
fect the quantitative aspect by collapsing 
or expanding certain entries (e.g. the 1st 
and 2nd person singular of many verbs 
constitute the same entry in the data- 
base but are differentiated afterwards) or 
excluding specific combinations after ex- 
amination of the input. 
1148 
The feature bundles constitute the 
main datastructure of the T/L.Atself. 
They arc conceived as Directed Aey- 
clic Graphs, which are implemented as 
open ended Prolog lists (Gazdar and 
Mellish, 1989). This "low level" imple- 
mentation is only known by the predica- 
tes that make up the interface. Graph- 
unification provides a neat and easy way 
to impose various restrictions. A lingui- 
stic restriction can be exl)rcssed in terms 
of feature value pairs, which in turn can 
be represented as a l)AG. This DAG acts 
as filter towards other DAGs. The DAGs 
that are unifyable with the "filter DAG" 
meet the imposed restriction. The only 
thing to do is to define the appropriate 
filters. The contextual rules mainly con- 
sist of such filter DAGs. 
The T/L, able to analyse words lacking 
from the dictionary, is intended to fimc- 
tion primarily as a lexical front-end for 
the DMIA ) syntactic analyser (Spyns and 
Adriaens, 1992). Itowever, as the result 
of the tagging and lemmatising process 
consists of feature bundh's implemented 
as DAGs, the output format can be ad- 
apted very easily if required (by defining 
various "format filters"). The output 
format can be transduced to the format 
required by the "SAC-tools" o1' the Sy- 
stem Management 'lbols of the Menelas- 
project (Ogonowski, 1993). Another fib 
ter transforms the output to the format 
of the Multi-TMe semantic tagger (Ceu- 
sters, 1994). 
4 Evaluation 
In order to assess the performance of the 
T/L, several data sets were used. A 
learning set of 1314 tokens (5 reports) 
from the cardiology department (cardio) 
should eliminate as much as possible 
errors due to unknown vocabulary. A 
new large test set of 3167 tokens of 35 
neurosurgical reports was fed to the T/L 
to see how robust it is when confronted 
with the vocabulary of a comt)letely new 
domain. The t)roblem with an applica- 
tion of this type is the trade-olr between 
overkill (a good analysis is injustly dis- 
carded) and undershoot (an invalid ana- 
lysis is kept). The extensive tagset (tag- 
setl) provides all the morphosyntaetic 
information as required by the DMLP 
parser for sentence analysis, while the re- 
duced tagset (tagset2) consists of 15 ('a- 
tegories and 25 speciliers (which gives 43 
meaningfifl combinations). This simplifi- 
'fable 1: results of contextual tagging with an ex- 
tensive tagset (tagsetl) versus a reduced one (tag- 
set2) on the eardio and neuro sets 
bail 
2 
1 
b ad 
2 
1 
tagsetl 
1314 
102 
129 
cardio tagset2 cardio 
100% 1314 100% 
1083 
tagsetl neuro 
3167 100% 
39 
92.23 % 75 97.03 % 
82.42 % 1200 91.32 % 
446 
389 85.91% 
2332 73.63 % 
tagset2 
3167 
276 
261 
2630 
neuro 
100 % 
91.28 % 
83.04 % 
cation of the syntactic information grea- 
tly improves the results. 
All the results were manually examined 
and synthesised (of. table 1). As soon 
as even one feature of the complete fea- 
ture bundle with linguistic information 
is wrong, the analysis as a whole is con- 
sidered to be incorrect. All the words 
that have wrong, lacking, doubtful or 
more than 2 competing analyses are con- 
sidered as bad. Sometimes, two compe- 
ting readings could not be disambigua- 
ted without sernantico-pragmatic know- 
ledge. In addition, we deliberately left 
some ambiguities pending for the syntac- 
tic parser to avoid the danger of over- 
kill (el. also (Jaeobs and l:\[au, 1993, 
pp.166--167) on this matter). These ea- 
ses of "double analysis" are grouped in 
the "class 2''. The question whether 
these cases should be considered as bad 
or correct is left open "~ 
The difference between the results is 
mainly due to the amount of unknown 
vocabulary (around 9 % for the cardio 
set VS. around 18% for the neuro set 
which results in a difference of 82.42 % 
vs. 73.63 % and 91.32 % vs. 83.04 %) 
and the nature of the tagsets (82.42 % 
vs. 91.32 % and 73.63 % vs. 8'.1.0/1%). 
5 Discussion 
As tar as we know, only one T/L for me- 
dical English exists (Paulussen and Mar- 
tin, 1992), which has recently been ad- 
apted to medical Dutch and extended 
with semantic lal)elling (Maks and Mar-- 
tin, 1996). Most of ttle T/Ls 6 attain a 
5Probably, the ~mswer will be different depending 
on tile task of tile T/L: "pure" tagging or auxiliary 
function for the parser. 
~Cf. (l'aulussen, 1992) for a detailed overview 
and discussion of some T/l,s - including CGC, Tag- 
2149 
95% - 97% score, although for ENGCG 
a 99.7 % succes rate is claimed (Tapa- 
nainen and Jiirvinen, 1994). All these 
taggers use a rather restricted tagset. 
Therefore, we consider it fair to com- 
pare only our results on tagset2 with the 
scores of the mentioned T/Ls. It must 
be mentioned as well that word order in 
medical Dutch can be rather free. Mo- 
reover, medical sublanguage sometimes 
deviates considerably from the standard 
grammar rules. E.g. determiners can be 
easily skipped, which enhances the ditIi- 
culty to distinguish a noun from certain 
conjugated verbal forms. As a conclu- 
sion, we believe that, our T/L performs 
relatively well and still has potentialities 
for improvement. 
Acknowledgements 
Parts of this work were supported by 
the MENELAS (AIM #2023) (Zweigen- 
baum, 1995) and DOME (MLAP #63- 
221) projects (S~roussi, 1995) of the E.U. 
We also would like to thank Luc Dehaspe 
for his work on the lexical database (De- 
haspe, 1993a). 
References 
Ceusters W., 1994, The Generation of 
MULTI-lingual Specialised Lexicons by 
using Augmented Lemmatizer-Taggers, 
Multi-TALE Delivrable #1, 
Dehaspe L. & Van Langendonck W., 
1991, Automated Valency Dictionary of 
Dutch Verbs, K.U. Leuven. 
Dehaspe L., 1993a, Report on the buil- 
ding of the MENELAS lexical database, 
Technical Report 93-002, Division of Me- 
dical Informatics, K.U. Leuven. 
Dehaspe L., 1993b, Full form retrie- 
val versus canonical form computation of 
morphological data: a performance ana- 
lysis, Technical Report 93-004, Division 
of Medical Informatics, K.U. Leuven. 
Friedman C. & Johnson S., 1992, Medi- 
cal Text Processing: Past achievements, 
future directions, in Ball M. & Collen 
M., Aspects of the Computer-based Pati- 
ent Record: 212 - 228, Berlin: Springer - 
Verlag. 
Gazdar G. & Mellish C., 1989, Natural 
Language Processing in Prolog: an in- 
troduction to computational linguistics, 
Addison-Wesley. 
git, Parts, Claws, Dilemma, the Pare Tagger and 
"Brill tagger" - as well as (Voutilainen, 1995). 
the 
Jacobs P. &Rau L., 1993, Innovations 
in text interpretation, in Artificial Intel- 
ligence 63:143 - 191 . 
Maks I. & Martin W., 1996, MULTI- 
TALE: Linking Medical Concepts by 
means of Frames, Proc. of COLING 96, 
Copenhagen. 
Ogonowski A., 1993, SAC Manuel Uti- 
lisateur, GSI-ERLI, Internal Report. 
Paulussen H., 1992, Automatic Gram- 
malical Tagging: description, compa- 
rison and proposal for augmentation, 
U.I.A., Wilrijk (M.A. thesis). 
Paulussen H. & Martin W., 1992, 
DILEMMA-2: A Lemmatizer-Tagger for 
Medical Abstracts, in Proc. of ANLP 
92, 141 - 146, Trento. 
Ritchie G., R.ussell G., Black A. & Pul- 
man S., 1992, Computational Morpho- 
logy: Practical Mechanisms for the Eng- 
lish Lexicon, MIT Press. 
Sdroussi B., & DOME Consortium, 
1995, Document Management in Heal- 
thcare: Final Report, DOME Deliverable 
#D.02, Paris. 
Spyns P. & Adriaens G., 1992, Applying 
and Improving the Restriction Grammar 
Approach for Dutch Patient Discharge 
Summaries, Proc. of COLING 92, 1254 
- 1268, Nantes. 
Spyns P., 1994, A robust category gues- 
ser for Dutch Medical language, in Proc. 
of ANLP 94, 150-155, Stuttgart. 
Spyns P., 1995, A contextual Disambi- 
guator for Dutch medical language, in 
Proc. of the 7th Benelux Workshop on 
Logic Programming, Gent. 
Spyns P. & De Moor G., 1996, A Dutch 
Medical Language Processor, in Interna- 
tional Journal of Bio-Medical Enginee- 
ring, (in press). 
Tapanainen P. & Jiirvinen T., 1994, 
Syntactic Analysis of natural language 
using linguistic rules and corpus-based 
patterns, in Proc. of COLING 94, 629 
- 634, Kyoto. 
Voutilainen A., 1995, A syntax-based 
part-of-speech analyser , in Proc. of 
EACL 95, Dublin. 
Zweigenbaum P. & MENELAS Consor- 
tium, 1995, Menelas: Coding and In- 
formation Retrieval from Natural Lan- 
guage Patient Discharge Summaries, in 
Laires M., Ladeira M. & Christensen J., 
Health in the New Communications Ape, 
IOS Press, Amsterdam, 82 - 89 . 
1150 
