In: Proceedings o/CoNLL-2000 and LLL-2000, pages 111-114, Lisbon, Portugal, 2000. 
Minimal Commitment and Full Lexical Disambiguation: 
Balancing Rules and Hidden Markov Models 
Patrick Ruch and Robert Baud and Pierrette Bouillon and Gilbert Robert* 
Medical Informatics Division, University Hospital of Geneva 
ISSCO, University of Geneva {ruch, baud}@dim.hcuge.ch, {bouillon, robert}@issco.unige.ca 
Abstract 
In this paper we describe the construction of 
a part-of-speech tagger both for medical doc- 
ument retrieval purposes and XP extraction. 
Therefore we have designed a double system: for 
retrieval purposes, we rely on a rule-based ar- 
chitecture, called minimal commitment, which 
is likely to be completed by a data-driven tool 
(HMM) when full disambiguation is necessary. 
1 Introduction 
Nowadays, most medical information is stored 
in textual documents 1, but such large amount 
of data may remain useless if retrieving the rel- 
evant information in a reasonable time becomes 
impossible. Although some large-scale informa- 
tion retrieval (IR) evaluations, made on unre- 
stricted corpora (Hersh and al., 1998), and on 
medical texts (Hersh, 1998), are quite critical 
towards linguistic engineering, we believe that 
natural  processing is the best solution 
to face two major problems of text retrieval en- 
gines: expansion of the query and lexical dis- 
ambiguation. Disambiguation can be separated 
between MS (morpho-syntactic, i.e. the part-of- 
speech (POS) and some other features) and WS 
(word-sense) disambiguation. Although we aim 
at developing a common architecture for pro- 
cessing both the MS and the WS disambigua- 
tion (Ruch and al., 1999), this paper focuses on 
the MS tagging. 
• We would like to thank Thierry Etchegoyhen and Erik 
Tjong Kim Sang for their helpful assistance while writing 
this paper. The Swiss National Fundation supported the 
present study. 
1While our studies were made on French corpora, the 
examples are provided in English -when possible- for the 
sake of clarity. 
2 Background 
Before starting to develop our own MS tagger, 
some preliminary studies on general available 
systems were conducted; if these studies go far 
beyong the scope of this paper, we would like 
to report on the main conclusions. Both statis- 
tical taggers (HMM) and constraint-based sys- 
tems were assessed. Two guidelines were fram- 
ing the study: performances and minimal com- 
mitment. We call minimal commitment 2 the 
property of a system, which does not attempt 
to solve ambiguities when it is not likely to solve 
it well! Such property seems important for IR 
purposes, where we might prefer noise rather 
than silence in the recall process. However, it 
must remain optional, as some other tasks (such 
as the NP extraction, or the phrase chunking 
(Abney, 1991)) may need a full disambiguation. 
2.1 Data-driven tools 
We adapted the output of our morphological 
analyser for tagging purposes (Bouillon et al., 
1999). We trained and wrote manual biases 
for an HMM tagger, but results were never far 
above 97% (i.e. about 3% of error); with an av- 
erage ambiguity level of around 16%, it means 
that almost 20% of the ambiguities were at- 
tributed a wrong tag! We attempted to set 
a confidence threshold, so that for similarly 
weighted transitions, the system would keep the 
ambiguity, as in (Weischedel and al., 1993), but 
results were not satisfying. 
2.2 Constraint-based systems 
We also looked at more powerful principle- 
based parsers, and tests were conducted on 
2The first one using this expression was maybe M. 
Marcus, lately we can find a quite similar idea in Sil- 
berztein (1997). 
111 
Token Lemma Lexical tag(s) 
fast a fast 
section section nc\[s\] 
face/toface 
of of sp 
the the dad 
internal internal a 
faces ~n 
Token Lexical tags Disambiguated tag 
fast a a 
section nc\[s\] nc\[s\] 
of sp sp 
the dad dad 
internal a a 
faces nc\[p\]/v\[s03\] nc\[p\] 
Table 1: Tag-like representation of MS lexical 
features 
FIPSTAG 3 (a Government and Binding chart- 
parser (Wehrli, 1992)). Although this system 
performed well on general texts, with about 
0.7% of errors, its results on medical texts 
were about the same as stochastic taggers. As 
we could not adapt our medical morphological 
analyser on this very integrated system, it had 
to cope with several unknown words. 
3 Methods 
In order to assess the system, we selected a 
corpus (40000 tokens) based equally on three 
types of documents: reports of surgery, dis- 
charge summaries and follow-up notes. This ad 
hoc corpus is split into 5 equivalent sets. The 
first one (set A, 8520 words) will serve to write 
the basic rules of the tagger, while the other sets 
(set B, 8480 tokens, C, 7447 tokens, D, 7311 
tokens, and E, 8242 tokens), will be used for 
assessment purposes and incremental improve- 
ments of the system. 
3.1 Lexicon, morphological analysis 
and guesser 
The lexicon, with around 20000 entries, covers 
exhaustively the whole ICD-10. The morpho- 
logical analyser is morpheme-based (Baud et al., 
1998), it maps each inflected surface form of a 
word to its canonical lexical form, followed by 
the relevant morphological features. Words ab- 
sent from the lexicon follow a two-step guess- 
ing process. First, the unknown token is anal- 
ysed regarding its respective morphemes, if this 
first stage fails then a last attempt is made to 
guess the hypothetical MS tags of the token. 
The first stage is based on the assumption that 
aFor a MULTEXT-like description of the FIP- 
STAG tagset see Ruch P, 1997: Table de cot- 
respondance GRACE/FIPSTAG, available at 
http://latl.unige.ch/doc/etiquettes.ps 
Table 2: Example of tagging 
unknown words in medical documents axe very 
likely to belong to the medical jargon, the sec- 
ond one supposed that neologisms follow regular 
inflectional patterns. If regarding the morpho- 
syntax, both stages are functionally equivalent, 
as each one provides a set of morpho-syntactic 
information, they radically behave differently 
regarding the WS information. For guessing 
WS categories only the first stage guesser is 
relevant, as inflectional patterns are not suffi- 
cient for guessing the semantic of a given token. 
Thus, the ending able characterises very proba- 
bly an adjective, but does not provide any se- 
mantic information 4 on it. 
Let us consider two examples of words absent 
from the lexicon. First, allomorph: the prefix 
part allo, and the suffix part, morph, are listed 
in the lexicon, with all the MS and the WS fea- 
tures, therefore it is recognized by the first-stage 
guesser. Second, allocution, it can not be split 
into any affix, as cution is not a morpheme, but 
the ending tion refers to some features (noun, 
singular) in the second-stage guesser. As the 
underlying objective of the project is to retrieve 
documents, the main and most complete infor- 
mation is provided by the first-stage guesser, 
and the second-stage is only interesting for MS 
tagging, as in (Chanod and Tapanainen, 1995). 
Finally (tab. 1), some of the morpho-syntactic 
features provided by the lemmatizer are ex- 
pressed into the MS tagset 5, to be processed 
by the tagger (tab. 2). 
4A minimal set of lexical semantic types, based on 
the UMLS, has been defined in (Ruch and al., 1999). 
5The MS tagset tends to follow the MULTEXT lexi- 
cal description for French, modified within the GRACE 
action (http://www.limsi.fr/TLP/grace/doc/GTR-3- 
2.1.tex). However, it is not always possible, as this 
description does not allow any morpheme annotation. 
112 
Evaluation 1-Set B 2-Set C 3-Set D 4-Set E 
Tokens with lexical ambiguities 
Tokens correctly tagged 
1178 (13.9) 
8243 (97.2) 
1273 (17.1) 
7177 (96.4) 
1132 (15.5) 
7137 (97.6) 
1221 (14.8) 
8082 (98.1) 
Tokens still ambiguous, with GC 161 (1.9) 183 (2.5) 136 (1.9) 101 (1.2) 
Tokens ambiguous, without GC 9 (0.1) 2 (0) i 9 (0.1) 
Tokens incorrectly tagged 76 (0.9) i 78 (1.0) 36 (0.5) i 51 (0.6) 
Table 3: Results for each evaluation (GC stands for good candidates) 
Statistical evaluation on the residual ambiguity MFT HMM 
Tokens correctly tagged 8136 (98.7) 8165 (99.1) 
Tokens incorrectly tagged 107 (1.3) 78 (0.9) 
Table 4: Processing the residual ambiguity 
3.2 Studying the ambiguities 
Our first investigations aimed at assessing the 
overall ambiguity of medical texts. We found 
that 1227 tokens (14.4% of the whole sample 6) 
were ambiguous in set A, and 511 tokens (6.0%) 
were unknown. We first decided not to care 
about unknown words, therefore they were not 
taking into account in the first assessment (cf. 
Performances). However, some frequent words 
were missing, so that together with the MS 
guesser, we would improve the guessing score by 
adding some lexemes. Thus, adding 232 entries 
in the lexicon and linking it with the Swiss com- 
pendium (for drugs and chemicals) provides an 
unknown word rate of less than 3%. This result 
includes also the pre-processing of patients and 
physicians names (Ruch and al., 2000). Con- 
cerning the ambiguities, we found that 5 to- 
kens were responsible for half of the ambiguities, 
while in unrestricted corpora this number seems 
around 16 (Chanod and Tapanainen, 1995). 
3.2.1 Local rules 
We separated the set A in 8 subsets of about 
1000 tokens, in order to write the rules. We 
wrote around 50 rules (which generated more 
than 150 operative rules) for the first subset, 
while for the 8th, only 12 rules were necessary 
to reach a score close to 100% on set A. These 
rules are using intermediate symbols (such as 
the Kleene star) in order to ease and improve 
the rule-writing process, these symbols are re- 
placed when the operative rules are generated. 
6For comparison, the average ambiguity rate is about 
25-30% in unrestricted corpora. 
Here is an example of a rule: 
prop\[**\];v\[**\]/nc\[**\] ---+ prop\[**\];v\[**\] 
This rule says 'if a token is ambiguous be- 
tween (/) a verb (v), whatever (**) features it 
has (3rd or lst/2nd person, singular or plural), 
and a common noun, whatever (**) features it 
has, and such token is preceded by a personal 
pronoun (prop), whatever (**) features this pro- 
noun has (3rd or lst/2nd person), then the am- 
biguous token can be rewritten as a verb, keep- 
ing its original features (**)'. 
4 Performances 
4.1 Maximizing the minimal 
commitment 
Four successive evaluations were conducted 
(tab. 3); after each session, the necessary rules 
were added in order to get a tagging score close 
to 100%. In parallel, words were entered into 
the lexicon, and productive endings were added 
into the MS guesser. The second, third, and 
fourth evaluations were performed with activat- 
ing the MS guesser. Let us note that translation 
phenomena (Paroubek and al., 1998), which 
turn the lexical category of a word into another 
one, seem rare in medical texts (only 3 cases 
were not foreseen in the lexicon). 
A success rate of 98% (tab. 3, evaluation 
4) is not a bad result for a tagger, but the 
main result concerns the error rate, with less 
than 1% of error, the system seems particularly 
minimally committed 7. Another interesting re- 
sult concerns the residual ambiguity (tokens still 
rLet us note that in the assessment 1, the system had 
113 
ambiguous, with GC): in the set E, at least half 
of these ambiguities could be handled by writ- 
ing more rules. However some of these ambigui- 
ties are clearly untractable with such contextual 
rules, and would demand more lexical informa- 
tion, as in le patient prdsente une douleur ab- 
dominale brutale et diffuse (the patient shows 
an acute and diffuse abdominal pain/the pa- 
tient shows an acute abdominal pain and dis- 
tributes*S), where diffuse could be adjective or 
verb. 
4.2 Maximizing the success rate 
A last experiment is made: on the set E, which 
has been disambiguated by the rule-based tag- 
ger, we decided to apply two more disambigua- 
tions, in order to handle the residual ambi- 
guity. First, we apply the most frequent tag 
(MFT) model, as baseline, then the HMM. Both 
the MFT and the HMM transitions are calcu- 
lated on the set B-t-C÷D, tagged manually, but 
without any manual improvement (bias) of the 
model. 
Table 4 shows that for the residual ambiguity, 
i.e. the ambiguity, which remained untractable 
by the rule-based tagger, the HMM provides an 
interesting disambiguation accuracy 9. 
5 Conclusion 
We have presented a rule-based tagger for elec- 
tronic medical records. The first target of 
this tool is the disambiguation for IR purposes, 
therefore we decided to design a system with- 
out any heuristics. As second target, the system 
will be used for conducting NP extraction tasks 
and shallow parsing: the system must be able 
to provide a fully disambiguated output; there- 
fore we used the HMM tool for completing the 
disambiguation task. 

References 
Abney, Steven. 1991. Parsing by chunks. In R. 
Berwick and S. Abney and C. Tenny, editors, 
Principle-based parsing, pages 257-278. Kluwer. 
Robert Baud, C. Lovis, and AM. Rassinoux. 1998. 
Morpho-semantic parsing of medical expressions. 
In Proceedings of AMIA '1998, pages 760-764, Or- 
lando. 
Pierrette Bouillon, R Baud, G Robert, and P Ruch. 
1999. Indexing by statistical tagging. In Proceed- 
ings of the JADT'2000, pages 35-42, Lausanne. 
Jean-Pierre Chanod and Pasi Tapanainen. 1995. 
Tagging french: comparing a statistical and 
a constraint-based method. In Proceedings of 
EACL'95, pages 149-156, Dublin. 
William, Hersh and S. Price and D. Kraemer and 
B. Chan and L. Sacherek and D. Olson. 1998. 
A large-scale comparison of boolean vs. natural 
 searching for the trec-7 interactive track. 
In TREC 1998, pages 429-438. 
William Hersh. 1998. Information retrieval at the 
millenium. In Proceedings of AMIA'1998, pages 
38-45, Lake Buena Vista, FL. 
Paroubek, Patrick and G. Adda and J. Mariani and 
J. Lecomte and M. Rajman. 1998. The GRACE 
french part-of-speech tagging evaluation task. In 
Proceedings of LREC'1998, Granada. 
Ruch, Patrick and J. Wagner and P. Bouillon and R. 
Baud and A.-M. Rassinoux and G. Robert. 1999. 
Medtag: Tag-like semantics for medical document 
indexing. In Proceedings of AMIA '99, pages 35- 
42, Washington. 
Ruch, Patrick and R. Baud and A.-M. Rassinoux 
and P. Bouillon and G. Robert 2000. Medical 
document anonymization with a semantic lexicon. 
In Proceedings of AMIA '2000, Los Angeles. 
Max Silberztein. 1997. The lexical analysis of nat- 
ural s. In Emmanuel Roche and Yves 
Shabes, editors, Finite-State Language Process- 
ing, pages 176-205. MIT Press. 
Eric Wehrli. 1992. The IPS system. In Proceedings 
COLING-92, pages 870-874. 
Weischedel, Ralph and M. Meeler and R. Shwartz 
and L. Ramshaw and J. Palmucci, 1993. Cop- 
ing with ambiguity and unknown words through 
probabilistic models. Computational Linguistics, 
19(2):359-382. 
