Accenting unknown words in a specialized language
Pierre Zweigenbaum and Natalia Grabar
DIAM — STIM/DSI, Assistance Publique – Hôpitaux de Paris
& Département de Biomathématiques, Université Paris 6
{ngr,pz}@biomath.jussieu.fr
Abstract
We propose two internal methods for ac-
centing unknown words, which both learn
on a reference set of accented words the
contexts of occurrence of the various ac-
cented forms of a given letter. One method
is adapted from POS tagging, the other is
based on finite state transducers.
We show experimental results for letter
e on the French version of the Medical
Subject Headings thesaurus. With the
best training set, the tagging method ob-
tains a precision-recall breakeven point
of 84.2A64.4% and the transducer method
83.8A64.5% (with a baseline at 64%) for
the unknown words that contain this let-
ter. A consensus combination of both in-
creases precision to 92.0A63.7% with a re-
call of 75%. We perform an error analysis
and discuss further steps that might help
improve over the current performance.
1 Introduction
The ISO-latin family, Unicode or the Universal
Character Set have been around for some time now.
They cater, among other things, for letters which can
bear different diacritic marks. For instance, French
uses four accented es(éèêë) besides the unaccented
form e. Some of these accented forms correspond to
phonemic differences. The correct handling of such
accented letters, beyond US ASCII, has not been
immediate and general. Although suitable charac-
ter encodings are widely available and used, some
texts or terminologies are still, for historical rea-
sons, written with unaccented letters. For instance,
in the French version of the US National Library
of Medicine’s Medical Subject Headings thesaurus
(MeSH, (INS, 2000)), all the terms are written in
unaccented uppercase letters. This causes difficul-
ties when these terms are used in Natural Language
interfaces or for automatically indexing textual doc-
uments: a given unaccented word may match several
words, giving rise to spurious ambiguities such as,
e.g., marche matching both the unaccented marche
(walking) and the accented marché (market).
Removing all diacritics would simplify match-
ing, but would increase ambiguity, which is al-
ready pervasive enough in natural language pro-
cessing systems. Another of our aims, besides,
is to build language resources (lexicons, morpho-
logical knowledge bases, etc.) for the medi-
cal domain (Zweigenbaum, 2001) and to learn lin-
guistic knowledge from terminologies and cor-
pora (Grabar and Zweigenbaum, 2000), including
the MeSH. We would rather work, then, with lin-
guistically sound data in the first place.
We therefore endeavored to produce an accented
version of the French MeSH. This thesaurus in-
cludes 19,971 terms and 9,151 synonyms, with
21,475 different word forms. Human reaccentua-
tion of the full thesaurus is a time-consuming, error-
prone task. As in other instances of preparation of
linguistic resources, e.g., part-of-speech-tagged cor-
pora or treebanks, it is generally more efficient for a
human to correct a first annotation than to produce
it from scratch. This can also help obtain better con-
sistency over volumes of data. The issue is then to
find a method for (semi-)automatic accentuation.
The CISMeF team of the Rouen University Hos-
                                            Association for Computational Linguistics.
                            the Biomedical Domain, Philadelphia, July 2002, pp. 21-28.
                         Proceedings of the Workshop on Natural Language Processing in
pital already accented some 5,500 MeSH terms
that are used as index terms in the CISMeF online
catalog of French-language medical Internet sites
(Darmoni et al., 2000) (www.chu-rouen.fr/cismef).
This first means that less material has to be reac-
cented. Second, this accented portion of the MeSH
might be usable as training material for a learning
procedure.
However, the methods we found in the literature
do not address the case of ‘unknown’ words, i.e.,
words that are not found in the lexicon used by the
accenting system. Despite the recourse to both gen-
eral and specialized lexicons, a large number of the
MeSH words are in this case, for instance those in
table 1. One can argue indeed that the compila-
cryomicroscopie dactylolyse
decarboxylases decoquinate
denitrificans deoxyribonuclease
desmodonte desoxyadrenaline
dextranase dichlorobenzidine
dicrocoeliose diiodotyrosine
dimethylamino dimethylcysteine
dioctophymatoidea diosgenine
Table 1: Unaccented words not in lexicon.
tion of a larger lexicon should reduce the propor-
tion of unknown words. But these are for the most
part specialized, rare words, some of which we did
not find even in a large reference medical dictionary
(Garnier and Delamare, 1992). It is then reasonable
to try to accentuate automatically these unknown
words to help human domain experts perform faster
post-editing. Moreover, an automatic accentuation
method will be reusable for other unaccented textual
resources. For instance, the ADM (Medical Diagno-
sis Aid) knowledge base online at Rennes University
(Seka et al., 1997) is another large resource which is
still in unaccented uppercase format.
We first review existing methods (section 2). We
then present two trainable accenting methods (sec-
tion 3), one adapted from part-of-speech tagging, the
other based on finite-state transducers. We show ex-
perimental results for letter e on the French MeSH
(section 4) with both methods and their combina-
tion. We finally discuss these results (section 5) and
conclude on further research directions.
2 Background
Previous work has addressed text accentuation, with
an emphasis on the cases where all possible words
are assumed to be known (listed in a lexicon). The
issue in that case is to disambiguate unaccented
words when they match several possible accented
word forms in the lexicon – the marche/marché ex-
amples in the introduction.
Yarowsky (1999) addresses accent restoration in
Spanish and in French, and notes that they can be
linked to part-of-speech ambiguities and to seman-
tic ambiguities which context can help to resolve.
He proposes three methods to handle these: N-gram
tagging, Bayesian classification and decision lists,
which obtain the best results. These methods rely
either on full words, on word suffixes or on parts-
of-speech. They are tested on ‘the most problem-
atic cases of each ambiguity type’, extracted from
the Spanish AP Newswire. The agreement with hu-
man accented words reaches 78.4–98.4% depending
on ambiguity type.
Spriet and El-Bèze (1997) use an N-gram model
on parts-of-speech. They evaluate this method on a
19,000 word test corpus consisting of news articles
and obtain a 99.31% accuracy. In this corpus, only
2.6% of the words were unknown, among which
89.5% did not need accents. The resulting error rate
(0.3%) accounts for nearly one half of the total er-
ror rate, but is so small that it is not worth trying to
guess accentuation for unknown words.
The same kind of approach is used in project
RÉACC (Simard, 1998). Here again, unknown
words are left untouched, and account for one fourth
of the errors. We typed the words in table 1
through the demonstration interface of RÉACC on-
line at www-rali.iro.umontreal.ca/Reacc/: none of
these words was accented by the system (7 out of
16 do need accentuation).
When the unaccented words are in the lexicon,
the problem can also be addressed as a spelling cor-
rection task, using methods such as string edit dis-
tance (Levenshtein, 1966), possibly combined with
the previous approach (Ruch et al., 2001).
However, these methods have limited power when
a word is not in the lexicon. At best, they might say
something about accented letters in grammatical af-
fixes which mark contextual, syntactic constraints.
We found no specific reference about the accentua-
tion of such ‘unknown’ words: a method that, when
a word is not listed in the lexicon, proposes an ac-
cented version of that word. Indeed, in the above
works, the proportion of unknown words is too small
for specific steps to be taken to handle them. The sit-
uation is quite different in our case, where about one
fourth of the words are ‘unknown’. Moreover, con-
textual clues are scarce in our short, often ungram-
matical terms.
We took obvious measures to reduce the number
of unknown words: we filtered out the words that
can be found in accented lexicons and corpora. But
this technique is limited by the size of the corpus that
would be necessary for such ‘rare’ words to occur,
and by the lack of availability of specialized French
lexicons for the medical domain.
We then designed two methods that can learn ac-
centing rules for the remaining unknown words: B4CXB5
adapting a POS-tagging method (Brill, 1995) (sec-
tion 3.3); B4CXCXB5 adapting a method designed for learn-
ing morphological rules (Theron and Cloete, 1997)
(section 3.4).
3 Accenting unknown words
3.1 Filtering out know words
The French MeSH was briefly presented in the in-
troduction; we work with the 2001 version. The part
which was accented and converted into mixed case
by the CISMeF team is that of November 2001. As
more resources are added to CISMeF on a regular
basis, a larger number of these accented terms must
now be available. The list of word forms that oc-
cur in these accented terms serves as our base lex-
icon (4861 word forms). We removed from this
list the ‘words’ that contain numbers, those that are
shorter than 3 characters (abbreviations), and con-
verted them in lower case. The resulting lexicon in-
cludes 4054 words (4047 once unaccented). This
lexicon deals with single words. It does not try to
register complex terms such as myocardial infarc-
tion, but instead breaks them into the two words my-
ocardial and infarction.
A word is considered unknown when it is not
listed in our lexicon. A first concern is to filter out
from subsequent processing words that can be found
in larger lexicons. The question is then to find suit-
able sources of additional words.
We used various specialized word lists found on
the Web (lexicon on cancer, general medical lex-
icon) and the ABU lexicon (abu.cnam.fr/DICO),
which contains some 300,000 entries for ‘gen-
eral’ French. Several corpora provided accented
sources for extending this lexicon with some med-
ical words (cardiology, haematology, intensive care,
drawn from the current state of the CLEF corpus
(Habert et al., 2001), and drug monographs). We
also used a word list extracted from the French ver-
sions of two other medical terminologies: the In-
ternational Classification of Diseases (ICD-10) and
the Microglossary for Pathology of the Systematized
Nomenclature of Medicine (SNOMED). This word
list contains 8874 different word forms. The total
number of word forms of the final word list was
276 445.
After application of this list to the MeSH, 7407
words were still not recognized. We converted these
words to lower case, removed those that did not in-
clude the letter e, were shorter than 3 letters (mainly
acronyms) or contained numbers. The remaining
5188 words, among which those listed in table 1,
were submitted to the following procedure.
3.2 Representing the context of a letter
The underlying hypotheses of this method are that
sufficiently regular rules determine, for most words,
which letters are accented, and that the context of
occurrence of a letter (its neighboring letters) is a
good basis for making accentuation decisions. We
attempted to compile these rules by observing the
occurrences of eéèêë in a reference list of words
(the training set, for instance, the part of the French
MeSH accented by the CISMeF team). In the fol-
lowing, we shall call pivot letter a letter that is part
of the confusion set eéèêë (set of letters to discrimi-
nate).
An issue is then to find a suitable description of
the context of a pivot letter in a word, for instance
the letter é in excisée. We explored and compared
two different representation schemes, which under-
lie two accentuation methods.
3.3 Accentuation as contextual tagging
This first method is based on the use of a part-of-
speech tagger: Brill’s (1995) tagger. We consider
each word as a ‘string of letters’: each letter makes
one word, and the sequence of letters of a word
makes a sentence. The ‘tag’ of a letter is the ex-
pected accented form of this letter (or the same letter
if it is not accented). For instance, for the word en-
dometre (endometer), to be accented as endomètre,
the ‘tagged sentence’ is e/e n/n d/d o/o m/m e/è t/t
r/r e/e (in the format of Brill’s tagger). The regular
procedure of the tagger then learns contextual accen-
tuation rules, the first of which are shown on table 2.
Brill Format Gloss
(1) e é NEXT2TAG i e.iB5 eAXé
(2) e é NEXT1OR2TAG o e.?oB5 eAXé
(3) e é NEXT1OR2TAG a e.?aB5 eAXé
(4) e é NEXT1OR2WD e e.?eB5 eAX é
(5) e é NEXT2TAG h e.hB5eAX é
(6) é è NEXTBIGRAM ne éne B5 é AXè
(7) é e NEXTBIGRAM me émeB5éAX e
(8) e é NEXTBIGRAM tr etr B5 eAXé
(9) é e NEXT1OR2OR3TAG x é.?.?xB5 éAXe
(10) e é NEXT1OR2TAG y e.?yB5 eAX é
(11) e é NEXT2TAG u e.uB5eAX é
(12) e é SURROUNDTAG ti teiB5 eAXé
(13) é è NEXTBIGRAM se éseB5 éAX è
Table 2: Accentuation correction rules, of the form
‘change D8
BD
to D8
BE
if test true on DC [DD]’. NEXT2TAG =
second next tag, NEXT1OR2TAG = one of next 2 tags,
NEXTBIGRAM =next2words,NEXT1OR2OR3TAG = one
of next 3 tags, SURROUNDTAG = previous and next
tags,
Given a new ‘sentence’, Brill’s tagger first assigns
each ‘word’ its mots frequent tag: this consists in
accenting no e. The contextual rules are then ap-
plied and successively correct the current accentu-
ation. For instance, when accenting the word flex-
ion, rule (1) first applies (if e with second next tag
= i, change to é) and accentuates the e to yield fléx-
ion (as in ...émie). Rule (9) applies next (if é with
one of next three tags = x, change to e) to correct
this accentuation before an x, which finally results
in flexion. These rules correspond to representations
of the contexts of occurrence of a letter. This rep-
resentation is mixed (left and right contexts can be
combined, e.g.,inSURROUNDTAG, where both imme-
diate left and right tags are examined), and can ex-
tend to a distance of three letters left and right, but
in restricted combinations.
3.4 Mixed context representation
The ‘mixed context’ representation used by
Theron and Cloete (1997) folds the letters of a word
around a pivot letter: it enumerates alternately
the next letter on the right then on the left, until it
reaches the word boundaries, which are marked with
special symbols (here,
CM
for start of word, and $ for
end of word). Theron & Cloete additionally repeat
an out-of-bounds symbol outside the word, whereas
we dispense with these marks. For instance, the
first e in excisée (excised) is represented as the
mixed context in the right column of the first row of
table 3. The left column shows the order in which
the letters of the word are enumerated. The next two
rows explain the mixed context representations for
the two other es in the word. This representation
Word Mixed Context=Output
CM
exc i sée$
2.1345678
x
CM
cisee$=e
CM
exci sée$
876542.13
es$icxe
CM
=é
CM
exci sée$
8765432.1
$esicxe
CM
=e
Table 3: Mixed context representations.
caters for contexts of different sizes and facilitates
their comparison.
Each of these contexts is unaccented (it is meant
to be matched with representations of unaccented
words) and the original form of the pivot letter is
associated to the context as an output (we use the
symbol ‘=’ to mark this output). Each context is
thus converted into a transducer: the input tape is the
mixed context of a pivot letter, and the output tape is
the appropriate letter in the confusion set eéèêë.
The next step is to determine minimal discrimi-
nating contexts (figure 1). To obtain them, we join
all these transducers (OR operator) by factoring their
common prefixes as a trie structure, i.e., a determin-
istic transducer that exactly represents the training
set. We then compute, for each state of this trans-
ducer and for each possible output (letter in the con-
fusion set) reachable from this state, the number of
paths starting from this state that lead to this output.
^allergie$, ^chirurgie$
i
^réfugié$
^cytologie$
^échographie$
^lipoatrophie$
r
h
u
o
e
é
e
e
$
g
3 é
505 e,
65 e
6 e
1 é
63 e
1 é
86 e,
Figure 1: Trie of mixed contexts, each state showing
the frequency of each possible output.
We call a state unambiguous if all the paths from
this state lead to the same output. In that case, for
our needs, these paths may be replaced with a short-
cut to an exit to the common output (see figure 1).
This amounts to generalizing the set of contexts by
replacing them with a set of minimal discriminating
contexts.
Given a word that needs to be accented, the first
step consists in representing the context of each of
its pivot letters. For instance, the word biologie:
$igoloib
CM
. Each context is matched with the trans-
ducer in order to find the longest path from the start
state that corresponds to a prefix of the context string
(here, $igo). If this path leads to an output state, this
output provides the proposed accented form of the
pivot letter (here, e). If the match terminates earlier,
we have an ambiguity: several possible outputs can
be reached (e.g., hémorragie matches $ig).
We can take absolute frequencies into account to
obtain a measure of the support (confidence level)
for a given output C7 from the current state CB:how
much evidence there is to support this decision. It
is computed as the number of contexts of the train-
ing set that go through CB to an output state labelled
with C7 (see figure 1). The accenting procedure can
choose to make a decision only when the support
for that decision is above a given threshold. Table 4
Context Support Gloss Examples
$igo=e 65 –ogie cytologie
$ih=e 63 –hie lipoatrophie
$uqit=e 77 –tique amélanotique
u=e 247 -eu- activateur, calleux
x=e 68 -ex- excisée
Table 4: Some minimal discriminating contexts.
shows some minimal discriminating contexts learnt
from the accented part of the French MeSH with a
high support threshold. However, in previous exper-
iments (Zweigenbaum and Grabar, 2002), we tested
a range of support thresholds and observed that the
gain in precision obtained by raising the support
threshold was minor, and counterbalanced by a large
loss in recall. We therefore do not use this device
here and accept any level of support.
Instead, we take into account the relative frequen-
cies of occurrence of the paths that lead to the dif-
ferent outputs, as marked in the trie. A probabilistic,
majority decision is made on that basis: if one of the
competing outputs has a relative frequency above a
given threshold, this output is chosen. In the present
experiments, we tested two thresholds: 0.9 (90% or
more of the examples must support this case; this
makes the correct decision for hémorragie)and1
(only non-ambiguous states lead to a decision: no
decision for the first e in hemorragie,whichwe
leave unaccented).
Simpler context representations of the same fam-
ily can also be used. We examined right contexts
(a variable-length string of letters on the right of the
pivot letter) and left contexts (idem, on the left).
3.5 Evaluating the rules
We trained both methods, Brill and contexts (mixed,
left and right), on three training sets: the 4054 words
of the accented part of the MeSH, the 54,291 lem-
mas of the ABU lexicon and the 8874 words in the
ICD-SNOMED word list. To check the validity of
the rules, we applied them to the accented part of
the MeSH. The context method knows when it can
make a decision, so that we can separate the words
that are fully processed (CU,alles have lead to deci-
sions) from those that are partially (D4) processed or
not (D2) processed at all. Let CU
CR
the number of correct
accentuations in CU. If we decide to only propose an
accented form for the words that get fully accented,
we can compute recall CA
CU
and precision C8
CU
figures
as follows: CA
CU
BP
CU
CR
CUB7D4B7D2
and C8
CU
BP
CU
CR
CU
. Similar
measures can be computed for D4 and D2,aswellas
for the total set of words.
We then applied the accentuation rules to the 5188
accentable ‘unknown’ words of the MeSH. No gold
standard is available for these words: human vali-
dation was necessary. We drew from that set a ran-
dom sample containing 260 words (5% of the total)
which were reviewed by the CISMeF team. Because
of sampling, precision measures must include a con-
fidence interval.
We also tested whether the results of several meth-
ods can be combined to increase precision. We sim-
ply applied a consensus rule (intersection): a word
is accepted only if all the methods considered agree
on its accentuation.
The programs were developed in the Perl5 lan-
guage. They include a trie manipulation package
which we wrote by extending the Tree::Trie pack-
age, online on the Comprehensive Perl Archive Net-
work (www.cpan.org).
4 Results
The baseline of this task consists in accenting no e.
On the accented part of the MeSH, it obtains an ac-
curacy of 0.623, and on the test sample, 0.642. The
Brill tagger learns 80 contextual rules with MeSH
training (208 on ABU and 47 on CIM-SNOMED).
The context method learns 1,832 rules on the MeSH
training set (16,591 on ABU and 3,050 on CIM-
SNOMED).
Tables 5, 6 and 7 summarize the validation results
obtained on the accented part of the MeSH. Set de-
notes the subset of words as explained in section 3.5.
Cor. stands for the number of correctly accented
words.
Not surprizingly, the best global precision is ob-
tained with MeSH training (table 6). The mixed
context method obtains a perfect precision, whereas
Brill reaches 0.901 (table 5). ABU and CIM-
SNOMED training also obtain good results (table 7),
again better with the mixed context method (0.912–
0.931) than with Brill (0.871–0.895). We performed
the same tests with right and left contexts (table 6):
precision can be as good for fully processed words
(set CU) as that of mixed contexts, but recall is always
lower. The results of these two context variants are
therefore not kept in the following tables. Both pre-
cision and recall are generally slightly better with
the majority decision variant. If we concentrate on
the fully processed words (CU), precision is always
higher than the global result and than that of words
with no decision (D2). The D2 class, whose words
are left unaccented, generally obtain a precision well
over the baseline. Partially processed words (D4)are
always those with the worst precision.
training set cor. recall precisionA6ci
MeSH 3646 0.899 0.901A60.009
ABU 3524 0.869 0.871A60.010
CIM-SNOMED 3621 0.893 0.895A60.009
Table 5: Validation: Brill, 4054 words of accented
MeSH.
context set cor. recall precisionA6ci
right D2 1906 0.470 0.747A60.017
D4 943 0.233 0.804A60.023
CU 324 0.080 1.000A60.000
D8D3D8 3173 0.783 0.784A60.013
left D2 743 0.183 0.649A60.028
D4 500 0.123 0.428A60.028
CU 1734 0.428 1.000A60.000
D8D3D8 2977 0.734 0.736A60.014
mixed D2 7 0.002 1.000A60.000
D4 0 0.000 0.000A60.000
CU 4040 0.997 1.000A60.000
D8D3D8 4047 0.998 1.000A60.000
majority decision (0.9)
mixed D2 2 0.000 1.000A60.000
D4 0 0.000 0.000A60.000
CU 4045 0.998 1.000A60.000
D8D3D8 4047 0.998 1.000A60.000
Table 6: Validation: different context methods,
MeSH training, 4054 words of accented MeSH.
Precision and recall for the unaccented part of
the MeSH are showed on tables 8 and 9. The
global results with the different training sets at
breakeven point, with their confidence intervals, are
not really distinguishable. They are clustered from
0.819A60.047 to 0.842A60.044, except the unambigu-
ous decision method trained on MeSH which stands
a bit lower at 0.800A60.049 and the Brill tagger
trained on ABU (0.785). If we only consider fully
processed words, precision can reach 0.884A60.043
(ICD-SNOMED training, majority decision), with a
recall of 0.731 (or 0.876A60.043 / 0.758 with MeSH
training, majority decision).
Consensus combination of several methods (ta-
ble 8) does increase precision, at the expense of
recall. A precision/recall of 0.920A60.037/0.750 is
ABU training (strict)
set cor. recall precisionA6ci
D2 368 0.091 0.864A60.033
D4 227 0.056 0.668A60.050
CU 3164 0.780 0.964A60.006
D8D3D8 3759 0.927 0.929A60.008
majority decision (0.9)
cor. recall precisionA6ci
111 0.027 0.860A60.060
77 0.019 0.524A60.081
3585 0.884 0.951A60.007
3773 0.931 0.932A60.008
CIM-SNOMED training
D2 176 0.043 0.752A60.055
D4 114 0.028 0.425A60.059
CU 3400 0.839 0.959A60.007
D8D3D8 3690 0.910 0.912A60.009
majority decision (0.9)
57 0.014 0.803A60.093
51 0.013 0.300A60.069
3607 0.890 0.948A60.007
3715 0.916 0.918A60.008
Table 7: Validation: mixed contexts, strict (thresh-
old = 1) and majority (threshold = 0.9) decisions,
4054 words of accented MeSH.
training set cor. recall precisionA6ci
MeSH 219 0.842 0.842A60.044
ABU 204 0.785 0.785A60.050
CIM-SNOMED 218 0.838 0.838A60.045
Combined methods
mesh/Brill + mesh/majority 195 0.750 0.920A60.037
mesh/Brill + mesh/majority
CU
185 0.712 0.930A60.036
mesh+abu+cim-snomed/Brill 178 0.685 0.927A60.037
+ mesh/majority
Table 8: Evaluation on the rest of the MeSH: Brill,
estimate on 5% sample (260 words).
obtained by combining Brill and the mixed context
method (majority decision), with MeSH training on
both sides. The same level of precision is obtained
with other combinations, but with lower recalls.
5 Discussion and Conclusion
We showed that a higher precision, which should
make human post-editing easier, can be obtained in
two ways. First, within the mixed context method,
three sets of words are separated: if only the ‘fully
processed’ words CU are considered (table 9), preci-
sion/recall can reach 0.884/0.731 (CIM-SNOMED,
majority) or 0.876/0.758 (MeSH, majority). Second,
the results of several methods can be combined with
a consensus rule: a word is accepted only if all these
methods agree on its accentuation. The combination
of Brill mixed contexts (majority decision), for in-
stance with MeSH training on both sides, increases
precision to 0.920A60.037 with a recall still at 0.750
(table 8).
The results obtained show that the methods pre-
sented here obtain not only good performance on
their training set, but also useful results on the tar-
MeSH training (strict)
set cor. recall precisionA6ci
D2 19 0.073 0.731A60.170
D4 15 0.058 0.429A60.164
CU 174 0.669 0.874A60.046
D8D3D8 208 0.800 0.800A60.049
majority decision
cor. recall precisionA6ci
8 0.031 0.727A60.263
11 0.042 0.458A60.199
197 0.758 0.876A60.043
216 0.831 0.831A60.046
ABU training (strict)
D2 30 0.115 0.882A60.108
D4 32 0.123 0.711A60.132
CU 153 0.588 0.845A60.053
D8D3D8 215 0.827 0.827A60.046
majority decision
13 0.050 0.929A60.135
11 0.042 0.786A60.215
194 0.746 0.836A60.048
218 0.838 0.838A60.045
CIM-SNOMED training
D2 27 0.104 0.818A60.132
D4 19 0.073 0.487A60.157
CU 168 0.646 0.894A60.044
D8D3D8 214 0.823 0.823A60.046
majority decision
14 0.054 0.824A60.181
9 0.035 0.321A60.173
190 0.731 0.884A60.043
213 0.819 0.819A60.047
Table 9: Evaluation on the rest of the MeSH: mixed
contexts, estimate on same 5% sample.
get data. We believe these methods will allow us to
reduce dramatically the final human time needed to
accentuate useful resources such as the MeSH the-
saurus and ADM knowledge base.
It is interesting that a general-language lexicon
such as ABU can be a good training set for accent-
ing specialized-language unknown words, although
this is true with the mixed context method and the
reverse with the Brill tagger.
A study of the 44 errors made by the mixed con-
text method (table 9, MeSH training, majority deci-
sion: 216 correct out of 260) revealed the follow-
ing errors classes. MeSH terms contain some En-
glish words (academy, cleavage) and many Latin
words (arenaria, chrysantemi, denitrificans), some
of which built over proper names (edwardsiella).
These loan words should not bear accents; some of
their patterns are correctly processed by the meth-
ods presented here (i.e., unaccented eae$, ella$), but
others are not distinguishable from normal French
words and get erroneously accented (rena of are-
naria is erroneously processed as in rénal; académy
as in académie). A first-stage classifier might help
handle this issue by categorizing Latin (and English)
words and excluding them from processing. Our
first such experiments are not conclusive and add as
many errors as are removed.
Another class of errors are related with mor-
pheme boundaries: some accentuation rules which
depend on the start-of-word boundary would need
to apply to morpheme boundaries. For in-
stance, pilo/erection fails to receive the é of r
CM
e=é
(
CM
érection), apic/ectomie erroneously receives an é
as in cc=é (cécité). An accurate morpheme seg-
menter would be needed to provide suitable input
to this process without again adding noise to it.
In some instances, no accentuation decision could
be made because no example had been learnt for a
specific context (e.g., accentuation of céfalo in ce-
faloglycine).
We also uncovered accentuation inconsistencies
in both the already accented MeSH words and the
validated sample (e.g., bacterium or bactérium in
different compounds). Cross-checking on the Web
confirmed the variability in the accentuation of rare
words. This shows the difficulty to obtain consistent
human accentuation across large sets of complex
words. One potential development of the present au-
tomated accentuation methods could be to check the
consistency of word lists. In addition, we discovered
spelling errors in some MeSH terms (e.g., bethane-
chol instead of betanechol prevents the proper ac-
centuation of beta).
Finally, further testing is necessary to check the
relevance of these methods to other accented letters
in French and in other languages.
Acknowledgements
We wish to thank Magaly Douyère, Benoît Thirion
and Stéfan Darmoni, of the CISMeF team, for pro-
viding us with accented MeSH terms and patiently
reviewing the automatically accented word samples.

References
[Brill1995] Eric Brill. 1995. Transformation-based error-
driven learning and natural language processing: A
case study in part-of-speech tagging. Computational
Linguistics, 21(4):543–565.
[Darmoni et al.2000] Stéfan J. Darmoni, J.-P. Leroy,
Benoît Thirion, F. Baudic, Magali Douyere, and
J. Piot. 2000. CISMeF: a structured health resource
guide. Methods Inf Med, 39(1):30–35.
[Garnier and Delamare1992] M. Garnier and V. Dela-
mare. 1992. Dictionnaire des Termes de Médecine.
Maloine, Paris.
[Grabar and Zweigenbaum2000] Natalia Grabar and
Pierre Zweigenbaum. 2000. Automatic acquisition
of domain-specific morphological resources from the-
sauri. In Proceedings of RIAO 2000: Content-Based
Multimedia Information Access, pages 765–784,
Paris, France, April. C.I.D.
[Habert et al.2001] Benoît Habert, Natalia Grabar, Pierre
Jacquemart, and Pierre Zweigenbaum. 2001. Build-
ing a text corpus for representing the variety of medi-
cal language. In Corpus Linguistics 2001, Lancaster.
[INS2000] Institut National de la Santé et de la Recherche
Médicale, Paris, 2000. Thésaurus Biomédical
Français/Anglais.
[Levenshtein1966] V. I. Levenshtein. 1966. Binary codes
capable of correcting deletions, insertions, and rever-
sals. Soviet Physics-Doklandy, pages 707–710.
[Ruch et al.2001] Patrick Ruch, Robert H. Baud, Antoine
Geissbuhler, Christian Lovis, Anne-Marie Rassinoux,
and A. Rivière. 2001. Looking back or looking all
around: comparing two spell checking strategies for
documents edition in an electronic patient record. J
Am Med Inform Assoc, 8(suppl):568–572.
[Seka et al.1997] LP Seka, C Courtin, and P Le Beux.
1997. ADM-INDEX: an automated system for index-
ing and retrieval of medical texts. In Stud Health Tech-
nol Inform, volume 43 Pt A, pages 406–410. Reidel.
[Simard1998] Michel Simard. 1998. Automatic inser-
tion of accents in French text. In Proceedings of the
Third Conference on Empirical Methods in Natural
Language Processing, Grenade.
[Spriet and El-Bèze1997] Thierry Spriet and Marc El-
Bèze. 1997. Réaccentuation automatique de textes.
In FRACTAL 97, Besançon.
[Theron and Cloete1997] Pieter Theron and Ian Cloete.
1997. Automatic acquisition of two-level morpholog-
ical rules. In Ralph Grishman, editor, Proceedings
of the Fifth Conference on Applied Natural Language
Processing, pages 103–110, Washington, DC, March-
April. ACL.
[Yarowsky1999] David Yarowsky. 1999. Corpus-based
techniques for restoring accents in Spanish and French
text. In Natural Language Processing Using Very
Large Corpora, pages 99–120. Kluwer Academic Pub-
lishers.
[Zweigenbaum and Grabar2002] Pierre Zweigenbaum
and Natalia Grabar. 2002. Accenting unknown words:
application to the French version of the MeSH. In
Workshop NLP in Biomedical Applications, pages
69–74, Cyprus, March. EFMI.
[Zweigenbaum2001] Pierre Zweigenbaum. 2001. Re-
sources for the medical domain: medical terminolo-
gies, lexicons and corpora. ELRA Newsletter, 6(4):8–
11.
