Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 843–850, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Paradigmatic Modi ability Statistics for the Extraction
of Complex Multi-Word Terms
Joachim Wermter
Jena University
Language & Information Engineering
JULIE Lab
Friedrich-Schiller-Universitcurrency1at Jena
Fcurrency1urstengraben 30
D-07743 Jena, Germany
joachim.wermter@uni-jena.de
Udo Hahn
Jena University
Language & Information Engineering
JULIE Lab
Friedrich-Schiller-Universitcurrency1at Jena
Fcurrency1urstengraben 30
D-07743 Jena, Germany
udo.hahn@uni-jena.de
Abstract
We here propose a new method which sets
apart domain-speci c terminology from
common non-speci c noun phrases. It
is based on the observation that termino-
logical multi-word groups reveal a con-
siderably lesser degree of distributional
variation than non-speci c noun phrases.
We de ne a measure for the observable
amount of paradigmatic modi ability of
terms and, subsequently, test it on bigram,
trigram and quadgram noun phrases ex-
tracted from a 104-million-word biomedi-
cal text corpus. Using a community-wide
curated biomedical terminology system as
an evaluation gold standard, we show that
our algorithm signi cantly outperforms
a variety of standard term identi cation
measures. We also provide empirical ev-
idence that our methodolgy is essentially
domain- and corpus-size-independent.
1 Introduction
As we witness the ever-increasing proliferation of
volumes of medical and biological documents, the
available dictionaries and terminological systems
cannot keep up with this pace of growth and, hence,
become more and more incomplete. What’s worse,
the constant stream of new terms is increasingly get-
ting unmanageable because human curators are in
the loop. The costly, often error-prone and time-
consuming nature of manually identifying new ter-
minology from the most recent literature calls for
advanced procedures which can automatically assist
database curators in the task of assembling, updat-
ing and maintaining domain-speci c controlled vo-
cabularies. Whereas the recognition of single-word
terms usually does not pose any particular chal-
lenges, the vast majority of biomedical or any other
domain-speci c terms typically consists of multi-
word units.1 Unfortunately these are much more
dif cult to recognize and extract than their singleton
counterparts. Moreover, although the need to assem-
ble and extend technical and scienti c terminologies
is currently most pressing in the biomedical domain,
virtually any (sub-) eld of human research/expertise
in which we deal with terminologically structured
knowledge calls for high-performance terminology
identi cation and extraction methods. We want to
target exactly this challenge.
2 Related Work
The automatic extraction of complex multi-word
terms from domain-speci c corpora is already an
active  eld of research (cf., e.g., for the biomedi-
cal domain Rind esch et al. (1999), Collier et al.
(2002), Bodenreider et al. (2002), or Nenadi·c et
al. (2003)). Typically, in all of these approaches
term candidates are collected from texts by vari-
ous forms of linguistic  ltering (part-of-speech tag-
ging, phrase chunking, etc.), through which candi-
dates obeying various linguistic patterns are iden-
ti ed (e.g., noun-noun, adjective-noun-noun com-
binations). These candidates are then submitted to
frequency- or statistically-based evidence measures
1Nakagawa and Mori (2002) claim that more than 85% of
domain-speci c terms are multi-word units.
843
(such as the C-value (Frantzi et al., 2000)), which
compute scores indicating to what degree a candi-
date quali es as a term. Term mining, as a whole,
is a complex process involving several other com-
ponents (orthographic and morphological normal-
ization, acronym detection, con ation of term vari-
ants, term context, term clustering; cf. Nenadi·c et al.
(2003)). Still, the measure which assigns a termhood
value to a term candidate is the essential building
block of any term identi cation system.
For multi-word automatic term recognition
(ATR), the C-value approach (Frantzi et al., 2000;
Nenadi·c et al., 2004), which aims at improving the
extraction of nested terms, has been one of the most
widely used techniques in recent years. Other po-
tential association measures are mutual information
(Damerau, 1993) and the whole battery of statisti-
cal and information-theoretic measures (t-test, log-
likelihood, entropy) which are typically employed
for the extraction of general-language collocations
(Manning and Schcurrency1utze, 1999; Evert and Krenn,
2001). While these measures have their statistical
merits in terminology identi cation, it is interesting
to note that they only make little use of linguistic
properties inherent to complex terms.2
More linguistically oriented work on ATR by
Daille (1996) or on term variation by Jacquemin
(1999) builds on the deep syntactic analysis of term
candidates. This includes morphological and head-
modi er dependency analysis and thus presupposes
accurate, high-quality parsing which, for sublan-
guages at least, can only be achieved by a highly
domain-dependent type of grammar. As sublan-
guages from different domains usually reveal a high
degree of syntactic variability among each other
(e.g., in terms of POS distribution, syntactic pat-
terns), this property makes it dif cult to port gram-
matical speci cations to different domains.
Therefore, one may wonder whether there are
cross-domain linguistic properties which might be
bene cial to ATR and still could be accounted for
by only shallow syntactic analysis. In this paper,
we propose the limited paradigmatic modi ability of
terms as a criterion which meets these requirements
and will elaborate on it in detail in Subsection 3.3.
2A notable exception is the C-value method which incorpo-
rates a term’s likelihood of being nested in other multi-word
units.
3 Methods and Experiments
3.1 Text Corpus
We collected a biomedical training corpus of ap-
proximately 513,000 MEDLINE abstracts using the
following query composed of MESH terms from
the biomedical domain: transcription factors, blood
cells and human.3 We then annotated the result-
ing 104-million-word corpus with the GENIA part-
of-speech tagger4 and identi ed noun phrases (NPs)
with the YAMCHA chunker (Kudo and Matsumoto,
2001). We restrict our study to NP recognition
(i.e., determining the extension of a noun phrase but
refraining from assigning any internal constituent
structure to that phrase), because the vast majority of
technical or scienti c terms surface as noun phrases
(Justeson and Katz, 1995). We  ltered out a num-
ber of stop words (determiners, pronouns, measure
symbols, etc.) and also ignored noun phrases with
coordination markers ( and ,  or , etc.).5
n-gram cut-off NP term candidates
length tokens types
no cut-off 5,920,018 1,055,820bigrams
c ≥ 10 4,185,427 67,308
no cut-off 3,110,786 1,655,440trigrams
c ≥ 8 1,053,651 31,017
no cut-off 1,686,745 1,356,547quadgrams
c ≥ 6 222,255 10,838
Table 1: Frequency distribution for noun phrase term candi-
date tokens and types for the MEDLINE text corpus
In order to obtain the term candidate sets (see Ta-
ble 1), we counted the frequency of occurrence of
noun phrases in our training corpus and categorized
them according to their length. For this study, we re-
stricted ourselves to noun phrases of length 2 (word
bigrams), length 3 (word trigrams) and length 4
(word quadgrams). Morphological normalization of
term candidates has shown to be bene cial for ATR
(Nenadi·c et al., 2004). We thus normalized the nom-
3MEDLINE (http://www.ncbi.nlm.nih.gov) is the
largest biomedical bibliographic database. For information re-
trieval purposes, all of its abstracts are indexed with a controlled
indexing vocabulary, the Medical Subject Headings (MESH,
2004).
4http://www-tsujii.is.s.u-tokyo.ac.jp/
GENIA/postagger/
5Of course, terms can also be contained within coordinative
structures (e.g.,  B and T cell ). However, analyzing their in-
herent ambiguity is a complex syntactic operation, with a com-
paratively marginal bene t for ATR (Nenadi·c et al., 2004).
844
inal head of each noun phrase (typically the right-
most noun in English) via the full-form UMLS SPE-
CIALIST LEXICON (UMLS, 2004), a large repository
of both general-language and domain-speci c (med-
ical) vocabulary. To eliminate noisy low-frequency
data (cf. also Evert and Krenn (2001)), we de ned
different frequency cut-off thresholds, c, for the bi-
gram, trigram and quadgram candidate sets and only
considered candidates above these thresholds.
3.2 Evaluating Term Extraction Quality
Typically, terminology extraction studies evaluate
the goodness of their algorithms by having their
ranked output examined by domain experts who
identify the true positives among the ranked can-
didates. There are several problems with such an
approach. First, very often only one such expert
is consulted and, hence, inter-annotator agreement
cannot be determined (as, e.g., in the studies of
Frantzi et al. (2000) or Collier et al. (2002)). Fur-
thermore, what constitutes a relevant term for a par-
ticular domain may be rather dif cult to decide  
even for domain experts  when judges are just ex-
posed to a list of candidates without any further con-
text information. Thus, rather than relying on ad
hoc human judgment in identifying true positives in
a candidate set, as an alternative we may take al-
ready existing terminolgical resources into account.
They have evolved over many years and usually re-
 ect community-wide consensus achieved by expert
committees. With these considerations in mind, the
biomedical domain is an ideal test bed for evaluat-
ing the goodness of ATR algorithms because it hosts
one of the most extensive and most carefully curated
terminological resources, viz. the UMLS METATHE-
SAURUS (UMLS, 2004). We will then take the mere
existence of a term in the UMLS as the decision cri-
terion whether or not a candidate term is also recog-
nized as a biomedical term.
Accordingly, for the purpose of evaluating the
quality of different measures in recognizing multi-
word terms from the biomedical literature, we as-
sign every word bigram, trigram, and quadgram in
our candidate sets (see Table 1) the status of being
a term (i.e., a true positive), if it is found in the
2004 edition of the UMLS METATHESAURUS.6 For
6We exclude UMLS vocabularies not relevant for molecular
biology, such as nursing and health care billing codes.
example, the word trigram  long terminal repeat 
is listed as a term in one of the UMLS vocabular-
ies, viz. MESH (2004), whereas  t cell response 
is not. Thus, among the 67,308 word bigram candi-
date types, 14,650 (21.8%) were identi ed as true
terms; among the 31,017 word trigram candidate
types, their number amounts to 3,590 (11.6%), while
among the 10,838 word quadgram types, 873 (8.1%)
were identi ed as true terms.7
3.3 Paradigmatic Modi ability of Terms
For most standard association measures utilized for
terminology extraction, the frequency of occurrence
of the term candidates either plays a major role
(e.g., C-value), or has at least a signi cant impact
on the assignment of the degree of termhood (e.g.,
t-test). However, frequency of occurrence in a train-
ing corpus may be misleading regarding the deci-
sion whether or not a multi-word expression is a
term. For example, taking the two trigram multi-
word expressions from the previous subsection, the
non-term  t cell response appears 2410 times in
our 104-million-word MEDLINE corpus, whereas
the term  long terminal repeat (long repeating se-
quences of DNA) only appears 434 times (see also
Tables 2 and 3 below).
The linguistic property around which we built our
measure of termhood is the limited paradigmatic
modi ability of multi-word terminological units. A
multi-word expression such as  long terminal re-
peat contains three token slots in which slot 1 is
 lled by  long , slot 2 by  terminal and slot 3 by
 repeat . The limited paradigmatic modi ability of
such a trigram is now de ned by the probability with
which one or more such slots cannot be  lled by
other tokens. We estimate the likelihood of preclud-
ing the appearance of alternative tokens in particular
slot positions by employing the standard combina-
tory formula without repetitions. For an n-gram (of
size n) to select k slots (i.e., in an unordered selec-
tion) we thus de ne:
C(n, k) = n!k!(n−k)! (1)
7As can be seen, not only does the number of candidate
types drop with increasing n-gram length but also the propor-
tion of true terms. In fact, their proportion drops more sharply
than can actually be seen from the above data because the vari-
ous cut-off thresholds have a leveling effect.
845
For example, for n = 3 (word trigram) and k = 1
and k = 2 slots, there are three possible selections
for each k for  long terminal repeat and for  t cell
response (see Tables 2 and 3). k is actually a place-
holder for any possible token (and its frequency)
which  lls this position in the training corpus.
n-gram freq P -Mod (k=1,2)
long terminal repeat 434 0.03
k slots possible selections sel freq modsel
k = 1 k1 terminal repeat 460 0.940
long k2 repeat 448 0.970
long terminal k3 436 0.995
mod1 =0.91
k = 2 k1 k2 repeat 1831 0.23
k1 terminal k3 1062 0.41
long k2 k3 1371 0.32
mod2 =0.03
Table 2: P -Mod and k-modi abilities for k = 1 and k = 2
for the trigram term  long terminal repeat 
n-gram freq P -Mod (k=1,2)
t cell response 2410 0.00005
k slots possible selections sel freq modsel
k = 1 k1 cell response 3248 0.74
t k2 response 2665 0.90
t cell k3 27424 0.09
mod1 =0.06
k = 2 k1 k2 response 40143 0.06
k1 cell k3 120056 0.02
t k2 k3 34925 0.07
mod2 =0.00008
Table 3: P -Mod and k-modi abilities for k = 1 and k = 2
for the trigram non-term  t cell response 
Now, for a particular k (1 ≤ k ≤ n; n = length of
n-gram), the frequency of each possible selection,
sel, is determined. The paradigmatic modi ability
for a particular selection sel is then de ned by the
n-gram’s frequency scaled against the frequency of
sel. As can be seen in Tables 2 and 3, a lower fre-
quency induces a more limited paradigmatic modi -
ability for a particular sel (which is, of course, ex-
pressed as a higher probability value; see the column
labeled modsel in both tables). Thus, with s being
the number of distinct possible selections for a par-
ticular k, the k-modi ability, modk, of an n-gram
can be de ned as follows (f stands for frequency):
modk(n-gram) :=
sproductdisplay
i=1
f(n-gram)
f(seli, n-gram) (2)
The paradigmatic modi ability, P -Mod, of an n-
gram is the product of all its k-modi abilities:8
P -Mod(n-gram) :=
nproductdisplay
k=1
modk(n-gram) (3)
Comparing the trigram P -Mod values for k =
1, 2 in Tables 2 and 3, it can be seen that the term
 long terminal repeat gets a much higher weight
than the non-term  t cell response , although their
mere frequency values suggest the opposite. This is
also re ected in the respective list rank (see Subsec-
tion 4.1 for details) assigned to both trigrams by the
t-test and by our P -Mod measure. While  t cell re-
sponse has rank 24 on the t-test output list (which
directly re ects its high frequency), P -Mod assigns
rank 1249 to it. Conversely,  long terminal repeat 
is ranked on position 242 by the t-test, whereas it
occupies rank 24 for P -Mod. In fact, even lower-
frequency multi-word units gain a prominent rank-
ing, if they exhibit limited paradigmatic modi abil-
ity. For example, the trigram term  porphyria cu-
tanea tarda is ranked on position 28 by P -Mod,
although its frequency is only 48 (which results in
rank 3291 on the t-test output list). Despite its lower
frequency, this term is judged as being relevant for
the molecular biology domain.9 It should be noted
that the termhood values (and the corresponding list
ranks) computed by P -Mod also include k = 3 and,
hence, take into account a reasonable amount of fre-
quency load. As can be seen from the previous rank-
ing examples, still this factor does not override the
paradigmatic modi ability factors of the lower ks.
On the other hand, P -Mod will also demote true
terms in their ranking, if their paradigmatic modi -
ability is less limited. This is particularly the case if
one or more of the tokens of a particular term often
occur in the same slot of other equal-length n-grams.
For example, the trigram term  bone marrow cell 
occurs 1757 times in our corpus and is thus ranked
quite high (position 31) by the t-test. P -Mod, how-
ever, ranks this term on position 550 because the to-
8Setting the upper limit of k to n (e.g., n = 3 for trigrams)
actually has the pleasant side effect of including frequency in
our modi ability measure. In this case, the only possible selec-
tion k1k2k3 as the denominator of Formula (2) is equivalent to
summing up the frequencies of all trigram term candidates.
9It denotes a group of related disorders, all of which arise
from a de cient activity of the heme synthetic enzyme uropor-
phyrinogen decarboxylase (URO-D) in the liver.
846
ken  cell also occurs in many other trigrams and
thus leads to a less limited paradigmatic modi abil-
ity. Still, the underlying assumption of our approach
is that such a case is more an exception than the rule
and that terms are linguistically more ‘frozen’ than
non-terms, which is exactly the intuition behind our
measure of limited paradigmatic modi ability.
3.4 Methods of Evaluation
As already described in Subsection 3.2, standard
procedures for evaluating the quality of termhood
measures usually involve identifying the true posi-
tives among a (usually) arbitrarily set number of the
m highest ranked candidates returned by a particu-
lar measure, a procedure usually carried out by a do-
main expert. Because this is labor-intensive (besides
being unreliable), m is usually small, ranging from
50 to several hundreds.10 By contrast, we choose
a large and already consensual terminology to iden-
tify the true terms in our candidate sets. Thus, we
are able to dynamically examine various m-highest
ranked samples, which, in turn, allows for the plot-
ting of standard precision and recall graphs for the
entire candidate set. We thus provide a more reli-
able evaluation setting for ATR measures than what
is common practice in the literature.
We compare our P -Mod algorithm against the
t-test measure,11 which, of all standard measures,
yields the best results in general-language collo-
cation extraction studies (Evert and Krenn, 2001),
and also against the widely used C-value, which
aims at enhancing the common frequency of occur-
rence measure by making it sensitive to nested terms
(Frantzi et al., 2000). Our baseline is de ned by the
proportion of true positives (i.e., the proportion of
terms) in our bi-, tri- and quadgram candidate sets.
This is equivalent to the likelihood of  nding a true
positive by blindly picking from one of the different
sets (see Subsection 3.2).
10Studies on collocation extraction (e.g., by Evert and Krenn
(2001)) also point out the inadequacy of such evaluation meth-
ods. In essence, they usually lead to very super cial judgments
about the measures under scrutiny.
11Manning and Schcurrency1utze (1999) describe how this measure
can be used for the extraction of multi-word expressions.
4 Results and Discussion
4.1 Precision/Recall for Terminology Extraction
For each of the different candidate sets, we incre-
mentally examined portions of the ranked output
lists returned by each of the three measures we con-
sidered. The precision values for the various por-
tions were computed such that for each percent point
of the list, the number of true positives found (i.e.,
the number of terms) was scaled against the overall
number of candidate items returned. This yields the
(descending) precision curves in Figures 1, 2 and 3
and some associated values in Table 4.
Portion of Precision scores of measures
ranked list
considered P -Mod t-test C-value
1% 0.82 0.62 0.62
Bigrams 10% 0.53 0.42 0.41
20% 0.42 0.35 0.34
30% 0.37 0.32 0.31
baseline 0.22 0.22 0.22
1% 0.62 0.55 0.54
Trigrams 10% 0.37 0.29 0.28
20% 0.29 0.23 0.23
30% 0.24 0.20 0.19
baseline 0.12 0.12 0.12
1% 0.43 0.50 0.50
Quadgrams 10% 0.26 0.24 0.23
20% 0.20 0.16 0.16
30% 0.18 0.14 0.14
baseline 0.08 0.08 0.08
Table 4: Precision scores for biomedical term extraction at
selected portions of the ranked list
First, we observe that, for the various n-gram
candidate sets examined, all measures outperform
the baselines by far, and, thus, all are potentially
useful measures for grading termhood. Still, the
P -Mod criterion substantially outperforms all other
measures at almost all points for all n-grams exam-
ined. Considering 1% of the bigram list (i.e., the  rst
673 candidates) precision for P -Mod is 20 points
higher than for the t-test and the C-value. At 1%
of the trigram list (i.e., the  rst 310 candidates),
P -Mod’s lead is 7 points. Considering 1% of the
quadgrams (i.e., the  rst 108 candidates), the t-test
actually leads by 7 points. At 10% of the quadgram
list, however, the P -Mod precision score has over-
taken the other ones. With increasing portions of all
ranked lists considered, the precision curves start to
converge toward the baseline, but P -Mod maintains
a steady advantage.
847
 0
 0.2
 0.4
 0.6
 0.8
 1
100908070605040302010
Portion of ranked list (in %)
Precision: P-Mod
Precision: T-test
Precision: C-value
Recall: P-Mod
Recall: T-test
Recall: C-value
Base
Figure 1: Precision/Recall for bigram biomedical term extrac-
tion
 0
 0.2
 0.4
 0.6
 0.8
 1
100908070605040302010
Portion of ranked list (in %)
Precision: P-Mod
Precision: T-test
Precision: C-value
Recall: P-Mod
Recall: T-test
Recall: C-value
Base
Figure 2: Precision/Recall for trigram biomedical term ex-
traction
 0
 0.2
 0.4
 0.6
 0.8
 1
100908070605040302010
Portion of ranked list (in %)
Precision: P-Mod
Precision: T-test
Precision: C-value
Recall: P-Mod
Recall: T-test
Recall: C-value
Base
Figure 3: Precision/Recall for quadgram biomedical term ex-
traction
The (ascending) recall curves in Figures 1, 2 and
3 and their corresponding values in Table 5 indicate
which proportion of all true positives (i.e., the pro-
portion of all terms in a candidate set) is identi ed by
a particular measure at a certain point of the ranked
list. For term extraction, recall is an even better indi-
cator of a particular measure’s performance because
 nding a bigger proportion of the true terms at an
early stage is simply more economical.
Recall Portion of Ranked List
scores of
measures P -Mod t-test C-value
0.5 29% 35% 37%
0.6 39% 45% 47%
Bigrams 0.7 51% 56% 59%
0.8 65% 69% 72%
0.9 82% 83% 85%
0.5 19% 28% 30%
Trigrams 0.6 27% 38% 40%
0.7 36% 50% 53%
0.8 50% 63% 66%
0.9 68% 77% 84%
0.5 20% 28% 30%
0.6 26% 38% 40%
Quadgrams 0.7 34% 49% 53%
0.8 45% 62% 65%
0.9 61% 79% 82%
Table 5: Portions of the ranked list to consider for selected
recall scores for biomedical term extraction
Again, our linguistically motivated terminology
extraction algorithm outperforms its competitors,
and with respect to tri- and quadgrams, its gain is
even more pronounced than for precision. In order to
get a 0.5 recall for bigram terms, P -Mod only needs
to winnow 29% of the ranked list, whereas the t-test
and C-value need to winnow 35% and 37%, respec-
tively. For trigrams and quadgrams, P -Mod only
needs to examine 19% and 20% of the list, whereas
the other two measures have to scan almost 10 ad-
ditional percentage points. In order to obtain a 0.6,
0.7, 0.8 and 0.9 recall, the differences between the
measures narrow for bigram terms, but they widen
substantially for tri- and quadgram terms. To obtain
a 0.6 recall for trigram terms, P -Mod only needs to
winnow 27% of its output list while the t-test and
C-value must consider 38% and 40%, respectively.
For a level of 0.7 recall, P -Mod only needs to an-
alyze 36%, while the t-test already searches 50% of
the ranked list. For 0.8 recall, this relation is 50%
848
(P -Mod) to 63% (t-test), and at recall point 0.9,
68% (P -Mod) to 77% (t-test). For quadgram term
identi cation, the results for P -Mod are equally su-
perior to those for the other measures, and at recall
points 0.8 and 0.9 even more pronounced than for
trigram terms.
We also tested the signi cance of differences for
these results, both comparing P -Mod vs. t-test and
P -Mod vs. C-value. Because in all cases the ranked
lists were taken from the same set of candidates (viz.
the set of bigram, trigram, and quadgram candidate
types), and hence constitute dependent samples, we
applied the McNemar test (Sachs, 1984) for statis-
tical testing. We selected 100 measure points in the
ranked lists, one after each increment of one percent,
and then used the two-tailed test for a con dence in-
terval of 95%. Table 6 lists the number of signi cant
differences for these measure points at intervals of
10 for the bi-, tri-, and quadgram results. For the bi-
gram differences between P -Mod and C-value, all
of them are signi cant, and between P -Mod and
t-test, all are signi cantly different up to measure
point 70.12 Looking at the tri- and quadgrams, al-
though the number of signi cant differences is less
than for bigrams, the vast majority of measure points
is still signi cantly different and thus underlines the
superior performance of the P -Mod measure.
# of # of signi cant differences comparing
measure P -Mod with
points t-test C-val t-test C-val t-test C-val
10 10 10 9 9 3 3
20 20 20 19 19 13 13
30 30 30 29 29 24 24
40 40 40 39 39 33 33
50 50 50 49 49 43 43
60 60 60 59 59 53 53
70 70 70 69 69 63 63
80 75 80 79 79 73 73
90 84 90 89 89 82 83
100 93 100 90 98 82 91
bigrams trigrams quadgrams
Table 6: Signi cance testing of differences for bi-, tri- and
quadgrams using the two-tailed McNemar test at 95% con -
dence interval
12As can be seen in Figures 1, 2 and 3, the curves start to
merge at the higher measure points and, thus, the number of
signi cant differences decreases.
4.2 Domain Independence and Corpus Size
One might suspect that the results reported above
could be attributed to the corpus size. Indeed, the
text collection we employed in this study is rather
large (104 million words). Other text genres and do-
mains (e.g., clinical narratives, various engineering
domains) or even more specialized biological sub-
domains (e.g., plant biology) do not offer such a
plethora of free-text material as the molecular biol-
ogy domain. To test the effect a drastically shrunken
corpus size might have, we assessed the terminology
extraction methods for trigrams on a much smaller-
sized subset of our original corpus, viz. on 10 million
words. These results are depicted in Figure 4.
 0
 0.2
 0.4
 0.6
 0.8
 1
100908070605040302010
Portion of ranked list (in %)
Precision: P-Mod
Precision: T-test
Precision: C-value
Recall: P-Mod
Recall: T-test
Recall: C-value
Base
Figure 4: Precision/Recall for trigram biomedical term ex-
traction on the 10-million-word corpus (cutoff c ≥ 4, with
6,760 term candidate types)
The P -Mod extraction criterion still clearly out-
performs the other ones on that 10-million-word cor-
pus, both in terms of precision and recall. We also
examined whether the differences were statistically
signi cant and applied the two-tailed McNemar test
on 100 selected measure points. Comparing P -Mod
with t-test, most signi cant differences could be ob-
served between measure points 20 and 80, with al-
most 80% to 90% of the points being signi cantly
different. These signi cant differences were even
more pronounced when comparing the results be-
tween P -Mod and C-value.
5 Conclusions
We here proposed a new terminology extraction
method and showed that it signi cantly outperforms
849
two of the standard approaches in distinguishing
terms from non-terms in the biomedical literature.
While mining scienti c literature for new termino-
logical units and assembling those in controlled vo-
cabularies is a task involving several components,
one essential building block is to measure the de-
gree of termhood of a candidate. In this respect, our
study has shown that a criterion which incorporates
a vital linguistic property of terms, viz. their lim-
ited paradigmatic modi ability, is much more pow-
erful than linguistically more uninformed measures.
This is in line with our previous work on general-
language collocation extraction (Wermter and Hahn,
2004), in which we showed that a linguistically mo-
tivated criterion based on the limited syntagmatic
modi ability of collocations outperforms alternative
standard association measures as well.
We also collected evidence that the superiority of
the P -Mod method relative to other term extraction
approaches holds independent of the underlying cor-
pus size (given a reasonable offset). This is a crucial
 nding because other domains might lack large vol-
umes of free-text material but still provide suf cient
corpus sizes for valid term extraction. Finally, since
we only require shallow syntactic analysis (in terms
of NP chunking), our approach might be well suited
to be easily portable to other domains. Hence, we
may conclude that, although our methodology has
been tested on the biomedical domain only, there are
essentially no inherent domain-speci c restrictions.
Acknowledgements. This work was partly supported by
the European Network of Excellence  Semantic Mining in
Biomedicine (NoE 507505).
References
Olivier Bodenreider, Thomas C. Rind esch, and Anita Burgun.
2002. Unsupervised, corpus-based method for extending a
biomedical terminology. In Stephen Johnson, editor, Pro-
ceedings of the ACL/NAACL 2002 Workshop on ‘Natural
Language Processing in the Biomedical Domain’, pages 53 
60. Philadelphia, PA, USA, July 11, 2002.
Nigel Collier, Chikashi Nobata, and Jun’ichi Tsujii. 2002. Au-
tomatic acquisition and classi cation of terminology using a
tagged corpus in the molecular biology domain. Terminol-
ogy, 7(2):239 257.
B·eatrice Daille. 1996. Study and implementation of com-
bined techniques for automatic extraction of terminology. In
Judith L. Klavans and Philip Resnik, editors, The Balanc-
ing Act: Combining Statistical and Symbolic Approaches to
Language, pages 49 66. Cambridge, MA: MIT Press.
Fred J. Damerau. 1993. Generating and evaluating domain-
oriented multi-word terms from text. Information Process-
ing & Management, 29(4):433 447.
Stefan Evert and Brigitte Krenn. 2001. Methods for the
qualitative evaluation of lexical association measures. In
ACL’01/EACL’01  Proceedings of the 39th Annual Meet-
ing of the Association for Computational Linguistics and the
10th Conference of the European Chapter of the ACL, pages
188 195. Toulouse, France, July 9-11, 2001.
Katerina T. Frantzi, Sophia Ananiadou, and Hideki Mima.
2000. Automatic recognition of multi-word terms: The C-
value/NC-value method. International Journal on Digital
Libraries, 3(2):115 130.
Christian Jacquemin. 1999. Syntagmatic and paradigmatic rep-
resentations of term variation. In Proceedings of the 37rd
Annual Meeting of the Association for Computational Lin-
guistics, pages 341 348. College Park, MD, USA, 20-26
June 1999.
John S. Justeson and Slava M. Katz. 1995. Technical terminol-
ogy: Some linguistic properties and an algorithm for identi-
 cation in text. Natural Language Engineering, 1(1):9 27.
Taku Kudo and Yuji Matsumoto. 2001. Chunking with support
vector machines. In NAACL’01, Language Technologies
2001  Proceedings of the 2nd Meeting of the North Amer-
ican Chapter of the Association for Computational Linguis-
tics, pages 192 199. Pittsburgh, PA, USA, June 2-7, 2001.
Christopher D. Manning and Hinrich Schcurrency1utze. 1999. Foun-
dations of Statistical Natural Language Processing. Cam-
bridge, MA; London, U.K.: Bradford Book & MIT Press.
Hiroshi Nakagawa and Tatsunori Mori. 2002. A simple but
powerful automatic term extraction method. In COMPU-
TERM 2002  Proceedings of the 2nd International Work-
shop on Computational Terminology, pages 29 35. Taipei,
Taiwan, August 31, 2002.
Goran Nenadi·c, Irena Spasi·c, and Sophia Ananiadou. 2003.
Terminology-driven mining of biomedical literature. Bioin-
formatics, 19(8):938 943.
Goran Nenadi·c, Sophia Ananiadou, and John McNaught. 2004.
Enhancing automatic term recognition through recognition
of variation. In COLING 2004  Proceedings of the 20th In-
ternational Conference on Computational Linguistics, pages
604 610. Geneva, Switzerland, August 23-27, 2004.
Thomas C. Rind esch, Lawrence Hunter, and Alan R. Aronson.
1999. Mining molecular binding terminology from biomed-
ical text. In AMIA’99  Proceedings of the Annual Sym-
posium of the American Medical Informatics Association,
pages 127 131. Washington, D.C., November 6-10, 1999.
Lothar Sachs. 1984. Applied Statistics: A Handbook of Tech-
niques. New York: Springer, 2nd edition.
MESH. 2004. Medical Subject Headings. Bethesda, MD: Na-
tional Library of Medicine.
UMLS. 2004. Uni ed Medical Language System. Bethesda,
MD: National Library of Medicine.
Joachim Wermter and Udo Hahn. 2004. Collocation extrac-
tion based on modi ability statistics. In COLING 2004  
Proceedings of the 20th International Conference on Com-
putational Linguistics, pages 980 986. Geneva, Switzerland,
August 23-27, 2004.
850
