Identifying Terms by their Family and Friends
 Diana Maynard Sophia Ananiadou
 
Dept. of Computer Science University of Sheffield Regent Court, 211 Portobello 
St Sheffield, $1 4DP, UK d. maynard0dcs, shef. ac. uk
 
Computer Science, School of Sciences University of Saltbrd, Newton Building
 Saltbrd, M5 4WT, U.K. s. ananiadou@salf ord. ac. uk
 
Abstract
 Multi-word terms are traditionally identified using statistical techniques or, 
more recently, using hybrid techniques combining statistics with shallow 
linguistic information. Al)proaches to word sense disambiguation and machine 
translation have taken advantage of contextual information in a more meaningflfl 
way, but terminology has rarely followed suit. We present an approach to t e r m 
recognition which identifies salient parts of the context and measures their 
strength of association to relevant candidate terms. The resulting list of 
ranked terms is shown to improve on that produced by traditional methods, in 
terms of precision and distribution, while the information acquired in the 
process can also be used for a variety of other applications, such as 
disambiguation, lexical tuning and term clustering.
 
Introduction
 
Although contextual information has been previously used, e.g. in general 
 (Grefenstette, 1994) mid in the NC-Value method for term recognition 
(Frantzi, 1998; Frantzi and Ananiadou, 1999), only shallow syntactic information 
is used in these cases. The T R U C K S approach identifies different; elements 
of the context which are combined to form the Information Weight, a measure of 
how strongly related the context is to a candidate term. The hffbrmation Weight 
is then combined with the statistical information about a candidate t e r m and 
its context, acquired using the NC-Value method, to form the SNC-Value. Section 
2 describes the NCValue method. Section 3 discusses the importance of contextual 
information and explains how this is acquired. Sections 4 and 5 describe the 
hffbrmation Weight and the SNC-VMue respectively. We finish with an evaluation 
of the method and draw some conclusions about the work and its fllture.
 
Although statistical approaches to automatic term recognition, e.g. (Bourigault, 
1992; Daille et al., 1994; Enguehard and Pantera, 1994; 3usteson and Katz, 1995; 
Lauriston, 1996), have achieved relative success over the years, the addition of 
suitable linguistic information has the potential to enhance results still 
further, particularly in the case of small corpora or very specialised domains, 
where statistical information may not be so accurate. One of the main reasons 
for the current lack of diversity in approaches to term recognition lies in the 
difficulty of extracting suitable semantic information from speeialised corpora, 
particularly in view of the lack of appropriate linguistic resources. The 
increasing development of electronic lexieal resources, coupled with new methods 
for automatically creating and fine-tuning them from corpora, has begun to pave 
the way for a more dominant appearance of natural  processing techniques 
in the field of terminology. The T R U C K S approach to t e r m recognition 
(Term Recognition Using Combined Knowledge Sources) focuses on identifying 
relevant contextual information from a variety of sources, in order to enhance 
traditional statistical techniques of t e r m recognition.
 
The NC-Value m e t h o d
 
The NC-Value method uses a combination of linguistic and statistical 
information. Terms are first extracted from a corpus using the C-Value method 
(Frantzi and Ananiadou, 1999), a measure based on frequency of occurrence and 
term length. This is defined formally as: is not nested l~('n,) ~b~T~f(b)) a is 
nested
 
where a is the candidate string, f(a) is its frequency in the corpus, eT, is the 
set of candidate terms that contain a, P(Ta) is the number of these candidate 
terms. Two different cases apply: one for terms t h a t are found as nested, and 
one for terms that are not. If a candidate string is not found as nested, its 
termhood is calculated from its total frequency and length. If it is found as 
nested, termhood is calculated from its total frequency, length, frequency as a 
nested string,
 
fiand the tmmber of longer candidate terms it; ai)l)ears in. The NC-Value 
metho(1 builds oil this by incorl)orating contextual information in the form of 
a context factor for each candidate term. A context word can be any noun, 
adjective or verb apI)earing within a fixed-size window of tim candidate term. 
Each context word is assigned a weight, based on how frequently it appears with 
a ca lldidate term. Ttmse weights m'e titan SUllslned for all colltext words 
relative to a candidate term. The Context l"actor is combined with the C-Value 
to form tlm NC-Value:
 
Category Verb Prep Noun Adj
 
Weight 1.2 1.1 0.9 0.7
 
Table 1: We.ights for categories of boundary words
 
where a is tile candidate term, Cvahte(a) is the Cvalue fin' tlm candidate term, 
CF(a) is the context factor tbr the candidate term.
 
Terminological knowledge Ternfinological knowledge concerns the terminological 
sta.tus of context words. A context word whicll is also a term (whicll we call a 
context term) is likely to 1)e a better indicator than one wlfich is not. The 
terminological status is determined by applying the NC-Value at)proach to the 
corlms, and considering tile top third of the list; of ranked results as valid 
terms. A context term (CT) weight is then produced fin" each candidate term, 
based on its total frequency of occurrence with all relewmt context terms. The 
CT weight is formally described as follows:
 
Contextual
 
Information:
 
a Term's
 where a is the candidate term, 7', is the set: of context terms of a, d is a 
word from Ta, fa(d) is the frequency of d as a context term of a. Semantic 
knowledge Semantic knowledge is obtained about context terms using the UMLS 
Metathesaurus and Semantic Network (NLM, 1997). The former provides a semantic 
tag for each term, such as Acquired Abnormality. The latte, r provides a 
hierarchy of semantic types, from wlfich we compute the similarity between a 
candidate term and the context I;erms it occurs with. An example of part of tim 
network is shown in Figure
 
Social Life
 Just as a person's social life can provide valuable clues al)out their 
i)ersonality, so we can gather much information about the nature of a term by 
investigating the coral)any it keeps. We acquire this knowledge by cxtra{:ting 
three different types of contextual information: 1. syntactic; 2. 
terminologic~fl;
 
Syntactic knowledge
 
Syntactic knowledge is based on words in the context which occur immediately 
t)efore or afl;er a candidatc term, wtfich we call boundary words. Following 
"barrier word" al)proaches to term recoglfition (Bourigault, 1992; Nelson et 
al., 1995), where partitular syntactic categories are used to delimit era> 
didate terms, we develop this idea fllrther by weighting boundary words 
according to tlmir category. The weight for each category, shown in Table 1, is 
all{)cate(1 according to its relative likelihood of occurring with a term as 
opposed to a non-term. A verb, therefore, occurring immediately before or after 
a candidate, term, is statistically a better indicator of a term than an 
adjective is. By "a better indicator", we mean that a candidate term occurring 
with it is more likely to be valid. Each candidate term is assigned a syntactic 
weight, calculated by summing the category weights tbr the context bomsdary 
words occurring with it.
 
Similarity is measured because we believe that a context term which is 
semantically similar to a candidate term is more likely to be significant than 
one wlfieh is less similar. We use tim method for semantic distance described in 
(M~\ynard and Ananiadou, 1999a), wtfich is based on calculating the vertical 
position and horizontal distance between nodes in a hierarchy. Two weights are 
cMculated:
 
� positionah measured by the combined distance
 from root to each node
 
measured by the number of shared common ancestors multiplied by the munber of 
words (usuMly two).
 
Similarity between the nodes is calculated by dividing tim commomflity weight by 
the 1)ositional weight to t)roduce a figure between 0 and 1, I being the ease
 
The Information
 
Weight
 
The three individual weights described above are calculated for all relevant 
context words or context terms. The total weights for the context are then 
combined according to the following equation:
 
beC.
 
[TAIII OIIGANISM
 
ITAIIlll ALGA
 
Figure 1: Fragment of the Semantic Network where tile two nodes are identical, 
and 0 being the case where there is no common ancestor. This is formally defined 
as follows:
 
where a is the candidate term, Cais the set of context words of a, b is a word 
from C , , f,(b) is tlm frequency of b as a context word of a, syn~(b) is the 
syntactic weight of b as a context word of a, T. is the set of context terms of 
a, d is a word fl'om T., fi,(d) is the frequency of d as a context term of a, 
sims(d) is the similarity weight of d as a context term of a. This basically 
means t h a t the Infornlation Weight is composed of the total terminological 
weight, 511151tiplied by tile total semantic weight, and then added to the total 
syntactic weight of all the context words or context terms related to the 
candidate term.
 
where
 corn(w1 ...w,~) is the commonality weight of words
 
The SNC-Value
 
pos('wl...w,~) is the positional weight of words
 
Let us take an example from the UMLS. The similarity between a term t)elonging 
to the semantic category Plant and one belonging to the category Fungus would be 
calculated as follows:-
 
Tile Information Weight gives a score for each candidate term based on the 
ilnt)ortance of the contextual intbrmation surrounding it. To obtain the final 
SNCValue ranking, the Information Weight is combined with the statistical 
information obtained using the NC-Vahm nmthod, as expressed formally below:
 
where
 
� Plant has the semantic code T A l l l and Fungus
 has the semantic code T A l l 2 . � The commonality weight is the number of 
nodes in common, multiplied by the number of terms we are considering. T A l l l 
and T A l l 2 have 4 nodes in common (T, TA, TA1 and T A l l ) . So the weight 
will be 4 * 2 = 8. � The positional weight is the total height of each of the 
terms (where tile root node has a height of 1). T A l l l has a height of 5 (T, 
TA, TA1, T A l l and T A l l 1 ) , and TAl12 also has a height of 5 (T, TA, TA1, 
T A l l and T A l l 2 ) . The weight will therefore be 5 + 5 = 10. � The 
similarity weight is tile comlnonality weight divided by the positional weight, 
i.e.
 
a is the candidate t e r m NCValue(a) is the NC-Value of a I W is the Inqmrtance 
Weight of a For details of the NC-Value, see (l:5'antzi and Ananiadou, 1999). An 
example of the final result is shown in Table 2. This corot)ares tile top 20 
results from the SNCValue list with the top 20 from the NC-Value list. The terms 
in italics are those which were considered as not valid. We shall discuss the 
results in more detail in the next section, but we can note here three points. 
Firstly, the weights for the SNC-Value are substantially greater than those for 
the NC-Vahm. This, in itself, is not important, since it, is the position in the 
list, i.e. the relative weight, rather t h a n the absolute weight, which is 
important. Secondly, we can see that there are more valid terms in the SNC-Value 
results than in the NC-Value results. It
 
Table 2: Top 20 results for the SNC-VaIue and NC-Value in hard to make flu:ther 
judgements based on this list alone, 1)ecause we cmmot s~3; wlmther on(; ter]u 
is 1)etter than another, if tiE(; two terms are both valid. Thirdly, we can nee 
that more of the top 20 terms are valid tin' tim SNC-Vahm than for the NCValue: 
17 (851X,) as ot)t)osed to 10 (50%). discrei)an(:y 1)etween this lint and the 
lint validated by the manual experts (only 20% of the terms they judged valid 
were fOtlEl(1 ill the UMLS). There are also further limitations to the UMLS, 
such as the fact that it is only nl)e(:ific to medicine in general, 1)ut not to 
eye t)athology, and the fact that it; is organised ill nllch a way that only the 
preferred terms, and not lexical variants, m'e actively and (:onnistently 
1)r(~sent. We first evaluate the similarity weight individually, since this is 
the main 1)rinciple on which the SNC-\Sflue method relies. We then ewduate the 
SNC-VaIue as a whole t)y comparing it with the NCValue, so I;hat we can ewfluate 
the impact of tile addition of the deel)er forms of linguistic information 
incorl)orated in {:he hnI)ortance Weight.
 
Evaluation
 
The SNC-Value method wan initially t(;sted on a eorl)US of 800,000 eye 
t)athoh)gy reI)ortn , which had 1)een tagged with the Brill t)art-of-nl)eeeh 
tagger (Brill, 1992). The ca.ndidate terms we,'e first extracted using the 
NC-Value method (lhantzi, 1998), and the SNC-Value was then (:alculated. To 
exvduate the results, we examined the p(.'rformanee of the similarity weight 
alone, and the overall 1)erformance of the system.
 
Similarity Weight
 
Evaluation m e t h o d s
 
The main evaluation i)rocedure was carried out with resl)ect to a manual 
assessment of tim list of terms l)y 2 domain exI)erts. There are, however, 
1)roblems associated with such an evaluation. Firstly, there ix no gold standm:d 
of evaluation, and secondly, manual evaluation is both fallil)le and 
sul)jective. To avoid this 1)rol)lem, we measure the 1)erformance of the system 
ill relative termn rather than in absolute terms, by measuring the improveln(mt 
over the results of tile NC-Value as eomt)ared with mmmal evahlation. Although 
we could have used the list of terms 1)rovided in the UMLS, instead of a manu~ 
ally evahlated list, we found that there was a huge
 
One of the 1)roblems with our method of calculating similarity is that it relies 
on a 1)re-existing lexi(:al resource, which Eneans it is 1)rone to errors and 
omissions. Bearing in mind its innate inadequacies, we can nevertheless evaluate 
the expected theoretical performance of tilt measure by concerning ourselves 
only with what is covered by the thesaurus. This means that we assume 
COml)leteness (although we know that this in not the case) and evahtate it 
accordingly, ignoring anything which may be inissing. The semantic weight ix 
based on the premise that tile more similar a context term is to the candidate 
term it occurs with, the better an indicator that context term is. So the higher 
the total semantic weight
 
Section top set middle set b o t t o m set
 
Table 3: Semantic weights of terms and non-terms
 
for the candidate term, the higher the ranking of the term and the better the 
chance that the candidate term is a valid one. To test the performmme of the 
semantic weight, we sorted the terms in descending order of their semantic 
weights and divided the list into 3, such that the top third contained the terms 
with the highest semantic weights, and the b o t t o m third contained those 
with the lowest. We then compared how m a n y valid and non-valid terms 
(according to the manual evaluation) were contained in each section of the 
list,. Tile results, depicted in Table 3, can be interpreted as follows. In the 
top third of the list;, 76% were terms and 24% were non-terms, whilst in the 
middle third, 56% were terms and 44% were non-terms, and so on. This means that 
most of the valid terms are contained in the top third of tile list mid the 
fewest valid terms are contained in the bottom third of the list. Also, the 
proportion of terms to non-terms in tile top of tile list is such that there are 
more terms than non-terms, whereas in the b o t t o m of the list; there are 
more non-terms than ternis. This therefore demonstrates two things: � more of' 
the terms with the highest semantic weights are valid, and fewer of those with 
the lowest semmitic weights are valid; � more valid terms have high semantic 
weights than non-terms, mid more non-terms have lower semantic weights than 
valid terms. We also tested the similarity measure to see whether adding sosne 
statistical information would improve its results, and regulate any 
discrepancies in tile uniformity of the hierarchy. The methods which intuitively 
seem most plausible are based on information content, e.g.(Resnik, 1995; Smeaton 
and Quigley, 1996). The informatiosl content of a node is related to its 
probability of occurrence in the corpus. Tile snore fi'equently it appears, the 
snore likely it is to be important in terms of conveying information, and 
therefore the higher weighting it should receive. We performed experiments to 
cosnpare two such methods with our similarity measure. The first considers the 
probability of the MSCA of the two terms (the lowest node which is an ancestor 
of both), whilst the second considers the probability of the nodes of the terms 
being colnpared. However, the tindings showed a negligible difference between 
the three methods, so we conchlde that there is no
 
Table 4: Precision of SNC-Vahle and NC-Value advantage to be gained by adding 
statistical int'ormation, fbr this particular corpus. It; is possible that with 
a larger corlms or different hierarchy, this might slot be the case.
 
Overall E v a l u a t i o n of t h e S N C - V a l u e
 
We first; compare the precision rates for the SNCValue and the NC-Value (Table 
4), by dividing tile ranked lists into 10 equal sections. Each section contains 
250 terms, marked as valid or invalid by the manual experts. In the top section, 
the precision is higher for the SNC-Value, and in the b o t t o m section, it is 
lower. This indicates that the precision span is greater fl~r the SNC-Value, and 
therefore that the ranking is improved. The distribution of valid terms is also 
better for the SNC-Value, since of the valid terms, more appear at the top of 
the list than at the bottom. Looking at Figure 2, we can see that the SNCValue 
graph is smoother than that of the NC-Vahle. We can compare the graphs niore 
accurately using a method we call comparative upward trend. Becruise there is no 
one ideal graph, we instead measure how much each graph deviates from a 
monotonic line downwards. This is calculated by dividing the total rise in 
precision percentage by the length of the graph. A graph with a lower upward 
trend will therefore be better than a graph with a higher upward trend. If we 
compare the upward trends of the two graphs, we find that the trend for the 
SNCValue is 0.9, whereas the trend for the NC-Value is 2.7. This again shows 
that the SNC-Value rmiking is better thmi the NC-Value ranking, since it is more 
consistent. Table 5 shows a more precise investigation of the top portion of the 
list, (where it is to be expected that ternis are most likely to be wflid, and 
which is therefore the inost imi)ortant part of the list) We see that the 
precision is most iml)roved here, both in terms of accuracy and in terms of 
distribution of weights. At the I)ottom of the top section, the
 
PlccJshm
 
T~ T T I
 
Scctionollist
 
tics for creating such a thesaurus automatically, or entrancing an existing one, 
using the contextual information we acquire (Ushioda, 1996; MaynaM and 
Anmfiadou, 1999b). There is much scope tbr filrther extensions of this research. 
Firstly, it; could be extended to other (lomains and larger corpora, in order to 
see the true benefit of such a.n apl)roach. Secondly, the thesaurus could be 
tailored to the corpus, as we have mentioncd. An incremental approach might be 
possible, whereby the similarity measure is combined with statistical 
intbrmation to tune an existing ontology. Also, the UMLS is not designed as a 
linguistic resource, but as an information resource. Some kind of integration of 
the two types of resource would be usefifl so that, for example, lexical 
variation could be more easily handled. 
 
Table 5: Precision of SNC-\Sdue and NC-Vahm for top 250 terms precision is much 
higher for the SNC-Value. This is important because ideally, all the terms in 
this part of the list should be valid, 7 Conclusions
 
In this paper, we have described a method for multiword term extraction which 
improves on traditional statistical at)proaches by incorporating more specific 
contextual information. It focuses particularly on measuring the strength of 
association (in semantic terms) l)etween a candidate term and its context. 
Evahlation shows imi)rovement over the NC-Vahm approach, although the 
percentages are small. This is largely l)ecmlse we have used a very small corpus 
for testing. The contextuM information acquired can also be used for a mmlber of 
other related tasks, such as disambiguation and clustering. At present, the 
semantic information is acquired from a 1)re-existing domain-slmcitic thesaurus, 
but there m:c 1)ossibili-
 
References

D. Bourigault. 1992. Surface grammatical analysis for the extraction of terminological noun phrases. In Proc. of International Conference on Computational Linguistics, pages 977-981, Nantes, France. 

Eric Brill. 1992. A simple rule-based part of speech tagger. In Proc. of 3rd Conference of Applied Natural Language Processing.

B. Daille, E. Gaussier, and J. M. Lange. 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proc. of International Conference on Computational Linguistics, pages 515-521.

Chantal Enguehard and Lmu'ent Pantera. 1994. Automatic natural acquisition of a terminology. Journal of Quantitative Linguistics, 2(1):27-32. 

K.T. li'r;mtzi and S. Ananiadou. 1999. The CValue/NC-Vahm domain independent method ~br multi-word term extraction. Journal of Natural Language PTvccssing, 6(3):1.45 179. 

K.T. Frantzi. 1.998. Automatic Recognition of Multi-Word Terms. Ph.D. thesis, Manchester Metropolitan University, England. 

G. Grefenstette. 1994. E:rplorations in Automatic Thesaurus Discovcry. Kluwer Acatemic Publishers. 

J.S. Justcson and S.M. Katz. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1:9-27. 

Andy Lauriston. 1996. Automatic term recognition: performance of linguistic and statistical learning techniques. Ph.D. thesis, UMIST, Manchester, UK. 

D. G. Maynard and S. Anmfiadou. 1999a. hlentifying contextual information tbr term extraction. In fiogy and Knowlc@c Engineering (TKE '99), pages 212-221, Innsbruck, Austria. 

D. G. Maynard and S. Anmfiadou. 1999b. A linguistic approach to context clustering. In Proc. of National Language Processing Pacific Rim Symposium (NLPRS), pages 346-351, Beijing, China. 

S. J. Nelson, N. E. Olson, L. Fuller, M. S. Turtle, W. G. Cole, and D. D. Sherertz. 1995. Identifying concepts in medical knowledge. In Proc. of 8th World Congress on Medical Informatics (MEDINFO), pages 33-36. 

NLM, 1997. UMLS K?wwlcdgc Sourccs. National Library of Medicine, U.S. Dept. of Health and Human Services, 8th edition, January. 

P. Resnik. 1995. Disambiguating noun groupings with respect to WordNet senses. In Proc. of 3rd Workshop on Very Large Corpora. MIT. 

A. Smeaton and I. Quigley. 1996. Experiments on using semantic distances between words in image caption retrieval. In Proc. of 19th InternationaI Conference on Research and Development in Information Retrieval, Zurich, Switzerland. 

Akira Ushioda. 1996. Hierarchical clustering of words. In Proc. of 16th International Conference on Computational Linguistics, pages 1159-1162. 
