Exogeneous and Endogeneous Approaches to Semantic 
Categorization of Unknown Technical Terms 
Farid Cerbah 
\])assault Aviation- I)PR/DESa - 78, quai Mm't:el \])assault 
92552 Saint-Cloud cedex 300 - France 
farid, cerbah@dassault-aviation, fr 
Abstract 
Acquiring and updating terminological re- 
sources are diflicult and tedious tasks, esi)ecially 
when selnanti(: inti)rmal;ion should l)e t)rovided. 
This \])aper deals with !lbrm ,b'o, mantic Ca.tog'o- 
riza.tion. The goal of thin l)ro(:ess in to as- 
sign senlani;i(: categories to miknown i;(~('hni(:al 
ternls. We t)ropose two at)l)roachen to the t)rol)- 
1(;111 1;11;1t; rely on (tifferenl; kn()wte(tge sources. 
The, exogeneoun al)l)roa('h exph)it;s (;ont(~xl;ual 
information ex|;ra(;\[;c(t frolll cort)ora. Tile (~ll- 
(logeneous al)i/roa(:h reli(~s on a texi('al analy- 
sis of the te(:hnical ternls. After (les(:ril)ing the 
two iml)lemente, d methods, we i)rese, nl, the ex- 
\])eriments that we (:on(lu(:ted on signiticant tesl; 
sets. The results demonstrate, that Ix;tin (:;11;- 
egorization (:an provide a reliable hel t) in the 
terminology a(:(luisition 1)ro(:esses. 
1 Introduction 
T(n'minologi(:al resour(:cs have, 1)(xm found use- 
ful in re;my Nl,l ) nl)t)li(-;ltionn , in(:huting 
Coml)ut(~r-Aided 'Pranslation and IM'orn~;d;ion 
Retrieval. However, to have a siguiti(:ant im- 
pact (m al)i)li('ations , t;erminologi('al knowledge 
should not 1)e limited to tl;11; nolnent'latures 
of single-word and mull;i-word lx',rms. They 
should include semant;i(: knowledge, su(:h an se- 
manti(" (:ategori(',n ;111(1 various kinds of seman|;ic 
relations (hyperonynly/hyponymy: synonymy). 
'l?hin 1)al)er fo(:uses Oll the asniglunenl; o\[ ncnlan- 
tic (:ategories to techni('al l;erms. Semantic cat- 
egories may 1)lay a crucial role in many apl)li- 
(:ations, and t/articularly when disaml)iguation 
\])rocesses are required. For examl)le , in our 
al)t)licative fralncwork, semanti(" categories en- 
sure coarse sense division of l)olysenloun terms 
~i1(t are acl;ually used to iml)rove the lh'eau:h-to- 
English translation of tlm t(~ctmical (lo(:unmn- 
tat;ion. Assigning semantic categories to techni- 
cal ternls is a difficult and tinle-('onsunling l;ask. 
Highly st)ecialized skills are required and, even 
though the major concepts represented in these 
ternlinologi(:al resources 1)ertain to the aeronau- 
tic donlain, various related knowledge areas are 
con('erned, such as Data, Processhig', Mc('han- 
its; and Mmmth.cl;uring Procossc's. Our goal is 
to elaborate a tool that helps ternfinologisl;s to 
assign semantic ('ategories when ut)dating the 
reli;ren(:e lx;rminology. We think that signiti- 
(:ant SUl)l)()rl; can 1)e l)rovidcd to 1;11(; lx'.rminoh)- 
gists 1)y taking a(lvanl;age of existing (:at(',gorized 
ternls an(I their usa,gcs in do('muenl;s. Such 
a tool can t)e integr:~te(t within a ternfinology 
a(;(luiniI;ioll (;nvironnl(~lll; ;~s & (:onlt)lenieni; to & 
l;erlll (;xl;ra(;l;ioll COlll\])Oli(~ii~;. 
We distinguish two kin(tn of at)l)roa(:hes to se- 
mantic (:ate, gorizal;ion. In a way similar to 
(:orl)uS-l)an(xt mel;h()(ts for Wor(t Sense Dis- 
amt)iguation (WSl))('~rowski, \]992; i(le and 
Vdronin, 1998), an exogenc'ou,~ ;tpt)ro:u:h ex- 
\])loits (x)ntextual information extracte, d t'ron~ 
cort)or;~ in ()r(ler to (tet(;rmine the most t)lau- 
nil)le (:ategories. By (:()111;l'asl;~ all (31dOy)'(31(~,olls 
aI)t)roach re, lies soMy (111 a lexi('al analysis (if 
nmlti-word terms which are very frequent in ter- 
minological (tatal)nse, s. This approach is 1)ased 
on the assmnt)tion that lexi(:al milts used in the 
composition of teclmi(:al |:(',rins are releva.nt in- 
dicators of semantic domains. 
The rest of the paper is organized as ibllows. 
Section 2 dcs(:ril)en the terlninological resources 
used in this study. Section 3 conq)arcs lx~rm 
categorization with rebated issues, such as the- 
saurus extension, WSD and term clustering. 
Sections 4 and 5 descrit)e the two t)roposed 
methods. R,esults and evaluation are given and 
dis('ussed in section 6. l)irections tbr fltrther 
resear(:\]L are i)oini;ed oul; in last; section. 
145 
French English Categories 
anti-ddrapage 5, long terme 
assiette de consigne au ddcollage 
barre de remorquage 
ddrive 
ddrive 
ddriver 
ddrivde 
embout eoulissant 
enregistreur de fatigue 
enregistreur de l)aram~tres 
jeu de protecteurs boudin cabine 
bit d'appui touche 
amplificateur tdldphone de bord 
ne pas ddboucher 
long-action antislip 
required takeoff attitude. 
tow bar 
fin 
wander 
to unrivet 
derivative 
sliding endpiece 
fatiguemeter 
ttight data recorder 
set of cockpit seal protectors 
keystroke bit 
flight crew interphone amplifier 
to be l)lind 
POS 
NAV N 
CKI N 
TOO N 
STR N 
PRO, FLP N 
MEC,MAI V 
COM N 
ENG,LGR N 
FPL N 
FPL N 
TOO N 
DPR. N 
RTL N 
MAI LV 
• MAI: Maintenance, NAV: Navigation, CKI: Cockpit; Indications, TOO: Tools, FLP: Flight 
Parameters, ENG: Engines, LGR: Landing Gear, STR: Aircraft Structure, DPR: Data 
Processing, RTL: ll,adiocommunications. 
Table 1: A sample of the terlninological database. 
2 The Terminological Resources 
We use in this study a French/English bilingual 
terminology of the aeronautic domain. This 
hand-crafted database results tTrom a multi- 
disciplinary effort involving technical writers, 
translators, terminologists and engineers. In 
its current state, the database contains 12,267 
French/English term ctmt)les, structured in 70 
semantic categories. As already ot)served in sev- 
eral terminological databases, multi-word terms 
cover the larger part; of the database (nearly 
80%). Tile described terms are mostly nouns 
but the database also contains about 200 verbs 
and verbal phrases. Table 1 gives some exam- 
pies of terms and a short description of the as- 
sociated categories. These linguistic resources 
are integrated in a computer-aided translation 
environment used by technical writers. 
Semantic categories have originally been intro- 
duced in order to distinguish the various senses 
of polysemous terms. Each term couple is an- 
notated with one or more categories specifying 
the contexts in which the translation is rec- 
ommended. An entry is associated to a ternr 
for each identified meaning. For example, the 
french term ddpasscmerd has at least two pos- 
sible meanings, corresponding to two different 
translations: ovc~iflow in the Data Processing 
category (DPR,) and out-o\[-./lushnc.ss in the Air- 
m'M't Structure category (STR). As shown in the 
examples of table 1, the assigmnent of seman- 
tic categories has been extended to monosemous 
terms. 
In our experiments of term categorization, only 
the fl:ench l;erms have 1)een used. 
3 Related Work 
~brm Semantic Ca, teg'orizat, ion is on several as- 
pects similar to _Th(;sam'us l~2xtension (Uramoto, 
1996; Tokunaga et al., 1997). Our methods 
are ch)se to those used for positioning unknown 
words in thesauri. However, the two issues cm~ 
be ditlhrentiated with respect to the manit)u- 
lated data. A thesaurus is intended to cover 
a large set; of concet)tual domains while a ter- 
minological database is tbcused accurately on a 
specific topic and its related domains. For ex- 
ample, in (Tokunaga et al.~ 1997), the thesaurus 
to be extended contains lnore than 500 cate- 
gories. This tends to make the t)roblem harder, 
lint, since many categories are strictly indepen- 
dent, it is easier to find distinctive tieatures be- 
tween categories. By contrast, our terminolog- 
ical database contains only 70 categories. But, 
in this restricted set, we find categories corre- 
sponding to close or even overlapping knowledge 
areas. It; is more difficult to ditt'erentiate them. 
146 
tqn:(;hermore, (;he endogeneous al)t)roaeh, which 
ext)ioits the mull;i-word natm'e of terminological 
units, caroler 1)e al)l)lied (;()(;hesaurus extension 
l)e(:ause of the large alll()llll(; of single-word (;he- 
Sallrlls e\]ll;l:ies I. 
It is usethl to compare exogeneous I;ernt cat- 
egorization with corl)us-based WS1) methods. 
\]in 1)oth cases, contextual ilfli)rmation extracted 
from corpora are used in order to assign the 
ltlOS|; plausible se, illa, ll(;ic |;ags I;O words. In WSI), 
the contextual cues that co-occur with l;he tar- 
get word (:ons(;itute the main (;raining source 
w\]ler(:~s, in term (:ategoriza(;ion, the (:onI;extual 
information occurring with (;he term to l)e (:al;e- 
gorized should not l)e iuchuled ilt training (bd;a 
sin('e the term is sut)l)osed to t)e unknowlL The 
only relevant training sources are the conl;(~x- 
tual cues surrounding the already categorized 
terms, q?his is a basil: ditl'erclme that ext)lains 
why WS\]) tasks usually a(;hicve t)ettor t)erfor- 
Ulallce l;han term cal;egoriz~ttion an(t l;hesnurus 
extension. 
In a terminology acquisition framework, Itaberl, 
et al. (1998) I)rOl)OSe an exogeneous c~ttegoriza- 
lion nml;hod of unknown simple words. They 
use local context of simt)le words 1)rovided by ~ 
terlll eX(;lJa, e(;iOll sysl;em. 
l~n(logene, ous (;erm c~t;egorization (:~m also l)e 
COlllpared wiI;}l sollle ;~l)l)ro;tches |;o terlll (;hls- 
tering (Bourigaul(; and .lac(tuemin , 1999; As- 
sad|, 19!)7). These al)l)roa('hes (;ak(~ adv;mi;ag('. 
of l;he lexical and syntac(;i(: structures of te('lmi- 
cal terms in order to t)uild seman(;i¢: clusters. 
4 Exogeneous Categorization 
We tested several classiticatiol~ models. Our 
tirst ex1)erim(mts were carried out with 
Examl)le-1)ased classitiers. We used our own 
implementation of K-nearest neighbors algo- 
rithm (kNN), mid then (;lie TiMBL learner 
(Daelemans et al., 1999), which provides 
several extensions to kNN, well-suited tbr 
NLP 1)rol)lems. Neverl;heless, in the current 
st~tte of our work, better l'esull;s were ol)- 
tained with a 1)rol)al)ilistic classifier similar to 
1For Japanese, (Tokunaga et al., t997) reports some 
l)ronfising CXl)crimcnts of cndogcncous categorization to 
thesaurus extension. The al)proach relies on 1)rot)ertics 
of Jal)ancse word formal;ion rules and, t;hus, it; (:an hardly 
1)e adapted for other s. Their experiments sug- 
gest that exogeneous and cndogencous al)l)roaches air(,' 
complementary. 
the one used l)y C\['okllnaga et al.: 1997) tbr 
thesnm'us exl;ension, l)ue to lack of sl)ace , 
only this method will l)e (les(:ribed in this 1)aper. 
We use as contextual cues the open-class words 
(nouns, verbs, adjectives, adverbs) theft co- 
occur in the corpus with the technical terms. 
More precisely, the cues are open-class words 
surrounding the occllrr(;lices of t;he term in some 
window of t)i'edetined size. Each new term to be 
(:ntegorizeA is rel)resented by the overall set of 
(:oni;exi;ual (:ues that have 1)een (~xl;ra(:l;ed fl:onl 
:t \])arl; ()f the (:orlms (I;est (:orlms). 
4.1 Probability Model 
Lel; us consider a tel'ill 5/' for which the con- 
tcxl;ll;tl (;ll(;S {lt;i}~_1 ha,ve been collected in the 
test corl)tlS. Tlm c~l;egorization of this t;erm 
alllOUlli;S l;o lind the cal;(;gory C* that maximizes 
prot)al)ility P(C\]T): 
c* = a,g P(cI5") 
According to the exogeneous api)roa(:h , l;he 
probalfility that a term ~/' belongs to category 
C del)ellds (m the contexl;ual cues of ~F: 
1/, 
c': = .,rg.F}× 
i= :1 
After ~q)plying B~yes;rulc: 
(2) 
) ) r~ 
c*= argl, ax./'(C) 1 (',,,,:lC)I 
: /'0",:) 
(3) 
The l)rob~dfilities of the eqmd;ion 3 are esti- 
mated from trailfing data: 
• /)(wile) is the prol)ability that a word wi 
co-oe(:urs with a term t)elonging to (:ate- 
gory C. It is estimated in the fi)llowing 
way: 
= C) (4) E,,,+ c) 
N,,('wi, C) is the mnnl)er of ifimes that wi 
co-occurs with a term belonging to c~te- 
gory 6'. 
This probability accounts tbr the weight of 
(;ue "w i in cai;egory C. 
147 
• P(wilT) is the probability that wi co- 
occurs with T: 
P(wilT ) = N,~,(wi, T) (5) 
E,,,.i N (wj, T) 
Nw(wi,T) is tile number of times that wi 
co-occurs with T. 
• P(wi) is the probability of cue wi2: 
Nw(wi) (6) = 
• P(C) is tile prior probability that a term of 
the corpus belongs to the category C: 
P(C) - Nt(C) (7) 
Ec~ Nt ( Ci ) 
where Nt(C) is the occurrence number in 
training data of terms t)elonging to C. This 
probability accounts for the weight of cat- 
egory C in the corpus. 
4.2 Training and Test 
qJhe exogeneous classifier starts with the selec- 
tion of test documents in the cort)us. Technical 
terms found in these documents will form the 
test set. The remaining documents represent 
the training corl)us. %'aining and test stages 
are the following: 
• POS tagging. The test and training cor- 
pora are tagged with MultAna, a tagger de- 
signed as an extension of the Multex mor- 
phological analyzer (Petitpierre and Rus- 
sell, 1995). Occurrences of the technical 
terms are identified during this stage and 
the terms to be categorized are those which 
are identified in the test corpus. 
• Extraction of contextual cues. For 
each term occurrence in training and test 
data, the contextual cues are collected. 
Only the lemmas of open-class words are 
used and cues may correspond to multi- 
word terms. Each test term is then repre- 
sented by the set of cues which have been 
collected in test data. 
'~Note that the categorization process could be simpli- 
fied by eliminating P(wi), since this quantity is constaifl; 
for all categories. 
Incrdmentation (hzcremcnt) \[DPP~\] 
I)PR (0.2106), CKI (0.2027), I(J)R (0.1967) 
D6cr6mentatlon (Decrement) \[DPR.\] 
DPR (0.2111), CKI (0.1843), I'd)R (0.1654) 
II, enseigner (Inform) \[NAV\] 
CKI (0.24), VOR (0.21), DPR (0.1276) 
Entrde d'alr (air intake) \[ENG\] 
ENG (0.1895), FUE (0.1214), 1)OC (0.1192) 
Motet,," (Engine) \[ENG\] 
ENG (0.1494), CDV (0.1285), Dec (0.1095) 
Effacement de donn6es (Data clearing) \[RTL\] 
DPR (0.2059), DOC (0.1357), RTL (0.129) 
T616phone de piste (Ground telephone) \[RTL\] 
RTL (0.1251), ELE (0.1131), EQX (0.1011) 
Figure 1: Some results of the exogeneous cate- 
gorization. 
• Frequency calculation and probabil- 
ity estimation. Training data are ex- 
plored to compute the frequencies (occur- 
rences and co-occurrences) of cues, terms 
and categories. As mentioned earlier (sec- 
tion 3), tile cue occun'ences which have 
been collected around the test terms are 
ignored during this step. Tile probabili- 
ties required for the categorization opera- 
tion are then computed. 
• Categorization of the test terms. The 
most probable categories are assigned to 
each test term (see section 4.1). Figure 1 
gives; some examples of exogencous catego- 
rization 3. 
5 Endogeneous Categorization 
Our approach to endogeneous categorization is 
simpler. It is exclusively based oll a quantitative 
analysis of the lexical composition of technical 
terms. Henceforth, the open-class words used 
to compose technical terms will be called ter- 
minolwical components. The endogeneous ap- 
proach relies on a much more restricted source 
of data than the exogeneous approach, since the 
comI)one:nt set of a terminological database is 
quantitatively limited compared with the set of 
contextual cues extracted from corpora. Never- 
theless, we make the assumption that this quan- 
titative limitation is partly compensated by the 
aSome category labels are described in table 1. 
148 
ARR 
archi(;e(:(;ure (architecture) 
('.nca(lr(;nmnt (fi'aming) 
(foa,,,,,) 
cal)otage (fairin.q) 
chSssis (rack) 
51astique (elastic) 
cloison (bulkhead) 
(;s(:mnol;abl(; (retractable) 
suspension (shock mount) 
ventilation (ventilation) 
8.440 
8.440 
7331 
7.445 
7.353 
7.020 
6.632 
6.441 
6.266 
6.11<1: 
MAP 
fimderie ( eastin,9) 
1)la(:age ( claddin.q) 
school)age (schoop p~veess) 
dral)age (layup) 
(:uisson (baking) 
mouh: ( mo'uld) 
(:ompact;er (to compact) 
t)roche (pin) 
chimique (chemical) 
usinage (milling) 
7.910 
7.872 
7.232 
7.1.11 
6.912 
6.838 
6.591 
6.591 
6.134 
6.101 
T~dlle 2: The to t) tell lllos(; significant tcrnfinological corot)one.his of (:ategories AR.R (Arra.ngemcnt) 
and MAP (Mmmthcl, uril~g l)roc(;.,~scs). 
strong discrimination pow(;r of tel"minologi(:al 
COml/onen(;s. 
Th(: training t)hase assigns to each category a 
set of ret)resenl;ative (:onq)onents with resl)e(:t 
to some association score. The categorization 
t)hase determines tlm most t)lausil)lc cntegories 
of a term a(:cording to its (:omt)onents. 
5.1 Association Score 
To estimate the (lel)end(m(:y lml,ween (:()ml)O- 
n(:nts and cat(:gori(:s, w(: eXl)erilnent(:d s(:v(:r~l 
association criteria. The choi(:(; of l:hes(; (:riteria 
has l)(;ell intluen(:ed by the (:omt)aral;iv(: study 
described in (Yang and P(:rdersen, 1997) on f(:a- 
tam; select;ion (:riteria for text categorization. 
W(: test(:(t s(:vcral measures, including COml)O- 
n(:nt fl:(:quency, information gain and mutual in- 
formation. Ore: best results were a(:hiev(:(t with 
mutual information whi(:h is (:stimat(:(t using: 
N,,,O,, C) x N,,, 
l('w,C) ~ log9 N.,,,(C) x N,,,(w) (8) 
• N,,,(w, C) is the frcqu(mcy of conlt)onellt w 
in category C. 
• Nw is the total number of component oc- 
curren(:cs. 
• N,,,(G) is tim total nunfl/er of comt)onent 
oe(:urrences in category C. This factor re- 
duces the etlbct of the coml)oncnts weakly 
represented in category C, compared with 
the other ColnI)ollell|;8 oJ' C. 
• N,,,(w) is the frequency of conH)onent 'w in 
tile terminological database. This factor 
reduces tile efl'ect of the comi)onents that 
(h:not(; basi(: concel)ts spread all over (;11(: 
datzd)ase. \],br examl)le, the coral)orients 
,weed, altitude, press'urc, hay(' high frequen- 
<:ies in <:ateg<)ry FLP (I'~lig'ht: IS~rnmct(:rs), 
but, as basic con<:el>ts, they also at)l)ear fix> 
(luently in many categories. 
'\]'able 2 gives for two (:ategories the ten most 
rel)rc.sel~l;;d;ive (:Olnl)onen(;s a(:(:or(ling (;o this 
S(;or(;. 
The asso(:iation s(:ore l)etween a t;erm T (with 
,;(,lll\],(,ll0,11\[;S {'/11i}~:1) all(| a ca{;(,g(,l'y (7 is given 
according to the. colnt)onents of '2': 
AI,(T,C) - N,(C) ~-~ l(wi, C) (9) 
Nt(C) is the nmnber of terms 1)el"raining 
t() (:at.eg()ry C and Nt is the. total nmnber of N,(c) 
terms. The factor ~ favors lm'ger cal;egories. 
Th(: (:atcgorization task determines 1;t1(: category 
C* that maxinfizes the association score: 
C* = arg,,~x; At(T, C) (10) 
5.2 Training and Test 
Only multi-word terms can be categorized with 
this method since our endogeneous at)l)roach is 
1)y nature not relevant for simt)le words. A test 
set of conq)ound terms is extracted from lille ter- 
minological database. The remaining terms are 
us(:d tbr training. The training terms ar(: ana- 
lyzed in order to assign to each category its ter- 
minological comt)onent;s. Then, component fie- 
quen(:ies and asso(:iation scores arc computed. 
149 
://:T k=l 1¢=2 k=3 k=4 k=5 
312 45.07 68.22 79.15 88.10 91.40 
89 42.25 73.14 86.95 91.57 98.19 
98 50.65 84.94 91.11 94.93 96.96 
120 48.11 78.95 91.26 96.83 97.92 
253 59.17 81..98 92.35 96.75 98.65 
125 73.16 89.30 96.85 98.68 99.59 
234 68.19 87.63 94.14 96.99 98.31 
205 49.05 83.86 95.12 98.61 99.47 
203 47.25 72.1.8 85.66 94.40 98.48 
105 44.55 69.61 87.28 93.97 97.11 
Tot. 1744 
Avg. 52.75 78.98 89.99 95.08 97.61 
Table 3: 1-~,esults tbr tit(; exogeneous apln'oach. 
'Ibn ext)eriments have been run tbr a total test 
set of 1744 terms. 
#T k=l k=2 k=3 k=4 k=5 
229 47.96 62.05 71.23 75.77 78.50 
232 42.60 56.38 62.28 69.73 74.23 
228 41.5 57.89 63.03 72.76 76.27 
227 45.54 61.05 68.57 76.74 80.18 
239 46.38 61.01 67.93 73.41 75.94 
229 43.41 57.54 65.42 71.16 75.34: 
231 43.75 6(/.74 70.04 73.97 77.62 
228 42.85 60.18 66.97 70.50 73.85 
237 45.68 58.62 66.37 72.10 74.67 
240 42.30 58.82 68.48 70.29 74.05 
'pot. 2320 
Avg. 44.19 57.77 67.03 72.64 76.06 
Table 4: Results for the endogeneous apt)roach. 
Ten experiments have been run fbr a total test 
set of 2320 terms, 
During the test phase, each test term is anno- 
tated with the most plausible categories accord- 
ing to its coxnl)onents. 
6 Experiments and Evaluation 
To estimate the accuracy of the exogeueous 
method, we used a domain-specific corpus of 
541,964 words, coml)osed of documents I)ertain- 
tug to various textual genres (software st)ecifica- 
tions, maintenance procedures, manufacturing 
notices...). This corpus covered 63 of the 70 
categories. Each run starts with the selection 
eta document among the corpus documents. 
The known terms identified in this document 
are considered as test: terms. We used rela- 
tively wide contexts. The cues were extracted 
in a window of :t=20 words around the term. 
Each run involved more than 70,000 contexts 
of term occurrences. To experiment the endo- 
geneous apI)roach, test sets of col~xl)ound terms 
have l)een randomly extracted from the, terlni- 
nological database. 
We adopted an evaluation scheme similar to 
that detlned in (Tokunaga et al., 1997) for the- 
saux'us extension. The categorization is con- 
sidered successful if' the right categox'y appears 
among the t~: first categories assigned by the 
classifier. Within a semi-automatic acquisition 
framework, this evaluation scheme is more suit- 
able than strict evaluation where only the first 
category assigned by the classifier is consid- 
ered as relevant (evaluation restricted to h=l.) 'l. 
Frolu oux' application 1)erspective, it is useful to 
provide to the terminologist a restricted set of 
less than 5 t)lausible categories instead of the 
complete set of 70 categories without prior fil- 
tering. 
In the experiments des(:ribed in (Tokunaga et 
al., 1997), k takes the values 5, 10, 20 and 30 
for average(t 1)ertbrmance ranging from 26.4:% 
to 55.9% (the choice is made among 544 cat- 
egories). Some of their precise experiments 
yielded an accm'acy greater than 80% fox' t~: 
= 30. In ore" experinmnts, we measured ac- 
curacy tbr h=l to 5. Some results are given 
in tables 3 and 4. The scores are higher 
than those achieved in thesaurus extension, es- 
pecially with the exogeneous approach (fi'om 
52.75% to 97.61%). We should however keel) 
in mind that we deal with a diflbrent kind of 
data (see section 3). 
7 Conclusion and Further Work 
We have presented in this paper two al)proaches 
to term semantic categorization that have been 
fully implemented and experimented on signifi- 
cant test sets. The results achieved in this work 
demonstrate that term categorization tasks 
could be integrated within a senti-automatic 
4Some limitations of strict evaluation are also t)ointed 
out in (r/~esnik and Yarowski, 1999). 
150 
termil~ology acquisition t)ro(:ess to l)rovide an 
active sut)I)ort to terminologists. 
The soluti(ms 1;(2 i;his 1)rol)lem (:~m 1)e (:onsi(t- 
ernbly ixnl)rovcd mid we h~ve \](hint\]tied s(;vernl 
t2romising dire('tions for tin'thin' r(',scar(:h. 
Our exl)eriments show that cxogen(xms c~tego- 
riza,tion is notice~fl)ly the most et\[icicnt of both 
&t)l)ro~ehes. now(?ver~ il; requires lmlch more 
knowledge sour(:cs and comlmt~tion~fl overtmad. 
It is more eXl)osed to (tatn slmrsencss , since 
large amounts of contextual data are not nlways 
available, especially in technical domains. We 
should stress that this study 12(m(~tiix;d fl'om the 
av;filal)ility of a highly relewmt corpus. This 
means that, for sak(; of rolmsl;lt(;ss, ol;hcr ~neth- 
o(ls (even less eflici(mt) and relewmt knowl- 
e(lge sources should not t)e negle(:t(xl. Tim 
two i)rol)oscd hilt)roaches are (:()mt)lemeld;;wy in 
the sense i;h;~t tlmy take a(lvm~l,~ge of distinct 
kn(2wledgc, sources. Further work wilt investi- 
gal;c the various whys to (:omt2ilm them in order 
1;(2 iml2rove the overall t)erforman(:(,,. 
The use of rel~tiomd inibrmation, mM t)artic- 
ulnrly syntactic x'elati(ms, is anoth('.r major (ti- 
l'e(:tion for further r(;sear(:h. Exogeneous (:atc- 
goriz;ttion in based on ~t bag of wor(Is/lcmmas 
model since wide (:ontexts oi7 hmnnatiz(;d words 
were used, without (:onsi(h~ration for the l)osi- 
tions of these (:ues and th(;ir l)ol;(;nl;i:d synta('- 
tic rel;ttionshil)s with tim l;arg(;1; 1;(~t~ns. Syn|;a(:- 
tic int'(2rm~tion cxtr~u:t(,,(l fl:om lo(:nl (:()nt(;xl;, as 
verl)-ot)j('x't relations, is :mother major source 
tbr exogeneous ('atcgorization that has t)e(m ex- 
1)loit(xl in thesmu'us ext(;nsion methods. The 
endog(meous ni)l)roa(:h C&ll ;dso l)e improved l)y 
exl)h)iting the synl;~mti(: sl;ru(:i;m'(', of the t(;(:lmi- 
(:al l;o, rllls, ilt Ollr ;tl)t)l'();t(;ll: all (;()lnI)o12elll;s of 
l;e(:lmical terms ~r(; equally wcighl;cxl, indet)(;n- 
denl;ly of their synta(:tic roh;s wil;hin l;he l;crms. 
More accurate nsso(:i~tion s(:ol(~s can 1)e intro- 
(tuced by tnking a(tvmltage of head/modifier re- 
lations. 
Finally, we should note that the |)\]lingual nntm:e 
of our terminological rcsour(:(~s has 11(2|; t)een 
I;&k(m into a(:(xmnt, Minor ('hm~ges are required 
to make the two classifiers work for El~glish. 
Furl;tmr experiments will l)e (:ondu(:t(xt on the 
English resour(:(;s, in this l)ilingual (:ont(~xl~, ei- 
ther the Frcn(:h (2r the English (~xt)r(~,ssion (or 
both) (:ouht 1)e used to categorize a given term. 

References 

H. Assadi. 1997. Knowledge Acquisition from 
Texts : Using an Automatic Clustering Method 
Based on Noun-Modifier Relationships. In 35th 
Annual Meeting of the Association for Computational Linguistics , Madrid. 

D. Bourigault and C. Jacquemin. 1999. Term 
Extraction Term Clustering: An Integrated 
Platform for Computer-Aided Terminology. In 
Proceedings of the European Chapter of the Association for Computational Linguistics (EACL '99), I)~ges 15 22, Berg(re. 

W. Daelemans, J. Z~vrel, K. wm del Sh2ot, and 
Antal van (hm l~os(:h, 1999. TiMBL: Tilb'urg 
Memory \]\]a.~cd Le(rrnc, r Ve.rsion, 2.0 Rqfc'r- 
chef'. Guide.. Tiltmrg University. 

13. Habert, A. N~tzarenko, P. Zweigcnt)mnn, 
a.nd J. \]/ommd. 1998. Extending an Exist- 
ing Sl)eci~diz(;d Scmanl;ic Lexicon. In Proce:cd- 
infls o\]' ,first Intc'rnalional CoV/'cTv.ncc on Lan- 
9'u, agc \]~,c.so'wrccs and \]'>z~al'uation, t)~gcs 663 668, 
Grnm~da. 

N. Ide and J. Veronis. 1998. Introduction to 
the Special Issue on Word Sense Disambiguation: The State of the Art. Computational Linguisics, 24(1):1 40. 

1). \])ct;itlfi(;rre ~md G. ll,usscll. 1995. MM(m,'I, 
Tim Mult('.xt Morl)hoh)gy Program. T(;chnical 
r('.p()rt, Multcxt l)clivernl)h; 2.3.1. 

1 ). l{u.snik nnd D. Ym'owski. 1!)!)9. l)ist;inguisl> 
ing systems ~md distinguishing senses: nc.w ewd- 
untion m(~l;hods for Word Sens \])ismnlfiguation. 
Nat'uval Lang'uagc Engincc'ring , 5(2):113 133. 

T. Tokunaga, A. Fujii, and M. Iwayama. 1997. 
Extending a thesaurus by classifying words. In 
Association for Computational Linguistics, editor, ACL Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources, pages 16-21. 

N. Uramoto. 1996. Positioning unknown words 
in a thesaurus by using information extracted from a corpus. In Proceedings of Coling 96, pages 956 961. 

Y. Yang mid J. Per(lersen. 1997. A comparativ(.' 
study on fe~tture sehx:tion in text (:ategorization. 
In Proccc.di',,g,s of the l'b'u,'rtc, cnth International 
Co'vktrvncc on Machin(; Learning (ICML'97),. 

D. Yarowski. 1992. Word Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora. In Proceedings of Coling 92, Nantes.
