Positioning Unknown Words in a Thesaurus by 
Using Information Extracted from a Corpus 
Na,ohiko URAMOTO 
IBM R(~,qea, rch, Tokyo Resea, r<h Lal>or~Ltory 
1623-14 Shinlo-tsuruma,, Y;una, to-shi, Kana,g~Lwa,-ken 242 Japan 
uramoto({~t rl.il)ln.(:o.j t) 
Abstract 
This p~q)er describes a. method for positio,ing un- 
known words in an existing thesa,rus by using word- 
to-word rela.tionships with relation (case) markers 
extracted from a large corpus. A suitable area (if the 
thesaurus for an unknown woM ix estimated l)y inte- 
grating the human intuition I)urled in the thesaurus 
and statistical data extracted from the corpus. To 
overcome the prohlem of data sparseness, distin- 
guishing features of each node, called "viewpoints" 
are. extracted a.utomatically and used to calcMa.te 
the similarity between the unknown woM and a. 
word in the thesaurus. The results of a.tl experi- 
ment confirm the COrltril)ution of viewl)oints to the 
I)ositioning task. 
1 Introduction 
Thesauruses are among the most useful knowledge 
resources for natural language processing. For ex- 
ample, English thesauruses sttch as Roger's The- 
saurus and WordNet \[4\] are. wideJy used for tasks 
in this area \[,5, 6, a\]. Howew~r, most existing the- 
sauruses are compiled by hand, and eonsequently~ 
the following three problems occur when they are 
used For NI,P systems. 
First, existing thesauruses have insufficient vo- 
cal)ularles~ especially in I~nguages other than En- 
glish. In J~pa, n, there are no free thesauruses that 
can I)e shared by researchers. Vu rthermore, ge.eral-- 
domain thesauruses do not (:over (lomMn-specifie 
terms. 
Se(:ond~ the human intuition used in constrllct- 
ing thesauruses is not explicit. Most existing the- 
sa.tlrttses are hand-crafted by ol)serving huge amounts 
of data on the usage of words. The data a.n(l human 
judgements used in (;onstructhlg thesauruses would 
be very useful in NLP systems; unfi)rtunately, how- 
eve h this information is not represente(I in the the- 
sauruses. 
'l~hird, the structure of thesauruses is subjec-- 
hive. The depth and (lensity of nodes it, (tree-llke.) 
thesauruses directly a:\[l~ct tilt', calculated distances 
between words. For example, n(>des fi)r biological 
words haw=, many levels, while abstract words are 
classified in relatively shallow lew4s. However, ex- 
isting thesauruses only represent unif(lrm relation- 
ships between words. 
This pa, per describes a way of overcoming the 
prol)lems, using a, medium-size Japanese thesaurus 
aim large corpus. 'Yhe main goal of our work is 
to expand the thesaurus automatically, explicitly 
including distinguishing features (viewpoints), and 
to construct a domain-setlsitive thesaurus system. 
:Co expand the vocabulary of the thesaurus, it 
is important to position new words in it automati- 
(:ally. In this paper, words that are not contained in 
the thesa.urus but that appeared in the corpus more 
than once are called unknown words, l '\['he proper 
positions of the unknown words in the thesaurus are 
estimated by using woM-to-word relationships ex- 
tracte(l from a large:sca.le corpus. This task may be 
similar to word-sense disambiguation, which deter- 
I|Hlles the correct sellse of a, word from several pre- 
defined candidates. However, in l)ositioning a word 
whose sense is 1,11knowtl, a suitable position must be 
selected from thousands of nodes (words) it, the the- 
saurus, and therefore it is very difficult to position 
the word with pinpoint accuracy. Instead~ in this 
paper~ we give a method for determining the area 
in which the unknown words belongs. For example, 
suppose the word "SENTOUKI" (fighter) 2 is not 
contained in a thesaurus. Calculation of the simi- 
larity between the word and those in the thesaurus 
a.sslgns it to tile area \[flying vehicle \[air plane, heli- 
copter\]\]. 
Viewpoints are features that distinguish a. node 
from other nodes in the thesaurus, and are good 
(:lues for estimating the area to which an unknown 
wor<l should be assign(d. The area can be efficiently 
estimate(/ by extracting viewpoints. 
Several systems have used Wor<lNet a.nd statls: 
tieal infi)rmation from large corpora \[31 5, 6\]. How- 
eve.r, there are two common problems: noisy co- 
occurrence of words a,nd data sparseness. In Word- 
Net, since each node it, the thesaurus is a set of 
words that haw~ synonym relationships (SynSet)~ 
wtrious methods for similarity cah'ulatlon using the 
SynSet classes have been proposed. In this t)aper~ 
ISAMAP \[8\], a hand-crafted Japanese thesaurus, is 
used as a (:ore. To overcome the problems of noise 
1Tha.t is, unknown words do not lne~\[\[ very |ow-frequency 
words. 
~A Jttpa.nese word in ISAMAP is represented by a pair of 
capital Rom~tn letters and the word's English tr~mslatlon. 
956 
J*.-~/J (Physical Object) J3,\[~ (Phenomenon) 
4'f,(,?,;tu<~ (Creature) I'N.{~ (Relation) 
:hllN~ (Abstract Object) II,~/: (Time) 
)J;?)? (Method) Jg-/i)i (Location) 
~rigOJ (Action) ~'2111J (Space) 
),'~'~'\[:. (Attribute) q'i.l'~; (Unit) 
~fL'~ (State) @'f"l~ (Operation) 
)3 (Force) 
Fig. 1: '1'() 1) Categ(,'ies ,)\[" ISAMAI' 
and (la,t;a si)arse.ess, relati(mshil)s ofc(m imcl,ed ,odes 
in the 1.hesa.urus are used. t{,'s,ik \[)r(qmsed a. class- 
tmse(I a, pproach, in which sets o\[" words are .sed 
insl:ead of words \[5\]. ht his apl>r,)a<Jt , each bynset 
is used as a, class. In our apl)roach ~ (,n tim other 
ha, n(\[, an a,r,'m, l, ha.l; coul;a.ins ('otlrm(:t,,(l no(los iu 1.ho 
thesa, tlrus is use(l as a class. The .odes are on. 
ne('ted by IS=A relatio.ships as well as syn(mym 
rela, i,ionshil)s ~ and theref'ore large areas rel)resenl, 
strong similarities to unknow, words. 
2 Knowledge Sources 
This sectlo. (lescrihes the thesaurus and stal islical 
da.l;a used in l.his pa.lmr. A Jalm.U(,se uoult t;h('salJrus 
ca.lh~.<l 1SAMAI' is a. set <)f' IS=A r,qa.lio,shil,s. 1l 
contains at)out 4,000 nou.s wil, h a.lu)ut tel, h,vels. 
Each node of \[SAMAI ) is a. woM or a woM a.l,l 
it, s (one or two) synonyms. Figure 1 shows the t,o 1) 
categories of ISAMAP. Some words are 1)la,:ed al. 
mult, iph~ l>osil,i(ms iu l, he thesa,rus, tg)," ,>xantlde , 
SENSUIKAN (sulmla.ri,e)" is cla,ssilied a.s "wa.l,er 
wJdcle" and "weapon." 
To extract viewpoints For the ,,xlsling sl, ructure o\[' 
t,}le 1;hesa, llrtls ill oMer to position u,klmwu words 
in i% a collect, io. of pairs of words a,d t.heir rela~ 
t;ion markers, togel.her wil.h l.}mir \['requ,mcy, was ex 
Lra.(:l;ed from acorpus. The source of the words was 
artMes pu blished in a. aa.p.umse , ewspape r (Ni kkei 
St,inbun) iu 1993. The a.rl.icles were mOrld,,)h)gi 
tally analyzed a, utoma.tica,lly, a,,d t t,,', stored i, t,}te 
following form: 
oc(wordl, rel, word2) = n 
This mea.ns 1;ha,l; ~ordl a.nd word2 occur n 1:i rues 
wil;h a relation marker rel. I(elati<)u markers (:(re- 
sist of cause markers such as "G A", "WO", alld ~"N I'~'~ 
and adn<)minal Forms of adjecl,ives a.d a(ljefq,ive 
nouns. The statisl, ical (lal,a \[',)r each relationshi l) 
are shown in Figure 2. 
We use restrictive relatiolmhil)s with 1,he mark 
ers, rather than wor<l 2-grains, for two reaso,s. In 
the tluknowli-word=sense disa, ml)iguatiotl task, the 
n u tuber of possi hi,; ca.di(late word-s(mses (l),)sitions 
iu the thesaurus, in this paper) is very larg% a,.d 
thus it; is iml)ortant t,o re(luc(~ noises t\[la, l, l,r(wenl, 
the output of a result. Secou(\[~ 1;hese case rela- 
ti,)nships <'an I>e used l,o i(h.ntify classilical, iolt view- 
points for thesauruses. For examph~, suppose that, 
i o 
(a) 
a l'ea \] a Y (2,21 
Vig. 3: Marke, d Nodes i. the Thesaurus 
the woMs plaue and ship ~re located tmlow vehi- 
cb-. We can say that "planes fly" a.d "a plane h, 
1he sky," but not "ships fly" or "a shil~ i. the sky." 
That is, (plane, SUBJ, fly) a.d (pla.e, i., sky) can 
I., called viewlmints for the word "pla,.e." 
3 Positioning Unknown Words 
'l'his sectio, descrilms the procedure fi)r positioning 
()\[" words iu \[SAMAI ). In this taM<, tim inlmt is a 
woM to t)e l~la,('ed somewhere in ISAMAI ). 'l'h(~ goal 
is 1,o (letermine tim most suil, a,t)le area for the woM. 
The procedure consists (,1" 1he following three steps: 
Step l: I';xtraclh)n of viewl)ohgs I'or each node in 
ISAMAP. 
Step 2: I<xtra,ction or('andidat,,~ areas f.r the input, 
WOF(\[. 
Step 3: F, wLhm.tion ol" the ca,,ldldates and selection 
<)f' the rooM. preferabh+ area+. 
3.1 Basic Idea 
The I)asic idea. is very simple. For a, unk.ow. 
wor(l~ I,\[le word l,o-.word rela.l.ions\[lips l.ha.L (~oltl;a, in it, 
are exl.racte(l. 'l'h,, similaril;y between the word a.nd 
each ,,,)de in ISAMAP is calculated. The nodes for 
w hich tim similarity exc(mds a predefi ned l,h reshold 
ar(' Illal'k(~(t a,u(1 cOIIllCCl;('d ill the l;}H~Sallrtl.S. '\['\[m 
left tree in Figure 3 shows nodes in 1.he t.hesa.urus. 
'l'h(, ma.rke, I nodes are represented I)y hlack cir- 
ch's. For st raightfi)rwa.rd statistical similarity caJ- 
cula.l.i(ms, there are ma W similar words, inclu(tiltg 
,()isy words. \[n this l)aimr) the followi,g three hy- 
tmtheses a, re used to resolw~ l, he probhmt. First, the 
marked words \[})rm cerl;aiu areas (connect(~d nodes) 
<)fwords in tlmt.hesaurus. Tim areas tha.l, occupy a 
large sl~ace are preferred. The right tree in Figure 3 
~;\[IOWS a,i'c~aS of words. 
Se(:o,t<l, specific words, that is to sa,y~ words a,t 
h)wer hwels of trees a, re preferred. In Figure 3, 
areal is pro.ferre(l to area2. 
Third, ea.ch node in the thesaurus has viewpoi'nts 
that distinguish it: from ol.her nodes. The view- 
lmi.ts fi)r ca,el, node are ext,ract,ed 1)y using case 
aml modilication relati(mships t;i~a,t contain sta, tis- 
1,teal data extracted \['rom the corpus, lfa, n unknowll 
word has the sa, me viewpoints as a certain ,ode, t,\[le 
simila,ril, y tbr 1,he .ode is weighted. 'rh(~ next sub 
secti(m (h~s(:ril)es how viewlmints axe exl, ra,cted. 
3.2 Extraction of Viewpoints 
A viewpoi.t is a set of disl, i.guishing M,,tures \['.r 
each node i. a thesaurus. The viewpoint of a no(le 
957 
Ma, rkcr 
GA 
we 
HE 
NI 
DE 
TO 
NA 
Total 
l)istinct lillilll)er 
394.887 
483,400 
18,564 
451,986 
225,247 
176,738 
78,079 
,51,001 
1,879,902 
Total illltnber 
817,030 
1,2101581 
53,876 
1,114,877 
61,4619 
570,475 
569,837 
881,25,5 
5,832,550 
Relatioiiship 
Subject (e.g. liia.n go) 
Object (e.g. drink coffee) 
Goal (e.g. go to offme) 
Goal, etc. (e.g. go to church) 
histruinent, etc. (e.g. hit with }laniiner) 
Accompanier (e.g. ili~ll and wonia.n) 
a(InonllnalizMion (e.g. basic word) 
Adnomlna,liza.tion (e.g. large t)uilding) 
Fig. 2: Numl)er of Sta.tistical l)atlt 
node is defined ~s a. list, O~od<, marker, word). '\['hough 
Stl(:h features are implicitly used in the creation of 
most existing thesauruses according to hilRiaJi in-- 
tuition, they axe lost wheii the constructe(l t\]~le-- 
sara'uses are used. An exception is the Wor(1Net~ 
in which the distinguishing \[ea.tures a.re nlP~nually 
listed. In tliis l)aper~ the distinguishing fea.tllres aa'e 
extracted automaticMly, reflecting the characteris- 
tics of the corpus to be used. 
For example, Figure ,5 shows a f)a.rt of ISAMAP. 
The viewpoint of a. node in the the.sa.urus is esti- 
mated by using a certa.in l)rocedure. Suppose we 
want to extract the viewpoint o\[' the noun "HF, I{IKOP- 
UTAA" (helicopter). Tile wor(t occurs 131 1.iines in 
our corpus. Figure 3.2 shows exa.mples of the rela_ 
tlonshil)s. 
For each rela.tionship~ a sea.tell is nlade h)r nodes 
that have the sa.me relationship. In tile c~se of 
the pattern "TUKAU" (use), 385 nodes with the 
s~mm relationship ~tre extra.cted Ijrom a.reas, scat- 
tered throughout ISAMAP. On the other ha.nd~ the 
pattern "TOBU" (fly) shares only two nodes, h.e- 
licopter aaid aiwlaue. The nodes have direct IS- 
A relationships; in other words, the nodes are c.an 
be connected in the hierareDy of nodes. Since the 
viewpoints of a node are inherited by its children 
in many cases, the existence of the connected nodes 
that include ISA relationships is strong evidence for 
the viewpoints. In this case., (fly, SUB) is a view- 
point for the node "airpla'n% " which is the topmost 
of the connected nodes. 
Viewpoints are extracted by calculating th.c typi- 
cMness of word-to-wo,¥t relationsh@s. Given a node 
nd a'nd its candidate viewpoi.n.t (a pair of a relation 
marker rel and a wo,d w), the typicalness of the 
viewpoint is calculated as 
typicalness(nd, tel, w) 
_ ( E<~ o..( .... ~,'~) E. ,, o<,( ....... ~,<:) 
- "°:' t o.(.,..,,-,' ' 
where N is a set of no&s i'n \]b'AMAP, and C 
is a set of conuected no&s that contain the: word 
w. Examples o i the vie'tvpoi,nts (whose typicalness 
exceeds 0.5.) are as follows: 
I 
flying vehicle land vehicle water vehicle / / 
iic(~ket balloon /% a ear train coach s I 
air plane helicopter cargo ship patrol boat 
Fig. 5: Exanlple of Viewpoints the Thesaurus 
Node (word) 
airplane 
Viewpoints 
(fly, SUBJ), (land, SUBJ), 
(take off, SUBJ) 
rocket (la/in('h, OB3) 
ship (come alongside the pi~,r, SUB J), 
(sink, SUB J) 
land vehicle (transportation , by) 
3.3 Example for Positioning Words 
in ISAMAP 
Let us consider ~n exainph~ to see how Mgorithm 
works. Suppose the word "SEN'I'OUKI" (fighter a) 
is to be placed in the thesaurus. 
First;, for each node in the 1SAMAP, the slmi- 
l~rity between the word and the node is cMculated. 
The similarity is ca.lculated according to the rollow- 
ing formula.: 
st're(w1, w~ ) = rnax(simi , sire2) 
(o,.(w,, ,., v) o,,(w~, ,_',_s,) '~ st.,,,, -_ E ,, + 
,,~p o,,(_, ",v) ) 
(o,~(v_~, y, 2':) + o,,0,, ,,, y~) \] W si71~ 2 
~' ,. o~0,, ,.,_) o4v,,.,-) i pC= P 
1 ) is set of words that co-occurs with wiorw2, 
and the argument "_" can be any words. If the 
simib~rity va.lue exceeds a pre-delined threshold, the 
node is marked. Figure 6 shows marked nodes that 
h~ve high similarity. 
aIn English, a fighter meltns 1)oth a t)lane and a, person; 
however, the original Jetpturese word SENTOUKI means only 
a pla,ne. 
958 
Word M,t,rker ~ W,,,,l 1 
TUKAU (to !,,~e) ~:0 -I to i :!yAATA (J,(t .... \] w°-.-l- 
SIMA (,,nTciVislaT{,l 7 l)1< !~ 5 \[ HUI{UMU (to contain) _J W() \[ 2 \[ 
K~VtTI;JY-()~-.o~-- NI \] .'}J I(ANPAN (oll dock) ~_ 1)1< __L~_j2 
Fig. 4: l,xa, inl)le (if' llel~tlonshil,s for "h(qicopt(,r." 
5 
6 
7 
9 
io 
13 
Word 
\[IITO (hu ma.n) 
KABU (stock) 
SE IH IN ()}m,n ufa,('t t, )'(:) 
MONO (object) 
K IN (gold) 
a Y UUTAK U (house) 
____ GIJYU'I'tlSYA (engineer) 
KIG YOIJ (compa.IG) 
BUH IN ipa,rts) 
SET UB I (fa,cilides) 
HeN (1)(iok) 
TAI (pa.rty) 
\[ KO UI< U U )< I 0d,. pla.,le) 
HEIKI (weapon) 
- N(;,h;::kl - 
1.0 
0.0.1.6 
0.0 
0.0.6.3.0.6.0 
0.0.0. I.0. I.0 
- t(,~i;~tidt~li \[1;.~ - 
}nL(l, protect 
purch~Ls(',~ h~Lve: buy 
purcha,s% ha~v% buy 
purcha.se, h;Ge, buy 
\])llrch~so, \[ia, ve, re}) 
l)UeCh~se, have, buy 
1.2.0.0.2.3 
0.0.0.0.1 
0.0.O.t.3 
0.0.0.0.0.:1 .'2 ,) 
I.~.0.0.14 
ha,v% tit!y, prot(~ct 
l)urchasc, buy, ('Xl)ort 
l)urcha,se, buy, ha, ve 
l)tlr('ll~tSo: ha.v(~ \])~ly 
0.0.0.0.0.0.1.0.2.0 
0.0.0.0.0.2 purchase, ha.w~, buy 
sc)!d, da.i}g(;r, collision 
l)}!rcha,s(',, fly, buy 
Fig. 6: Marked !redes will! ma.tctmd !'(da.tiouships 
(~h)uma / engineer (<:) r~omployee flying vehicle ~-helicopter 
", air plane 
Hayer ( d ) 
~man vehicle ~- bicycle 
--woman "- bus 
(b) (d) weapon ~- nuclear weapon food -- confoctionoD/ 
k missi)o 
Fig. 7: Cat,didati, cii,,oc(,io,s for "light(,r" 
Areas tha,t co)d;a,i)! ma, rked !moles ~r(: (m,hmla,ti~d. 
The results ~Lre given in Figure 7. 'l'h(~ )Host suite.hie 
a.rea, for tim word "fighter" must be s(de(:ted from 
mull.!pie c~mdida.te sets of con imcth)l!s. 
The I\]uM liha~se is tim evaJua.thm of tim ca.udi 
(hm,s. E~L(:h (:aii!(lida.te is (wa.hlail;ed aic('(irding (,o t.l)e 
fl)lh)wing t'o(!r criteria!. 
Criterion 1: The size of the ca.,dida.t(~. Giv(,~ 
~ri inl2UL word w (in this ca,se, "fight(.'"), a, ud a, 
nod(: (,b~t is conta,ined i, the ca,ndida, te C, CI - 
\]C,<,<~,;c c' "~i"" ("", ",od,:). 
Criterion 2: q'he h(:ight ()f" l.hc (:a.)!dida.t('. C2 is 
th(; number of levels in the ca,,dida,l,e. For c×a.mI)lc, 
in the c,~ndidaA;e (a.) in Figure 7, C2 = 2. 
Criterion 3: The a.ver~ge det)l.h of the nodes. For 
exa.mph,, the depth of the node "a.irpla.ne', whose. 
node-id is 0.0.O.0.0.0. l.0.2.0, is 10. 
Criterion 4: The nund)er of viewpoints. For ex- 
a.inple, c~L),lida,lc (a) (whose top node is "human") 
ha,s the l~rgest imml-)er (if no(h,s. However, ~s show. 
in Figure 6, the ma.tclmd rehd, ionshilis ("ba.d hu- 
me.u/fight ~r ~nd "l)rotect hulua.u/fighter") are not 
typica.l (~×pressions h)r (he word "fighteF'; th~( is, 
the r(:h~.tii)nsltips a.rc not vii:wpoints. On !hi" other 
ha.,d, "a.iriih~rm" in cn.ndi(hLte (c) sha.res t\]!(, "fighter 
(a.irpla),') fly", which is the viewpi)int of th('. n<)<le 
"~Lirphm(C' C,4 is the numl)i~r (if ma.tclmd rela,tion- 
ships tha.t a.re ('l))tsider(~d as viewpoints of the node 
ill (,h(~ ca.)ldid~Lte. 
) :') Tim totaJ i)refi~r(mcc P(word) is/'I (71 + it 2c,- + 
p3C3+p4C4, wh(:re p!, p2:/>3, amd P4 axe weights for 
ea.(:h crit(~rion. Intuitively, ~u!d according to a pre- 
limhl~!'y (,Xl)Crime.t ~ 1,he conl.ril)ution of C3 sho!!ld 
ca.try more weight tha.n the ol.her criteria. (in our ex- 
Imrhtte)d,, p) : I, P2 --= I ~ P3 --= 0.4, ~nd P4 = 3). 'l'h(~ 
mosl. l)r(ff(~ra.ble ca.ndid~Ll.e for l.h(, word "fighttC' is 
(c); tha.t is, "fighter" is I)la,ced in the +~)'ea whose 
top node is "tlyi+!g vehich:." 
4 Experiment and Discussion 
This section describes some (,Xl)eriments for l)()si - 
)ioning woMs in ISAMAP. Figure 4 shows pa, rt of 
the rest!Its. In the eXl)erhnent ~ 2,000 nodes with the 
root "physical obje(:t" in ISAMAt' were used. 
959 
eo tl r t 
presklent 
A ustralia, 
present 
the House of Rei)resenl.atives 
author 
SelllillaJ" 
lllllSell ill 
wife 
t}eayy oil 
Word l)osition 
(orga.niza.tion (utiion, meeting, pa.rt.y, clas.~)) 
(human (man, woma.,, lawyer, family emphlyee, etc)) 
(,.~,tio,, (J.,pa,,, Chi,,., r.~si~,)) 
(objec.t (food, hat, l,arts, etc)) 
(organiza,tion (unioll~ meeting~ pa,rty: team, eL(',)) 
(h,!,.~,,, (,,.~., we.,..., l~,wye,', r,~:,,,ib, e,.l,l,,yee, ere)) 
(equipment (seho(,I, public equipment, pa.rking, etc)) 
• (eq,ipn,e!,t (scho(,l, l, ublic equipment,pa,'khig, etc)) 
(human (man, w,,ma.., la.wyer, family emph)yee, etc)) (,,l~,\]e(:t (,.,~teri..l (r,,eJ (g~,~, p,'tr,)Je,l,~O))) 
Fig. 8: Result 
IO0 
90 
gO 
70 
60 
50 
40 
'gr,d~t' 
b0 lO0 150 200 2gO ~00 
Numbel of felationahipa 
Fig. 9: Rela.tionshi t) I)etweell the uuml>er of rela, 
tionships and the a.ccura,('y (ff 1)ositil)ning 
of the node. For example, the likely area of "heavy 
oir' is "((,I)jeet (lnaterla.l (ruel (g~s, l)em)let.~O)))" , 
whose top node is very a,bsl, ract. \[towew'~r, the rela- 
t.ionshi 1) "heavy oil and gas ~i'' suggests the position 
of "heavy oil." 
By ,sing the proposed method, the existing the- 
sa, urus was expanded to cower a large quantity el 
text. Though ISAMAI ) was designed for general 
purposes, the method alh)ws it to reflect a specific 
domain through tile use of a. dt)ma.in-dependent col 
pus. One o1" our gems is m develop a corpus-based 
thesaurus, c(msisting I)f' a c()re thesaurus such as 
ISAMAP and a eorl)us that reflects domain knowl- 
edge. When a thesaurus is used for NLI ) applica- 
tions, such as a.n information retrlewd a,nd disam: 
....... , ..... • ,,. bigu~tion system there is no need for it 1.o have 1 lie exDerlll\]ellL Vlel(l(~(1 SeVOI'~LI ODserv;LLIOIIS. VIeV~:- • 
• " • % ~ , . • .... , well-defined tree-like strtlcl, tlre The system ca31 rise pOlllLS aJ'o sbrollg Cl/les lot (lel,er\[llllllll~ Llle Still,- ....... 
al)le positions in the thesa, rus for unknown words, the thesaurus a~ a, bl~ck box via certain functions. 
In co-occurrence based similarity calcula.tion, words 
with strong similarities but whose relatioashit)s seem 
st;r~nge to huma.n intuitioa reduce the a.ceuracy of 
the proposed method. Howew~r, ill trla, lly cases~ 
these strong similarities are caused by less typica.l 
co-occurrences. In Figure 6, the words "buy, .... tmr - 
chase," aim "h~ve" convey less informa.tiw, relation 
ships than viewpoint relationships. 
If ttmre are many role.lion,ships \[br an unkuown 
word, the possibility of the existence of viewpoints 
will increase. However, some relationships ma.y be 
noisy. Figure 9 shows the relationships between the 
mtmber of relationships a.nd the ~ccuraey of posi- 
tioning. In this case, the a.cctlra.cy mea.ns the per- 
centa,ge of words tot which the most preh~ra.lde area 
estimated hy the proposed method contained the 
node theft the word really belonged to. As shown 
in Figure 9, 50-100 relationshil~s are needed to es- 
t.im~te the nodes. On the other }ran(l, too ma, ny 
relationships prevent the ext.raction of useful view 
points. 
It is very di\[\[ictdt 1.o position a, word with pin 
point accuracy. Experiment showed lhat the fol 
lowing heuristic is usefll\], If all ullkltoWIl word lies 
conjunctive relationships with a node (word) in a, 
pa,rtieular area. it can be positioned a.s a sibling 
I%)r examph', the following f/l.ctions are needed fi)r 
a corpus based thesa, tlrtls system: 
'positiou(w): ret;,rns the position (or pa.th) of the 
word w. 
supcvordi,~al, e(@: returns the superordinate words 
of the word w. 
subovdinale(w): returns the subordinate words of 
tim word w. 
simihrr(,~): returns the words similar to w. 
dista'nc~-(wJ, w~): returns the distance between wl 
and w2. 
It is important that the return vMues of the 
%nctions should be depend on the corpus and the 
locM context of words, s q'he proposed method can 
t)e used to reMize these functions. Viewpoints make 
it po.ssible to realize ;t, dynamic interpretation of dis- 
Lance. 
5 Related Work 
The method proposed here is rela.ted to two top- 
ics in the litera.ture: ~utomatic construction of a 
thesaurus and word-sense d isa.mbiguation. 
4The marker TO iudic~tes the conju.ction. 
For this lmrpose., the functions can be exp~nded to con- 
taiu tit(', local conlexl o\[ the word as augmeutations of the 
functions (e.g. posltion(w, context(w))). 
960 
Tltl;re tiave I)eeli sevei'a,l st u{lies of' the a, ti I,litnatit + 
ct}iisl;t:llt;lJ{}tl (}\[' lJtt+sa, tirlises or ,'-;e(,s 1}1 ~ IS-A i'olaJJiin 
sllil)s \[7, 1\]. In i,ttese si, udies, l tl{" c{)nMruclx,<l rela, 
tionshi i}s :-;Oitl{~{,inie,'-; {Io ii(il; liiaJ,t2h ti tl itl gt, ll in l, llilio it. 
IS-A re, la.tionshil)s {1o ilol, a,t)tiea, r itl {;}tc~ {tt)rl)()l';,i, ox- 
t}li{:itly+ a.iid it, is t,}iert,f'{}r{! dillicult 1,(} ('xl, ril.tq t heiti 
witlitiul, including tltii:-;y re/a, t iliiishilis, Ill otlr ;'t. l) 
I)roa>ch~ a, (2()1"{2 lJL('S~t, III'IIS iS /IN('~(\] |,O iIIL(B~I'~L(,{+ ~Itltll+l.il 
int uitit)n with cor l}tis- l ia,sed co occu I'l'{+li{T' in fl)rtita_ 
l,ion. 
ga, rowsky i}roiJ()s(~(I a, iilC, l, hod f'{ir word {lisa.tit 
t}igua,tion using |(o<get's The,~a, urus \[9\]. In his at)- 
f}r()a,ch} a wor(\[ wtlose sr, ilSO:-; ;i,r(' \](il()Wll (;i, word 
itla,y have severa,\] ,'-;c'iiso,w,) is disa,tiltiigual,e(l I)y liB- 
Dig "sa.lienl, words" for oa.ch word s(qise. A <~et of. 
sa, lieiit words is a, llsl, {)f wilt'lib with rio r{qati()n- 
shit}s. I n {lilt" ;i,t>l)roa{'li } wlir{I- 1,o word rela,i,ionsilil}S 
wii, h iti:-t.rkers a,re used, in ()t'{l(~t" {,{i r{~tIIiC{~ li{}ises a,n{I 
to (;xl, ra,{;L viewl}oints. 8(}ili{~ ol, ller ineth()ds tip w~ir{l- 
seiisc~ {lisa~litl}igtla,l, iOli liSiiigj W{irdN(% Ira,re t){;+~il pro 
l}{}scd \[7> 6} 3\]. Their a,l}l)roa,t:hes are siinila, r tl} ()u i':% 
wiLh~ the difl'{u'ence that tile s(~ltst~ {}f ~L w{}r(I to I}l" 
placed ill 1;h0, 1,he,,-ia, ill'llS is tlIIkiLoW/t. T}lolts;t, ll(ls (}\[ 
lit)dos ht lJlo. l;h(~sa, tlrllS ;/,pc: candidat,es, a, tt{l Ll,,re 
fore} liloro sut'll,le knowh'dge is ne(~<l(~(l. /Is,' (}1' a, 
core Lhe, sa>tlrLl8 a,tld vlew|}oili{\]s tha, t, is~ word 1,o 
word relati()nships wiLh rclaLio, n~a,rkcrs makes 
it l}{)ssible 1,o (~stima,te a. suit,a,I}h~ area for a,i u, 
known word. 
6 Conclusion 
This tin,per has (lescri\]led a. m<g, ll(id for l)osili(}lt 
iltg tltlkilowii w(ir(ts ht a, tl exisiJtig Lh(,,'-;a.llrll.'-; /}y t\[,',.;- 
ilig wor(l-l,o-wllr{I re\]a,lJons}iil}S with relaJ, i()n tna.rlo- 
ers extracl;e(l frotl\] ;t, la,rge {!(ll'\]}tlS. SuiLM>le a,rl)~ks in 
the/,\]l(;~l+ill;tlS f<}r liliktiOWll WOl'dS wl~l'e es/iuiai,e{l I}y 
iul,(~p;r;/,1,ing hiiltia,ii iiil, uil,lon ll/irh+{I in Lhe l,}io~a.l\[rils 
with ~l,atistical (laA, a, e×t, ra,{%ed flOiti /Jie {~()rpu:-;. Ex 
l}l~t'itll(~ti{,s :-;h()w<~{l thai; ;t,s,'-;igllillp~ '"viewl)oinl,s" rt)r 
e,%ch nolle gives i lilf}orl,ani; in forin al,i(}tl (~a, tl l ic, iise{t 
1,(} csCiina.te Stlital}h~ t}osition,~ iti tile lhesauru~ ~t)l' 
tl~ikiloWii word,<.4. Tile fi}ll(}wiliP£ lX)l}ics ,<-;hoti hi \])e i ti 
vestigai;ed in ful, tlre w(}rk: 
ill W\]ioti a,il tlttkiiOWtl wor(l \]las sevoral word- 
se\[isos, derivative iii("a, iiilig~ {)f it {oii{I 1,() lie 
t)tiried a~lii()ii~ the (',a.n{lidal,e,~. if" we take the 
exli~llllllc ()\[ the wor{I "fi<ghl,cr" use(I in 1,tlis l}a, 
i)l% "w{~a4}oii" is r(;cl}~itiT,,(~d ;~s a, {~a,,{lillalx~ 
a, reP~} but is not g;ive~i a sl, r(}ll~ sinlilarity. Otie 
r(;a,,<-;{)il for tltc prol}hqii is the \]ack o(' view 
l}()iitt,,~. Morc~ lo{'al {'ontexts o\[" 1,he word ai'(~ 
tit;t;{led to s\[}o(:ify ~ii{'li lil('~a,iiiit~:-;. 
I The sitiiila,riLy v~i,lll('~ and viewpoints (;;t,tt t}t ~ 
u><-;e(l to refine l,h(~ sl\[,l'llCl,/lr{~ o\[" the 1,ht~silAli'ilS. 
They rna, ke il, l}ossillie 1,o l'}l;t,lL~(': dy\]tami(;a.lly 
the rela,1;ionshil)s luq,ween words in the the 
sa, tlrti:-; a,{x:or{ling to d()ina, in-sensitiv(~ cor\[}us. 
WI~ ~l't~ il()W (hwehll)ittg Lhe getiera, l \['uiiclJ{ins 
described iii 1Jie i)r(wious se<%ion to realize a 
large-scale thosatir/is t(}l' NIJ > systoillS. 
• I\[" tho lllllil\]){~l" of o{~Cllrl'~ll¢t~,~ of a,lt ullklloWll 
wor\[Is is low, the proposed method t, en{ls to 
(ml,t)ul, larger a.rl,a.+<+ ;i.s<:pO~ith)ns. (_)th('+r con.+ 
l.ra.ints 8llt~}l it.s rise o/" local (:OlIl,l~Xt ~l,ro rl+. 
qtl ired. 
Acknowledgentent 
W'e w(iul(l li ke t(} tha,n k ProL th:izu nii Ta,naka,, Tokyo 
Illsl,il tll,<' (}\['l'(~{;}lilliil}~y fop a,llowiil~ tlS l,(} Jibe ISALM AI'. 
References 
\[1\] M. A. Hearst. "Autouiatic Acquisiti(m of tty: 
l){)iiytti,s t't'(}tti I,a, rge Text, (J, orl}tira'. In I>roc+++:d - 
i'I~gs of COLING-9+<.7, pages 539-5,15, 1992. 
\[2\] M. A. llea,rst a.n{I (l. (lri+i'onst,tq,te. "Pt, efiu 
ing A ut(lina,tica,lly-I)is(:()vtwt~(I I,exica,1 t(ela+tiou 
ships: (J(}lnilitiilig Wea+k 'l'o(:hlliqtleS for ~lsl'oii~t'r 
l(esults", lit \[)roct:f:di.#lg.+ of l\]l.f+ A A A I Wo'rk.+hop 
on ,5'lal, istically-bas+::d N L P 7kchniqu~ s+ F, ages 
64-72, 1992. 
\[3\] X. I,i. "A WordNet.-l}ased Algorit, hlu for Word 
S('~ilS\['~ l)isanll}iguation '~. In l)'roce+rdings o i lh+- 
3'rd A'.'..'tm\[ Workshop ou VePy La'rgc Corp<J'm, 
lia.ges 1368---1374, I995. 
\[4\] A. Milh% tl.. Bex'kwil.h, C. I~llt}a.tllil, 1). Gi'tls, 
I(+ Milh'r, and R. 'll,ngi. "Five Pal}ers {}n Wt}rd- 
N(q,'. 'I'echiiica,I Ptel}ort (',S I, Report 43, (Jogiii 
l iV{! ,q<:iOil{:(+ I,M}ora,l,(}ry} l)rhlceton Uiiiversil, y\] 
1990. 
\[5\] 1 }. t(esiiil(. "Wor(lNlq, a,n(l l)istri\[}ute{l Ana, ly 
sis: A C, lass-I}ased Al}lirtiach I,(} I,exica, I I)isc{}v- 
cry'. In I'roc<<d'i+,gs of t/+,+¢ A AAI Wo'rksD.op o, 
.b'~alisii<:<Uly-Das~d NLI ) T+-:ch.niqu+:s~ pages 48- 
56, 1992. 
\[(i\] I ). Restiik. "l)isa, ml}igua, t,ittg Noun (Ir(}illiiitg 
with Ileal)eel, t(} Wi}r(lNet 8ellgOs +, lit P'rocc~d- 
i'l~gs of lh~ 3rd A',.,u(d Workshop o', V~ry l.a~y\]< 
Co'rpora+ pages 5/1--68, 1995. 
\['7'\] T. 8i.rza.lkowski. "Buihling A I:exica.I l)omaiil 
Ma t} frolii Text (I(}rpt}ra,'. It} P'roc+-:cdings <,f 
COI, IN(Lg/#, l}ages 604---616~ 1994. 
\[8\] I1. 'Pa,na,ka,. "Construction (if a, Tht;sa.urus 
Based (}tl Sul}lq'ordi}ia, tel~ltl)ordina, te C,()ilCC!\]}t >'1 
(it, Japanese). #PSJ, ,_S'I(/-/VL, {14(,1):~,5-44, 
1 {<}87. 
\[9\] I). Ya, rowsky. "W{}r{l-Sense I)isatil.l)iguit.ti(in I}s- 
iilg St,a,tistJ{:a,l Models (if Roger's C+/,l,c'glirles 
Tra,ined on \],arge (Jtit'i}tlr;-t, ~'l. Ill \["roc+:fdi, gs of 
COLING-99, page.<+ 454-460, 1992. 
961 
