Semantics 

WORD SENSE ACQUISITION FOR MULTILINGUAL TEXT 
INTERPRETATION * 
Paul S. Jacobs 
I:iiforiqriaJ, iou Technology l,a.bora.tory 
(I\]", l{.esca, rch aud I)evc\]opumni; (Tentei: 
~(:li('.necta,(\[y, NY 12301 USA 
I)sj a.cobs((#crd.ge.coin 
Abstract 
Wc discuss ~r method for usi,,g ~ut,om~tted cor 
pllS ana.lysis to ;tcquil:e word sense iliforiliaJ,\[Oll 
for inuli;\]lingual text inl;e, rl)reL~tLiOll, Our systl!ili, 
,q \[1 O(', U IN, cxtra.cl, s da.l.~ ~'1'o111 li(!ws stories with 
brewed coverage in 7h~p~mv.~( ~md English. Our al> 
pi'o~tch focuses Oll tyiil~ l.otAel.her word SOllS(~s> ilS- 
ing a. (:olribin~t~ion of workl kuowledge (ontology) 
with word knowledge (corpus da.t~t). ~;e expla.in 
l, he approach a.nd it;s results in SI\[O(',UN. 
1. INTRODUCTION 
Text intcrprei;ation resca.rch has recelil, ly come I;o foctts 
eli data. exl, raction t;he probleui of producing struc- 
l;lll'(!(\[ ilil'orr~i~l;ion f'l;Olil free l,eXf,> usually {,o popitllt{c 
a. dal,~dmse. Oilce I;he key iiil()riual, ioll has lmeli c×- 
lir~*ctcd, it C~ill \])(~ lln(!(l l,o help i~tlalyze tlic (',Olll;oni,s 
o\[' lm:ge volunies of Ix'.×t;s, (h:~cct lTencls, and retrieve 
selccl;ed informaliion. I)~ta exl;l'~ction is ~i, the center 
of the problem o\[ uuma.ging l~rge vollllllC8 of' toxl;. 
()UF gF()llp lla,S led da.ta (,,xl;iJ~tctioii work \['or a lltll\[l- 
1oer of ycal:s, dew;loping new ~rcllitcci;urcs and lexi- 
cons for uaLllra\] \]migltage proc('.ssing an(\[ t;(:si, ing thes('. 
reel;hods in a v;u:iety of ~q)plications \[,h~cobs aii(| l{,au, 
1993; ,|~cohs, 1990; ;/~(:ol)s mid l{,au, 1990\]. lu the 
last two yore'n> as lmrl; of the /\].S. gjOVCl>tilliOllt>s a I{,PA 
TII)8'I'I,;I{ progi;~un, wc have cxt,cnded this research 
I>o halidle bro~t(lCl! doni0~his, wil,h higher a.OC, lll'a('.y, ;rod 
t,o I)ro(:(:SS l;exLs ill Ulll\]l;il)l(~ \[a, ligllag~(~s \[,lacol)s cl al., 
1993\]. 
The goal of processing texts hi a new l~uiguage 
is nol, only to show thai; I,he I)asic algorithnis ~tr(! 
language-indel)endent , but; ~dno Lo preserve as iilil('h 
kno.wh'.dgc a,s possible ~cross laAlgll3,gc8, and, wliere 
applicable, across doniains. For example, in mlapt- 
ing an English system i,o handle Ja,paillese lcxl;n, it, 
is iUll)or\[,a.ul, that I;\]le ;lapancsc. sysi;(Hfl (:oli\[igllrat;ioH 
il;la, kes (ISC ils lnuch as possibh; o\[' 1,he general knowl 
edge, mid cw;n l, he English vo('.a, blilai~y, i,ll~ti, iJic sysl,ciri 
has. Thin tilaxili;lizcn t;lie l)(~r\['orltlalic.(~, a.li(\[ iuiliillliZos 
l;hc. a\[nol/lll; o\[ worl(, each l;ilne l;he sys\[,('.lli is applied 
i;o a new language. 
SI\[()GUN in unique in a, iliilnbcr o\[ ways, btll, it 
in pa.i:ticularly dist;inguinhed by l, he nharing of knowl- 
edge rcsollrces lit difl'erelll, languages. The appl:o~ch to 
lnlflt,illngu~d inl;erprel;ation involw;s two key elements: 
Fit:st, t, he. sysl, eni hichldes ~ core onl, ology of ~d)Otlt 
1,000 concepl,s thag Sul)pori, word ncnnen hi l;he core 
English and '/apancse lexicons, which arc tdso identical 
in sl;ructure. Second, our systenl acquires much of its 
doinain-spcci\[ic kuowlcdge, including combinai, ions of 
words and phrases, Dora corpus data, casing the map- 
liing of word (:lass hdbrn-lation into a new language. 
l,'or example, Lhe \],\]liglish verb establish corl:f!sponds 
very closely to L\]lC. J;tl)aliese WOrd <sc:lsizrilsu (1!~ ~;<:). 
Ill the 'I'I\])S'I'I!',I{, domain of johll; Velitllres, both cs- 
lablisb mid .scls'a'rils'tt ~tr( ~, used to descril)e l;hc (:real.ion 
Of COlnl)allics ('%stalAisti a joinl; VCIILIlI?O\] >), prodil(:tS 
("csl,ablish a. l,eicconiinunicatiolls and dalia nei, work"), 
facilitien ("esi;~blish a. fi~ctory"), and other more ab- 
str*t(:t conccpi, s (e.g. "esli;~biish a nl;ronger foothold hi 
Eul:opo'). 'l!he TIPS'FEI{ task, which requires dis- 
l;inct inforuiatioll for companies, facilities, activities, 
aud pro<blots, makes ii; cl:ucial to disti nguish these dill 
re.toni; word usages regardless of langmtgc. 
SIIO(',UN's results on the final q'IPSTEI(, bench 
irlal'k couipal:ed very favor;d)ly 1,o l;iiosc of other sys 
t, ellls \[.lac.obs el aL, 1993\]. 'l'here are Illaliy ditfcrent 
ways Do view a.nd im~dyze lille li+l&l/y dil\[7.~renl; I)crlc.h 
lnark sl,al;istics, bllt the area ill which SH OGUN's ~l)- 
proa.ch was mosl, clcl-trly distinguished wan in recall 
the percent~ge of dal;~t 1"17Olfl cacti 1;esl, seA, l;\['lag was 
correctly extracted by l, he In'ogrmn. On thin mea.mlre, 
,~1\[O(\] \[I N exl;r&cl;e(t, on ;tverage, 37% more corl'ecl; in- 
formalion tium rely other system in rely conligm'atfion. 
SIIOGUN had somcwhag lower precision (13% lower 
ON aw~rage) titan the highest, prccisiou sysW, m in each 
configuration, meaning that S\[IOG tin also produced ~ 
somewh~d; \]m'ger a.tYiOilltl; o\[ irlcorrecl; information i,h~m 
other sysl,elns. The sysl,elfi> in both |a.llgtl&ges, oil;on 
idenllificd inform~fl, ion tha.l, w~ts noi; found 1)y ~41iy ottier 
nysi, cm, a FOnlli{ ~haL We al;l;ribul;e l.o }u~ving better coy- 
el:age il/ its knowh!dge base l;han o~iloF sysl;Ol\[ln. 
*rl'lliS resem'ch w*Ls Sl)Oll~or()d ill pint. l)y the A(Ivitm:cd \]~.e- 
search Projects Agency (\])O\[)) &rid el.her governincnt ~tgeneles. 
'l'he views *rod COliCliiSiOiP; i:Otltailied ill this d, lcnnicnt are those 
of I;he &ltt;itors ~tild should nol, be hll.erl)rel;ed ;ts represenl.hlg the 
oltlciM policies, eli.her expressed or imlfllcd, .f l.he advmiced 
l{ese&i'cil \]~I'Ojec\[s Agency or l.hc \[J~ (low~i'lilllellL 
The rest, of this paper will describe tile problem 
o\[" multilingual interpretation a.s it; ~ppe~rs in the T\[P- 
STEI{ task, Lhen present: our sohttion, elnphas\]zing 
l,:nowledge st.t:ucl;urcs and knowledge ttcquisition. 
665 
Input: raw texts 
I_U GECk~ 
Japanese 
microelectronics 
Output: object-oriented database "templates" 
<TIEUP RELATIONSHIP-0403-g> := \[ 
TIE UP STATUS: EXISTING \[ 
ENTITY: <Eb\[I'ITY 0403-15> <ENTITY 0403-12>--d......~= 
ACTIVITY: </~-CTIV1TY-0403. 6> \[ <ENTITY 0"~03-12> ,=< 
\ ALIASES:" EC- sthom" 
<ENTrFY 0403-15> '- \ \[ I,OCATION: "UK" (UNKNOWN) Europe (CON'HNENT) 
NAME: Ge,/eral Eie-ctric \ I TYPE: COMPANY AIAASES: "GE" \ ENTITY RELATI!)NSHIP: <ENTITYRELATIONSHIP 0403-8> 
LOCATION: Untied States (C~UNTItY) ~ELA'I IONSHIP- 0403-I 1> 
TYPE: COMPANY ~ \[ 
ENTITY RELATIONSHIP:<EN~ITY P, ELATIONSHIP 0403-10> 
I <ENTITY RELATIONSHIP-O~)3~-I 1> \] 
<ACTIVITY 0403-6> := INDUSTRY: <INDUSTRY-0403-8> 
<INDUSTRY~0403-8> := PRODUCT/SERVICE: (- "gas turbines") 
Figure 1: The TII'STI,',\]~, (MUC-5) tlabt extraction task 
2. TIPSTER TASKS 
TIPS'I~Et{ is a program of the U.S. government Ad- 
winced I{eseareh Projects Agency (AI~,PA).** 'lb em- 
phasize portability across languages and domains, 
the teams in 'I'IPSTEll. dat~ extraction were re- 
quired to develop capabilities and perfbrm benchmark 
tests in two languages English and Japanese and 
two domains -microeh.'eCronics and joint wmtures- 
resulting in four sets of bendunark results in each eval- 
uation. The fi~lal evaluation, known as MUC-5 \[Sund- 
helm, 1993\], was held in August, 1993, and inehldetl 
the four TIPSq?I'~I{ data c'xtraction contractors as well 
as it;:; other sites from four countries. 
Figure 1 illustrates the basic TIPSTEI{ data ex- 
traction task. hi each configuration, systems process 
a sO, of texts and produce a set of database entries, or 
templates. The temple~tes are specified as pt~rt of each 
domain; thus the Japanese l, emplates in the joint ven- 
ture domain are Mmost identicM in structure to the 
English joint venture templates. The task, for each 
tc~xt, combines the recognition of high-level concepts 
(such as the identitication o\[ a joint wmture in a text) 
**Our project, which included GE Corporate l/esearch and 
\])ewdopmen{,, the Center for Machine Translation at Carnegie 
Mellon University, and Mm:tin Marietta Managmnent and Data 
Systems (formerly tIE Aerospace), was one of four reruns in the 
data extraction component of TII~ST\]EH. 
with the discrimination of the meaning of iudividnal 
phrases (such as descriptions of products) and the res- 
olution of references. D~r examl)le , Figure 2 shows a 
very simple example of a production joint venture be- 
tween two companies. 
For each of these texts, the data that must be ex- 
tracted includes the generation of typed objects (such 
as entities and relationships) and slot fills that incorpo- 
rate information, either directly or through inferences, 
from l he texts. Much of this information comes from 
the recognition of highqevel entities and relationships 
such as that shown in Figure i. The rest includes 
much more detailed information, such ms the activity, 
fSeilities and financing involved in a joint venture. Fig- 
ure 3 shows this part of the infornmtion for the samph; 
text, in tim format of the actual correct responses, with 
itMieized annotations to show where tile information 
cotnes from in the example. 
rHw slot; fills in T\[PSq'F,R templates include "set 
tills" drawn from a~ tixed list, such as the text code 
PRODUCTIOB for mauufaetm:ing and the nulnericM code 
20 ("Food and ldndred products")*** lbr processed 
tbod i)rodnction, "string fills" drawn from the ae- 
***'\]'he tmmerical codes for the PRODUCT/SERVICE slot (and the 
groupings of the U. S. govermnent Standm'd hldustry Classifi- 
cation (SIC) scheme. 
666 
<DOCNO> 0659 4/I)OCNO> 
<DI)> SI'\]'TEMla;I';I/28, 1989, 'I'IlIIRSI)AY <II)1)> 
<St)> Oapyright (c) 1989 Kyodo News Service <IS()> 
<TXT> 
KIKKOMAN COI'~ P. Wll ,I, 1 .INK UP WH'tl A TAIWANESE FOOD FIRM IN OCTOI~,I';R TO PI.~OllIJCF, SOY SAUCE IN TAIWAN, COMF'ANY 
OFI~ICI ALS SA \[I ) TI l URSIkAY. 
PRI,',SIDENT KIKKOMAN, CAPITALIZED AT 81) MIIA,ION TAIWAN YUAN (ABOUT 440 MILLION YEN), WILl, BE OWNED 50 PI,\]/(\],;NT 
EACI I I!.Y KIKKOlVlAN AND PRI.;SIDP;NT ENTERPP, ISES CORI'., TAIWAN'S LARGEST FOODSTUFI: MAKER, 
TIlE JOINT VI';NTURE WII,L MANUFACTURESOY SALlCE A\]'TIIE TAIWANESE FIRM'S H,ANTWITII KIKKOMAN'S TECIINOI+OGICAI, 
ASSISTANCF, AND I)ISTRII/UTI'; TIlE PROI)UCT UNDER THI:, KIKKOMAN IIRANI) NAME. 'Fill,', ANNUAL SAI,ES TAI.I.GF:I' lS S1'71' AT 
AROUNI) 3,0110 KII,OLITI~;RS W1TII\[N A I,'t,;W YEAI<S, TI IEY SAIl). 
.... l,'igure 2: A s+unlflC iul)ut; l.cxl; 
<FACII.H'Y 0659-1> := 
I+oCKrION: Taiwan (COIINTP.Y) ,,.IN O(.'TOIII;'R 70 I'ROI)UCE SOY SAU('I,J IN TAIWAN, 
TYPE: FAC1OI~.Y "1711", JOINT VI'~NTURF, WII,LMANUI,TiCTURI';SOYSAUCEATTHIi "IAIWANI':SI,', I"IRM'S 
<INI)USTP, Y- (1659. I> := I'IANT 
INI)USTI(Y- TYPE: PROI)UCTION 
I'I.~(}I)UCr/SERVICli: (20 "SOY ISAtJCtq") 
<INI)USTI.~Y- 0659 2> := 
INI)USTIIY TYPE: SAI ,I,'.S _.ANIJI)ISTRIIHH'IiTHICI'RODUCI\].. 
PllOI)UC'I'/SEI/VICI{: (51 "SOY \[SAUCIt\]") / (51 "\[TI IE I'ROI)LJCTI") 
<ACTIVITY 0659-1> := 771E.IO/N/'VI~NI'URE WII,I,MANI/I,AC?'I\]RI,:SOYSAU(?EATTIll,; "IAIWANI;SI( I,'IRM'S 
tNIJUS'fI<Y: <INI )I/S'I'I<Y 0659-1 > t'I,ANT 
ACTIVH'Y SITI,;: (<FAClI.ITYq)659 1> <};,NTFFY 0659 3>) 
STAI.(TTIME: <TIME 0659 l> ...INOCTOBIqR'IOI'ROIJU('I".SOYSAUUI,;IN "I;IlWAN, 
<ACHVH'Y 0659 2> := 
INI)USTRY: <\[NI)LISTI<Y. 11659 2> 
ACTIVITY SITE: ('litiwan (('OUNTRY) <I?,NTITY l)659 3>) 
<TIMI~ 11659 1> := 
IIL/RING: 1(189 
<OWN\[';RSIIIP 0659 1> ::: ...(,'AI'HTII,IZI{I)A'I'gOMII,LION "IAIWAN YUAN+.. 
WII,I, BE OWNI,,D 50 I'H¢(JI;NT EACII BY KIKKOMAN AND I'RESII)F+NT' I';N'I'I!RI'RISI,:S OWNED: <I';NTFI'Y 0659 3> 
CORP. TOTAL CAPI'FAI,IZATION; 80001)(100 TWI) 
OWNIiRSHII' .%: (<ENTITY 0659 I> 50) (<I~N'HTY {)659 2> 50) 
t"igtm~ 3: t}m't, oF (:orr('cL answer for t, ext; 0659 
ta\]a,l Lcxt, xuch as ''SOY SAUCE'', pointers t,o ol;her 
ohj(,.cLs such as <ER'TITY-0659-:I.> aud a, wtrid;y or 
"tmrm~-dizc<l" fills such as Taiwan (COUNTRY). Tim 
s<.'t fills ol'Lcu Ca, l;t, ui:e. Io(;~d iul'orm;-tt, ion in Lhc l;cxl,, 
while the ol)j(~ct;s (consisl,ing 17t' a,n idctttiIi(n: wil;h ;-1 t'e 
lal;cd Letltpla,I;c fills) oFt, on involve infL't:~uces from many 
different Imrt;s of the Lext. For exmnplc, in Lhis case, 
the objcc, t, ACTIVITY+-0659-1 re\[h~cLs Lhc fairly subLlc 
disl, incl;ion lhaL th<, vcnhu'e will be mamd'~{d,udng soy 
sauce aL l>reside~nL l,',nl;(,rp|fises' phmt (the resull, oli rcI: 
crencc resolution) but that l, hc sMes will bo ca.tried 
ouL somewhere else in 'l'~fiwan (l, hc result, of a. real in 
I'ercnco+). \[n this part, of 1,he t;ask, tm+\ior object-tew!l 
decisions O\[I;CH hinge on the itll,crl>l:(!l;;d,iotl o\[" t;he indi 
vi(hml wor(ts, ma, king |,he t,ask very l(~xicon--im;cnsiw:. 
|nL('rl>rCl;ing the a.cl;iviLy itd'orma, Ci,:)n; LhaL is, 
McnLil'ying whaL each wult,ure is (loiug along wit, h l, he 
approl)ri+:d;c \[)lX)dtl(;Lx alld co(\[(?s, I:(;quil;CS I,;t,owb:xtgo. 
a.bout word us++ge in context. Activil,y words like build, 
e,s~aklish, and create axe .iusL as COIIIIIIO\[i as wot'(ls like 
pr'od't.:e and ma~t.uJhclure+ \[n rna, tty (:a,scs, whel;hcr 
xomc't;hing ix a ,joint vellt, ltrO a,<'.t;ivil,y or not depends 
on ~ fairly detailed ;-ut~dysis of l, he.sc words "build. 
ing a \[a, clory" is dili'crcnl; fi'om "lmilding ;_~ nc;w pla, ne", 
"l)uil(ling a, I)usin(.ss", aml of c(mrsc, \['l:Ollt 'q)ldldiHg a 
prcseltO:?'. Th('.sc similar phr~l,X<'.s <m,u tic)l, Oflly O.vokc 
difl'ercnt; i>roducL <:(>(los, lmL Mso cau ol'l;el~ M\[~!ct, lhl! 
high-level consLrua, l of a, sl,ory, 'l'hc ilfl,erl)rel,aLiol\]s 
o{' word scns,:~s cotJm 1,,agcl,her wit Jr <lotmfiu ~md t+ask 
knc>wl,+(Igc in exLra.ct, ing I;h,:~' a.pl+roprial;<~ inl'ormai, ion 
fi'om t, Im++u I~hr~ses. 
Because ~me of 1,he go~ds of Lhis projocL wa,~; t.o 
dew;l<, I) mot;hods o1' ha, ndling new dom~dns and lau 
guagc, s, il. w+~s import,a.*tk t,o cope. wil, h l;ilcsc <q'ucb'd 
dill'o+rcnccs \]u word usa,ge i.u a geuct's.I way. This in<tar 
lmrtilAouing tim knowledge o1 Lhc sysl;e\]n lilt,() f'O/il' 
coml)ouenLs: (1) gcm~ric, (:g)doma.in dul,eu<hmL, (3) 
la.ll.gu:-tgedepctMot~L, aud (d) (\]o\[i'l:aiil a.il(l, latlgtlal,;c 
depcnd(mL. WiLh Lhc d(q;a.il ()17 a+llarlySis t, ha l imri,s oF 
Lll,~ task requir,c;, such ax I;hosc~ de+scribed al',,:)ve, it, i;; 
esscnl;iM not; only l,c, minimize l;hc ;-ttUO/llll; of ktlov, q 
,:~(lgc that, is d<'t)c'n<hmt, on elf, her language or domain, 
bul; aim() to minimizv the off'oft, of acquiring knowledge 
tha, t, is dependent, on ciLhcr domain or bmgt)age, aud, 
eSl),:'.<;ialiy , knowlodgc (ha.l, is d,~!lmndc\]ll, on bol;h. The 
sccl,ions thai, follow will covc.r t,hcse astmcl,s c~f Otll; so- 
667 
lntion to the TII'STEI{, problem. 
3. LEXICON &: ONTOLOGY 
The previous section flamed some of the problems of 
data extraction in TIPSTEI{. with an emphasis on the 
aspects of the task that require substantial amounts 
of knowledge. We also presented our approach to the 
task by explaining tire synergistic objectives of creat- 
ing generic resources and developing knowledge acqui- 
sition methods. This section will focus on the generic 
resources, while the next section will concentrate on 
acquisition methods. 
The main generic resource of SIIO(\] UN is its core 
ontology of about 1,000 concepts, which was devel- 
oped to support GE's NLToolset lexicon \[,lacobs and 
Rau, 1993; Mcl{oy, 1992\] and had been tested fairly 
thoroughly on a variety of data extraction tusks prior 
to 'HPSTEI{. We augmented the core ontology using 
the CMU ontology from machine l;ranslation \[KBM, 
1989\] and used the extended ontology as the basis tbr 
Japanese lexicon development. The idea of this effort 
was that the Japanese lexicon would mirror the exist- 
ing English lexicon, allowing fbr sharing of tire domain 
independent components of the knowledge base across 
langnages as well as the sharing of any (lomain-specific 
knowledge that would be added. 
For example, the following is the English entry for 
the verb esiablish and its related forms: 
( establish 
:POS verb 
:G-DERI1/S ((-er noun tr actor) (-ment noun ...)) 
: SENSES 
(( establishl 
:EXAMPLES (she established superiority * ... ) 
:SYNTAX (one-obj thatcomp whcomp prespart) 
: TYPE *primary* 
:PAR (c-causal-event) 
: SYNONYMS (set_up) ) 
( est ablish2 
:EXAMPLES (the court established fault) 
:SYNTAX (one-obj thatcomp whcomp prespart) 
: TYPE *primary* 
:PAR (c-deciding) 
: SYNONYMS (determine) 
)) 
: X-DERIVS 
( ( establish-ment-x 
:X-DEKIVS (-meat noun) 
:EXAMPLES (the eating establishment) 
: EXPRESS c-organizat ion 
)) ) 
The ,lapanese lexicon now consists of about \] 3,000 
words. This is somewhat more than the. 10,000 unique 
roots of the English lexicon, but tire /,;nglish lexicon is 
still much richer in morphology and more thoroughly 
tested than the Japanese. Nevertheless, the two lexi- 
cons are roughly comparable and certainly eOmlmJ;ible. 
For example, the Japanese entry for sclsurilsu (~-~.) 
is the following: 
:POS usa 
:G-DENIES () 
:SENSES 
(( setsnritsul 
:SYNTAX () 
:TYPE *primary* 
:PAR (c~causa\].-event) 
:SYNONYMS (establish set_up) 
:NOTE (:nttd-kana ('''15\['~) ~ _'U),,) :jv-dom) 
)) 
The main link between the English and Japanese 
lexicons is through the :PAR field (for parent) in each 
word sense, which joins that sense to its parent in the 
ontology. In this case, the common parent betweeJt es- 
tablish and selsurilsu, c-causal-event (the bringing 
about of events or effects), is a t'airly general category 
that includes two senses of ope~t as well as a variety of 
others like duplicalc iloll(\[ bridge. The reason that eslab- 
lish ends up in this general class is that it is very hard 
to confine any sense of the word to ereation events. 
I\]aving a shared ontology and lexicon format has 
certain adw~,ntages. It is a requirement for using a 
common language processing framework across lan- 
gaages, and it ensm:es that words with similar meat> 
ings in different languages end up with similar repre- 
sentations and ontological restrictions. The next sec- 
tion discusses how this coHufn)ll framework inllst be 
extended for domain-specific usage. 
4. ACQUISITION 
In a task like TIPSTEI/,, we cannot ca.ptm:e all the sub- 
tie distinctions that the task requires in the (:()re lexi-. 
con. Each domain, like joint ventures, requires a large 
amount of very specific knowh'.dge, not only about 
how words like eslablish behave, but also about simple 
racts like that oJ\]ice supplies usually includes things 
like pens and papers while office equipmenl usually in- 
ehnles machines like computers and copiers. Because 
many of these facts are at the intersection of world 
knowledge and word knowledge (that is, they are pat- 
terns of language use that relleet real-word concepts), 
even the most specific pieces of knowledge often seenl 
to apply across hmguages. 
The degree t;o which ontok)gy contributes to in- 
terpretation in any particular domain was, in geu- 
eral, somewhat less than we might have expected. 
For example, the category c-causal-event, inchnles 
not only words that don't haw~ anything to do with 
joint ventures, but also words thai in the .joint ven- 
ture domain could be misinterpreted. The category in .... ~Jh ~" ~)L 
Japanese lnchldcs senses o words hke ~)i:~;~ an( {~:,~, 
which hehave, very similarly to sclsurilsu (iEgM.), but 
doesn't iuel ude many others tIntt a/so behave similarly. 
lit English joint ventures, the extended ('.lass of words 
used to describe the" establishment of a new con~l)a.uy 
includes plan, set 'up, form, and create. In ,lal)anese, 
the class i~n:ludes a~13", /J~a~, ~f/a~, a~, > <, ~md 
668 
,m~. \[n hoth (:a,s(!s, l;hese word classes ~u'e de.t,ermined 
from exaa'aining corpus da, l;{~, with a i)articulaa' empha 
sis on words Ih~l; a, re used to desct:il)e t;he tbrmat;ion of 
new companies. This includes words from different on- 
I,ologicaJ groups aud excludes cerbfin woMs from the 
c-causal-event e;ttegory. 
As wc ha,re l)oinl, ed out, words like c,~Rddisk +rod 
,++clsuviL+u aa'e so eritica,l to the undersl;mlding of joinl, 
venl, ures that; ktlowle(lge td)oul; such words (:a.l bc \[mud 
coded \['O1' ea.(:h ta,lgua.gc' and (IomMn. Ilowcver, doin,~,; 
tiffs hand-coding for ina, ny aspects of the TIIWI'I'H{, 
task woutd not only involve au ext.raordinary amoutd; 
ol'eltbrt, IiIH; it; would thwarl; one (If I,lm maitl ohjcct,ives 
of the proje<:t t,o develo I) methods that ease porl,a, hil- 
i~y ~teross langua,ges and domaius. 
Our "lMddlc groun(.l" s,.)lutiou to capt,H'iug t, hc 
more specialized k,lowledge, rulying heir, her or, gew.:ric 
knowledge nor on l~mgltttge spc<:itic cm:odings, was 
to crc'a, te word classes to rcpres(ull, 1;tl(! informa t, io,l 
needed in the TII)S'I'EIi, dnta extra('tion task, Ix) a,p- 
ply these word cla.sses a crons hmgua.gcs, and t;o CXliand 
them using ~mtomated ('.orp/ls ;i, ila.lysis. ~V(! ol)sel;ve(\[ 
that, a, lt,hough ,lal)nnesc +rod English ha(I ditl'ereut vo 
cabularies atM properties, the ust~ge of words iu G~ch 
,I;tl)attese corpus was very similar I,() the usage of (:ore 
iia, rable I:mglish words iu /:orpora f'l'Olll the s~unc do+ 
mnin. I:'()l' cxatnl)Ic, I, he word tq~tipm+'nl hl English 
joint ventures is w.ry simihu' to I, he woM ,~o'uch~ (: +'~.) 
in ,lat)~u.ese, and tiw l;ask Sl)ecillc dist, in(%ions are t;h(! 
same iu l:,nglish aml ,hq)am!se (e.g., the (listim:tious 
~m,ong ,n,xlica\] cqUil>,H++'n|, , l, ra, nsl),:)rl:atiot~ cqttil)meHt, , 
a,n(I elect;rieal eqtfil/nmu\[;). 
We U)ok a, dvant;~ge of I;his ohserwlLion itl <hwel- 
oping a. two.sta.ge proe('.ss of (hwclopil~g word group 
h~gs across la.ugug~ges. Ou(:e the tmtjor groul/iugs were 
detined (w.a.nuaJly), l;hc autx)tna.l, cd I/IX)tess ()F COrl)US 
a,naJysis consisted of (l) eXl)a.,Miug word class(:s by as- 
soci+~t;ing con(Inon, t'el~d,ively unaml)iguot,s words with 
other classes, and (2) lurthcr CXlmmliu~.~ and hh',tify 
ing aJnbiguitics using a "llool,si;ral)l/ing" in'oc(~ss. 'l'h(~ 
I)oot, st,ral)l)ilu{; i)ro(:+!ss usc.:l the k,towh:dg," that hnd 
already bee. cn(:o<h'd I,o classi/'y a chunk of l,(!xL (ti>r 
example, deci(liug tha, t a Im, rl,icuhu: l)hrasc described 
I, rallsl)orl, at, iOll eq.fil)nl(ml,), and assu,ni\[tg t,hal, wor(Is 
with a high degree of association with that (:al,cgory 
l'Gllnl; ;I.\[S() I)(! ,'et~d;e(I. 
The (it'at st,age o\[' Lhe pl'O(:css st,,~tl'LCd wit.h, E)r 
bo/,h E.glish ;u,d .lal)auese , a sel, of words !,hal, were 
closely i(l(;ntilied wit;h husiness ~ct, ivil, ies (lil,:c "tllaUU- 
fa, cl;ures', a, nd "distrilmt, es'). Using a COrl)US ol7 shout, 
\[0 mi\]lion words (English I'rom l;he W:d! ,Vh'c+t ,/oar- 
'n.al aud .l+q)a,mse from Nikk(:i ,5'hinblt;+), we t o(>k the 
wo,'ds l, haJ. weI:C tnost, likely to a.l>l)car within a window 
or three words of au "a.ct, ivi/.y" word, am:i iri(xt, tmmu 
Mly, to assigu them to pro, duct. classes. Tit(: ~+ta.t, isl.ical 
aualysis used a, weighl,ed mul,tiM in\['orm~d:iou statist, is. 
'l'tds resulted in initiM ,~;roul)hlgs of words iul,o ch,~sscs 
corresponding to I)arti(:ular producl, groups, or codes. 
For example, the following ix IJ.: Euglish class sort( 
st)onding roughly U) SIt', cod(: 38, "Mcastlriug, analyz- 
ing, +rod controlling instrumsnl, s": 
biomedical copier copiers lens lenses 
instrument pacemakers photocopy photocopier 
photocopiers radar navigational microfilm 
monitoring navigation guidance avionics photo 
photographic photography camera clocks watches 
eyeglasses suuglasses glasses Polaroid frames 
The second stage o\[ corpus ~malysis was the 
"hoot, st, rapping" process, I:rom t, he texts that included 
the "good" activity I,ernls, the program assigued ase(, 
of wor(\[ classes, such as thai, almvc, ha,sod on its exisL- 
ing l.:.llOWhxlge base. I"/)r exaanple, Jr" "eyeglasses" ap- 
peared in an activity LexL, t.hat (,ext would h,c nssigucd 
to group 38, a.long wiLh wh~tever other ca.t;egorh:s also 
;H)l)earccl. 'l'hcn, I'or ca.oh word appearing ha every 
i;(}xI; ill tflle eOl:|)llS, w(! a.g;lill appIhxl the IIIIIIMIH.\[ in- 
I'orlna.tion sLaLisi,ie 1,o \[in,:\[ t, he siguilicant rcla.t;iouships 
hel:weeH wor(ls slit\[ gl'Otll)S. Whet, a wotx\[ could lie as-- 
sociated with more t,han one grout) , this I)rocens iden 
tiffed phrases I,hat could help (,o distinguish Lhe woM 
sense, ;rod collecl;ed 811(:\[/ al\[ll)ig/l()tlS W(/I'(IS in t~ sepa- 
ra, I;e list, so that, they could be dealt wil, h ma, tma, lly, i\[' 
lie ccsn& i:y. 
I:igurc 4, shows, for +~ .I ap;mesu sample, t;hc results 
of the c()l'pllS a cm, lysis I)rocess, itMudiug the identi\[i- 
era, ion of the "producl:" words, with \['re(it.mci(!n a.ml 
weights, aml the anMysis of whether the corpus data 
(:oulirllie,:\] wtml, was knowll al>ouL each word. 
In the TIPS'I'I':Ii+ hem:hmarl,:s, ,.re relied (m 
u.umally-corr(.ct, ed list,n, unillg I,he sl,a,t, inl,ic;d weighl;s 
only lo help i'es/)lw: dill'st'slices in select, big among Imll 
titdc' I)otent,ial product descriptious. Ilowcw'.r, in our 
own t,c.sl,s, we t'ouud the i)m'l~:)rmancc o\[' the ma, tmally 
(~dit;ed kuowhxlge (m the activity portion of the tern-+ 
pla/,e to bc only slightly helJ;er I, hau t, hc fully a ut, o 
mated sample. The kuowh',dge huse of word groups 
it.eluded over 4000 woMs in Iq,lglish uud over 2000 in 
,\] &pall(~Se. 
All,houglt SII()(~UN ban 1)Cell I,(!sLIM ilt a series ol' 
govcrnmcut hunchmarl(s, w('. still consider l his method 
to I)c ouly ~ sl,m'l,iug I)Oim,. There arc mauy i)rohlems, 
'1'11('. corpora used for l.:aiuing wer(~ not a good reprc+- 
scut,a.t,ive sample, I)ecnuse t, hey were (h';twtl \['roHI dil- 
\['ereul, sourc,::s frc+tu the t;est sa.mph.s due t.o limit.at.ions 
iu the a.vaila, bility o\[' rcl)rescntat;iv(~ l, rah~ing ma£eri- 
aln. The .Japanese training relied (m s(;glltenl;ing Lhe 
t;rainiJ G corpus inl,o words, a process t;\]l~l.t, occanio.mlly 
iutroduced error. ()thor sourc(~s of' error i t.:h.hxl eases 
where our initial manual groupiltgs iuvolv<,d misinl;er 
prct;aLious of l;hc t;asl,:. 
Neverl,heless, hot:h the cor(~ ontology n.ud the au 
I,omal, cd l;ra.ildng reel,hod had ~ sigtdli,::aul, impac.I; ou 
SIIO(ltJN's renulls iu TIPS'I'I;;IL The Ilexl, seetiol. 
prcselH;s a. hrief SllllllIIF~ry O\[' /,h()s(! result,s. 
5. R,ESULTS 
I:igu.,c 5 shows I,he overall recall, l/recision a ud I"- 
IIIC~)SlII'(+ SCOI'(!S \['(:'1' SII()(~UN on t, lm E>ur c,.)ltIigtiration 
o1' tile \[iual TIPS'I'I';II. (M tO(', 5) Iw.uchniark. I';:IV a.nd 
669 
# Score Word 
424 52.1 ~'7 }" 
373 77.3 72~{ ?,h 
327 88.9 d; 
314 95,2 2/.~ 
288 38.2 I.~\] 13- 
242 36.2 -~?~ 
218 47.3 F~\[ IkJ 
198 29.2 :q 2/If ::t-- 3t -- 
189 57.8 ~ul~ 
186 51.4 ,~rn" ~' 
154 28.5 Y?~- / z 
137 33.1 ~\['d,~ {g 
128 25.8 I~") p 
127 352.3 ,~" K,S~ 
125 93.0 \[J ~"~- I" 
123 26.2 t~ 
118 988.6 P 0 S 
118 37,O 7"9 2/F 
ll0 108.1 '2 7 /" ')~'/ 
108 26, I ,il~{d" 
105 32.9 >1": ,JlJJ j~ 
104 37.5 ,)d 
93 58,1 ;(i"d\[\] 
> (find-industry-conflicts) 
.R ~'-( 7" ambiguous: ((36 3475 4.6) (48 390 4.1) (73 1082 4.7) (78 205 5.9) (27 62 4.5)) 
,~(\]p: new: ((45 47 4.8)) 
")':}~ ambiguous: ((28 1994 6.8) (20 400 4,6)) 
~J/:~\[-;l; ambiguous: ((62 128 5.0) (63 160 6.1) (65 80 5. l)) 
~J -- ¢/'-- new: ((26 120 9.1))) 
"tz ~N ambiguous: ((36 5640 4.4) (48 1026 4.6)) 
I, S I ambiguous: ((36 3977 4.4) (48 913 5.0)) 
{iJ~q { new: ((26 5 5.8)) 
~,~,,~ new: ((54 2 4.1)) 
JT~-F confirmed: ((61 1201 5.6)) 
5(~ll new: ((13 274 4.4)) 
-2 t) y\]- 2/ new: (161 26 5.2)) 
{\[:,,X,+l, new: ((54 5 4.1)) 
4' 2/3/-- .71 ~ & ~ "1" 14"~ q" ambiguous: ((36 1800 6.5) (48 144 5.5) (73 552 6.5) (78 108 7.7)) 
"~"/' 2/new: ((78 36 4.1)) 
2/2/;'3~--~ 2/b ambiguous: ((36 6603 6.2) (73 2026 6.3) (48 528 5.2) (78 396 7.5)) 
t\[I ri3"_~ new: ((33 36 4,0)) 
5\[':'~.-." confirmed: ((65 656 5.1)) 
~\[':)~(,g ambiguous: ((36 11231 4.3) (73 3437 4.3) (78 677 5.5)) 
_T_2/J;/ confim~ed: ((37 11744.4)) 
U 7j~.',y b contirmcd: ((35 264 4.2)) 
5~J'J 9 3/ confirmed: (03 48 4.7)) 
7"/( ,7, ~ O --,5. different: ((54 4 4. I )) vs. (20) 
Figure 4: Some results of corpus analysis 
JJV are the F, nglish and Japanese joint venture tests, 
and EME and JME are the two microelectronics test 
sets. R,ecall is the percentage of possible information 
that is correctly identified by the system. Precision is 
the percentage of information produced by the system 
that is correct. The F-measure is the geometric mean 
of recall and precision. 
Rec /)re F-meas 
EJV 57 49 52.8 
JJV 57 64 60.1 
EME 50 48 49.2 
3ME 60 53 56.3 
Figure 5: SIIOGUN Scores for MUC-5 
Scores as low as 50 recall may appear low, and cer- 
tainly leave room h)r improvement. A 50 recall mea- 
sure means that the system only correctly recovered 
half of the possible information, on average, from each 
text. llowever, by t~ number of relative comparisons, 
these nmnbers are good. They are a significant im- 
provement over previous benchmarks, and are close to 
(;he recall and precision scores of the GE system on 
nmch easier tests. The TIPSTER (,ask is quite dif- 
ficult, wif, h trained human iiitelligence analysts often 
producing recall scores in the 70s. 
As we have pointed out, SHOGUN's recall was, 
on average, 37% higher than any other system in each 
configuration, although the precision was 13% lower 
than the system with the best precision in each toni(a- 
ural, ion. For example, the next best systeIrt in Fmglish 
joint ventm'es (l~aV) had 38 recall and 58 precis(old, 
and the next best system in Japanese .joint ventures (a 
different system) had 42 recall and 67 precision. 
Much of the difh.'rence in perff)rmance between 
SIIOGUN and other systems can be attributed to dif- 
ticult portions of the task, where SHOGUN somethnes 
ln*d recall scores as rllllch as 3 or 4 tidies 3.8 high as 
other systems. The portions of the joint ventm:e tem- 
plate shown in Figure 3 are examples of' such com- 
ponents, l~ecause these were. the most knowledge- 
intensiw.' components of the task, we believe that the 
results validate SIIOG UN's approach to knowledge ac- 
quisition. Certainly the system had much better cover: 
age than other systems, and we attribute this result to 
the representation gild automation used in word sense 
interpretation. 
Figm:e 6 gives an inh)rmal analysis of the level and 
type ofeflbrt used ill each configuration. Although the 
Japanese scores were generally higher thtm English, 
the Japanese contigurations largely relied on the En- 
glish knowledge development. The level of ell'or( for 
Japanese joint ventures was higher than English be- 
cause the English system started out with nmch more 
than the aal>anese system (for example, we ah:eady had 
a fairly well developed English nan\](? recognition eom- 
poIlent). By contrast, the Japanese microelectronics 
configuration derived ahnost entirely fl:om the English, 
with ahnost no eflbrt required t}om ,lapanese speakers. 
Many other sites participated in the TIPSTF, I{, 
project and the MUC evahmtions, including two oth- 
ers \[Cowie and Pustqjovsky, 1993; Weischedel el al., 
1993\] that covered both domains and both languages, 
and one ol, her \[Lehnert el al., 1993\] that ff)cused on lex- 
ical acquisition, although only in English. In addition, 
670 
Domain~Language 
English joint ventures 
Effort~Skiff Level 
1 person-year, system developers, 
native English speakers 
Japanese joint ventures 1.5 person-years, mostly Japanese 
college students with non-native 
developers 
English micro-electronics 3 person-months, system developers, 
native speakers, no knowledge of ME 
Japanese micro-electronics 2 person-months, non-developers, 
non-native speakers (with some help 
from natives, developers) 
Other Notes 
Some effort not reflected in results 
Difficult to measure because 
of many experiments 
Least efficient, but most 
interesting effort 
Best overall results 
Lowest overafl results (but 
explained by sample variation) 
Last configuration done, least 
work, good results (but not 
refined) 
Figure 6: I:,\[\['ort required for eax;h dOlilaili lind language 
l, here \]ms I)c(:n oliher signific,aut relidx~d work hi rolmst 
processing of I;exts, notM)ly \[llobbs cl al.> 1992\]; how 
ever, this rc,semmh has gener~dly emphasize(l synt,;u%ic 
coverage rather th;m lcxical (:overage. Finally, rc,latcd 
researc,h in lexical acquisition \[Zernik, ;1991\] focuses 
on corm lexicnl resources r~tller I,|l;.tli on c,ustomizing 
the lexicon through the use of a rel)resentative corlms. 
I\]enc,e, the rese.;u:ch i;}iat we have t)reseni, ed has ad- 
vanced the state of the a, rt \])()tit in t, he use of I;\]ie corpus 
I,o ideni, ify word sl!iise in|'ornw.l, ion ~il(I the denionstrn: 
Lion of lrutll,ilingua.\] CalmbililAos. 
6. CONCLUSION 
SII()GUN's N)l)roach t;o word sense interpretation 
a.cross l\[mguages uses I;mguagc-in(lc,i)endenl~ infoi:uia,- 
l, ioIl to troll D word S(?llS(?S in dili);renl, la, llglla,ges. A 
core onto\]ogy of 1,000 c,ouc,epts links senses tlu~t ~m'. 
domain-indel)endcnt, llowever, ill th(! l,asks on which 
SIIOG UN lilts been tested, (\[Olll&iil sl)e('ific word SellS(? 
inforniation is more ('title,at. l,'or this i\[ioi'(~ sl)ecia.l- 
ized sense knowledge, tim sysl;eni ilS(~s &ll iimow~tiv('. 
method of ta:aining a.nd I)ool,stra.pl)iug using a~ corpus, 
usillg i~ sta.tisl, ic,al mmlysis to help assign words mid 
phrases 1;o la.nguago-inde\[>end0n{ groupings. This ap 
l)rOac,h signific,anl;ly sped the ;:requisition el' word sense 
infbrm<%ion in 'I'II)S'I'I';I{,, resulting in high (:overage 
on the niost diilicult (;olllpOltO.ilt~S of the task. 
References 
\[Cowie ;Uld l'ustejovsky, 1993\] d. Cowic aml ,1. l)ust.e - 
jovsky. I)cscription of the I)IDEI{OT systcln as 
usc,d for TII)S'L'ER I.ext. In P~vceedings of lhc !I'HL 
ST'I';I~ Pha.se l Final Mecliu9, Sel)tembcr 1993. 
\[Hobbs el aL, 1992\] ,l. R. l\]obbs, 1). E. Appelt, 
J. I~em', M. 'l;yson, ~md 1). Magerman. II,ol)ust 
processing of re;d-world natural-hmguag(~ texts. In 
Paul S. ;lac.ol)s, edil;or, 7'ezl-Ba~cd Intcllige~l Sys- 
lems: Currcnl Research and P'r'aclice in h@)rmalioT~ 
lCxh'aclio'n and Relrieval. Lawrence Erlbaum Asso- 
c,ial;es, IlillsdMe, N J, 1992. 
\[Jacobs and H.~m, 11990\] Pmfl Ja.c,obs and l,is~ l(au. 
SCIS()It,: Extracting information from on-line news. 
CommmHcation.s of the As.sociatioT~ for (:omqralhtg 
Machiuc'ry, 33(I1):88 97, Novcndn'a' 1990. 
\[.lacobs tu,d Rau, 1997,\] P. S..I;~cs)bs mid I. F. Rau. 
Innovations iu l;ext inl:erpret.ation. ArhJic~al lnlelli- 
.qence, 63:143 \]9l, 1993. 
IJacol)s cl aL, 19931 I ). .lacol)s, (I. Krupka, I,. I(.au, 
M. Ma.uldin, T. Mit~mmr;~, T. Kittmi, I. Sider, aud 
I~. Childm The 'I'\[PS'I?F, II,/SI\[O(,'UN proj(!cl;. In 
t)roce~di,g.~ of lhc 7'1l~$7'1','1~, Phase 1 I,'i,al Meet- 
i'ng, S~ul M;~l.co, CA, Sq)l;eml)er 1993. Mol:gau 14dill\[ L 
IIN~IIII, 
\[,lacobs, 1990\] Paul ,la.cobs. To p~u:se or not i,o pars(',: 
I{,elation-driv(m t,cxt sldinmiitg. In Proceeding.~ of 
lhe 73irleenlh l',h/rnaliomd Uo~d?rcnc~ on Complt- 
lalional Li,guislics, p;~ges L94 198, llelsinki, Fin 
hmd, 1990. 
\[KBM, 1989\] The KIIMT RCl)ort. Te(:huic,aL report, 
(hinter for Ma('hiu(: Translation, Carnegie IVlellotl 
Univ(wsity, 1989. 
\[l,ehncrt el al,, 1993\] W. I,ehnert, ,I. Mc(3a.i'l;hy> 
S. Soderland, E. /~,iloff, (L (\]a.r(lie, ,I. I)(q.erson, and 
F. I"cug. I)escril)tion o\[' the CII{(J/\[S system used 
for 'I'\[I)S'\['EI{ \[;ext extra(:tion. In IJroccc¢liT~gs of ther 
771),b"1'1¢1~ Phase 1 l"inal Mccliug, Sepl:eml)er 1993. 
\[Mcl/,oy, 19921 Stlsa.tt Mc,l{oy. Using mull;il)le knowl- 
edge sources for word sense discrimina.tiou. (:om.pu- 
tatioual I,i~guislic,s, 18(I), March 1992. 
\[Sun(lheina, 1993\] I~;c!th Sundheim, editor. Proceed- 
in:l s oJ'lhe l"iJTh Mcs.sagc Understanding Cm@rcncc 
(MU(.,'-5). Morgan I(;ulfmaiui I)ublishers, San Ma 
tee, (ia., August 1993. 
\[Weisc,hedel ctal., 19921\] R. Weisc,hedel, 1). Ayuso, 
S. Ik)isen, II. Fox, II.. Ingi'i¢ h T. Matsuka.wa, 
C. I)apa.georgiou, 1). Ma.clm.ughlin, M. Kitagawa, 
T. Sakai, J. Abe, I\[. Ilosihi, Y. Miyanloto, ;rod 
S. Miller. BIIN PI,UM exe('lttivesmimlary. In Pro- 
ceedings of the 7'll),5'T1¢1~ l'Dasc l Final Mccling, 
Sel)tenlber 1993. 
\[Zernik, 1.991\] U. Zerllik, editor, l,~:l:ical Acq~tisi- 
lion: Using O'n-Lrnc l~(so'urcc.s to Ihtild a I;cmicou. 
l,;~wr('uc,e Erlba.uul Assoc,iates, lliLlsdale, N,I, 1991. 
671 
