Automatic Thesaurus Generation through Multiple Filtering 
Kyo KAGEURA t, Keita TSUJI*, and Akiko, N. AIZAWA * 
• National Institute of hffonnatics 
2-1.-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430 Japan 
E-Mail: {kyo,akiko} @nii.ac.jp 
tGraduate School of Education, University of Tokyo, 
7-3-1 Hongo, Bunkyo-ku, Tokyo, 113 Japan 
E-Mail: i34188{hn-unix.cc.u-tokyo.ac.jp 
Abstract; 
11, this paper, we propose a method of gen- 
(',rating bilingual keyword eh.lsters or thesauri 
from parallel or comi.m, able bilingual corpora. 
The method combines nmrphological and lex- 
ical processing, bilingual word aligmnent, and 
graph-theoretic cluster generation. An experi- 
ment shows that the method is promising. 
1 Introduction 
In this paper, we propose a method of auto- 
matte bilingual thesaurus generation by a com- 
bination of methods or multiple tiltering. The 
procedure consists of three modules: (i) a mor- 
phological and lexical processing module, (it) a 
translation pair extraction module, and (iii) a 
cluster generation module. The method takes 
parallel or comparable corpora as input, and 
produces as outlmt bilingual keyword clusters 
with a reasonable computational cost. 
Our aim is to construct domain-oriented 
bilingual thesauri, which are much in need both 
for cross- IR and tbr technical tr~msla- 
tors. We assume domain-dependent parallel or 
comparM)le corpora as a source of inibrmation, 
which are. abundant in case of Japanese and En- 
glish. 
The techniques used in each module are 
reasonably well developed, including statistical 
word alignment methods. Itowever, there re- 
main at le.ast three problems: (i) ambiguity of 
multiple hNmx combinations ill an aligmnent, 
which cannot be resolved by purely statistical 
methods, (it) syntagmatie unit mismatches, es- 
pecially in such cases as English and Jal)anese, 
and (iii) difficulty ill final cleaning-up 1 . 
In this paper, we show that the proper com- 
bination of the above modules can be useful es- 
pecially for resolving the cleaning-up problem 
and can produce good results in bilingual ellis- 
ter or thesaurus generation. 
2 Method 
The procedure for thesaurus generation con- 
sists of the following three main nlodules. 
(1) Morphological and lexical processing mod- 
ule: keyword milts 2 for English and Japanese 
are extracted separately. 
(2) Translation pair extraction module: statis- 
tical weighting is applie.d to a corpus which has 
been through the morl}hological and lexical pro- 
cessing module. The ailn of this stage is not to 
determine mdque translation pairs, but to re- 
strict translation candidates to a reasonable ex- 
tent. 
(3) Cleaning and cluster generation module: 
a bilingual keyword graph is constructed on 
the basis of" the pairs extracted at translation 
pair extraction module, and a graph-theoretic 
method is applied to the keyword graph, to gen- 
erate proper keyword clusters by removing er- 
roneous links. 
If we want to obtain a clean lexicon, minor trans- 
lation variations tend to be omitted, while many errors 
would be included if we want to retain minor variations. 
2 The word 'keyword' implies words that are impof 
tant with respect to documents or domains. In this pa- 
per, we use the word for convenience, roughly in the 
81une se~lse as "content-bearing words". If necessary, a 
module of keyword or terin weighting (e.g. Fl'antzi $¢ 
Ananiadou 1995; Nakagawa & Mort 1998) can be incor- 
porated easily. 
397 
2.1 Morphological gz lexical processing 
At this stage, basic lexical units or keyword 
candidates are extracted. We separately extract 
minimum or shortest units and maxinnlm or 
longest complex units as syntagmatic units for 
keyword candidates. So two outputs are pro- 
duccd from this module, i.e. a bilingual key- 
word corpora of minimum units and another of 
maximum units. 
The processing proceeds as follows: 
(a) Morphological analysis 
First, the cortms is morphologically anal- 
ysed and POS-tagged. Currently, JUMAN3.5 
(Kurohashi ~z Nagao 1998) is used for Japanese 
and LT_POS/LT_CHUNK (Mikheev 1996) is 
used for English. 
(bl) Extraetlon of minimum units 
Minimum units in English are simply de- 
fined as non-flmctional simple words extracted 
from the output of LT_POS. Minimum mean- 
ingful units in Japanese are defined as: 
C_Prefix* (C_AdvlC_AdjlN) C Suffix* 
where C_ indicates that the unit should consist 
of either Chinese characters or Katakana 3 . 
(b2) Extraction of maximum units 
Maximum complex units for English are the 
units extracted by LT_CHUNK, with some ad- 
hoc modifications. 
Maximum complex units fin' Japanese are 
defined by the following basic pattern, 
^C_Adj * (C_Affix l C_tdv l C_Adj \[ N) + 
where ^C means that the unit should begin with 
either Chinese character or Katakana. The pat- 
tern remains deliberately coarse, to absorb er- 
rors by JUMAN. Coarse patterns with simple 
character type restrictions produce better re- 
sults than grammatically well-defined syntag- 
matic patterns. A separate stop word list for 
affixes is also prepared together with an excep- 
tional treatment routine, to make the Japanese 
units better corresl)ond to English units 4 . 
After these processes, two corpora, one con- 
sisting of minimum units and the other of max- 
3 In addition, we have made a few ad-hoc rules to 
screen out some consistent errors produced by the mor- 
phological analysers. 
4 For instance, the Japanese suffix 'Ill' is eliminated 
because it corresponds in most cases to the English word 
'for', which tends to be excluded fi'om chunks made by 
LT_CHUNK. 
imum units, are created. 
Intermediate constituent units are not ex- 
tracted, because their inter-lingnal unit corre- 
spondence is less reliable. Also, many impor- 
tant intermediate units of longer complex units 
appear themselves as an independent complex 
unit in a large domain-specific corpus, and, even 
if they do not, intermediate units can be ex- 
tracted on the basis of minimum and maximum 
translation pairs if necessary. 
2.2 Extraction of translation candidates 
The module for extracting translation can- 
didate pairs consists of statistical weighting and 
postprocessing. These are applied to the data of 
nfinimum units and maximum units separately. 
After that, the two data are merged to make 
input for the cluster generation module. 
(a) Statistical weighting 
Many methods of extracting lexical transla- 
tion pairs have been proposed (Daille, Gaussier 
& Langd 1994; Eijk t993; Fung 1995; Gale £~ 
Church 1991; Hiemstra 1996; ltull 1998; Ku- 
piec 1993; Melamed 1996; Smadja, McKeown & 
Hatzivmssiloglou 1996). Though it, is ditficult to 
evaluate the performance of existing methods as 
they use ditferent corpora for evaluation 5 , the 
performance does not seem to be radically dif- 
ferent. We adopted log-likelihood ratio (Dan- 
ning 1993), which gave the best pertbrmance 
among crude non-iterative methods in our test 
experiments 6 . 
(b) Postproeessing filter 
As the output of statistical weighting is sim- 
ply a weighted list of all English and Japanese 
co-occurring pairs, it; is necessary to restrict 
translation candidates so that they can be ef- 
t~ctively used in the graph-theoretic cluster gen- 
eration module. In addition to restricting pos- 
sible translation pairs, it is necessary to deter- 
mine unique translation pairs for hapax legom- 
ena. We use both macro- and micro-filtering 
heuristics to restrict translation candidates. 
'~ A common testbed exists for French-English align- 
ment (Veronis 1996-99) but not for Japanese-English. 
6 At the time of writing this paper, we have finished a 
preliminary comparative experiments of various meth- 
ods, among which the method proposed by Melamed 
(1996) gave by far the best result. We are thus plan- 
ning to replace this module with the method proposed 
by Melamed (1996). 
398 
Two macro heuristics, applied to the over- 
all list of pairs, are defined, i.e. (i) a proper 
translation should have a statistical score higher 
than the threshold Xs, and (ii) a keyword 
should have maximally Xc translations or Xp x 
token frequertcy when the frequency is less 
than Xc. 
Micro heuristics uses the information within 
each alignment; we assume that a keyword in 
one  only has one translation within 
an aligninent r . Selecting unique pairs in each 
alignment is achieved by recursively taking a 
pair with tile highest score within an alignment, 
ead~ time deleting other pairs which have the 
same English or Japanese elements 8 . 
After this process, the data of nlininmm 
units and maximum units are merged, which 
constitutes input for the, next stage. 
2.3 Graph-theoretic cluster generation 
Up to this stage, the cooccurrence inforlna- 
tion used to extract pairs has the depth of only 
one. In order to eliminate erroneous transla- 
tions, we re-organise the extracted pairs into 
graph and use multi-level topological informa- 
tion by applying tile graph-theoretic method. 
For exi)lanation , let us assume that we obtained 
the data in Table 1 fi'om the previous module 
~us an input to this module. 
Firstly, the initial ke!jword graph is con- 
structed, where each node represents an English 
or JaI)anese keyword, and a link or edge repre- 
se.nts the pairwise relation of corresponding key- 
words. W(' define the capacit~j or strength of a 
link by the frequency of occurrence of the pair 
in the corpus, i.e.. the nmnber of alignments 
in which the pair occurs 9 . Figure 1 shows the 
This is not true for longer alignment units such as 
full texts. However, this will apply to parallel titles and 
abstracts which are readily available. Many lexical align- 
ment methods starting fi'om sentence-levd aligmnents 
assume this or some variations of this. 
Many maximum unit pairs in fact have the same 
score. We used the arithmetic mean of the constituent 
minimum units to resolve aligmnent ambiguity. 
9 The score of likelihood ratio is a possible alternative 
for link eai)acity, but the result of a preliminary experi- 
ment was no better. In addition, after selecting pairs by 
threshold, whether a pair constitutes a proper transla- 
tion or not is not a matter of weight, because threshold 
setting implies that all pairs above that are regarded as 
correct. So we adopt simple frequency as the link ca- 
pacity. Itowever, we notice a lack of atfinity/)etween the 
Japanese keywords English keywords frequency 
- U -- b ~ information retrieval 1 
g- -- q -- b ° keyword 39 
5- --~ .7. b }~..~ information retrieval 1 
5- 4 ~ X b ~,.'~'~ text retrieval 6 
5- 4- X I- ~'~.~ text search 3 
J}~ >1 t~. ~rll/-~ iiii" keyword 1 
~ J."~.l'~ {~.~. "~"~ information retrieval 1 
'\]~i{fi}'~ information gathering 4 
'l'~ N~ information retreival t 
i'~ ~'~ information retrieval 320 
'1~ N~'5.'~ information search 5 
t~ ~f{l\[?{ ~JS inibrmation gathering 6 
'\]'~11~ information retrieval 1 
~i~t;~'~ bibliographic search 1 
~i}J~tt~ document retrieval 11 
~ ~: ~,~ "-.4~ document retrieval 19 
~'~ text retrieval 1 
Table 1. Input exanlple for cluster generation 
iifitial keyword graph made from t, he data in 
Table 1. The task now is to detect and re- 
move erroneous links to generate independent 
graphs or clusters consisting of I)roper transla- 
tion pairs 1° . 
infornlalion galhcrillg O,ihlie,t, llqphic r,'uie~td) 
Figure I. Example of initial keyword graph 
The detection algorithm is based on the sim- 
ple principle that sets of links, which decompose 
a connected keyword cluster into disjoint sub- 
clusters when they are removed fronl tile origi- 
nal cluster, are candidates for improt)er transla- 
tions. In graph theory, such a link set is called 
an edge cut and the edge cut with the min- 
imal total capacity is called a m, inimum edge 
cut. A minimum edge cut does not necessarily 
imply a single translation error. An efficient al- 
statistical alignment method we used here and the deft- 
nition of link cat)acity, which is currently under exami- 
nation and will be iml)roved by renewing the alignment 
module. 
m This approach is radically different fl'om statisti- 
cally oriented word clustering (Deerwester et. al 1990; 
Finch 1993; Grefenstette 1994; Schiitze & Pedersen 1997; 
Strzalkowski 1994); this is why we use the word '(:luster 
generation' instead of 'clustering'. 
399 
I~eworU t2~,f%1: kyewo~rd .~)/;/#o;'.','} 
i I ~,,/~ 0 toxt search =F - '/-- 4"~J " "~" z / 
,t,,,,,,,~{D/ /:r~z, kt~£ ~ .... core cl I~te v~.~ I (le'tl relR~i(ID ~Y'v"t 
g'~l,~t#t~'..~;J*\/\ \x. . ~',.J.Y~¢?;t,' (Ivld.'-,r¢,kJ I ~lnfoll~latlon rotr~lva/ £ ) Id~.~u.lent 
inJorln(ttion .I~. ~20~. \ .A~ 1 ~;'etrt~:val retrievtdl /V xl \~f~l ~\ 19 
r. / i ,,,~.~" , ~u~nl rotfiova, 
x\ 6 ~ifo a h . / , I 
I \ I thihho r.pldcTl~iz't~fl* :':-... "6../. \.", f :. 
infon'hzdiongalh'el~ng ~ \ ~/ ~ '\[~ ~ / bibl~graphic sea r-ch 
'u! 
(a) (b) 
Figure 2. Steps of graph-theoretic cluster generation 
(c) 
gorithm exists for minimum edge cut detection 
(Nagamochi 1993). 
Our procedure first checks links that should 
not be eliminated, using the conditions: (i) the 
frequency is no less than Na, (ii) the Japanese 
and English notations are identical, or (iii) ei- 
ther of the Japanese or English expressions have 
only one corresponding translation (Figure 2 
(a); it is assumed that N~ = N/~ = Ne = 3). 
Secondly, core keywords whose fi'equency is no 
less than NZ are checked (Figure 2 (b)). This is 
used for the restriction that each cluster should 
include at le~t one core keyword. Lastly, edge 
cuts with a total capacity of less than Ne are 
detected and removed (Figure 2 (c)). This pro- 
cedure is repeated recursivety until no fllrther 
application is possible. Figure 3 shows the state 
after these steps are applied. 
./ 
/ 
! 
I. 
"'" .......... :L ..... ~ ........... 2 ......... ~'- 
Figure 3. Generated clusters 
3 Experiment 
3.1 Settings and procedures 
We applied the method to Japanese and 
English bilingual parallel corpus consisting of 
25534 title pairs in the field of computer sci- 
ence. Table 2 shows the basic quantitative infor- 
mation after morphological and lexical filtering 
was applied. 
Mininmm units 
Japanese Token: 178091 Type: 14938 
English Token: 154554 Type: 12634 
Maximum units 
Japanese Token: 89742 Type: 38813 
English Token: 80018 Type: 41693 
Table 2. Basic quantity of the data 
In the pair extraction module, the threshold 
Xs' was set to 1011 . The parameter X c w~s 
set to 10 and Xp to 0.5. As a result, 28905 
translation candidate pairs were obtained, with 
24855 Japanese and 23430 English keywords. 
Of these, 20071 pairs occurred only once and 
3581 only twice. The most frequent pair oc- 
curred 3196 times in the corpus. 8242 (28.5%) 
were minimum unit pairs, and 20663 (71.5%) 
were maximum unit pairs. 
Table 3 shows the number of keywords which 
had N translations. On average, a Japanese 
keyword had 1.16 English translations, while 
an English keyword had 1.23 Japanese trans- 
lations. 
N Jap. Eng. N ,lap. En. 
1 21796 19778 5 62 157 
2 2409 2693 6 10 59 
3 412 437 7 7 17 
4 159 285 8 0 4 
Table 3. Number of translations 
ix This is purely heuristic. Minimum units and maxi- 
mum units are given ditferent scores. But only 3 pairs 
below this threshold were proper translation pairs in 100 
random samples of minimum unit pairs, and 5 in 100 
samples of maxinmn~ units. 
400 
Evaluating recall and precision on the ba- 
sis of 100 randonfly selected title pairs, which 
consisted of 778 keyword token pairs, the pre- 
cision tokenwise was 84.06% (654 correct trans- 
lations) and the recall was 87.08% (654 of 751 
correct pairs). Typewise precision was 81.65% 
(543 correct of 665 pairs). 
The initial keyword graph generated fi'om 
these 28905 translation candidates consisted of 
19527 independent subgraphs, with the largest 
cluster containing 2701 pairs (i.e. 9.3% of all 
the pairs). The cluster generation method was 
applied with parameters Na =: 4, Ne = 10 
and N/~ = 1) 2 . As a result, 893 translation 
pairs were removed, and 20357 bilingual clus- 
ters were generated. The maximum cluster now 
contained only 64 pairs. Table 4 :shows the num- 
ber of clusters by size given by number of pairs. 
size no. of clusters size no. of (:lusters 
1 16693 5-9 322 
2 2354 10-19 52 
3 504 20-64 22 
4 410 
Table 4. Number of chlsters by size 
3.2 Overall evaluation 
The result was manually evaluated fi'om two 
points of view, i.e. consistency of clusters and 
cocrectness of link removal ~3 . 
(1) rib check the internal consistency, clusters 
were classified into three groul)S by size, and 
were separately evaluated. 2000 'small' clusters, 
consisting of only one pair, were randomly sam- 
pled and evaluated as 'correct' (c), 'more or less 
correct' (m) or ~wrong' (w). 4t}0 medimn size 
clusters consisting of 2-9 pairs and all the 74 
large clusters consisting of 10 or more pairs were 
evaluated as 'consistent' (c: consisting only of 
closely related keywords), 'mostly consistent' 
(Ill: consisting mostly of related keywords), 'hy- 
brid' (t1: consisting of two or more different key- 
word groups: 11) or q)ad' (w). Table 5 shows 
the result of the evaluation. The general per- 
formance is very good, with more or less 80% of 
the clusters being meaningflfl. 
12 This is again determined heuristically. For an exami- 
nation of the effect of parameters, see Aizawa & Kageura 
(to apl)ear). 
~3 The evaluation was done by the first author. Cur- 
rently no cross-checking has been carried out. 
For small clusters, the performance was sep- 
arately evaluated for minimuln and maximuln 
refit pairs. Note that the ratio of maximum 
unit pairs is comparatively higher in the small 
cluster than the overall average. Most pairs 
ewfluated as partially correct, as well as some 
wrong pairs, suffered from mismatch of the syn- 
tagmatic units. 
c m w total 
Small 1389 370 241 2000 
(69.5%) (18.5%) (12.1%) (100%) 
milfimum 288 26 69 383 
(75.2%) (6.8%) (18.0%) 19.2% 
maximum 1101 344 172 1617 
(68.1%) (21.3%) (10.6%) 80.9% 
c m h w 
Medium 116 148 32 104 
(29.0%) (37.0%) (8.0%) (26.0%) 
Large 8 18 43 5 
(lo.8%) (24.3%) (58.1%) (&8%) 
Table 5. Evaluation of internal consistency 
73% of tile medium sized clusters were 'cor- 
reel), 'mostly correct' or 'hybrid'. Among the 
'lnostly con'ect' and 'hybrid' clusters, 97 (91 
and 6 respectively) were mainly caused by the 
mismatch of the units. For instance, in the 
case: { Kid, i~iN'fL, ~i~@, optimization, opti- 
mal, optimisation, optimum, network optimiza- 
tion }, the last English keyword has the excess 
unit 'network'. Other 'mostly correct' and 'hy- 
brid' chtsters were due to the l)roblem of corpus 
frequencies. 
Among the large clusters, more than half 
were qlybrid '14 . Among the hnostly correct' 
and qlybrid' large (;lusters, only 8 (3--t-5) were 
due to unit mismatch, while 53 (15+38) were 
due to quantitative factors. This shows a strik- 
ing contrast to the medium sized clusters. Large 
hybrid clusters tended to include lnany common 
word pairs which occur fi'equently. For instance, 
in the largest chlster, ') x ¢ .z, system' (3196), 
'lJ~l~} development' (1097), '~tki~\] design' (1073), 
and 'NiL enviromnent' (890) are included due 
to indirect associations. The tbllowing are two 
examples of hybrid clusters, whose hybridness 
comes fi:om quantitative factors and unit mis- 
matches respectively: 
Example 1: ~fg2:/~.{C6/~tJ/4) -x" ')/overview/outline/ 
summary/smmnarization/overall 
14 And most of the sub-clusters in these hybrid clusters 
are 'mostly correct'. 
401 
/pattern/patterns/patten/patterm matching 
In the first case, the 'overall' group and the 
'summary' group are mixed up. In the sec- 
ond case, the mismatch of syntagmatic units is 
caused by borrowed words. In fact, many errors 
caused by the mismatch of syntagmatic units in- 
volve borrowed words written in Katakana. 
(2) To look at the perfbrmance of graph- 
theoretic cluster generation, we exanfined the 
removed pairs fl'om two points of view, i.e. the 
correctness of link removal and the internal con- 
sistency of clusters generated by link remowfl. 
For the former, we introduced three categories 
for evaluation: mismatched pairs correctly re- 
moved (c), proper translation pairs wrongly re- 
moved (w), and pairs of related meaning re- 
moved (p). The consistency of newly generated 
clusters were evaluated in the same manner as 
above. 
c p w total 
cc 90 (10.1) 53 (5.9) 39 (4.4) 182 (20.4) 
cm 148 (16.6) 56 (6.6) 32 (3.6) 236 (26.4) 
ch 96 (10.8) 20 (2.2) 6 (0.7) 122 (13.7) 
mm 44 (4.9) 29 (3.3) 30 (3.4) 103 (tl.5) 
mh 52 (5.8) lS (1.5) 5 (0.6) 70 (7.8) 
hh 30 (3.4) 3 (0.3) 3 (0.3) 36 (4.0) 
xc 42 (4.7) 9 (1.0) 9 (1.0) 60 (6.7) 
xm 28 (3.t) 8 (0.9) 20 (2.2) 56 (6.3) 
xh 8 (0.9) 2 (0.3) 5 (0.6) 15 (1.7) 
xx 4 (0.5) 1 (0.1) 8 (0.9) 13 (1.5) 
all 542 (60.7) 194 (21.7) 157 (17.6) 893 (100) 
Table 6. Evaluation of removed links 
Table 6 shows the result of evaluation of all 
the 893 removed pairs. 'c' 'p' and 'w' in the top 
row indicate types of removed links, and 'cc', 
'cm' etc. in the leftmost column indicate inter- 
nal consistencies of two clusters generated by 
link removal. A total of 157 (17.6%) of the re- 
moved links were correct links wrongly removed, 
but among them, 115 links did not produce 
'bad' clusters. If we consider them to be toler- 
able, only 42 removals (4.7%) were fatal errors. 
By exanfining the renloved links, wc found 
that the links removed at the higher edge capac- 
ity included more wrongly removed pairs. For 
instance, among 142 edges removed at capacity 
4 (which is the maximum deletable value set by 
N,~), 41 or 28.9% were wrongly removed correct 
translations, while among 288 links removed at 
capacity l, only 15 or 5.2 % were correct trans- 
lations. 
4 Discussion 
From the experiment, we have found some 
factors that affect performance. 
(1) Many errors were produced at the stage of 
extracting keyword milts, by syntagmatic mis- 
match. A substantial nmnber of them involved 
Japanese Katakana keywords. Thereibre, in ad- 
dition to the general refinement of the morpho- 
logical processing module, the perfbrmance will 
be improved if we use string proxinfity informa- 
tion to determine syntagmatic units 15 . 
(2) We expect that some errors produced by 
statistical weighting and filtering could be re- 
moved by applying stemming and orthographic 
normalisations, which are not flflly exploited in 
the current implementation. Looking back from 
the cluster generation stage, frequently occur- 
ring keywords tend to cause problems due to 
indirect associations. At the time of writing, we 
are radically changing the statistical alignment 
module based on Melamed (1996) and incorpo- 
rating iterative alignment anchoring routine so 
that the method can be applied not only to titles 
but also to abstracts, etc. Used in conjunction 
with string proximity and stemming inforina- 
tion, we might be able to retain nfinor va.riations 
properly. 
(3) At the cluster generation stage, we observed 
that correct links tend to be wrongly removed 
for higher capacities of edge cut. In the cur- 
rent implementation, the parameter values re- 
main the same for all the clusters. Performance 
will be improved by introducing a method of 
dynamically changing the parameter w-dues ac- 
cording to the cluster size and the frequencies 
of their constituent pairs. 
5 Conclusion 
We have proposed a method of constructing 
bilingual thesauri automatically, fl'om parallel 
or comparable corpora. The experiment showed 
that the performance is fairly good. We are cur- 
rently improving the method further, along the 
lines discussed in the previous section. Further 
experiments are currently being carried out, us- 
ing the data of narrower domains (e.g. artificial 
ls This can also be used for resolving hapax ambiguity. 
402 
intelligence) as well as abstracts instead of ti- 
tles. 
At the next stag(.', we are 1)lanning to eval- 
uate the method fi'om the point of view of per- 
formance of generated clusters in practical ap- 
plications. We are currently planning to apply 
the generated clusters to query expansion and 
user navigation in cross-lingual Ill., as well as to 
on-line dictionary lookup systems used as trans- 
lation aids. 
Acknowledgement 
This research is a part of the research 
project "A Study on Ubiquitous Information 
Systems tbr Utilization of Highly Distributed 
Information FLesources", fimded by the Japan 
Society for the Promotion of Science. 

References 

\[1\] Aizawa, A. N. and Kageura, K. (to appear) "A 
grai)h-/)ased al)proach to the autoinatic gen- 
eration of multilingual keyword clusters." In: 
Bouligmflt, D., Jacquemin, C. and l'tIomme, 
M-C. (eds.) Recent Advances in Computational 
7~rminology. Amsterdam: John Benjanfins. 

\[2\] Dagan, I. and Church, K. (1994) "Termight: 
Identifying and translating technical terminol- 
ogy. " Prec. of the Fourth ANLP. p.34 40. 

\[3\] Daille, B., Gaussier, E. and Langd, J. M. (t994) 
"Towards automatic extraction of monolingual 
and bilingual terminology." COLING'9~. p. 
515-.521. 

\[4\] Deerwester, S., Dumais, S. T., Furnas, G. W., 
Landauer, T. K. and Harshman, R. (1990) "In- 
dexing by latent semantic analysis." JASIS. 
41(6), p. 391 407. 

\[5\] Dunning, T. (1993) "Accurate reel,hods for the 
statistics of surprise and coincidence." Compu- 
tational Lin.quistics. 19(1), p. 61 74. 

\[6\] Eijk, van der P. (1993) "Automating the acqui- 
sition of bilingual terminology." Prec. of the 6th 
EACL. p. 11.3-119. 

\[7\] Finch, S. P. (1993) Finding Structure in Lan- 
9ua.qe. PhD Thesis. Edinbourgh: University of 
Edinbourgh. 

\[8\] Frantzi, K. T. and Ananiadou, S. (1995) "Sta- 
tistical measures for terminological extraction." 
Proc. of 3rd Int'l Conf. on Statistical Analysis 
of Textual Data. p. 297-308. 

\[9\] Fung, P. (t995) "A t)attcrn matching method 
for finding noun and proper noun translations 
fi'om noisy parallel cort)ora.." Proe. of 33rd 
A CL. p. 233 236. 

\[10\] Gale, W. A. and Church, K. W. (1991.) "Idem 
tifying word correspondences in parallel texts." 
Proc. of DARPA &~eech. and Natural Lan.quwe 
Workshop. p. 152-157. 

\[11\] Grefenstette, G. (1994) Explorations in Auto- 
matic Thesaurus Discovery. Boston: Kluwer 
Academic. 

\[12\] tfiemstra, D. (1996) Using Statistical Methods 
to Creat a Bilingual Dictionary. MSc Thesis, 
Twcnte University. 

\[13\] Ifull, D. A. (1998) "A practical approach to ter- 
minology aligmnent." Computerm'98. p. 1---7. 

\[14\] Kitamura, M. and Matsumoto, Y. (1997) "Au- 
tomatic Extraction of Translation Patterns in 
Parallel Corpora." Transactions of IPSJ. 38(4), 
p. 727- 735. 

\[15\] Kupiec, J. (1993) "An algorithm for finding 
noun phrase correspondences in bilingual col 
pora." 15"oc. of 31st ACL. p.17--22. 

\[16\] Kurohashi, S. and Nagao, M. (1998) Japanese 
Morphological Analysis System .luman versioT~ 
3.5 User's Mawaal. Kyoto: Kyoto University." 

\[17\] Melamed, I. D. (1996) "Automatic construction 
of clean broad-coverage translation lexicons." 
2nd Conference of the Association for Mach, ine 
Translation in the Americas. p. 125-134. 

\[18\] Mikheev, A. (1996) '%earning pro:t-of-speech 
guessing rules from lexicon." COLING'96, p. 
770-775. 

\[19\] Nagamochi, H. (1993) "Minimum cut, in a 
graph." In: Fujisige, S. (ed.) Discrete Struc- 
ture and Algorithms H (Chapter 4). Tokyo: 
Kindaikagakusha. 

\[20\] Nal~Gawa , H. and Mori, T. (1998) "Nested col 
location and COml)ound noun for term extrac- 
tion." Computerm'98. p. 64 70. 

\[21\] Sch{itze, It. and Pedersen, J.O. (1997) "A 
cooccurrence-based thesaurus and two appli- 
cations to information retrieval." Information 
Processing and Management. 33(3), I).307-318. 

\[22\] Smadja, F., MeKeown, K. R. and Hatzivas- 
siloglou, V. (1996) "Translating collocations 
for bilingual lexicons: A statistical apt)roach." 
Computational Linguistics. 22(1), p. \]-38. 

\[23\] Strzalkowski, T. (1994) "Building a lexicM do- 
main map from text corpora." COLING'94, 
t).604-610. 

\[24\] Veronis, J. (1996-) "ARCADE: Evaluation of 
parallel text alignment systems." 
ht tl)://www.lpl.univ-aix.fi'/projects/arcade/ 

\[25\] Yonezawa, K. and Matsumoto, Y. (1998) 
"Zoshinteki taiouzuke ni yoru taiyaku tekisuto 
lmra no hol?yaku hyougen no cyusyutu." Proe 
of the \]#h, Annual Meeting of th.e Association 
for NLP. p. 576-579. 
