Learning Bilingual Collocations by Word-Level Sorting 
Masahiko Haruno Satoru Ikehara Takefumi Yamazaki 
NTT (',ommunicatiolt Sci('nce l,al)s. 
1-2:156 Take Yokosuka-Shi 
Ka.nagawa. 2:18-03, ,la,l:)a,n 
haruno@n'c'ckb, nt-I;, j p ikehara@nttkb, rt'c-I;, j p yamazaki~nttkb, ntt. j p 
Abstract 
This paper I)roposes ;t new tnethod 
for learning bilingual colloca, tions from 
sentence-aligned paralM corpora. Our 
method COml)ris('s two steps: (1) ex- 
tracting llseftll word chunks (n-grmns) 
by word-level sorting and (2) construct- 
ing bilingua,l ('ollocations t)y combining 
the word-(;hunl(s a(-quired iu stag(' (1). 
We apply the method to a very ('hal- 
lenging text l)~tir: a stock market 1)ul- 
let;in in Japanese and il;s abstract in En-- 
glish. I)om;tin sl)ecific collocations are 
well captured ewm if they were not con- 
ta.ined in the dictionaric's of economic 
tel?IllS. 
1 Introduction 
In the field of machitm translation, there is a, 
growing interest in corl)llS-I)ased al)l)roa.('hes (Sato 
and Nagao, 1990; l)a.gan mid (',hutch, 199d; Mat. 
sulnoto et M., 19.93; Kumano m,d ll imka.wa., 199d ; 
Smadja et al., 1996). The main motiw~.tion be- 
hind this is to well ha.ndle, domain specific ex- 
pressions. I~ach apl~licatiotl dom~dn has va.rious 
kinds of collocations ranging from word-level to 
sentence-level. '\]'he correct use of these colloca- 
tions grea.l.ly inlluellcC's the qua.lity ofoutpttt texts. 
Ilexa.uso such detaih'd collocations ;~r~'~ <tillicult 1:o 
hand-conlpile, the automatic extra(:tion of bilin- 
gual collocations is needed. 
A number of studies haw> aJ.tetnpte(I to extract 
bilinguaJ collocations from paralM corpora. These 
studies c~m be classified into two directions. One 
is hased on the full parsing techniques. (Mat,- 
sumoto et a l., 1993) I)roposed a. method to find out 
phrase-lew'l correspondences, while resolving syn- 
tactic ambiguities a.t the same time. Their ninth- 
()(Is determine t)hrase eorresl)ondences I)y using 
the phrase structures of I,he two hmgua,ges and ox- 
isting bilingual dict.iona.ries. Unfi)rl.unately tl,'se 
al)proaches are protnising only for (,he compara-- 
I, ively short sentences l, hat ca, I>e a,a\]yze(I I>y ;t 
(',I+,:Y type l>arser. 
The other direction for extracting bilingual cob 
local.ions involw+:s statistics. (Fung, 1995) ac- 
quired bilingual word correspondences without 
sentetlce alignment. Although these methods 
;|re rob/Is| ~HI(I aSSllllle rio illfOl'lll~ttiOll SOltrce~ 
their outputs are just word-word corresl)otMences. 
(Kupiec, 1993; Kumano and }lirakawa, 1!194) ex- 
tracted noun phr~me (NP) correspondences from 
aligned parallel corpora. |n (Kupiec, 1993), Nl's 
in English and French t;exts are+ first extracted 
by a. NP recoguizer. Their correspotldence prol> 
abilities arc then gradually relined by using an 
EM-like iteration algorithm. (t,~uma.no and Ili- 
rakawa, 1994) lirst extracted .Japanese N Ps in the 
S&III(? way, and comhined statistics with a bilin- 
gtta.l dictionary tbr MT 1o find out NP (-orrespon- 
dences. Although their apl)ro+tches a.t.ta.ined high 
accuracy for the+ task considered, the most cru- 
cial knowledge for MT is tnorc COml~lex corre- 
spOlldelices Sllch ~-LS NI'-VP corres\[,Oll(teltces atHI 
senl.et,'e-hwe\[ eorrespotldences. It seems di\[\[icutt 
I.o extend these statistical lllethods to ~t I)roa.(ler 
rmtge of collocations because they are specialized 
to N l's o1: sillglc" words. 
(Smmlj~t et al., 1996) proposed a generM 
method to extract a I)roader range of colloca.tions. 
They first extract English collocations using the 
Xtract systetn (Smadja, 1993), and theu look for 
French coutlterparts. Their search strategy is an 
itemtive combina.tion of two elements. This is 
ha,sed on the intuitive ide~ tim| "if a set of words 
('onstitutes a collocation, its subset will Mso be 
correla.ted". Although this idea is corre~(:t, the it- 
era|ire combination strategy generates a. mlmber 
ol + useless expressions. In fa.ct, Xtract. employs a. 
rol)ust l",nglish pa.rser to lilter out the wrong collo- 
ca.tions which form more thaal ha.If lhe candidates. 
In other hmgua,ges such as Japanese, pa,rser-lmscd 
prmfi.g cannot be used. Another drawback of 
their approa, ch is that only the longesl, n-gram 
is adopl.ed. That is, when 'Ja.lmn-US auto trade 
talks' is ardol)ted as ;/collocation, ',lapall-IlS' can- 
not bc recognized as a. collocal,ion though it is i.- 
dependently used very often. 
In thi,~ pN)er, we propose an alt,ernative method 
based oil word-lewd sorting. Our method com- 
525 
prises two steps: (1) extracting useful word chunks 
(n-grams) by word-level sorting and (2) constrnct- 
ing bilingual collocations by combining the word- 
chunks acquired at stage (1). Given sentence- 
aligned texts in two languages(Haruno and Ya- 
mazaki, 1996), the first step detects useful word 
chunks by sorting and counting all uninterrupted 
word sequences in sentences. In this phase, we de- 
veloped a new technique for extracting only useful 
chunks. The second step of the method evalu- 
ates the statistical similarity of the word chunks 
appearing in the corresponding sentences. Most 
of the fixed (uninterrupted) collocations are di- 
rectly extracted from the word chunks. More flex- 
ible (interrupted) collocations are acquired level 
by level by iteratively combining the chunks. The 
proposed method, which uses effective word-level 
sorting, not only extracts fixed collocations with 
high precision, but also avoids the combinatorial 
explosion involved in searching flexible colloca- 
tions. In addition, our method is robust and suit- 
able for real-world applications because it only as- 
sumes part-of-speech taggers for both languages. 
Even if the part-of-speech taggers make errors in 
word segmentation, the errors can be recovered in 
the word chunk extraction stage. 
2 Two Types of Japanese-English 
Collocations 
In this section, we briefly classify the types of 
Japanese-English collocations by using the ma- 
terial in Table 1 as an example. These texts 
were derived from a stock market bulletin written 
in Japanese and its abstract written in English, 
which were distributed electrically via a computer 
network. 
In Table 1, (~g-~,~'l-~/Tokyo Forex), (H~I~!IYJ 
~\[~n~\]{~ /auto talks between Japan and the U.S.) 
and (~k,.'~/ahead of) are Japanese-English col- 
locations whose elements constitute uninterrupted 
word sequences. We call hereafter this type of col- 
location fixed eolloeatlon. Although fixed col- 
location seems trivial, more than half of all use- 
ful collocations belong to this class. Thus, it is 
important to extract fixed collocations with high 
precision. In contrast, ( b')t-t~'~ ~ ~1~ ¢ki~?,_ 
~ / The U.S. currency was quoted at -~ ) and ( b" 
)t.~'~ ~ ~l~ ~k_2~ /The dollar stood ..~)1 
are constructed from interrupted word sequences. 
We will call this type of collocation flexible col- 
location. From the viewpoint of machine learn- 
ing, flexible collocations are much more difficult 
to learn because they involve the combination of 
elements. The points when extracting flexible col- 
locations is how the number of combination (can- 
didates) can be reduced. 
Our learning method is twofold according to 
the collocation types. First, useful uninterrupted 
1 ~. represents any sequence of words. 
word chunks are extracted by the word-level sort- 
ing method. To find out fixed collocations, we 
evaluate stochastic similarity of the chunks. Next, 
we iteratively combin the chunks to extract flexi- 
ble collocations. 
3 Extracting Useful Chunks by 
Word-Level Sorting 
3.1 Previous Research 
With the availability of large corpora and mem- 
ory devices, there is once again growing interest in 
extracting n-grams with large values of n. (Nagao 
and Mori, 1994) introduced an efficient method 
for calculating an arbitrary number of n-grams 
from large corpora. When the length of a text 
is I bytes, it occupies l consecutive bytes in mem- 
ory as depicted in Figure 1. First, another table 
of size l is prepared, each field of which represents 
a pointer to a substring. A substring pointed to 
by the (i - 1)th entry of the table constitutes a 
string existing from the ith character to the end 
of the text string. Next, to extract common sub- 
strings, the pointer table is sorted in alphabetic 
order. Two adjacent words in the pointer table 
are compared and the lengths of coincident prefix 
parts are counted(Gonnet et al., 1992). 
For example, when 'auto talks between Japan 
and the U.S.' and 'auto talks between Japan and 
China' are two adjacent words, the nmnber of co- 
incidences is 29 as in 'auto talks between Japan and 
'. The n-gram frequency table is constructed by 
counting the number of pointers which represent 
the same prefix parts. Although the method is ef- 
ficient for large corpora, it involves large volume 
of fractional and unnecessary expressions. The 
reason for this is that the method does not con- 
sider the inter-relationships between the extracted 
strings. That is, the method generates redundant 
substrings which are subsumed by longer strings. 
text ntr|hg (I oharaoter~: I bytes) 
la, ol.ter table 
Figure 1: Nagao's Approach 
To settle this problem, (Ikehara et al., 1996) 
proposed a method to extract only useful strings. 
Basically, his methods is based on the longest- 
match principle. When the method extracts a 
longest n-gram as a chunk, strings subsumed by 
the chunk are derived only if the shorter string of_ 
tell appears independently to the longest chunk. 
If 'auto talks between Japan and the U.5'.' is ex- 
tracted as a chunk, 'Japan and the U.S.'is also 
526 
Tokyo Forex 5 PM: Dollar at 84.21-84.24 yen 
The dollar stood 0.26 yen lower at 84.21-84.24 at 5 p.m. 
Forex market trading was extremely quiet ahead of fnrther auto talks between Japan and the U.S., slated 
for early dawn Tuesday. 
The U.S. currency was quoted at 1.361-1.3863 German marks at 5:15 p.m. 
Table 1: Sample of Target Texts 
extracted because 'Japan and the U.S.' is used 
so often independently as in 'Japan and the U.S. 
agreed ...'. However, 'Japan and the' is not ex- 
tracted because it always appears in the context 
of 'Japan and the U.S.'. The method strongly 
suppresses fractional and unnecessary expressions. 
More than 75 % of the strings extracted by Na- 
gao's method are removed with the new method. 
3.2 Word-Level Sorting Method 
al Ipl l I@ i I \[dl@~tlh\] \[ 
poln~r imbl~ 
¢O:~nl ~llm~r 
Figure 2: Word-Level Sorting Approach 
The research described in the previous section 
deals with character-based n-grams, which gener- 
ate excessive numbers of expressions and requires 
large memory for the pointer table. Thus, from 
a practical point of view, word-based n-grams are 
preferable in order to further suppress fractional 
expressions and pointer table use. In this paper, 
we extend Ikehara's method to handle word-based 
n-grams. First, both Japanese and English texts 
are part-of-speech (POS) tagged 2 and stored in 
memory as in Figure 2. POS tagging is required 
for two main reasons: (1) There are no explicit 
word delimiters in Japanese and (2) By using POS 
information, useless expressions can be removed. 
In Figure 2, '@' and '\0' represent the explicit 
word delimiter and the explicit sentence delimiter, 
respectively. Compared to previous research, this 
data structure has the following advantages. 
2We use in this phase the JUMAN morphological 
analyzing system (Kurohashi et al., 11994) for tagging 
Japanese texts and Brill's transformation-based tag- 
get (Brill, 1994) for tagging English texts. We would 
like to thank all people concerned for providing us 
with the tools. 
1. Only heads of each word are recorded in the 
pointer table. As depicted in Figure 2, this 
remarkably reduces memory use because the 
pointer table also contains other string char- 
acteristics as Figure 3. 
2. As depicted in Figure 2, only expressions 
within a sentence are considered by introduc- 
ing the explicit sentence delimiter '\0'. 
3. Only word-level coincidences are extracted 
by introducing the explicit word delimiter 
'@'. This removes strings arising from a 
partial match of different words. For exam- 
ple, the coincident string between 'Japan and 
China' and 'Japan and Costa Rica' is 'Japan 
and'in our method, while it is 'Japan and C' 
in previous methods. 
colnol ~°ont adopt dance 
~4 
(15 
1-/0 2 
I lC)*I 
104 
string 
I0 
I¢) 
16 
I 6 
16 
J~p. n~-v andc~a ¢ "h m,,ov 
J,..,.,,<.o...tc.~,c'o.,. ~1o. 
a ap. n~-q an d¢,~ t |~,*~ 1 Js 
Ju pa t t(~ an,tC,~ t I~ U S 
J it pit t~ ttn¢U~, t I~<~ ~ I S 
Ja p a i ~¢U ~n (ICa) II~C~O_ • / 
Figure 3: Sorted Pointer Table 
Next, the pointer table is sorted in alpha- 
betic order as shown in Figure 3. In this table, 
sentno, and coincidence represent which sen- 
fence the string appeared in and how many char- 
acters are shared by the two adjacent strings, re- 
spectively. That is, eoineidenee delineates can- 
didates for usefifl expressions. Note here that the 
coincidence between Japan@and@China... and 
Japan@and@Costa Rica... is l0 as mentioned 
above. 
Next, in order to remove useless subsumed 
strings, the pointer table is sorted according to 
sentno.. In this stage, adopt is filled with '1' 
or '0' , each of which represents if or not if a 
string is subsumed by longer word chnnks, respec- 
tively. Sorting by sentno, makes it much easier 
to check the subsumption of word chunks. When 
527 
both 'Japan and the U.S.' and 'Japan and the' 
arise from a sentence, the latter is removed be- 
cause the former subsumes the latter. 
Finally, to determine which word-chunks to ex- 
tract, the pointer table is sorted once again in al- 
phabetic order. In this stage, we count how many 
times a string whose adopt is 1 appears in the 
corpus. By thresholding the frequency, only use- 
tiff word chunks are extracted. 
4 Extracting Bilingual 
Collocations 
In this section, we will explain how Japanese- 
English collocations are constructed from word 
chnnks extracted in the previous stage. First, 
fixed collocations are induced in the following way. 
We use the contingency matrix to evaluate the 
similarity of word-chunk occurrences in both lan- 
guages. Consider the contingency matrix, shown 
Table 2, for Japanese word chunk cjp,~ and English 
word chunk c~,g. The contingency matrix shows: 
(a) the number of Japanese-English corresponding 
sentence pairs in which both Cjp n and ce,~g were 
found, (b) the number of Japanese-English cor- 
responding sentence pairs in which just c~, v was 
found, (c) the number of Japanese-English cor- 
responding sentence pairs in which just ejp,~ was 
fonnd, (d) the mnnber of Japanese-English col 
responding sentence pairs in which neither chunk 
was found. 
Ceng a b 
c d 
Table 2: Contingency Matrix 
If ejpn and Cen.q are good translations of one 
another, a should be large, and b and c should 
bc small. In contrast, if the two are not good 
translations of each other, a should be small, mid 
baud c should be large. To make this argument 
more precise, we introduce mutual information ~s 
follows. Thresholding the mutual information ex- 
tracts fixed collocations. Note that mutual in- 
formation is reliable in this case because the fre- 
quency of each word chunk is thresholded at the 
word chunk extraction stage. 
p,'ob(q,,,,, c~,,.,) = log "(" + ~ + ~ + d) 
log v,.ob(,:j,,,)v,.ob(~,~,,~) (, + b)(, + c) 
Next, we sumnmrize how flexible collocations 
are extracted. The following is a series of proce- 
dures to extract flexible collocations. 
1. For any pair of chunks in a Japanese sen- 
tence, compute mutual information. Con> 
bine the two chunks of highest mutual in- 
formation. Iteratively repeat this procedure 
and construct a tree level by level. 
2. For any pair of chunks in an English sen- 
tence, repeat the operations done in the the 
Japanese sentence. 
3. Perform node matching between trees of 
both langnages by using mutual information 
of Japanese and English word chunks. 
tin ,~l~ore R 
Figure 4: Constructing Flexible Collocations 
The first two steps construct monolingual simi- 
larity trees of word chnnks in sentences. The third 
step iteratively evalnates the bilingual similarity 
of word chunk combinations by using the above 
trees. Consider the example below, in which the 
underlined word chunks construct a flexible col- 
location (~ Yif/~°~.~t~,f~t~_~,:x ~g ~, I-iti~'~3: 
~¢_k~-L/~:/~ rose ~ on the oil products spot 
market in Singapore). First, two similarity trees 
are constructed as shown in Figure 4. Graph 
matching is then iteratively attempted by compnt- 
ing mutual inforlnation fbr groups of word chunks. 
In the present implementation, the system com- 
bines three word chunks at most. The technique 
we use is similar to the parsing-b~sed methods 
for extracting bilingual collocation(Matsumoto et 
al., 1993). Our method replaces the parse trees 
with the similarity trees and thus avoids the com- 
binatorial explosion inherent to the parsing-ba~sed 
methods. 
lia:ample: , ,, 
Naphtha and gas oil rose 
on the oil products spot market in Singapore 
5 Preliminary Evaluation and 
Discussion 
We performed a preliminary ewduation of tile 
proposed method by using 10-days Japanese stock 
market bulletins and their Fnglish abstracts, each 
containing 2000 sentences. The text was first au-- 
tomatically aligned and then hand-checked by a 
hum~m supervisor. A sample passage is displayed 
in TM~Ie 1. 
In this experiment, we considered only the word 
chunks thai; appeared more than 4 times for fixed 
collocations and more than 6 times for flexible col- 
locations. Table 4 illustrates the fixed collocations 
acquired by our method. Almost all collocat.ions 
in Table 4 involw~ domain specilic jargon, which 
528 
~tmil~Tse English 
DI(j\](~I-~ ~ I 1\] ~ Tokyo Forex ~ I)olhu" ~tt, ~ yen 
b',i!--'l-1 ~ "~\[+\]\]~ 9 ~I ~ ~_~k ?c The 1,J.S. clirrency WitS (llloted at 
were sold ~ dropped as well 
I I~I~ ~" ©fJ~'i~(|l b/'e I la,nk of ,lapiin injected 
P~-d~'n Y -- ~-\]!~ ~" OlllrOll ~ ~lllllil, GlllO Forcsl, fy -- 
Tal)h; 3: Saniples of li'lexible (~ollocations 
No___ Jttl)O lles(: 
= ~.k'c 
,t JAFCO ~-~ 
-7--~ 
s q-:~.k b/%'~ ~, 6 
t0 ,j,~,~ ~ 
~ 2" ":; Ta) 5~ ~~t~ - 
14 
15 
16 
17 
18 
2o 
22 
~3 
24 2~L_ 
_ 26 
27 
28 
29 
3O 
31 
3~ 
aa_~ 
34 
as 
36 
_ 37 
C B Q s\]z~j 
~kA 
,~ 0 f~ff kt 
f/19~l Ig} 
' t11~ 
Eiil411sh 
Tokyo Forex 
ahead of 
German mark 
Japan Associated lCin~nce 
in contrast 
remained aidelined watching 
fear 
Tokyo Gold future~ (\]in: 
slow 
wait-and-ace mood 
Loco-London gold 
~inat mark 
Convertible bond~ 
dealers 
t radin~; volume 
h~i~h-yieldera 
Nikkei 300 futuren Aft-opg: 
~-cln: 
con| r~ct ended 
economic ~timulun packa e~a¢ -- 
cloted t~t 
futurea cls: 
bond Int~rket 
convertible bondn 
nikkei future~ aft-opg: 
~ \[. disheartened by 
-'..0')~1.~7)~ ~ .... pet|at|on of 
%Lo~,9 
~'1~ 
4~q: 
e@;ed up 
hish-tech sharen 
wMt-and-~e¢ mood 
Surnit omo Forestry 
U.S.-dapttn ~uto t~lks 
npecul~tive buying of_ 
Tokyo ~ me.8: ~ 
intereat rate~ 
the dapan-U,S. &uto dispute 
N -oT. T .J ii i)l-i 11 thG(~ 
as f~9fll~ 
*~ 'l',t31~)l 
4 a ~Cl 4Al~ -- , 
,m-,4~ 
4o /\]'I I a)9'd I) 
~--7-- ~Ltc 
--~-~--~ 
56 
57 
o~ 0-~: 3 5t 
68 
70 
72 
English 
bond~ and bond futures 
public funds 
inttitutional inventors 
benchmark 
semiconductor-related ~tocks 
forei6n inve~tor~ 
hlgh-tech ~tocks 
turnover 
small-lot ~ellinfz - 
r~cord high 
benchmark 
low 
Tokyo Stockn 2nd Sec 
were weak 
individual inveatora 
pretax profit 
The firnt ~ection of TSI~, 
the Nikkei ~tock average 
Tokyo CB~ O qp~ i 
long-term government bondll 
were ~rnded at 
impor terll 
advanced 
coverin~ 
Showa Denko 
volume w~ 
hi ta new year'~ hi~rh 
ruling co~litlon 
q:i~JJ~Tif~ __ 
tl~\[:~l~R'~r,i,:}~,~..,~ • Nikkei World Commoditie,: 
~ir,J I r/ 
Sumltomo Special Metals 
Nikkei 300 futuren Mn~ 
OSl~ 
year'~ low 
ln~ close 
inched np 
Ta, ble 4: Siunples of Fi×cd Collocation,<~ 
cannot, be const.rueted composit, ionally. For exam- 
phi, No 9 nieans 'Tokyo (~ohl FuLure, m~rkel; ended 
trading R)r the (lay', but was never written as 
such. As well as No. 9 , a nuuflml: ofseut;ence-level 
collocations were also extracl, ed. No. 9, No. 18, 
No. 23, No. 2< No. 35, No. 56 and No. 67 a.re 
t,ypica,l heads of Llle stock markel; report. These 
exi)rcssioiis a.pllear eweryda.y in st.ock markel, re- 
ports. 
IlL is inl, eresl, iil E I4) not, ic(~ lhe variel,y o\[ fixed 
colh)ca.tions. They dill'~'r in their consl.rucl.ions; 
noun phrases, verll phrases, I)rel)osit.iolml phrase<; 
and sentrnce--level. All, hough coltventionaJ nleLll- 
otis focus on houri llhrases or |,ry t;o en(:onll/ass 
all kinds of (-olloca.tions at the sanie time, we be- 
liew" l, ha, t, fixed colloca, tion is au ilnporl,anl, class o\[' 
colh)cation. It is useful to iltl,ensively sl,udy fixed 
collocations because 1,he (:ollocatioll of lilore com-- 
plex structures is (lillic.lt to h'i,', regardle'~,~ of 
the mf~l,hod used. 
'I'MAe 3 exemplifies the flexible colloca.tions we 
acquired fronl the saint cOrllUS. No. 1 to No. 4 are 
typical exprossions in stock nlarkc'l, reports. These 
collocation are eXl;l'enlc.ly useful for l,ellll)lal, e-- 
based nlachine /.ra.nsla.tiol~ sysl.enls. No. 5 is a.n 
examph~ o1' a useless ('ol\[ocalriOIt. BOt\]l Olnron 
a, nd ~unii|,omo Forcst;ry arc cotupap, y names 1,lid, l; 
co-ocem- I'requenl, ly i. sl,ock uia,l'kel, i'el)ort;s , bul, 
t.he.qc two conlpanics ha,ve uo direct relal;iou. In 
fact, nlore I.han half of a.II lh!xibh~ collocations ac- 
quired were like No. 5. To remove useh>ss coJJ()(';t- 
lions, co,stra.inl.s <)n l;ll<" <'haracl.er tyl>eS would I)e 
useful. Most useful ,lapa/ICSe /lcxiblt' (:ollocai.iOllS 
coul;;lin al, least one ilira.gamt 3 ch~u-acter. Thus, 
3 ,I a i)~nese has (,}n'c(~ t,y pe,~ of ch ara~ctcrs ( II ira.ga.na, 
I(atak;~na., and t<anjO, each of which has dilt't!rcnt 
a.n.)uttts of i.lbrntalio.. In ( OllLl,t,qt, Enl-lish ha.s ouly 
529 
many useless collocations can be removed by im- 
posing this constraint on extracted strings. 
It is also interesting to compare our results 
with a Japanese-English dictionary for economics 
(Iwatsu, 1990). About half of Table 4 and all of 
Table 3 are not listed in the dictionary. In partic- 
ular, no verb-phrase or sentence-level collocations 
are not covered. These collocations are more use- 
ful for translators than noun phrase collocations, 
but greatly differ from domain to domain. Thus, it 
is difficult in general to hand-compile a dictionary 
that contains these kinds of collocations. Because 
our method automatically extracts these colloca- 
tions, it will be of significant use in compiling do- 
main specific dictionaries. 
Finally, we briefly describe the coverage of the 
proposed method. For the corpus examined, 70 % 
of the fixed collocations and 35 % of the flexible 
collocations output by the method were correct. 
This level of performance was achieved in the face 
of two problems. 
• The English text was not a literal transla- 
tion. Parts of Japanese sentence were often 
omitted and sometimes appeared in a differ- 
ent English sentence. 
• The data set was too small. 
We are now constructing a larger volume of cor- 
pus to address the second problem. 
6 Conclusion 
We have described a new method for learning 
bilingual collocations from parallel corpora. Our 
method consists of two steps: (1) extracting use- 
ful word chunks by the word-level sorting tech- 
nique and (2) constructing bilingual collocations 
by combining these chunks. This architecture re- 
flects the fact that fixed collocations play a more 
crucial role than accepted in previous research. 
Our method not only extracts fixed collocations 
with high precision but also reduces the combi- 
natorial explosion that would be otherwise con- 
sidered inescapable in extracting flexible colloca- 
tions. Although our research is in the preliminary 
stage and tested with a small number of Japanese 
stock market bulletins and their English, the ex- 
perimental results have shown a number of inter- 
esting collocations that are not contained in a dic- 
tionary of economic terms. 
References 
Eric Brill. 1994. Some advances in 
transformation-based part of speech tagging. In 
Proc. 12th AAAI, pages 722-727. 
Ido Dagan and Ken Church. 1994. Termight: 
identifying and translating technical terminol- 
one type of character. 
ogy. In Proc. Fourth Conference on Applied 
Natural Language Processing, pages 34-40. 
Pascale Fung. 1995. A pattern matching method 
for finding noun and proper noun translations 
from noisy parallel corpora. In Proc. 33rd ACL, 
pages 236-243. 
Gaston H. Gonnet, Ricardo A. Baeza-Yates, and 
Tim Snider, 1992. Information Retrieval, chap- 
ter 5, pages 66 82. Prentice-Hall. 
Masahiko lIaruno and Takefinni Yamazaki. 1996. 
High-Performance Bilingual Text Alignment 
Using Statistical and Dictionary Information. 
In Proc. 34th A CL. 
Satoru Ikehara, Satoshi Shirai, and Hajime 
Uehino. 1996. A statistical method for extract- 
ing unitnerrupted and interrupted collocations 
from very large corpora. In Proc. COLING96. 
Keisuke Iwatsu. 1990. TREND: Japanese-English 
Dictionary of Current Terms. Shougakkan. 
Akira Kumano and Hideki Hirakawa. 1994. 
Building an MT dictionary from parallel texts 
based on linguisitic and statistical information. 
In Proc. 15th COLING, pages 76-81. 
Julian Kupiec. 1993. An algorithm for finding 
noun phrase correspondences in bilingual cor- 
pora. In the 3lst Annual Meeting of ACL, 
pages 17-22. 
Sadao Kurohashi, Toshihisa Nakamura, Yuji Mat- 
sumoto, and Makoto Nagao. 1994. Improve- 
ments of Japanese morphological analyzer JU- 
MAN. In Proc. International Workshop on 
Sharable Natural Language Resources, pages 
22-28. 
Yuji Matsumoto, tIiroyuki Ishimoto, and Takehito 
Utsuro. 1993. Structural matching of paral- 
lel texts. In the 31st Annual Meeting of ACL, 
pages 23-30. 
Makoto Nagao and Shinsuke Mort. 1994. A new 
method of n-gram statistics for large number of 
n and automatic extraction of words and pha- 
rases from large text data of japanese,. In Proc. 
15th COLING, pages 611 615. 
Satoshi Sato and Makoto Nagao. 1990. Toward 
memory-based translation. In Proc. 13th COL- 
ING, pages 247-252. 
Frank Smadja, Kathleen McKeown, and Vasileios 
tlatzivassiloglou. 1996. Translating colloca- 
tions for bilingual lexicons: A statistical ap- 
proach. Computational Linguistics, 22(1):1- 38, 
March. 
\['rank Smadja. 1993. Retrieving collocations 
from text: Xtract. Computational Linguistics, 
19(1):143 177, March. 
530 
