Automatic Extraction of Word Sequence 
Correspondences in Parallel Corpora 
Mihoko Kitamura 
Kansai Laboratory 
Oki Electric Industry Co., Ltd. 
kit a@kansai, oki. co. jp 
Yuji Matsumoto 
Graduate School of Information Science 
Nara Institute of Science and Technology 
mat su@is, aist-nara, ac. jp 
Abstract 
This paper proposes a method of finding correspondences of arbitrary length word sequences 
in aligned parallel corpora of Japanese and English. Translation candidates of word sequences are 
evaluated by a similarity measure between the sequences defined by the co-occurrence frequency 
and independent frequency of the word sequences. The similarity measure is an extension of Dice 
coefficient. An iterative method with gradual threshold lowering is proposed for getting a high 
quality translation dictionary. The method is tested with parallel corpora of three distinct domains 
and achieved over 80~0 accuracy. 
1 Introduction 
A high quality translation dictionary is indispensable for machine translation systems with good 
performance, especially for domains of expertise. Such dictionaries are only effectively usable for 
their own domains, much human labour will be mitigated if such a dictionary is obtained in an 
automatic way from a set of translation examples. 
This paper proposes a method to construct a translation dictionary that consists of not only 
word pairs but pairs of arbitrary length word sequences of the two languages. All of the pairs are 
extracted from a parallel corpus of a specific domain. The method is proposed and is evaluated 
with Japanese-English parallel corpora of three distinct domains. 
Several attempts have been made for similar purposes, but with different settings. (see \[Kupiec 
93\]\[Kumano & Hirakawa 94\]\[Smadja 96\]) 
Kupiec and Kumano ~ Hirakawa propose a method of obtaining translation patterns of noun 
compound from bilingual corpora. Kumano & Hirakawa stand on a different setting from the other 
works in that they assume ordinary bilingual dictionary and use non-parallel (non-aligned) corpora. 
Their target is to find correspondences not only of word level but of noun phrases and unknown 
words. However, the target noun phrases and unknown words are decided in the preprocessing 
stage. 
Brown et al. use a probabilistic measure for estimating word similarity of two languages in 
their statistical approach of language translation \[Brown 88\]. In their work of aligning of parallel 
texts, Kay & RSscheisen used the Dice coefficient as the word similarity for insuring sentence level 
correspondence \[Kay & RSscheisen 93\]. 
Kitamura & Matsumoto use the same measure to calculate word similarity in their work of 
extraction of translation patterns. The similarity measure is used as the basis of their structural 
matching of parallel sentences so as to extract structural translation patterns. In texts of expertise a 
number of word sequence correspondences, not word-word correspondences, are abundant especially 
in the form of noun compounds or of fixed phrases, which are keys for better performance. Though 
the method proposed in this paper deals only with consecutive sequences of words and is intended 
to provide a better base for the structural matching that follows, the results themselves show very 
useful and informative translation patterns for the domain. 
Our method extends the usage of the Dice coefficient in two ways: It deals not only with cor- 
respondence between the words but with correspondence between word-sequences, and it modifies 
the formula measure so that more plausible corresponding pairs are identified earlier. 
79 
2 Related Work and Some Results 
Brown et. al., used mutual information to construct corresponding pairs of French and English 
words. A French word f is considered to be translated into English word ej that gives the maximum 
mutual information: 
P(e.~ l f) MI(ej, f) = log P(ej) 
Probabilities P(ej I f) and P(ej) are calculated from parallel corpus by counting the occurrences 
and co-occurrences of ej and f. 
Kay & Rbscheisen used the following Dice coefficient for calculating the similarity between 
English word we and French word w I. In the formula, f(we), f(wl) represent the numbers of 
occurrences of we and wl, and f(we, wl) is the number of simultaneous occurrences of those words 
in corresponding sentences. 
2f(we, w f) sire(we, = f(w ) + f(wj) 
Kitamura & Matsumoto used the same formula for calculating word similarity from Japanese- 
English parallel corpora. A comparison between the above two method is done on a parallel corpus 
and the results are reported in \[Ohmori 96\]. They applied both approaches to a French-English 
corpus of about one thousand sentence pairs. The results are shown in Table 1 where the correctness 
is checked by human inspection. Since both methods show very inaccurate results for the words 
of one occurrence, only the words of two or more occurrences are selected for inspection. Table 1 
shows the proportion that a French word is paired with the correct English words checked with the 
top, three and five highest candidates. 
Num. of words 
Mutual Information 697 
Dice coefficient 574 
1st candidate within best 3 within best 5 
43.6% 60.0% 65.4% 
46.2% 65.0% 66.5% 
Table 1: Comparison of Mutual Information and Dice coefficient 
The results show that though Dice coefficient gives a slightly better correctness both methods 
do not generate satisfactory translation pairs. 
\[Kupiec 93\] and \[Kumano & Hirakawa 94\] broaden the target to correspondences between word 
sequences such as compound nouns. Kupiec uses NP recognizer for both English and French and 
proposed a method to calculate the probabilities of correspondences using an iterative algorithm 
like the EM algorithm. He reports that in one hundred highest ranking correspondences ninety of 
them were correct. Although the NP recognizers detect about 5000 distinct noun phrases in both 
languages, the correctness ratio of the total data is not reported. 
Kumono & Hirakawa's objective is to obtain English translation of Japanese compound nouns 
(noun sequences) and unknown words using a statistical method similar to Brown's together with 
an ordinary Japanese-English dictionary. Japanese compound nouns and unknown words are de- 
tected by the morphological analysis stage and are determined before the later processes. Though 
they assume unaligned Japanese-English parallel corpora, alignment is performed beforehand. In 
an experiment with two thousand sentence pairs, 72.9% correctness is achieved by the best corre- 
spondences and 83.8% correctness by the top three candidates in the case of compound nouns. The 
correctness ratios for unknown words are 54.0% and 65.0% respectively. 
Smadja proposes a method of finding translation patterns of continuous as well as discontinu- 
ous collocations between English and French \[Smadja 96\]. The method first extracts meaningful 
collocations in the source language(English) in advance by the XTRACT system. Then, aligned 
corpora are statistically analized for finding the corresponding collocation patterns in the target 
language(French). To avoid possible combinational explosion, some heuristics is introduced to filter 
implausible correspondences. 
Getting translation pairs of complex expression is of great importance especially for technical 
domains where most domain specific terminologies appear as complex nouns. There are still a 
80 
............................................... _P~((eJ.C_9.~or~ ............................................. 
Japanese Corpus English Corpus 
I ............ I l ............... ..................... ;;: ........................................................... ..... ;;;;; 
1. Morphological Analysis ---ff-~..~"~- ...... "--'"1 Morphological Analysis ...... ~ ~'~ Content Word Extraction \[ r,.~'. ¢f¢_. 
\[ Content Word Extraction I English | 
~__,~uunm~j Word Sequence ~action ~ctionary__.~ 2. Word_Selquenc e Extraction 
I |_ 
3. Setting of Minu~um Occurence Condition 
~-ranslation |~i~ ..... 4. Extraction of Translation Canditates re er o 
5. Similarity Calculation threshold 
decrement 
. ........ :: ..........  6:.O? r?J.ation -'4°f rransistion 
......... .......... _____ 7~7,;g'-" .~....~ Japanese-English 
Pair ~ ......   
yes. 
Figure 1: The flow of finding the correspondences of word sequences 
number of other interesting and meaningful expression that should be translated in a specific way. 
We propose a method of finding corresponding translation pairs of arbitrary length word sequences 
appearing in parallel corpora and an algorithm that gradually produces "good" correspondences 
earlier so as to reduce noises when extracting less plausible correspondences. 
3 Overview of the Method 
Figure 1 shows the flow of the process to find the correspondences of Japanese and English word 
sequences. Both Japanese and English texts are analyzed morphologically. 
We make use of two types of co-occurrences: Word co-occurrences within each language corpus 
and corresponding co-occurrences of those in the parallel corpus. In the current setting, all words 
and word sequences of two or more occurrences are taken into account. Since frequent co-occurrence 
suggests higher plausibility of correspondence, we set a similarity measure that takes co-occurrence 
frequencies into consideration. Deciding the similarity measure in this way reduces the computa- 
tional overhead in the later processes. If every possible correspondence of word sequences is to be 
calculated, the combination is large. Since high similarity value is supported by high co-occurrence 
frequency, a gradual strategy can be taken by setting a threshold value for the similarity and by 
iteratively lowering it. Though our method does not assume any bilingual dictionary in advance, 
once words or word sequences are identified in an earlier stage, they are regarded as decisive en- 
tries of the translation dictionary. Such translation pairs are taken away from the co-occurrence 
data, then only the remaining word sequences need be taken into consideration in the subsequent 
iterative steps. Next section describes the details of the algorithm. 
4 The Algorithm 
The step numbering of the following procedure corresponds to the numbers appearing in Figure 1. 
In the current implementation, the Translation Dictionary is empty at the beginning. Steps 1 and 
2 are performed on each language corpus separately. 
1. Japanese and English texts are analyzed morphologically and all content words (nouns, verbs, 
adjectives and adverbs) are identified. 
81 
2. All content words of two or more occurrences are extracted. Then, word sequences of length 
two that are headed by a previously extracted word are extracted, provided they appear at 
least twice in the corpus. In the same way, a word sequence w of length i + 1 is taken into 
consideration only when its prefix of length i has been extracted and w appears at least 
twice in the corpus. This process is repeated until no new word sequences are obtained. The 
subsequent steps handle only those extracted word sequences. It would be natural to set a 
maximum length for the candidate word sequences, which we really have it be between 5 and 
10 in the experiments. 
3. A threshold for minimum frequency of occurrence (.f,~in) is decided, and the following process 
is repeated, every time decrementing the threshold by some extent. 
4. For the word sequence occurring more than fmin times, the numbers of total occurrence and 
total bilingual co-occurrence are counted. This is done for all the pairs of such Japanese and 
English word sequences. It is not the case for a pair that already appeared in the Translation 
Dictionary. 
5. For each pair of bilingual word sequences, the following similarity value (sim(w.r, wE)) is 
calculated, where wj and WE are Japanese and English word sequences, and fj, fe and fie 
are the total frequency of wj in the Japanese corpus, that of wE in the English corpus and 
the total co-occurrence frequency of wj and WE appearing in corresponding sentences. 
sim(wj,wE) = (log2 fje) fj2~ fe 
This formula is a modification of the Dice coefficient, weighting their similarity measure by 
logarithm of the pair's co-occurrence frequency. Only the pairs with their sire(w j, wE) value 
greater than log 2 frain are considered in this step. The fact that no word sequence occurring 
less than frnin times cannot yield greater similarity value than log 2 frnin assures that all 
pairs of word sequences with the occurrence more than fmin times are surely taken into 
consideration. 
6. The most plausible correspondences are then identified using the similarity values so calcu- 
lated: 
(a) For an English word sequence WE, let WJ = {Wjl,Wj2,"',wjn} be the set of all 
Japanese word sequences such that sim(wji, wE) > log2 f,~i,~. The set is called the 
candidate set for WE. For each Japanese word sequence w.t its candidate set is constructed 
in the same way. 
(b) Of the candidate set WJ for wE, if the candidate with the highest similarity value 
(w.ti = arg max sim(wjk,WE)) again selects wE as the candidate with the highest w~kEWJ 
similarity (wE = arg max sirn(wji,WEm)), where WE is the candidate set for w.tl, wEmEWE 
the pair (wji, WE) is regarded as a translation pair. 
7. The approved translation pairs are registered in the Translation Dictionary until no new pair 
is obtained, then the threshold value fmin is lowered, and the steps 4 through 6 are repeated 
until fmin reaches a predetermined value. 
5 Experiments of Translation Pair Extraction 
5.1 The settings 
We used parallel corpora of three distinct domains: (1) a computer manual (9,792 sentence pairs), 
(2) a scientific journal (12,200 sentence pairs), and (3) business contract letters (10,016 sentence 
pairs). All the Japanese and English sentences are aligned and morphologically analyzed 1. 
The settings of the experiments are as follows: The maximum length of the extracted word 
sequences is set at 10. The initial value of .fmi, is set at the half of the highest number of occurrences 
of extracted word sequences and is lowered by dividing by two until it reaches to or under 10, then 
it is lowered by one in each iteration until 2. 
1 Japanese and English morphological analyzers of Machine Translation System PENSIVE were used. PENSI~E is a 
trademark of Osaka Gas corporation, OGIS-RI, and Oki Electric Industry Co.,Ltd. 
82 
Business 
Science 
Manual 
single word word seq. 
total occurrence > 2 
Eng. Jap. Eng. Jap. 
2,300 3,739 2,218 '3,568 
7,254 9,415 6,764 8,856 
3,701 4,926 3,478 4,799 
occurrence _> 2 
Eng. Jap. 
73,026 72,574 
16,555 24,998 
32,049 38,796 
Table 2: Numbers of extracted words and word sequences 
Num. of threshold correct 
pairs 
1151 
575 
287 
143 
71 
35 
17 
10 
9 
8 
7 
6 
5 
4 
3 
total 
2 
3 
4 
12 
19 
48 
103 
164 
53 
67 
82 
134 
163 
318 
755 
1,927 
near 
miss 
2 0 0 
3 0 0 
4 0 0 
12 0 0 
18 1 0 
48 0 0 
101 2 0 
155 8 1 
51 2 0 
63 4 0 
75 6 1 
114 20 0 
145 15 3 
257 50 11 
502 195 59 
1,550 302 75 
correctness accumulative incorrect 
(+near)   correctness% 
100(100) 100(100) 100(100) 
100(100) 
100(100) 94.7(100) 
100(100) 98.1(100) 
94.5(99.4) 96.2(100) 
94.0(100) 91.5(98.8) 
85.1(100) 89.0(98.2) 
80.8(96.5) 66.5(92.2) 
100(100) 100(100) 
100(100) 97.5(100) 
98.9(100) 99.0(100) 
96.6(99.7) 96.6(99.8) 
96.2(99.8) 95.5(99,6) 
93.5(99.7) 92.6(99.4) 
89.4(98.6) 
80.4(96.1) 
Table 3: Results of Business Letter 
Table 2 summarizes the numbers of word sequences extracted by Step 2. For each corpus the 
table shows the numbers of distinct content words, those of two or more occurrences, and the 
numbers of word sequences of two or more occurrences. 
5.2 The results 
Tables 3, 4 and 5 shows the statistics obtained from the experiments. The columns specify the 
numbers of approved translation pairs. The correctness of the translation pairs are checked by a 
human inspector. A "near miss" means that the pair is not perfectly correct but some parts of the 
pair constitute the correct translation. 
It is noticeable that the pairs with high frequencies give very accurate translation in the cases 
of the computer manual and the business letters, whereas the scientific journal does not necessarily 
gives high accuracy to highly frequent pairs. The reason is that the former two corpora are really 
in a homogeneous domain, while the corpus of scientific journal is a complex of distinct scientific 
fields. The former two corpora reveal a worse performance with the pairs with low frequency 
threshold. This is because those corpora frequently contain a number of lengthy fixed expression 
or particular collocations. One such example is that "p type (silicon)" frequently collocates with 
"n type (silicon)," making the correspondence uncertain. 
The science journal shows a stable accuracy of translation pair extraction. The accuracy exceeds 
90% in most of the stages. The reason would be that scientific papers do not repeat fixed expression 
and the terminologies are used not in a fixed way. 
Table 6 summarizes the combination of the length of English and Japanese word sequences. 
The fraction in each entry shows the number of correct pairs over the number of extracted pairs. 
This table indicates that translation pairs of lengthy or unbalanced sequences are safely regarded 
83 
Num. of threshold 
pairs 
68 
34 
17 
10 
9 
8 
7 
6 
5 
4 
3 
2 
total 
1 
21 
69 
142 
52 
69 
66 
105 
168 
292 
536 
1,307(500) 
2,828(2,021) 
near correct 
miss 
1 0 
19 1 
64 5 
133 8 
49 3 
69 0 
63 2 
99 6 
155 12 
263 25 
494 34 
(445) (46) 
(1,854) (129) 
incorrect 
0 
1 
0 
1 
1 
0 
1 
0 
1 
4 
8 
(9) 
correctness 
(+near)% 
lOO(lOO) 
90.5(95.2) 
92.8(10o) 
93.7(99.3) 
94.2(98.1) 
100(100) 
95.5(98.5) 
94.3(100) 
92.3(98.8) 
90.1(98.6) 
92.2(97.2) 
89.0(97.4) 
accumulative 
correctness% 
100(100) 
90.9(95.5) 
92.3(98.9) 
93.1(99.1) 
93.3(97.9) 
94.6(99.2) 
94.7(99.0) 
94.7(99.2) 
94.1(99.1) 
92.9(99.0) 
92.6(98.4) 
91.7(98.1) 
(38) 91.7(98.1) 
Table 4: Results of Science Journal 
threshold Num. of i correct pairs 
209 
104 
52 
26 
13 
10 
9 
8 
7 
6 
5 
4 
3 
2 
total 
1 
4 
19 
55 
145 
81 
58 
75 
106 
126 
214 
367 
629 
1,401(500) 
3,281(2,380) 
near ~ incorrect miss 
1 0 0 
4 0 0 
19 0 0 
54 0 1 
140 5 0 
76 5 0 
55 2 1 
68 5 2 
99 7 0 
118 7 1 
198 13 3 
330 26 11 
519 97 13 
(395 / (87) (18) 
(2,076) (254) (50) 
correctness 
(+near)% 
100(100) 
100(100) 
100(100) 
98.1(98.1) 
96.6(100) 
93.8(100) 
94.8(98.3) 
90.7(93.6) 
93.4(100) 
93.7(99.2) 
92.5(98.6) 
89.9(97.0) 
82.5(97.9) 
79.0(96.4) 
87.2(97.9) 
accumulative 
correctness% 
100(100) 
lOO(lOO) 
100(100) 
98.7(98.7) 
97.3(99.6) 
96.4(99.7) 
96.1(99.4) 
95.2(99.1) 
94.9(99.3) 
94.6(99.3) 
94.1(99.1) 
92.9(98.5) 
89.4(98.3) 
87.2(97.9) 
Table 5: Results of Computer Manual 
-Business Length of Eng. Seq. 
Letters 1 2 3 4 5 6 7 8 9 10 
Length 
of 
Jap. 
Seq. 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
823/843 43/58 0/6 0/1 0 
32/45 401/450 17/55 1/23 0/5 
0 79/122 72/90 7/23 0/8 
0 6/21 29/45 15/23 2/5 
0 3/10 2/13 7/14 3/10 
0 0 2/4 2/3 0/1 
0 0/1 0 0/2 0 
0 0/1 0 0/1 0/1 
0 0 0 0 0 
o o/1 o/1 o o 
0 0 0 0 0 
0/4 0/1 0 0/1 0 
o/4 o/4 o o o 
1/2 0/1 0 0 0/1 
2/3 0 0/1 0/1 0/1 
0/2 0 0/1 0/1 0/2 
0/1 0 0/1 0 0 
0/1 0 0 0 0/1 
0 0 0 0 0 
1/1 o o o 0/6 
Table 6: Length Combination of Word Sequences and their Accuracy (Business Letter) 
84 
Japanese English Similarity 
-- 1. Business Letter -- 
~¢ (~) ~ 
~, (/¢) ~z~ ~ t~ 
-- 2. Science Journal -- 
u.x 79..~.x I~!.£ ~F.r 
n ~ "~9 ~-~ 
~4,P.x $.~ I-'7-~ 
-- 3. Computer Manual -- 
4 >'#$'~' l- 7VPX 
.'f >'#~ 1- 7"= b~V I P 
exclusive license 4.95 
dispute(,) controversy (or) difference (which may) arise 4.34 
trade secret 3.72 
effective date (of this) agreement 3.12 
business hour 2.92 
utility model(,) trademark(,) design (or) copyright 2.81 
irrevocable confirm(ed) letter (of) credit 2.81 
technique manufacture know-how 2.62 
patent(s)(,) know-how (or) technical information 1.06 
hemorrhage fever virus 3.19 
methyl acrylate 3.17 
Los Alamos national laboratory 2 
n type 1.78 
p type 1.78 
university (of) California (at) Davis 1.58 
wireless network 1.19 
fiber(-)optic network 1.19 
hemorrhage fever 1.14 
internet 5.25 
internet address 2.83 
double precision float point 1.79 
internet protocol IP 1.78 
internet protocol 1.66 
DoD internet 1.6 
name (to) address map(ping) 1.58 
internet service 1.45 
indicates "near miss" and * indicates "incorrect". 
Table 7: Samples of Corresponding Word Sequences 
as incorrect correspondences. 
Tables 7 and 8 list samples of translation pairs extracted from the experiments. Table 7 lists 
some of typical word sequence pairs. Many of Japanese translation of English technical terms are 
automatically detected. Table 8 lists the top 30 pairs from the experiment on the business contract 
letters. 
The method is capable of getting interesting translation patterns. For example, "~l~l~" and 
"~ll~" are found to correspond to "trade secret" and "business hour" respectively. Note that 
Japanese word "~" is translated into different English words according to their occurrences with 
distinct word. 
Table 9 shows the recall ratio based on the results of the experiments. The figures show the 
numbers of words that are included at least one extracted translation pairs. The recall rates 
are shown in parentheses, which indicates how much proportion of the words with two or more 
occurrences in the corpora are finally participated in at least one translation pair. The major 
reason that the recall is not sufficiently high is that we decided to use a rather severe condition on 
selecting a translation pairs in Step 6 in the algorithm. The condition may be loosen to get better 
recall ratio though we may lose high precision. We have not yet tested our method with other 
conditions. 
85 
Japanese English 
-- Freq.Stage 1151 -- 
A 
A 
~}± company 
,1' ~ ~ 5,'-- licensee 
Freq.Stage 575 -- 
~ distributor 
~i~ ~ product 
9 -~ seller 
Freq.Stage 287 -- 
~- buyer 
~ party 
~ writing 
article 
Freq.Stage 143- 
b b 
a a 
A B C ABC 
~ information 
X Y Z XYZ 
~ patent 
~ technical 
~J right 
~ trademark 
C C 
~:t~ territory 
~,~ necessary 
Freq.Stage 71 -- 
~ ~ technical information 
~ consignee 
4 "Y )l, ~" 4 royalty 
~gJ PAT hereinafter 
d d 
~ sale 
~ exclusive 
~ manufacture 
i~ obligation 
: 
10.73 
10.47 
9.55 
9.26 
9.24 
8.92 
8.84 
8.39 
8.34 
8.07 
8.01 
7.99 
7.87 
7.77 
7.65 
7.64 
7.60 
7.50 
7.41 
7.26 
7.26 
7.08 
6.99 
6.84 
6.82 
6.76 
6.75 
6.74 
6.72 
6.59 
3952 
2436 
1471 
2511 
999 
940 
1276 
754 
778 
332 
324 
354 
489 
327 
455 
520 
664 
369 
218 
668 
332 
214 
225 
295 
198 
126 
626 
235 
578 
228 
4081 
2521 
1562 
2996 
1039 
970 
1394 
860 
837 
345 
335 
362 
549 
333 
545 
558 
869 
401 
231 
693 
356 
227 
244 
377 
223 
130 
804 
278 
930 
255 
4720 
2715 
1679 
3127 
1116 
1112 
1584 
858 
955 
344 
340 
388 
561 
370 
505 
670 
769 
438 
226 
1033 
410 
241 
259 
331 
220 
130 
920 
271 
648 
287 
A indicates "near miss". 
Table 8: Sample of Top Correspondences (Business Letter(Best 30)) 
Corpus English 
Business 867 
Science 2,240 
Manual 1,922 
(recall) Japanese (recall) 
(39.1%) 1,005 (28.2%) 
(33.1%) 2,359 (26.6%) 
(55.3%) 2,224 (46.3%) 
Table 9: Numbers of words identified 
86 
6 Conclusion 
A method for obtaining translation dictionary from parallel corpora was proposed, in which not 
only word-word correspondences but arbitrary length word sequence correspondences are extracted. 
This work is originally motivated for the purpose of improving the performance of our translation 
pattern extraction from parallel corpora \[Kitamura & Matsumoto 95\], in which translation patterns 
are extracted by syntactically analyzing both Japanese and English sentences and by structurally 
matching them. Some discrepancy is caused by poor quality of translation dictionary. This is why 
we tried to pursue a way to obtain better translation dictionary from parallel corpora. We believe 
that the proposed method gives results of good performance compared with previous related work. 
The translation pairs obtained through our method are directly usable as the base resource for MT 
systems based on translation memory \[Lehmann 95\]. 
We hope to acquire better translation patterns by combining the current results with our work 
of structural matching for finding out fine grained correspondence. 

References 
P.F. Brown. A Statistical Approach to Language Translation. In COLING-88, volume 1, pages 
71-76, 1988. 
M. Kay and M. RSscheisen. Text-Translation Alignment. Computational Linguistics, 19(1):121- 
142, 1993. 
M. Kitarnura and Y. Matsumoto. A Machine Translation System based 6n Translation Rules 
Acquired from Parallel Corpora. In Recent Advances in Natural Language Processing, pages 
27-44, 1995. 
A. Kumano and H. Hirakawa. Building an MT Dictionary from Parallel Texts Based on Linguistic 
and Statistical Information. In COLING-94, volume 1, pages 76-81, 1994. 
J. Kupiec. An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In 
31st Annual Meeting of the Association for Computational Linguistics * Proceedings of the 
Conference (ACL93), pages 23-30, 1993. 
H. Lehmann. Machine Translation for Home and Business Users. In Proceedings of MT Summit 
V,1995 
K. Ohmori, J. Tsutsumi, and M. Nakanishi. Building Bilingual Word Dictionary Based on Statisti- 
cal Information. In Proceedings of The Second Annual Meeting of The Association for Natural 
Language Processing, pages 49-52, 1996. (in Japanese) 
F. Smadja, K.R. McKeown and V. Hatzivassiloglou. Translating Collocations for Bilingual Lexi- 
cons: A Statistical Approach. Computational Linguistics, 22(1):1-38, 1996. 
