A Comparative Study on Translation Units
for Bilingual Lexicon Extraction
Kaoru Yamamotoa0 and Yuji Matsumotoa0 and Mihoko Kitamuraa0a2a1
Graduate School of Information Science, Nara Institute of Science and Technologya0
8916-5 Takayama, Ikoma, Nara, Japan
a3 kaoru-ya, matsu, mihoko-k
a4 @is.aist-nara.ac.jp
Research & Development Group, Oki Electric Industry Co., Ltd.a1
kita@kansai.oki.co.jp
Abstract
This paper presents on-going research
on automatic extraction of bilingual
lexicon from English-Japanese paral-
lel corpora. The main objective of
this paper is to examine various N-
gram models of generating transla-
tion units for bilingual lexicon ex-
traction. Three N-gram models, a
baseline model (Bound-length N-gram)
and two new models (Chunk-bound N-
gram and Dependency-linked N-gram)
are compared. An experiment with
10000 English-Japanese parallel sen-
tences shows that Chunk-bound N-
gram produces the best result in terms
of accuracy (83%) as well as coverage
(60%) and it improves approximately
by 13% in accuracy and by 5-9% in
coverage from the previously proposed
baseline model.
1 Introduction
Developments in statistical or example-based MT
largely rely on the use of bilingual corpora.
Although bilingual corpora are becoming more
available, they are still an expensive resource
compared with monolingual corpora. So if one
is fortune to have such bilingual corpora at hand,
one must seek the maximal exploitation of lin-
guistic knowledge from the corpora.
This paper presents on-going research on au-
tomatic extraction of bilingual lexicon from
English-Japanese parallel corpora. Our approach
owes greatly to recent advances in various NLP
tools such as part-of-speech taggers, chunkers,
and dependency parsers. All such tools are
trained from corpora using statistical methods
or machine learning techniques. The linguistic
“clues” obtained from these tools may be prone
to some error, but there is much partially reliable
information which is usable in the generation of
translation units from unannotated bilingual cor-
pora.
Three N-gram models of generating transla-
tion units, namely Bound-length N-gram, Chunk-
bound N-gram, and Dependency-linked N-gram
are compared. We aim to determine character-
istics of translation units that achieve both high
accuracy and wide coverage and to identify the
limitation of these models.
In the next section, we describe three mod-
els used to generate translation units. Section 3
explains the extraction algorithm of translation
pairs. In Sections 4 and 5, we present our ex-
perimental results and analyze the characteristics
of each model. Finally, Section 6 concludes the
paper.
2 Models of Translation Units
The main objective of this paper is to determine
suitable translation units for the automatic acqui-
sition of translation pairs. A word-to-word cor-
respondence is often assumed in the pioneering
works, and recently Melamed argues that one-to-
one assumption is not restrictive as it may appear
in (Melamed, 2000). However, we question his
claim, since the tokenization of words for non-
segmented languages such as Japanese is, by na-
ture, ambiguous, and thus his one-to-one assump-
tion is difficult to hold. We address this ambigu-
ity problem by allowing ’overlaps’ in generation
of translation units and obtain single- and multi-
word correspondences simultaneously.
Previous works that focus on multi-word
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 28 .
Figure 1: sample sentence
Pierre
Pierre-Vinken
Pierre-Vinken-years
Pierre-Vinken-years-old
Pierre-Vinken-years-old-join
Vinken
Vinken-years
Vinken-years-old
Vinken-years-old-join
Vinken-years-old-join-board
years
years-old
years-old-join
years-old-join-board
years-old-join-board-nonexecutive
old
old-join
old-join-board
old-join-board-nonexecutive
old-join-board-nonexecutive-director
join
join-board
join-board-nonexecutive
join-board-nonexecutive-director
join-board-nonexecutive-director-Nov.
board
board-nonexecutive
board-nonexecutive-director
board-nonexecutive-director-Nov
nonexecutive
nonexecutive-director
nonexecutive-director-Nov
director
director-Nov
Nov
Figure 2: Bound-length N-gram
correspondences include (Kupiec, 1993) where
NP recognizers are used to extract translation
units and (Smadja et al., 1996) which uses the
XTRACT system to extract collocations. More-
over, (Kitamura and Matsumoto, 1996) extracts
an arbitrary length of word correspondences
and (Haruno et al., 1996) identifies collocations
through word-level sorting.
In this paper, we compare three N-gram mod-
els of translation units, namely Bound-length N-
gram, Chunk-bound N-gram, and Dependency-
linked N-gram. Our approach of extracting bilin-
gual lexicon is two-staged. We first prepare N-
grams independently for each language in the par-
allel corpora and then find corresponding trans-
lation pairs from both sets of translation units
in a greedy manner. The essence of our algo-
rithm is that we allow some overlapping transla-
tion units to accommodate ambiguity in the first
stage. Once translation pairs are detected dur-
ing the process, they are decisively selected, and
the translation units that overlaps with the found
translation pairs are gradually ruled out.
In all three models, translation units of N-gram
are built using only content (open-class) words.
This is because functional (closed-class) words
such as prepositions alone will usually act as
noise and so they are filtered out in advance.
A word is classified as a functional word if it
matches one of the following conditions. (The
Penn Treebank part-of-speech tag set (Santorini,
1991) is used for English, whereas the ChaSen
part-of-speech tag set (Matsumoto and Asahara,
2001) is used for Japanese.)
part-of-speech(J) “a5a7a6 -a8a9a5a10a6 ”, “a5a11a6 -a12 ”, “a5a11a6 -a13a15a14
a16 ”, “
a5a17a6 -a18a7a19 ”, “a5a17a6 -a20a11a21 -a22a24a23a7a6a11a25a27a26 ”, “a5a28a6 -a20
a21 -a29a7a6a10a30a27a31 ”, “a5a17a6 -a20a11a21 -a22a27a23a7a6 ”, “a20a17a32a10a6 ”, “a23a7a6 -
a20a10a21 ”, “a23a7a6 -a13a33a14
a16 ”, “
a22a11a6 ”, “a22a34a23a10a6 ”, “a35a37a36a10a6 -a13
a14
a16 ”, “
a35a27a36a10a6 -a20a38a21 ”, “a39a10a40 ”
part-of-speech(E) “CC”, “CD”, “DT”, “EX”, “FW”, “IN”,
“LS”, “MD”, “PDT”, “PR”, “PRS”, “TO”, “WDT”,
“WD”, “WP”
stemmed-form(E) “be”
symbols punctuations and brackets
We now illustrate the three models of transla-
tion units by referring to the sentence in Figure
1.
Pierre
Pierre-Vinken
Vinken
years old
join board
nonexecutive
nonexecutive-director
director
Nov
Figure 3: Chunk-bound N-gram
Bound-length N-gram
Bound-length N-gram is first proposed in (Ki-
tamura and Matsumoto, 1996). The translation
units generated in this model are word sequences
from uni-gram to a given length N. The upper
bound for N is fixed to 5 in our experiment. Fig-
ure 2 lists a set of N-grams generated by Bound-
length N-gram for the sentence in Figure 1.
Chunk-bound N-gram
Chunk-bound N-gram assumes prior knowledge
of chunk boundaries. The definition of “chunk”
varies from person to person. In our experiment,
the definition for English chunk task complies
with the CoNLL-2000 text chunking tasks and the
definition for Japanese chunk is based on “bun-
setsu” in the Kyoto University Corpus.
Unlike Bound-length N-gram, Chunk-bound
N-gram will not extend beyond the chunk bound-
aries. N varies depending on the size of the
chunks1. Figure 3 lists a set of N-grams gener-
ated by Chunk-bound N-gram for the sentence in
Figure 1.
Dependency-linked N-gram
Dependency-linked N-gram assumes prior
knowledge of dependency links among chunks.
In fact, Dependency-linked N-gram is an en-
hanced model of the Chunk-bound model in
that, Dependency-linked N-gram extends chunk
boundaries via dependency links. Although
dependency links could be extended recur-
sively in a sentence, we limit the use to direct
dependency links (i.e. links of immediate
mother-daughter relations) only. Two chunks of
dependency linked are concatenated and treated
as an extended chunks. Dependency-linked
N-gram generates translation units within the
1The average number of words in English and Japanese
chunks are 2.1 and 3.4 respectively for our parallel corpus.
Pierre
Pierre-Vinken
Pierre-Vinken-join
Vinken
Vinken-join
years
years-old
old
Pierre-Vinken-old
join board
join-board
nonexecutive
nonexecutive-director
director
Nov.
join-Nov
Figure 4: Dependency-linked N-gram
extended boundaries. Therefore, translation units
generated by Dependency-linked N-gram (Figure
4) become the superset of the units generated by
Chunk-bound N-gram (Figure 3).
The distinct characteristics of Dependency-
linked N-gram from previous works are two-fold.
First, (Yamamoto and Matsumoto, 2000) also
uses dependency relations in the generation of
translation units. However, it suffers from data
sparseness (and thus low coverage), since the en-
tire chunk is treated as a translation unit, which is
too coarse. Dependency-linked N-gram, on the
other hand, uses more fine-grained N-grams as
translation units in order to avoid sparseness. Sec-
ond, Dependency-linked N-gram includes “flex-
ible” or non-contiguous collocations if depen-
dency links are distant in a sentence. These col-
locations cannot be obtained by Bound-length N-
gram with any N.
3 Translation Pair Extraction
We use the same algorithm as (Yamamoto and
Matsumoto, 2000) for acquiring translation pairs.
The algorithm proceeds in a greedy manner. This
means that the translation pairs found earlier (i.e.
at a higher threshold) in the algorithm are re-
garded as decisive entries. The threshold acts
as the level of confidence. Moreover, translation
units that partially overlap with the already found
translation pairs are filtered out during the algo-
rithm.
The correlation score between translation units
a41a43a42 and a41a45a44 is calculated by the weighted Dice Co-
efficient defined as:
a46a48a47a50a49a52a51a53a41a43a42a55a54a50a41a45a44a57a56a59a58a60a51a62a61a62a63a65a64a67a66a69a68a70a42a71a44a48a56a73a72
a68a2a42a74a44
a68a70a42a76a75a77a68a65a44
model English Japanese
Bound-length 8479 11587
Chunk-bound 4865 5870
Dependency-linked 8716 11068
Table 1: Number of Translation Units
where a68a55a44 and a68a70a42 are the numbers of occurrences of
a41 a42 anda41 a44 in Japanese and English corpora respec-
tively, and a68a70a42a71a44 is the number of co-occurrences of
a41a43a42 and a41a45a44 .
We repeat the following until the current
threshold a68a70a78a71a79a55a80a81a80 reaches the predefined minimum
threshold a68a83a82a85a84a53a86 .
1. For each pair of English unit a87a89a88 and Japanese unit a87a48a90
appearing at least a91a69a92a94a93a96a95a62a95 times, identify the most likely
correspondences according to the correlation scores.
a97 For an English pattern
a87a89a88 , obtain the correspon-
dence candidate set PJ = a98a99a87a48a90a81a100 , a87a48a90a62a101 , ..., a87a83a90a103a102a11a104
such that sim(a87 a88 ,a87 a90a103a105 ) a106a108a107a53a109a96a110
a101
a91a69a92a94a93a111a95a62a95 for all k.
Similarly, obtain the correspondence candidate
set PE for a Japanese patterna87 a90
a97 Register (
a87a70a88 ,a87a65a90 ) as a translation pair if
a87 a90a113a112 a114a69a115a55a116a69a117a2a114a69a118
a119a81a120a94a121a69a122a48a123a45a124a48a125a96a126a128a127a27a129
a87 a88a111a130 a87 a90a103a105a55a131
a87 a88a132a112 a114a69a115a55a116a69a117a2a114a69a118
a119a134a133a71a121a65a122a48a123a136a135a136a125a96a126a128a127a27a129
a87 a90a55a130 a87 a88a62a105a57a131
The correlation score of (a87a89a88 ,a87a65a90 ) is the highest
among PJ fora87a136a88 and PE fora87a48a90 .
2. Filter out the co-occurrence positions for a87 a88 , a87 a90 , and
their overlapped translation units.
3. Lower a91a69a92a71a93a96a95a62a95 if no more pairs are found.
4 Experiment and Result
4.1 Experimental Setting
Data for our experiment is 10000 sentence-
aligned corpus from English-Japanese business
expressions (Takubo and Hashimoto, 1995). 8000
sentences pairs are used for training and the re-
maining 2000 sentences are used for evaluation.
Since the data are unannotated, we use NLP
tools (part-of-speech taggers, chunkers, and de-
pendency parsers) to estimate linguistic informa-
tion such as word segmentation, chunk bound-
aries, and dependency links. Most tools employ a
statistical model (Hidden Markov Model) or ma-
chine learning (Support Vector Machines).
Translation units that appear at least twice are
considered to be candidates for the translation
a91 a92a94a93a111a95a62a95 e c acc e’ c’ acc’
100.0 0 0 n/a 0 0 n/a
50.0 0 0 n/a 0 0 n/a
25.0 1 1 1.0000 1 1 1.0000
12.0 10 9 0.9000 11 10 0.9090
10.0 9 9 1.0000 20 19 0.9500
9.0 7 7 1.0000 27 26 0.9629
8.0 10 10 1.0000 37 36 0.9729
7.0 12 11 0.9166 49 47 0.9591
6.0 25 25 1.0000 74 72 0.9729
5.0 29 28 0.9655 103 100 0.9708
4.0 70 68 0.9714 173 168 0.9710
3.0 114 109 0.9561 287 277 0.9651
2.0 646 490 0.7585 933 767 0.8220
1.9 58 54 0.9310 991 821 0.8284
1.8 67 60 0.8955 1058 881 0.8327
1.7 186 131 0.7043 1244 1012 0.8135
1.6 105 93 0.8857 1349 1105 0.8191
1.5 220 161 0.7318 1569 1266 0.8068
1.4 267 182 0.6816 1836 1448 0.7886
1.3 309 228 0.7378 2145 1676 0.7813
1.2 459 312 0.6797 2604 1988 0.7634
1.1 771 404 0.5239 3375 2392 0.7087
Table 2: Accuracy(Bound-Length N-gram)
pair extraction algorithm described in the previ-
ous section. This implies that translation pairs
that co-occur only once will never be found in
our algorithm. We believe this is a reasonable
sacrifice to bear considering the statistical nature
of our algorithm. Table 1 shows the number of
translation units found in each model. Note that
translation units are counted not by token but by
type.
We adjust the threshold of the translation pair
extraction algorithm according to the following
equation. The threshold a68a2a78a71a79a55a80a81a80 is initially set to
100 and is gradually lowered down until it reaches
the minimum threshold a68a83a82a85a84a53a86 2, described in Sec-
tion 3. Furthermore, we experimentally decre-
ment the threshold a68 a78a71a79a55a80a81a80 from 2 to 1 with the
remaining uncorrelated sets of translation units,
all of which appear at least twice in the corpus.
This means that translation pairs whose correla-
tion score is 1 a137 sim(a41a138a42 ,a41a139a44 ) a137 0 are attempted to
find correspondences2.
2Note that
a91 a92a50a140a2a141a111a141 plays two roles: (1) threshold for the
co-occurrence frequency, and (2) threshold for the correla-
tion score. During the decrement of a91a69a92 a140a70a141a96a141 form 2 to 1,
the effect is solely on the latter threshold (for the correla-
tion score), and the former threshold (for the co-occurrence
frequency) does not alter and remains 2.
a91 a92a94a93a96a95a62a95 e c acc e’ c’ acc’
100.0 1 1 1.0 1 1 1.0
50.0 13 12 0.9230 14 13 0.9285
25.0 26 25 0.9615 40 38 0.95
12.0 63 61 0.9682 103 99 0.9611
10.0 35 35 1.0 138 134 0.9710
9.0 20 20 1.0 158 154 0.9746
8.0 17 16 0.9411 175 170 0.9714
7.0 40 39 0.975 215 209 0.9720
6.0 38 37 0.9736 253 246 0.9723
5.0 84 84 1.0 337 330 0.9792
4.0 166 160 0.9638 503 490 0.9741
3.0 198 195 0.9848 701 685 0.9771
2.0 870 816 0.9379 1571 1501 0.9554
1.9 112 106 0.9464 1683 1607 0.9548
1.8 109 105 0.9633 1792 1712 0.9553
1.7 266 239 0.8984 2058 1951 0.9480
1.6 155 139 0.8967 2213 2090 0.9444
1.5 292 253 0.8664 2505 2343 0.9353
1.4 365 327 0.8958 2870 2670 0.9303
1.3 448 391 0.8727 3318 3061 0.9225
1.2 599 483 0.8063 3917 3544 0.9047
1.1 890 481 0.5404 4807 4025 0.8373
Table 3: Accuracy(Chunk-bound N-gram)
a68a70a78a71a79a55a80a81a80a142a58
a143
a144a145
a91 a92a71a93a96a95a62a95a96a146a69a147
a129
a91 a92a71a93a96a95a62a95 a106 a147a96a148 a131
a149
a148
a129
a147a96a148a151a150 a91a57a92a71a93a96a95a62a95a152a106
a149
a148 a131
a91 a92a71a93a96a95a62a95a154a153
a149
a129
a149
a148a151a150 a91 a92a71a93a96a95a62a95a152a150a155a147 a131
a91 a92a71a93a96a95a62a95a154a153a38a148a48a156
a149
a129
a147 a106a157a91 a92a94a93a111a95a62a95 a106
a149
a131
The result is evaluated in terms of accuracy
and coverage. Accuracy is the number of cor-
rect translation pairs over the extracted translation
pairs in the algorithm. This is calculated by type.
Coverage measures “applicability” of the correct
translation pairs for unseen test data. It is the
number of tokens matched by the correct trans-
lation pairs over the number of tokens in the un-
seen test data. Acuracy and coverage roughly cor-
respond to Melamed’s precision and percent cor-
rect respectively (Melamed, 1995). Accuracy is
calculated on the training data (8000 sentences)
manually, whereas coverage is calculated on the
test data (2000 sentences) automatically.
4.2 Accuracy
Stepwise accuracy for each model is listed in Ta-
ble 2, Table 3, and Table 4. “a68a70a78a71a79a55a80a158a80 ” indicates
the threshold, i.e. stages in the algorithm. “e”
is the number of translation pairs found at stage
“a68a2a78a71a79a55a80a81a80 ”, and “c” is the number of correct ones
found at stage “a68a2a78a71a79a55a80a81a80 ”. The correctness is judged
by an English-Japanese bilingual speaker. “acc”
a91 a92a94a93a111a95a62a95 e c acc e’ c’ acc’
100.0 1 1 1.0 1 1 1.0
50.0 13 12 0.9230 14 13 0.9285
25.0 26 25 0.9615 40 38 0.95
12.0 62 60 0.9677 102 98 0.9607
10.0 32 31 0.9687 134 129 0.9626
9.0 20 23 1.15 158 152 0.9620
8.0 16 16 1.0 174 168 0.9655
7.0 43 43 1.0 217 211 0.9723
6.0 40 39 0.975 257 250 0.9727
5.0 85 83 0.9764 342 333 0.9736
4.0 166 162 0.9759 508 495 0.9744
3.0 205 201 0.9804 713 696 0.9761
2.0 949 849 0.8946 1662 1545 0.9296
1.9 115 107 0.9304 1777 1652 0.9296
1.8 105 103 0.9809 1882 1755 0.9325
1.7 268 230 0.8582 2150 1985 0.9232
1.6 156 145 0.9294 2306 2130 0.9236
1.5 288 244 0.8472 2594 2374 0.9151
1.4 373 300 0.8042 2967 2674 0.9012
1.3 434 344 0.7926 3401 3018 0.8873
1.2 576 383 0.6649 3977 3401 0.8551
1.1 855 417 0.4877 4832 3818 0.7901
Table 4: Accuracy(Dependency-linked N-gram)
lists accuracy, the fraction of correct ones over ex-
tracted ones by type. The accumulated results for
“e”, “c” and “acc” are indicated by ’.
4.3 Coverage
Stepwise coverage for each model is listed in Ta-
ble 5, Table 6, and Table 7. As before, “a68a2a78a71a79a55a80a81a80 ”
indicates the threshold. The brackets indicate
language: “E” for English and “J” for Japanese.
“found” is the number of content tokens matched
with correct translation pairs. “ideal” is the upper
bound of content tokens that should be found by
the algorithm; it is the total number of content to-
kens in the translation units whose co-occurrence
frequency is at least “a68 a78a71a79a55a80a81a80 ” times in the original
parallel corpora. “cover” lists coverage. The pre-
fix “i ” is the fraction of found tokens over ideal
tokens and the prefix “t ” is the fraction of found
tokens over the total number of both content and
functional tokens in the data. For 2000 test par-
allel sentences, there are 30255 tokens in the En-
glish half and 38827 tokens in the Japanese half.
The gap between the number of “ideal” tokens
and that of total tokens is due to filtering of func-
tional words in the generation of translation units.
a91 a92a71a93a96a95a62a95 found(E) ideal(E) i cover(E) t cover(E) found(J) ideal(J) i cover(J) t cover(J)
100.0 0 445 0 0 0 486 0 0
50.0 0 1182 0 0 0 1274 0 0
25.0 46 2562 0.0179 0.0015 46 2564 0.0179 0.0011
12.0 156 4275 0.0364 0.0051 146 4407 0.0331 0.0037
10.0 344 4743 0.0725 0.0113 334 4935 0.0676 0.0086
9.0 465 4952 0.0939 0.0153 455 5247 0.0867 0.0117
8.0 511 5242 0.0974 0.0168 501 5593 0.0895 0.0129
7.0 577 5590 0.1032 0.0190 567 5991 0.0946 0.0146
6.0 744 5944 0.1251 0.0245 734 6398 0.1147 0.0189
5.0 899 6350 0.1415 0.0297 891 6894 0.1292 0.0229
4.0 1193 6865 0.1737 0.0394 1195 7477 0.1598 0.0307
3.0 1547 7418 0.2085 0.0511 1549 8257 0.1875 0.0398
2.0 2594 8128 0.3191 0.0857 2617 9249 0.2829 0.0674
1.9 2686 8128 0.3304 0.0887 2713 9249 0.2933 0.0698
1.8 2831 8128 0.3483 0.0935 2858 9249 0.3090 0.0736
1.7 2952 8128 0.3631 0.0975 2983 9249 0.3225 0.0768
1.6 3180 8128 0.3912 0.1051 3214 9249 0.3474 0.0827
1.5 3387 8128 0.4167 0.1119 3423 9249 0.3700 0.0881
1.4 3587 8128 0.4413 0.1185 3628 9249 0.3922 0.0934
1.3 3836 8128 0.4719 0.1267 3901 9249 0.4217 0.1004
1.2 4106 8128 0.5051 0.1357 4184 9249 0.4523 0.1077
1.1 4470 8128 0.5499 0.1477 4558 9249 0.4928 0.1173
Table 5: Coverage(Bound-length N-gram)
a91 a92a71a93a96a95a62a95 found(E) ideal(E) i cover(E) t cover(E) found(J) ideal(J) i cover(J) t cover(J)
100.0 52 1374 0.0378 0.0017 52 1338 0.0388 0.0013
50.0 371 2813 0.1318 0.0122 372 2643 0.1407 0.0095
25.0 695 5019 0.1384 0.0229 696 4684 0.1485 0.0179
12.0 1251 7129 0.1754 0.0413 1246 6873 0.1812 0.0320
10.0 1478 7629 0.1937 0.0488 1463 7441 0.1966 0.0376
9.0 1607 7917 0.2029 0.0531 1590 7715 0.2060 0.0409
8.0 1690 8208 0.2058 0.0558 1673 8075 0.2071 0.0430
7.0 1893 8535 0.2217 0.0625 1879 8463 0.2220 0.0483
6.0 2023 8939 0.2263 0.0668 2015 8854 0.2275 0.0518
5.0 2464 9390 0.2624 0.0814 2445 9318 0.2623 0.0629
4.0 2893 9800 0.2952 0.0956 2882 9891 0.2913 0.0742
3.0 3425 10380 0.3299 0.1132 3439 10625 0.3236 0.0885
2.0 4702 11020 0.4266 0.1220 4737 11439 0.4141 0.1220
1.9 4869 11020 0.4418 0.1609 4906 11439 0.4288 0.1263
1.8 5020 11020 0.4555 0.1659 5057 11439 0.4420 0.1302
1.7 5177 11020 0.4697 0.1711 5214 11439 0.4558 0.1342
1.6 5388 11020 0.4889 0.1780 5423 11439 0.4740 0.1396
1.5 5621 11020 0.5100 0.1857 5676 11439 0.4961 0.1461
1.4 5907 11020 0.5360 0.1952 5971 11439 0.5219 0.1537
1.3 6227 11020 0.5650 0.2058 6298 11439 0.5505 0.1622
1.2 6513 11020 0.5910 0.2152 6589 11439 0.5760 0.1697
1.1 6787 11020 0.6158 0.2243 6874 11439 0.6009 0.1770
Table 6: Coverage(Chunk-bound N-gram)
a91a57a92a71a93a96a95a62a95 found(E) ideal(E) i cover(E) t cover(E) found(J) ideal(J) i cover(J) t cover(J)
100.0 52 1370 0.0379 0.0017 52 1334 0.0389 0.0013
50.0 370 2806 0.1318 0.0122 371 2629 0.1411 0.0095
25.0 693 5010 0.1383 0.0229 694 4675 0.1484 0.0178
12.0 1238 7117 0.1739 0.0409 1233 6845 0.1801 0.0317
10.0 1429 7611 0.1877 0.0472 1424 7428 0.1917 0.0366
9.0 1583 7906 0.2002 0.0523 1576 7714 0.2043 0.0405
8.0 1689 8201 0.2059 0.0558 1682 8074 0.2083 0.0433
7.0 1945 8522 0.2282 0.0642 1925 8455 0.2276 0.0495
6.0 2083 8930 0.2332 0.0688 2064 8854 0.2331 0.0531
5.0 2481 9376 0.2646 0.0820 2458 9317 0.2638 0.0633
4.0 2918 9792 0.2979 0.0964 2901 9893 0.2932 0.0747
3.0 3473 10367 0.3350 0.1147 3490 10633 0.3282 0.0898
2.0 4736 11011 0.4301 0.1565 4769 11450 0.4165 0.1228
1.9 4893 11011 0.4443 0.1617 4926 11450 0.4302 0.1268
1.8 5032 11011 0.4569 0.1663 5063 11450 0.4421 0.1303
1.7 5155 11011 0.4681 0.1703 5192 11450 0.4534 0.1337
1.6 5369 11011 0.4876 0.1774 5398 11450 0.4714 0.1390
1.5 5630 11011 0.5113 0.1860 5672 11450 0.4953 0.1460
1.4 5908 11011 0.5365 0.1952 5963 11450 0.5207 0.1535
1.3 6205 11011 0.5635 0.2050 6275 11450 0.5480 0.1616
1.2 6415 11011 0.5825 0.2120 6487 11450 0.5665 0.1670
1.1 6657 11011 0.6045 0.2200 6744 11450 0.5889 0.1736
Table 7: Coverage(Dependency-linked N-gram)
Bounded
Chunk Dependency
(1)
(2)
(3)
(4)
(5)
(6) (7)
(1) 1992   (5) 237
(2)  115   (6) 471
(3) 1447   (7) 331
(4)   48
Figure 5: Venn diagram
5 Discussion
Of the three models, Chunk-bound N-gram yields
the best performance both in accuracy (83%) and
in coverage (60%)3. Compared with the Bound-
length N-gram, it achieves approximately 13%
improvement in accuracy and 5-9% improvement
in coverage at threshold 1.1.
Although Bound-length N-gram generates
more translation units than Chunk-bound N-
gram, it extracts fewer correct translation pairs
(and results in low coverage). A possible expla-
nation for this phenomenon is that Bound-length
N-gram tends to generate too many unnecessary
translation units which increase the noise for the
3We did not evaluate results when
a91a69a92a94a93a111a95a62a95 = 1.0, since it
means threshold 0 , i.e. random pairing.
model English Japanese
B look forward a159a161a160a37a162a34a163a27a164a161a165
B look forward a166a10a167 -a164a168a165
B look forward a169 -a167a171a170
B do not hesitate a172a34a173 -a174a155a175
C Hong Kong a176a10a177
C San Diego a178a171a179a28a180a52a181a183a182a17a184
C Parker a185a38a186a24a187a34a186
D free (of) charge a188a27a189
D point out a190a10a191
D go press a192a17a193 (-a163 -) a194a17a164
D affect a195a24a196 (-a197 -) a198a24a199a24a165
Table 8: correct translation pairs
extraction algorithm.
Dependency-linked N-gram follows a similar
transition of accuracy and coverage as Chunk-
bound N-gram. Figure 5 illustrates the Venn di-
agram of the number of correct translation pairs
extracted in each model. As many as 3439 trans-
lation pairs from Dependency-linked N-gram and
Chunk-bound N-gram are found in common.
Based on these observation, we could say that
dependency links do not contribute significantly.
However, as dependency parsers are still prone
to some errors, we will need further investigation
with improved dependency parsers.
Table 8 lists the sample correct translation pairs
that are unique to each model. Most transla-
tion pairs unique to Chunk-bound N-gram are
named entities (NP compounds) and one-to-one
correspondence. This matches our expectation, as
translation units in Chunk-bound N-gram are lim-
ited within chunk boundaries. The reason why the
other two failed to obtain these translation pairs
is probably due to a large number of overlapped
translation units generated. Our extraction algo-
rithm filters out the overlapped entries once the
correct pairs are identified, and thus a large num-
ber of overlapped translation units sometimes be-
come noise.
Bound-length N-gram and Dependency-linked
N-gram include longer pairs, some of which are
idiomatic expressions. Theoretically speaking,
translation pairs like “look forward” should be ex-
tracted by Dependency-linked N-gram. A close
examination of the data reveals that in some sen-
tences, “look” and “forward” are not recognized
as dependency-linked. These preprocessing fail-
ures can be overcome by further improvement of
the tools used.
Based on the above analysis, we conclude that
chunking boundaries are useful clues in build-
ing bilingual seed dictionary as Chunk-bound N-
gram has demonstrated high precision and wide
coverage. However, for parallel corpora that in-
clude a great deal of domain-specific or idiomatic
expressions, partial use of dependency links is de-
sirable.
There is still a remaining problem with our
method. That is how to determine translation
pairs which co-occur only once. One simple ap-
proach is to use a machine-readable bilingual dic-
tionary. However, a more fundamental solution
may lie in the partial structural matching of par-
allel sentences (Watanabe et al., 2000). We in-
tend to incorporate these techniques to improve
the overall coverage.
6 Conclusion
This paper reports on-going research on extract-
ing bilingual lexicon from English-Japanese par-
allel corpora. Three models including a previ-
ously proposed one in (Kitamura and Matsumoto,
1996) are compared in this paper. Through pre-
liminary experiments with 10000 bilingual sen-
tences, we obtain that our new models (Chunk-
bound N-gram and Dependency-linked N-gram)
gain approximately 13% improvement in accu-
racy and 5-9% improvement in coverage from
the baseline model (Bound-length N-gram). We
present quantitative and qualitative analysis of the
results in three models. We conclude that chunk
boundaries are useful for building initial bilingual
lexicon, and that idiomatic expressions may be
partially handled with by dependency links.

References
M. Haruno, S. Ikehara, and T. Yamazaki. 1996.
Learning bilingual collocations by word-level sort-
ing. In COLING-96: The 16th International Con-
ference on Computational Linguistics, pages 525–
530.
M. Kitamura and Y. Matsumoto. 1996. Automatic ex-
traction of word sequence correspondences in par-
allel corpora. In Proc. 4th Workshop on Very Large
Corpora, pages 79–87.
J. Kupiec. 1993. An algorithm for finding noun phrase
correspondences in bilingual corpora. In ACL-93:
31st Annual Meeting of the Association for Compu-
tational Linguistics, pages 23–30.
Y. Matsumoto and M. Asahara. 2001. Ipadic users
manual. Technical report.
I.D. Melamed. 1995. Automatic evaluation and uni-
form filter cascades for inducing n-best translation
lexicons. In Proc. of 3rd Workshop on Very Large
Corpora, pages 184–198.
I.D. Melamed. 2000. Models of translational equiva-
lence. In Computational Linguistics, volume 26(2),
pages 221–249.
B. Santorini. 1991. Part-of-speech tagging guidelines
for the penn treebank project. Technical report.
F. Smadja, K.R. McKeown, and V. Hatzuvassiloglou.
1996. Translating collocations for billingual lexi-
cons: A statistical approach. In Computational Lin-
guistics, volume 22(1), pages 1–38.
K. Takubo and M. Hashimoto. 1995. A Dictionary
of English Bussiness Letter Expressions. Nihon
Keizai Shimbun, Inc.
H. Watanabe, S. Kurohashi, and E. Aramaki. 2000.
Finding structual correspondences from bilingual
parsed corpus for corpus-based translation. In
COLING-2000: The 18th International Conference
on Computational Linguistics, pages 906–912.
K. Yamamoto and Y. Matsumoto. 2000. Acquisi-
tion of phrase-level bilingual correspondence us-
ing dependency structure. In COLING-2000: The
18th International Conference on Computational
Linguistics, pages 933–939.
