Using the Web as a Bilingual Dictionary
Masaaki NAGATA
NTT Cyber Space Laboratories
1-1 Hikarinooka, Yokoshuka-shi
Kanagawa, 239-0847 Japan
nagata@nttnly.isl.ntt.co.jp
Teruka SAITO
Chiba University
1-33 Yayoi-cho, Inage-ku
Chiba-shi, Chiba, 263-8522 Japan
t-saito@icsd4.tj.chiba-u.ac.jp
Kenji SUZUKI
Toyohashi University of Technology
1-1 Hibarigaoka, Tempaku-cho, Toyohashi-shi
Aichi, 441-8580 Japan
ksuzuki@ss.ics.tut.ac.jp
Abstract
We present a system for extracting an
English translation of a given Japanese
technical term by collecting and scor-
ing translation candidates from the web.
We first show that there are a lot of par-
tially bilingual documents in the web
that could be useful for term translation,
discovered by using a commercial tech-
nical term dictionary and an Internet
search engine. We then present an al-
gorithm for obtaining translation candi-
dates based on the distance of Japanese
and English terms in web documents,
and report the results of a preliminary
experiment.
1 Introduction
In the field of computational linguistics, the term
‘bilingual text’ is often used as a synonym for
‘parallel text’, which is a pair of texts written in
two different languages with the same semantic
contents. In Asian languages such as Japanese,
Chinese and Korean, however, there are a large
number of ‘partially bilingual texts’, in which the
monolingual text of an Asian language contains
several sporadically interlaced English words as
follows:
“a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a8a15a17a16a8a18a20a19a4a21a10a22a24a23a26a25
a27a29a28a4a30a2a31a32a27 a18a32a33a34a33a6a35a26a36a38a37a34a39a4a40a4a41 (macu-
lar degeneration)a5a43a42a43a44a17a45a8a46a6a47a14a48a49a3a6a50
a46a8a51 a27a26a52a54a53a24a31a29a27a56a55a58a57a8a57a10a59a61a60a6a62 ”
The above sentence is taken from a Japanese
medical document, which says “Since glaucoma
is now manageable if diagnosed early, macular
degeneration is becoming a major cause of visual
impairment in developed nations”. These par-
tially bilingual texts are typically found in tech-
nical documents, where the original English tech-
nical terms are indicated (usually in parenthesis)
just after the first usage of the Japanese techni-
cal terms. Even if you don’t know Japanese, you
can easily guess ‘a37a49a39a63a40a49a41 ’ is the translation of
‘macular degeneration’.
Partially bilingual texts can be used for ma-
chine translation and cross language information
retrieval, as well as bilingual lexicon construc-
tion, because they not only give a correspondence
between Japanese and English terms, but also
give the context in which the Japanese term is
translated to the English term. For example, the
Japanese word ‘a40a8a41 ’ can be translated into many
English words, such as ‘degeneration’, ‘denatu-
ration’, and ‘conversion’. However, the words in
the Japanese context such as ‘a28a43a30 (disease)’ and
‘a3a10a50 (impairment)’ can be used as informants
guiding the selection of the most appropriate En-
glish word.
In this paper, we investigate the possibility
of using web-sourced partially bilingual texts as
a continually-updated, wide-coverage bilingual
technical term dictionary.
Extracting the English translation of a given
Japanese technical term from the web on the fly
is different from collecting a set of arbitrary many
pairs of English and Japanese technical terms.
The former can be thought of example-based
translation, while the latter is a tool for bilingual
lexicon construction.
Internet portals are starting to provide on-
line bilingual dictionary and translation services.
However, technical terms and new words are un-
likely to be well covered because they are too spe-
cific or too new. The proposed term translation
extractor could be an useful Internet tool for hu-
man translators to complement the weakness of
existing on-line dictionaries and translation ser-
vices.
In the following sections, we first investigate
the coverage provided by partially bilingual texts
in the web as discovered by using a commercial
technical term dictionary and an Internet search
engine. We then present a simple algorithm
for extracting English translation candidates of a
given Japanese technical term. Finally, we report
the results of a preliminary experiment and dis-
cuss future work.
2 Partially Bilingual Text in the Web
2.1 Coverage of Fields
It is very difficult to measure precisely in what
field of science there are a large number of par-
tially bilingual text in the web. However, it is
possible to get a rough estimate on the relative
amount in different fields, by asking a search
engine for documents containing both Japanese
and English technical terms in each field several
times.
For this purpose, we used a Japanese-to-
English technical term dictionary licensed from
NOVA, a maker of commercial machine transla-
tion systems. The dictionary is classified into 19
categories, ranging from aeronautics to ecology to
trade, as shown in Table 1. There are 1,082,594
pairs of Japanese and English technical terms1.
We randomly selected 30 pairs of Japanese
and English terms from each category and sent
queries to an Internet search engine, Google
(Google, 2001), to see whether there are any doc-
uments that contain both Japanese and English
technical terms. The fourth column in Table 1
shows the percentage of queries (J-E pairs) re-
turned by at least one document.
1The dictionary can be searched in their web site (NOVA
Inc., 2000).
It is very encouraging that, on average, 42% of
the queries returned at least one document. The
results show that the web is worth mining for
bilingual lexicon, in fields such as aeronautics,
computer, and law.
2.2 Classification of Format
In order to implement a term translation extractor,
we have to analyze the format, or structural pat-
tern of the partially bilingual documents. There
are at least three typical formats in the web. Fig-
ure 1 shows examples.
a64 aligned paragraph format
a64 table format
a64 plain text format
In ‘aligned paragraph’ format, each paragraph
contains one language and the paragraphs with
different languages are interlaced. This format
is often found in web pages designed for both
Japanese and foreigners, such as official docu-
ments by governments and academic papers by
researchers (usually title and abstract only).
In ‘table’ format, each row contains a pair
of equivalent terms. They are not necessarily
marked by the TABLE tag of HTML. This for-
mat is often found in bilingual glossaries of which
there are many in the web. Some portals offer hy-
per links to such bilingual glossaries, such as ko-
toba.ne.jp (kotoba.ne.jp, 2000).
In ‘plain text’ format, phrases of different lan-
guage are interlaced in the monolingual text of
the baseline language. The vast majority of par-
tially bilingual documents in the web belongs to
this category.
The formats of the web documents are so
wildly different that it is impossible to automat-
ically classify them to estimate the relative quan-
tities belonging to each format. Instead, we exam-
ined the distance (in bytes) from a Japanese tech-
nical term to its corresponding English technical
term in the documents retrieved from the web by
the experiment described in the Section 2.1
Figure 2 shows the results. Positive distance
indicates that the English term appeared after the
Japanese term, while negative distance indicates
the reverse. It is observed that the English and
Japanese terms are likely to appear very close to
Registration for Foreign Residents and Birth Registration
a5a17a65a63a66a2a67a69a68a71a70 a31a24a72a6a73 a67 a74a43a75a76a18a78a77a79a65 a31a81a80
a82a14a45a71a83a43a84a10a85 a31 a86a6a87a4a88a90a89
The official name for registration for foreign residents in Japana91 as
determined by the Ministry of Justicea91 is a92 Alien Registrationa93a95a94
...
Anyone staying in Japan for more than 90 daysa91 chil-
dren born in Japana91 ...
90 a96a98a97a43a99a100a96a102a101
a15a4a103a10a104a14a105 a60a24a31a6a106 a36a26a107a43a108a12a5
a96a98a101a49a109
a87a61a110a78a111 a33 a31a81a106 a36 ...
...
(http://www.pref.akita.jp/life/g090.htm)
(a) An example of ‘aligned paragraph format’ taken from a life guide for foreigners.
a96a98a101a10a112a43a113a115a114a26a116a26a117a119a118a121a120a34a122a10a123 1(a124a8a125 )
...a126
a59a90a127
a128a29a129a115a130a4a131 gasping respiration
a132a10a133a43a134a12a135a12a132 achalasia
a136
a113
a41a10a137a90a138a26a41a43a139a115a1a29a140a49a141 subacute bacterial endocarditis
...a126
a65 a127
a142 stomach
a142a32a143 gastric juice
a144a10a145 catabolism
...
(http://apollo.m.ehime-u.ac.jp/GHDNet/98/waei.html)
(b) An example of ‘table format’ taken from a medical glossary.
a146a71a147a14a148 a145a149a31a69a150
a147a90a148 a145a149a31a69a150 a36a26a83a61a151a8a46a26a152a10a153a63a5a8a152a43a11a24a15 a27a14a60 a15 a57a69a111 a19 a154a147a49a155a34a156a10a157a12a158a115a159a63a160 a5a24a161 a30a2a162 a15a10a161
a163a24a15a4a164 a86a63a165a78a111 a36a26a166a43a167a10a168a43a169a90a46a26a170a10a171 a30 a147 a5
a113a43a172
a15
a99
a5 a55a174a173 a35a8a19a90a65 a60a176a175a43a177 a46a24a66 a31a10a178 a65a43a65
a110 a105 a62 a161 a30a115a162 a15a26a179a12a163a90a15a26a180 a110a78a111a181a60a20a182a12a183a12a145a12a184a12a185 a186 CO2
a187
a36a54a188a4a189a26a190 a186 CH4
a187
a36 a136a12a183a12a145a12a191a12a185
a186 N2O
a187
a36a90a192a63a193a81a190 a27a43a80 a5a49a36 a147a49a155a34a156a10a157a12a158a115a159 a186 Green House Gases
a194 GHGsa187
a31 a65a69a195 a111 a19
a65 a110 a105 a62
...
(http://www.eic.or.jp/cop3/ondan/ondan.html)
(c) An example of ‘plain text format’ taken from a document on global worming.
Figure 1: Three typical formats of partially bilingual documents in the web
Table 1: The percentage of documents including both Japanese and English words
fields words samples found Example of Japanese-English pair
aeronautics and space 17862 30 57% a37a10a196a43a197a10a198 ecliptic coordinates
architecture 32049 30 30% a199a10a200 a48 load capacity
biotechnology 59766 30 50% a201a10a202 a11 a87 a116 phylogeny
business 50201 30 57% a133a12a134a34a203 a55 short selling
chemicals 122232 30 43% a204 a183 a188a206a205a10a207 methyl formate
computers 117456 30 57% a208a2a209
a193a4a210a26a211a14a210 OS loader
defense 4787 30 17% a212a10a213a43a214 a41 signature
ecology 32440 30 40% a215a10a216a43a217a49a218a34a219 permafrost
electronics 87942 30 47% a1a20a220a43a221a10a222a49a223a24a190a81a224 internal gear pump
energy 15804 30 50% a225a14a226a14a227 a193a54a228a20a193a81a190a26a229a10a230 cyclotron heating
finance 57097 30 37% a231a10a232a43a233a10a234 operating expenses
law 36033 30 60% a235a10a236 a83 sponsor
math and physics 76304 30 40% a40a10a237a90a238a49a239a81a207 a204
a210 deformation energy
mechanical engineering 86371 30 30% a240a49a241a24a242a71a201 tetragonal system
medical 135158 30 27% a243 a237a43a82a10a244 a116 orthopedics
metals 25595 30 37% a245a10a246 a229a10a247 electrochemical machining
ocean 13215 30 43% a248a10a249a43a250a10a251 mooring trial
(industrial) plant 95756 30 53% a252 a153a43a253a14a254a6a255a1a0 plotter
trade 16526 30 20% a2a1a3a5a4a1a6 remunerative price
total 1082594 570 42%
0
50
100
150
200
250
-200 -150 -100 -50 0 50 100 150 200
Number of occurrences
a7
Distance in bytes
Distance from Japanese words to English words
Figure 2: Distance from Japanese terms to En-
glish terms
each other. 28% (=233/847) of English terms ap-
peared just after (within 10 bytes) the correspond-
ing Japanese terms. 58% (=490/847) of English
terms appeared withina8 50 bytes. They probably
reflect either table or plain text format.
Although there are 28% (=237/847) English
terms appeared outside the window of a8 200
bytes, we find this ‘distance heuristics’ very pow-
erful, so it was used in the term translation algo-
rithm described in the next section.
3 Term Translation Extraction
Algorithm
Let a9 and a10 be Japanese and English technical
terms which are translations of each other. Let a11
be a document, and leta12a14a13a15a9a17a16 be a set of documents
which includes the Japanese terma9 . Leta18a20a19a21a13a15a9a23a22a21a10a24a16
be a statistical translation model which gives the
likelihood (or score) thata9 and a10 are translations
of each other.
Figure 3 shows the basic (conceptual) algo-
rithm for extracting the English translation of a
given Japanese technical term from the web. First,
we retrieve all documents a12a25a13a15a9a26a16 that contain the
1 foreacha11 ina12a25a13a15a9a26a16
2 ifa11 is a bilingual document then
3 foreacha10 ina11
4 computea18a20a19a21a13a15a9a23a22a21a10a27a16
5 end
6 endif
7 end
8 output a28a10a30a29a32a31a34a33a21a35a37a36a38a31a24a39a40a18 a19a13a15a9a23a22a21a10a27a16
Figure 3: Conceptual algorithm for extracting En-
glish translation of Japanese term
given Japanese technical terma9 using a search en-
gine. We then eliminate the Japanese only doc-
uments. For each English term a10 contained in
the (partially) bilingual documents, we compute
the translation probabilitya18 a19a13a15a9a41a22a21a10a27a16 , and select the
English term a28a10 which has the highest translation
probability.
In practise, it is often prohibitive to down load
all documents that include the Japanese term.
Moreover, a reliable Japanese-English statisti-
cal translation model is not available at the mo-
ment because of the scarcity of parallel corpora.
Rather, one of the aim of this research is to collect
the resources for building such translation mod-
els. We therefore employed a very simplistic ap-
proach.
Instead of using all documents including the
Japanese term, we used only the predetermined
number of documents (top 100 documents based
on the rank given by the search engine). This en-
tails the risk of missing the documents including
the English terms we are looking for.
Instead of using a statistical translation model,
we used a scoring function in the form of a geo-
metric distribution as shown in Equation (1).
a42
a13a15a9a23a22a21a10a27a16a43a29a45a44a46a13a48a47a40a49a50a44a51a16a48a52a54a53a56a55a57a55a57a58a60a59a62a61a54a59a63a21a64a65a57a66a68a67a70a69a72a71a73a66 (1)
Here, a11a74a13a15a9a41a22a21a10a27a16 is the byte distance between
Japanese terma9 and English terma10 . It is divided
by 10 and the integer part of the quotient is used as
the variable in the geometric distribution (a75a51a76a78a77a24a77a24a79
indicates flooring operation). The parameter (the
average) of the geometric distributiona44 is set to
0.6 in our experiment.
There is no theoretical background to the scor-
ing function Equation (1). It was designed, af-
ter a trial and error, so that the likelihood of can-
Table 3: Term translation extraction accuracy
tested by 34 Japanese terms
rank exact partial-1 partial-2
1 15% (5) 15% (5) 18% (6)
5 29% (10) 29% (19) 41% (14)
10 47% (16) 53% (18) 62% (21)
50 56% (19) 71% (24) 79% (27)
all 62% (21) 76% (26) 91% (31)
didates pairs being translations of each other de-
creases exponentially as the distance between the
two terms increases. Starting from the score of
0.6, it decreases 40% for every 10 bytes.
If we observed the same pair of Japanese and
English terms more than once, it is more likely
that they are valid translations. Therefore, we sum
the score of Equation (1) for each occurrence of
pair a13a15a9a41a22a21a10a27a16 and select the highest scoring English
term a28a10 as the translation of the Japanese terma9 .
4 Experiments
4.1 Test Terms
In order to factor out the characteristics of the
search engine and the proposed term extraction
algorithm, we used, as a test set, those words that
are guaranteed to have at lease one retrieved doc-
ument that includes both Japanese and English
terms.
First, we randomly selected 50 pairs of such
Japanese and English terms, from the pairs used
in the experiment described in Section 2.1. They
are shown in Figure 2. We then sent each
Japanese term as a query to an Internet search en-
gine, Google, and down loaded the top 100 web
documents. “o” indicates that at least one of the
down loaded documents included both terms. “x”
indicates that no document included both terms.
This resulted in a test set of 34 pairs of Japanese
and English terms.
For example, although there are a lot of doc-
uments which include both “a80 ” and “west”, the
top 100 documents retrieved by “a80 ” as the query
did not contain “west” since “a80 ” is a highly fre-
quent Japanese word.
Table 2: A list of Japanese and English technical terms used in the experiment.
oa81a30a82a84a83a86a85a88a87a84a89 National Information Infrastructure xa90a92a91a94a93 specific strength
oa95a86a96a84a97a86a98a88a99 terrestrial planet oa100a84a101a86a102a104a103a105a101a30a106a88a107 earth cable
oa108a86a109a92a110 load capacity oa111a113a112a114a100a94a115a92a116a114a117 tenuazonic acid
oa118a120a119a40a121 multiple factor oa122a84a123a86a124a86a122a92a125 ethology
oa126a86a127a84a128a86a129a88a130 radionuclide oa131a133a132a17a134a136a135a133a132a27a137a17a138a140a139a142a141a46a131a144a143a20a145a147a146a149a148a51a150 job shop scheduling
oa151a86a152a154a153a30a155a88a156 Government Printing Office oa157a84a127a86a158a86a159 launcher
xa160a154a161a40a85a163a162 expense reporting oa164a88a117a166a165a147a167a88a107 methyl formate
oa168a38a169a171a170a173a172a30a101a154a174a147a175a120a101a94a176 network game o a177a86a178a94a101a37a175a166a101a179a176 war game
oa180a84a181a40a182a50a169a94a174a179a102 Phoenix xa183 west
xa184a86a185 first day of winter oa186a171a187a105a174a37a107a189a188a94a187a94a176 cycle time
oa190a94a191a84a192a38a193a114a194 half duplex circuit oa195a84a196a86a197a86a198 market research
oa199a173a200a84a201a86a202a84a203a38a116a30a204 internal gear pump oa205a92a206a88a207a40a107a208a101a40a204 closed loop
oa186a171a187a105a174a88a209a154a170a114a209a30a116a104a210a86a211 cyclotron heating xa212a84a213a86a214a86a215 operating expenses
xa216a86a217 well-being oa218a84a219a86a195a86a196 world market
xa220a86a221 faith oa222a84a223 courtroom
xa222a86a224a84a225a38a226a114a227 treatise xa228a84a229a86a230 sponsor
oa100a154a231a163a232a40a102 address xa233a84a234a86a197a86a198 climate study
oa95a86a235a84a233a86a236a88a237 geomagnetic reversal xa238a92a239 edge
oa240a86a93 density oa241a84a122a86a242 end artery
oa243a86a244a84a245a86a246a84a125 orthopedics xa247a84a248a84a204a208a209a173a249a38a102 steelmaking process
xa250a1a251 knob oa252a84a253a86a254a86a237 mooring trial
oa255a1a0a189a188a104a101a3a2a92a116 low pressure turbine oa4a6a5a105a169a88a174 petcock
xa7a9a8 stay oa10a84a124a12a11a86a102a30a111a105a176 navigation system
xa13a1a0 total pressure oa14a9a15 debit
xa245a38a81a17a16a1a18a20a19a84a196 foreign exchange rate oa21a171a180a23a22a86a187a25a24a120a101 optical fiber
4.2 Extraction Accuracy
Table 3 shows the extraction accuracy of the En-
glish translation of Japanese term. Since both
Japanese and English terms could occur as a sub-
part of more longer terms, we need to consider lo-
cal alignment to extract the English subpart corre-
sponding to the Japanese query. Instead of doing
this alignment, we introduced two partial match
measures as well as exact matching.
In Table 3, ‘exact’ indicates that the output
is exactly matched to the correct answer, while
‘partial-1’ indicates that the correct answer was a
subpart of the output; ‘partial-2’ indicates that at
least one word of the output is a subpart of the
correct answer.
For example, the eye disease ‘a26a28a27a30a29a32a31 ’,
whose translation is ‘macular degeneration’, is
sometimes more formally refereed to as ‘a33a35a34
a31a36a26a37a27a36a29a35a31 ’, whose translation is ‘age-related
macular degeneration’. ‘Partial-1’ holds if ‘age-
related macular degeneration’ is extracted when
the query is ‘a26a38a27a38a29a39a31 ’. ‘Partial-2’ holds if ‘de-
generation’ is included in the output when the
query is ‘a26a39a27a40a29a39a31 ’.
It is encouraging that useful outputs (either ex-
act or partial matches) are included in the top 10
candidates with the probability of around 60%.
Since we used simple string matching to mea-
sure the accuracy automatically, the evaluation re-
ported in Table 3 is very conservative. Because
the output contains acronyms, synonyms, and re-
lated words, the overall performance of the sys-
tem is fairly credible.
For example, the extracted translations for the
query ‘a41a43a42a38a44a46a45a38a47a38a48 ’ (National Information In-
frastructure) were as follows, where the second
candidate is the correct answer.
18.721123: nii
13.912146: national informa-
tion infrastructure
2.137008: gii
1.398144: unii
NII (nii) is the acronym for National Informa-
tion Infrastructure, while GII (gii) and UNII (unii)
stand for Global Information Infrastructure and
Unlicensed National Information Infrastructure,
respectively.
If the query is a chemical substance, its molec-
ular formula, instead of acronym, is often ex-
tracted, such as ‘HCOOCH3’ for ‘a49a38a50a52a51a6a53a38a54 ’
(methyl formate).
1.801008: methyl formate
0.840786: hcooch3
0.84: hcooh
As for synonyms, although we took ‘operating
expenses’ to be the correct translation for ‘a55a57a56a59a58
a60 ’, the following third candidate ‘operating cost’
is also a legitimate translation. This is counted as
‘partial-2’ because ‘operating’ is a subpart of the
correct answer.
1.8: fa
0.606144: ohr
0.6: operating cost
For your information, OHR (Over Head Ratio)
is a management index and equals to the operat-
ing cost divided by the gross operating profit. ‘Fa’
happened to be used three times in a tutorial doc-
ument on accounting to stand for ‘operating ex-
penses’, such as “a55a46a56a40a58 a60 (Fa)=a61a40a62 (E)*23%”,
where ‘a61a40a62 ’ means ‘cost’.
The following example is a combination of the
acronyms, synonyms and related words, which is,
in a sense, a typical output of the proposed sys-
tem. The query is ‘a63a57a64a57a65a57a66 ’, and ‘climate study’
is the translation we assumed to be correct.
10.736611: wcrp
2.282483: wmo
1.220275: no
1.2: wc rp
0.72: igbp
0.6: sparc
0.6: wcp
0.6: applied climatology
0.2784: world climate research programme
A subpart of the 9th candidate ‘climate re-
search’ is also a legitimate translation. ‘WCRP’
is the acronym for ‘World Climate Research Pro-
gramme’, which is the 9th candidate and is trans-
lated to ‘a67a39a68a38a63a39a64a38a65a39a66a35a69a59a70 ’ which includes the
original Japanese query. ‘WMO’ stands for World
Meteorological Organization, which hosts this in-
ternational program.
In short, if you look at the extracted transla-
tions together with the context from which they
are extracted, you can learn a lot about the rele-
vant information of the query term and its trans-
lation candidates. We think this is a useful tool
for human translators, and it could provide a use-
ful resource for statistical machine translation and
cross language information retrieval.
5 Discussion and Related Works
Previous studies on bilingual text mainly focused
on either parallel texts, non-parallel texts, or com-
parable texts, in which a pair of texts are written
in two different languages (Veronis, 2000). How-
ever, except for governmental documents from
Canada (English/French) and Hong Kong (Chi-
nese/English), bilingual texts are usually subject
to such limitations as licensing conditions, us-
age fees, domains, language pairs, etc. One ap-
proach that partially overcomes these limitations
is to collect parallel texts from the web (Nie et al.,
1999; Resnik, 1999).
To provide better coverage with fewer restric-
tions, we focused on partially bilingual text. Con-
sidering the enormous volume of such texts and
the variety of fields covered, we believe they are
the best resource to mine for MT-related applica-
tions that involve English and Asian languages.
The current system for extracting the transla-
tion of a given term is more similar to the in-
formation extraction system for term descriptions
(Fujii and Ishikawa, 2000) than any other ma-
chine translation systems. In order to collect de-
scriptions for technical term X, such as ‘data min-
ing’, (Fujii and Ishikawa, 2000) collected phrases
like “X is Y” and “X is defined as Y”, from the
web. As our system used a scoring function based
solely on byte distance, introducing this kind of
pattern matching might improve its accuracy.
Practically speaking, the factor that most in-
fluences the accuracy of the term translation ex-
tractor is the set of documents returned from the
search engine. In order to evaluate the system, we
used a test set that guarantees to contain at least
one document with both the Japanese term and its
English translation; this is a rather optimistic as-
sumption.
Since the search engine is an uncontrollable
factor, one possible solution is to make your own
search engine. We are very interested in combin-
ing such ideas as focused crawling (Chakrabarti
et al., 1999) and domain-specific Internet portals
(McCallum et al., 2000) with the proposed term
translation extractor to develop a domain-specific
on-line dictionary service.
6 Conclusion
We investigated the possibility of using the web
as a bilingual dictionary, and reported the prelim-
inary results of an experiment on extracting the
English translations of given Japanese technical
terms from the web.
One interesting approach to extending the cur-
rent system is to introduce a statistical translation
model (Brown et al., 1993) to filter out irrelevant
translation candidates and to extract the most ap-
propriate subpart from a long English sequence
as the translation by locally aligning the Japanese
and English sequences.
Unlike ordinary machine translation which
generates English sentences from Japanese sen-
tences, this is a recognition-type application
which identifies whether or not a Japanese term
and an English term are translations of each other.
Considering the fact that what the statistical trans-
lation model provides is the joint probability of
Japanese and English phrases, this could be a
more natural and prospective application of statis-
tical translation model than sentence-to-sentence
translation.

References
Peter F. Brown, Stephen A. Della Pietra, Vincent
J. Della Pietra, and Robert L. Mercer. 1993. The
mathematics of statistical machine translation: Pa-
rameter estimation. Computational Linguistics,
19(2):263–311.
Soumen Chakrabarti, Martin van den Berg, and Byron
Dom. 1999. Focused crawling: a new approach to
topic-specific web resource. In Proceedings of the
Eighth International World Wide Web Conference,
pages 545–562.
Atsushi Fujii and Tetsuya Ishikawa. 2000. Utilizing
the world wide web as an encyclopedia: Extract-
ing term descriptions from semi-structured texts.
In Proceedings of the 38th Annual Meeging of the
Association for Computational Linguistics, pages
488–495.
Google. 2001. Google.
http://www.google.com.
kotoba.ne.jp. 2000. Translators’ internet resources (in
Japanese). http://www.kotoba.ne.jp.
Andrew Kachites McCallum, Kamal Nigam, Jason
Rennie, and Kristie Seymore. 2000. Automating
the construction of internet portals with machine
learning. Information Retrieval, 3(2):127–163.
Jian-Yun Nie, Michel Simard, Pierre Isabelle, and
Richard Durand. 1999. Cross-language informa-
tion retrieval based on parallel texts and automatic
mining of parallel texts from the web. In Proceed-
ings of the 22nd Annual International ACM SIGIR
Conference on Research and Development in Infor-
mation Retrieval, pages 74–81.
NOVA Inc. 2000. Technical term dic-
tionary lookup service (in Japanese).
http://wwwd.nova.co.jp/webdic/webdic.html.
Rhilip Resnik. 1999. Mining the web for bilingual
text. In Proceedings of the 37th Annual Meeting
of the Association for Computational Linguistics,
pages 527–534.
Jean Veronis, editor. 2000. Parallel Text Process-
ing: Alignment and Use of Translation Corpora,
volume 13 of Text, Speech, and Language Technol-
ogy. Kluwer Academic Publishers.
