Translation by Machine of Complex Nominals: Getting it Right
Timothy Baldwin
CSLI
Stanford University
Stanford, CA 94305 USA
tbaldwin@csli.stanford.edu
Takaaki Tanaka
Communication Science Laboratories
Nippon Telephone and Telegraph Corporation
Kyoto, Japan
takaaki@cslab.kecl.ntt.co.jp
Abstract
We present a method for compositionally translating
noun-noun (NN) compounds, using a word-level
bilingual dictionary and syntactic templates for can-
didate generation, and corpus and dictionary statis-
tics for selection. We propose a support vector
learning-based method employing target language
corpus and bilingual dictionary data, and evaluate it
over a Englisha0 Japanese machine translation task.
We show the proposed method to be superior to pre-
vious methods and also robust over low-frequency
NN compounds.
1 Introduction
Noun-noun (NN) compounds (e.g. web server, a1
a2a4a3a6a5a8a7 kikaia3hoNyaku “machine translation”,1 the
elements of which we will refer to as Na9 and Na10 in
linear order of occurrence) are a very real problem
for both machine translation (MT) systems and hu-
man translators due to:
constructional variability in the translations:
a1a11a2a12a3a13a5a14a7 kikaia3hoNyaku “machine transla-
tion” (N-N) vs. a15a17a16 a3a19a18a21a20 miNkaNa3kigyou
“private company” (Adj-N) vs. a22a8a23 a3a25a24a8a26
kaNkeia3kaizeN “improvement in relations” (N
in N);
lexical divergences in Japanese and English:
a27a29a28 a3a30a21a31 haifua3keikaku “distribution
schedule” vs. a32a34a33 a3a30a21a31 keizaia3keikaku
“economic plan/programme” vs. a35a17a36 a3a30a21a31
shuyoua3keikaku “major project”;
semantic underspecification: compounds gener-
ally have multiple interpretations, and can only
be reliably interpreted in context (Levi, 1978);
the existence of non-compositional NN compounds:
a37a11a38a40a39a41a3a13a42a40a43 idobataa3kaigi “(lit.) well-side
meeting”, which translates most naturally into
English as “idle gossip”;
high productivity and frequency
In order to quantify the high productivity and
frequency of NN compounds, we carried out a
1With all Japanese NN compound examples, we segment
the compound into its component nouns through the use of the
“a44” symbol. No such segmentation boundary is indicated in the
original Japanese.
BNC Reuters Mainichi
Token coverage 2.6% 3.9% 2.9%
Total no. types 265K 166K 889K
Ave. token freq. 4.2 12.7 11.1
Singletons 60.3% 44.9% 45.9%
Table 1: Corpus occurrence of NN compounds
basic study of corpus occurrence in English and
Japanese. For English, we based our analysis
over: (1) the written portion of the British Na-
tional Corpus (BNC, 84M words: Burnard (2000)),
and (2) the Reuters corpus (108M words: Rose et
al. (2002)). For Japanese, we focused exclusively
on the Mainichi Shimbun Corpus (340M words:
Mainichi Newspaper Co. (2001)). We identified
NN compounds in each corpus using the method de-
scribed in a45 2.2 below, and from this, derived the
statistics of occurrence presented in Table 1. The
token coverage of NN compounds in each corpus
refers to the percentage of words which are con-
tained in NN compounds; based on our corpora, we
estimate this figure to be as high as 3-5%. If we
then look at the average token frequency of each
distinct NN compound type, we see that it is a rel-
atively modest figure given the size of each of the
corpora, the reason for which is seen in the huge
number of distinct NN compound types. Combin-
ing these observations, we see that a translator or
MT system attempting to translate one of these cor-
pora will run across NN compounds with high fre-
quency, but that each individual NN compound will
occur only a few times (with around 45-60% occur-
ing only once). The upshot of this for MT systems
and translators is that NN compounds are too var-
ied to be able to pre-compile an exhaustive list of
translated NN compounds, and must instead be able
to deal with novel NN compounds on the fly. This
claim is supported by Tanaka and Baldwin (2003a),
who found that static bilingual dictionaries had a
type coverage of around 84% and 94% over the top-
250 most frequent English and Japanese NN com-
pounds, respectively, but only 27% and 60%, re-
spectively, over a random sample of NN compounds
occurring more than 10 times in the corpus.
We develop and test a method for translating NN
compounds based on Japanesea0 English MT. The
method can act as a standalone module in an MT
Second ACL Workshop on Multiword Expressions: Integrating Processing, July 2004, pp. 24-31
system, translating NN compounds according to the
best-scoring translation candidate produced by the
method, and it is primarly in this context that we
present and evaluate the method. This is congruent
with the findings of Koehn and Knight (2003) that,
in the context of statistical MT, overall translation
performance improves when source language noun
phrases are prescriptively translated as noun phrases
in the target language. Alternatively, the proposed
method can be used to generate a list of plausible
translation candidates for each NN compound, for
a human translator or MT system to select between
based on the full translation context.
In the remainder of the paper, we describe the
translation procedure and resources used in this re-
search (a45 2), and outline the translation candidate se-
lection method, a benchmark selection method and
pre-processors our method relies on (a45 3). We then
evaluate the method using a variety of data sources
(a45 4), and finally compare our method to related re-
search (a45 5).
2 Preliminaries
2.1 Translation procedure
We translate NN compounds by way of a two-phase
procedure, incorporating generation and selection
(similarly to Cao and Li (2002) and Langkilde and
Knight (1998)).
Generation consists of looking up word-level
translations for each word in the NN compound
to be translated, and running them through a set
of constructional translation templates to generate
translation candidates. In order to translate a22 a23 a3
a24 a26 kaNkeia3kaizeN “improvement in relations”, for
example, possible word-level translations for a22 a23
are relation, connection and relationship, and trans-
lations for a24 a26 are improvement and betterment.
Constructional templates are of the form [Na1
a10
in Na1
a9
]
(where Na1a2 indicates that the word is a noun (N) in
English (a3 ) and corresponds to the a4 th-occurring
noun in the original Japanese; see Table 3 for fur-
ther example templates and Kageura et al. (2004)
for discussion of templates of this type). Each slot in
the translation template is indexed for part of speech
(POS), and derivational morphology is optionally
used to convert a given word-level translation into
a form appropriate for a given template. Example
translation candidates for a22 a23 a3 a24 a26 , therefore, are
relation improvement, betterment of relationship,
improvement connection and relational betterment.
Generation fails in the instance that we are unable
to find a word-level translation for Na9 and/or Na10 .
Selection consists of selecting the most likely
translation for the original NN compound from the
generated translation candidates. Selection is per-
formed based on a combination of monolingual tar-
get language and crosslingual evidence, obtained
from corpus or web data.
Ignoring the effects of POS constraints for the
moment, the number of generated translations is
a5a7a6a9a8a11a10a13a12a15a14 where a8 and a10 are the fertility of Japanese
nouns Na16
a9
and Na16
a10
, respectively, and a12 is the number
of translation templates. As a result, there is often
a large number of translation candidates to select
between, and the selection method crucially deter-
mines the efficacy of the method.
This translation procedure has the obvious advan-
tage that it can generate a translation for any NN
compound input assuming that there are word-level
translations for each of the component nouns; that
is it has high coverage. It is based on the assump-
tion that NN compounds translate compositional-
ity between Japanese and English, which Tanaka
and Baldwin (2003a) found to be the case 43.1% of
the time for Japanese–English (JE) MT and 48.7%
of the time for English–Japanese (EJ) MT. In this
paper, we focus primarily on selecting the cor-
rect translation for those NN compounds which can
be translated compositionally, but we also inves-
tigate what happens when non-compositional NN
compounds are translated using a compositional
method.
2.2 Translation data
In order to generate English and Japanese NN com-
pound testdata, we first extracted out all NN bi-
grams from the Reuters Corpus and Mainichi Shim-
bun Corpus. The Reuters Copus was first tagged
and chunked using fnTBL (Ngai and Florian, 2001),
and lemmatised using morph (Minnen et al., 2001),
while the Mainichi Shimbun was segmented and
tagged using ChaSen (Matsumoto et al., 1999). For
both English and Japanese, we took only those NN
bigrams adjoined by non-nouns to ensure that they
were not part of a larger compound nominal. We ad-
ditionally measured the entropy of the left and right
contexts for each NN type, and filtered out all com-
pounds where either entropy value was a17a19a18 .2 This
was done in an attempt to, once again, exclude NNs
which were embedded in larger MWEs, such as ser-
vice department in social service department.
We next calculated the frequency of occurrence
of each NN compound type identified in the English
and Japanese corpora, and ranked the NN com-
pound types in order of corpus frequency. Based on
this ranking, we split the NN compound types into
three partitions of equal token frequency, and from
each partition, randomly selected 250 NN com-
pounds. In doing so, we produced NN compound
2For the left token entropy, if the most-probable left context
was the, a or a sentence boundary, the threshold was switched
off. Similarly for the right token entropy, if the most-probable
right context was a punctuation mark or sentence boundary, the
threshold was switched off.
Band English Japanese
Freq. range 346–24,025 336–64,835HIGH
Types 791 4,009
Freq. range 44–345 37–336MED
Types 6,576 32,283
Freq. range 1–44 1–37LOW
Types 158,215 852,328
Table 2: Frequency bands
data representative of three disjoint frequency bands
of equal token size, as detailed in Table 2. This al-
lows us to analyse the robustness of our method over
data of different frequencies.
Our motivation in testing the proposed method
over NN compounds according to the three fre-
quency bands is to empirically determine: (a)
whether there is any difference in translation-
compositionality for NN compounds of different
frequency, and (b) whether our method is robust
over NN compounds of different frequency. We re-
turn to these questions in a45 4.1.
In order to evaluate basic translation accuracy
over the test data, we generated a unique gold-
standard translation for each NN compound to
represent its optimally-general default translation.
This was done with reference to two bilingual
Japanese-English dictionaries: the ALTDIC dictio-
nary and the on-line EDICT dictionary. The ALT-
DIC dictionary was compiled from the ALT-J/E
MT system (Ikehara et al., 1991), and has approxi-
mately 400,000 entries including more than 200,000
proper nouns; EDICT (Breen, 1995) has approxi-
mately 150,000 entries. The existence of a trans-
lation for a given NN compound in one of the
dictionaries does not guarantee that we used it as
our gold-standard, and 35% of JE translations and
25% of EJ translations were rejected in favour of
a manually-generated translation. In generating the
gold-standard translation data, we checked the va-
lidity of each of the randomly-extracted NN com-
pounds, and rejected a total of 0.5% of the initial
random sample of Japanese strings, and 6.6% of the
English strings, on the grounds of: (1) not being
NN compounds, (2) being proper nouns, or (3) be-
ing part of a larger MWE. In each case, the rejected
string was replaced with an alternate randomly-
selected NN compound.
2.3 Translation templates
The generation phase of translation relies on trans-
lation templates to recast the source language NN
compound into the target language. The transla-
tion templates were obtained by way of word align-
ment over the JE and EJ gold-standard translation
datasets, generating a total of 28 templates for the
JE task and 4 templates for the EJ task. The rea-
son for the large number of templates in the JE task
is that they are used to introduce prepositions and
possessive markers, as well as indicating word class
conversions (see Table 3).
3 Selection methodology
In this section, we describe a benchmark selection
method based on monoligual corpus data, and a
novel selection method combining monolingual cor-
pus data and crosslingual data derived from bilin-
gual dictionaries. Each method takes the list of gen-
erated translation candidates and scores each, re-
turning the highest-scoring translation candidate as
our final translation.
3.1 Benchmark monolingual method
The monolingual selection method we benchmark
ourselves against is the corpus-based transla-
tion quality (CTQ) method of Tanaka and Bald-
win (2003b). It rates a given translation candidate
according to corpus evidence for both the fully-
specified translation and its parts in the context of
the translation template in question. This is calcu-
lated as:3
a0a2a1a4a3a6a5a8a7a10a9a12a11
a13a15a14
a7a10a9a12a11
a11
a14a17a16a17a18a20a19a22a21a24a23
a5a8a7a10a9a12a11
a13a15a14
a7a10a9a12a11
a11
a14a17a16a17a18a26a25a28a27a29a23
a5a8a7a10a9a12a11
a13a30a14a17a16a17a18a31a23
a5a8a7a10a9a32a11
a11
a14a33a16a17a18
where a34a36a35 a10
a9
and a34a37a35 a10
a10
are the word-level translations
of the source language Na35 a9
a9
and Na35 a9
a10
, respectively,
and a12 is the translation template.4 Each probabil-
ity is calculated according to a maximum likelihood
estimate based on relative corpus occurrence. The
formulation of CTQ is based on linear interpolation
over a38 and a39 , where a40a15a41a42a38a44a43a45a39a46a41 a18 and a38a48a47a49a39a51a50 a18 .
We set a38 to a40a53a52a31a54 and a39 to a40a55a52 a18 throughout evaluation.
The basic intuition behind decomposing the
translation candidate into its two parts within the
context of the translation template (a56 a6 a34 a35 a10
a9
a43
a12 a14 and
a56
a6
a34 a35
a10
a10
a43
a12 a14 ) is to capture the subcategorisation prop-
erties of a34 a35 a10
a9
and a34 a35 a10
a10
relative to a12 . For example,
if a34a36a35 a10
a9
and a34a36a35 a10
a10
were Bandersnatch and relation,
respectively, and a56 a6 a34a28a35 a10
a9
a43a45a34a28a35
a10
a10
a43
a12a15a14
a50a57a40 for all
a12 , we
would hope to score relation to (the) Bandersnatch
as being more likely than relation on (the) Bander-
snatch. We could hope to achieve this by virtue of
the fact that relation occurs in the form relation to
... much more frequently than relation on ..., mak-
ing the value of a56
a6
a34 a35
a10
a10
a43
a12 a14 greater for the template
[Na1
a10
to Na1
a9
] than [Na1
a10
on Na1
a9
].
In evaluation, Tanaka and Baldwin (2003b) found
the principal failing of this method to be its treat-
ment of all translations contained in the transfer
dictionary as being equally likely, where in fact
3In the original formulation, the product
a58a60a59a62a61 a9a32a11
a13a64a63
a58a60a59a62a61 a9a32a11
a11
a63
a58a60a59a4a65
a63 was included as a third term, but
Tanaka and Baldwin (2003b) found it to have negligible impact
on translation accuracy, so we omit it here.
4a61 a9a32a11
a13 and
a61 a9a32a11
a11 are assumed to be POS-compatible with
a65 .
Template (JE) Example
[Na0 Na1 ]Ja2 [Na0 Na1 ]E a3a5a4a7a6a9a8a5a10 shijoua6keizai “market economy”
[Na0 Na1 ]Ja2 [Na1 Na0 ]E
a11a5a12
a6a9a13a5a14 saNseia6tasuu “majority agreement”
[Na0 Na1 ]Ja2 [Na1 of (the) Na0 ]E
a15a5a16
a6a9a17a5a18 seikeNa6koutai “change of government”
Template (EJ) Example
[Na0 Na1 ]Ea2 [Na0 Na1 ]J exchange rate a19 a18a7a6a21a20a23a22a25a24 “kawasea6reeto”
[Na0 Na1 ]Ea2 [Na0 teki Na1 ]J world leader a26a5a27 a6a29a28a30a6a21a31a23a22a5a32a5a22 “sekaia6tekia6leader”
[Na0 Na1 ]Ea2 [Na1 no Na0 ]J baby girl a33 a6a35a34a30a6a37a36a39a38a5a40a5a41 “oNnaa6noa6akachaN”
Table 3: Example translation templates (N = noun and Adj = adjective)
there is considerable variability in their applicatil-
ity. One example of this is the simplex a42a44a43 kiji
which is translated as either article or item (in the
sense of a newspaper) in ALTDIC, of which the for-
mer is clearly the more general translation. Lack-
ing knowledge of this conditional probability, the
method considers the two translations to be equally
probable, giving rise to the preferred translation of
related item for a22a46a45
a3
a42a47a43 kaNreNa3kiji “related ar-
ticle” due to the markedly greater corpus occurrence
of related item over related article. It is this as-
pect of selection that we focus on in our proposed
method.
3.2 Proposed selection method
The proposed method uses the corpus-based mono-
lingual probability terms of CTQ above, but also
mono- and crosslingual terms derived from bilin-
gual dictionary data. In doing so, it attempts to pre-
serve the ability of CTQ to model target language
expressional preferences, while incorporating more
direct translation preferences at various levels of
lexical specification. For ease of feature expandabil-
ity, and to avoid interpolation over excessively many
terms, the backbone of the method is the TinySVM
support vector machine (SVM) learner.5
The way we use TinySVM is to take all source
language inputs where the gold-standard translation
is included among the generated translation candi-
dates, and construct a single feature vector for each
translation candidate. We treat those feature vec-
tors which correspond to the (unique) gold-standard
translation as positive exemplars, and all other fea-
ture vectors as negative exemplars. We then run
TinySVM over the training exemplars using the
ANOVA kernel (the only kernel which was found to
converge). Strictly speaking, SVMs produce a bi-
nary classification, by returning a continuous value
and determining whether it is closest to a47 a18 (the pos-
itive class) or a48 a18 (the negative class). We treat
this value as a translation quality rating, and rank
the translation candidates accordingly. To select the
best translation candidate, we simply take the best-
scoring exemplar, breaking ties through random se-
lection.
5http://chasen.aist-nara.ac.jp/˜taku/
software/TinySVM/
The selection method makes use of three basic
feature types in generating a feature vector for each
source language–translation candidate pair: corpus-
based features, bilingual dictionary-based features
and template-based features.
Corpus-based features
Each source language–translation pair is mapped
onto a total of 8 corpus-based feature types, in line
with the CTQ formulation above:
a49a51a50a9a52a54a53a56a55a58a57a60a59 a1
a0a39a61
a57a62a59 a1
a1a63a61a65a64a67a66
a49a69a68a71a70a71a72a71a53a35a55a58a57 a59 a1
a0a39a61
a57 a59 a1
a1a63a61a73a64a67a66
a49a69a68a71a70a71a72a71a53a35a55a58a57a60a59 a1
a0 a61a67a64a67a66
and a68a74a70a65a72a71a53a75a55a58a57a62a59 a1
a1 a61a65a64a67a66
a49a69a68a71a70a71a72a71a53a35a55a58a57a60a59 a1
a0 a66 ,
a68a71a70a65a72a71a53a35a55a58a57a62a59 a1
a1 a66 and
a68a71a70a76a72a71a53a75a55
a64a67a66
a49a78a77a80a79a21a72a75a55a81a57a82a59 a1
a0 a61
a57a62a59 a1
a1 a61a65a64a67a66
a83a85a84a87a86
a6
a34 a35
a10
a9
a43a45a34 a35
a10
a10
a43
a12 a14 is a normalisation parameter
used to estimate the frequency of occurrence of mul-
tiword expression (MWE) translations from that of
the head. E.g., in generating translations for a88a63a89
a90 a3 a42a63a91 fudousaNa3gaisha “real estate company”,
we get two word-level translations for a88a47a89 a90 : real
estate and real property. In each case, we identify
the final word as the head, and calculate the num-
ber of times the MWEs (i.e. real estate and real
property) occur in the overall corpus as compared
to the head (i.e. estate and property, respectively).
In calculating the values of each of the frequency-
based features involving these translations, we de-
termine the frequency of the head in the given con-
text, and multiply this by the normalisation param-
eter. The reason for doing this is for ease of cal-
culation and, wherever possible, to avoid zero val-
ues for frequencies involving MWEs. The feature
a83a85a84a87a86
a6
a34 a35
a10
a9
a43a45a34 a35
a10
a10
a43
a12 a14 is generated by multiplying the
MWE parameters for each of a34a37a35 a10
a9
and a34a36a35 a10
a10
(which
are set to 1.0 in the case that the translation is sim-
plex) and intended to model the tendency to pre-
fer simplex translations over MWEs when given a
choice.
We construct an additional feature from each of
these values, by normalising (by simple division to
generate a value in the range a92a40a55a43 a18a94a93 ) relative to the
maximum value for that feature among the trans-
lation candidates generated for a given source lan-
guage input. For each corpus, therefore, the total
number of corpus-based features is a95a85a96a98a97 a50 a18a100a99 .
In EJ translation, the corpus-based feature values
were derived from the Mainichi Shimbun Corpus,
whereas in JE translation, we used the BNC and
Reuters Corpus, and concatenated the feature val-
ues from each.
Bilingual dictionary-based features
Bilingual dictionary data is used to generate 6 fea-
tures:
a49a69a68a71a70a71a72a71a53a1a0a3a2a5a4 a4a7a6a9a8a11a10 a55a58a57a62a59 a1
a0a39a61
a57a60a59 a1
a1a63a61a65a64
a12
a57a62a59 a0
a0a39a61
a57a62a59 a0
a1a51a66
a49a69a68a71a70a71a72a71a53a1a0a3a2a5a4 a4a7a6a9a8a11a10 a55a58a57a62a59 a1
a0 a61
a57a60a59 a1
a1 a61a65a64a67a66
a49a69a68a71a70a71a72a71a53 a0a3a2a5a4 a55a58a57a62a59 a1
a0 a61a67a64
a12
a57a62a59 a0
a0 a66 and
a68a71a70a71a72a71a53 a0a3a2a5a4 a55a58a57a62a59 a1
a1 a61a65a64
a12
a57a62a59 a0
a1 a66
a49a69a68a71a70a71a72a71a53 a0a3a2a5a4 a55a58a57a62a59 a1
a0
a12
a57a62a59 a0
a0 a66 and
a68a74a70a76a72a71a53 a0a3a2a13a4 a55a81a57a62a59 a1
a1
a12
a57a62a59 a0
a1 a66
a14a16a15
a86a18a17a20a19a22a21a13a23 a23a25a24a27a26a29a28
a6
a34a36a35
a10
a9
a43 a34a37a35
a10
a10
a43
a12a9a30
a34a36a35
a9
a9
a43 a34a36a35
a9
a10
a14 is the total
number of times the given translation candidate oc-
curs as a translation for the source language NN
compound across all dictionaries. While this fea-
ture may seem to give our method an unfair ad-
vantage over CTQ, it is important to realise that
only limited numbers of NN compounds are listed
in the dictionaries (12% for English and 28%
for Japanese), and that the gold-standard accuracy
when the dictionary translation is selected is not as
high as one would expect (65% for English and 75%
for Japanese). a14a1a15 a86a31a17 a19a22a21a13a23 a23a25a24a27a26a32a28 a6 a34 a35 a10
a9
a43a45a34 a35
a10
a10
a43
a12a15a14 describes
the total occurrences of the translation candidate
across all dictionaries (irrespective of the source
language expression it translates), and is considered
to be an indication of conventionalisation of the can-
didate.
The remaining features are intended to capture
word-level translation probabilities, optionally in
the context of the template used in the translation
candidate. Returning to our a22a46a45 a3 a42a47a43 kaNreNa3kiji
“related article” example from above, of the transla-
tions article and item for a42a47a43 , article occurs as the
translation of a42a47a43 for 42% of NN entries with a42a51a43
as the Na10 , and within 18% of translations for com-
plex entries involving a42a47a43 (irrespective of the form
or alignment between article and a42a63a43 ). For item,
the respective statistics are 9% and 4%. From this,
we can conclude that article is the more appropri-
ate translation, particularly for the given translation
template.
As with the corpus-based features, we addition-
ally construct a normalised variant of each fea-
ture value, such that the total number of bilingual
dictionary-based features is a33 a96 a97 a50 a95 .
In both JE and EJ translation, we derived bilin-
gual dictionary-based features from the EDICT and
ALTDIC dictionaries independently, and concate-
nated the features derived from each.
Template-based features
We use a total of two template-based features: the
template type and the target language head (N1 or
N2). For template [Na9 Na10 ]J
a34
[Na10 Na9 ]E (see a45 2.3),
e.g., the template type is N-N and the target lan-
guage head is N1.
3.3 Corpus data
The corpus frequencies were extracted from the
same three corpora as were described in a45 1: the
BNC and Reuters Corpus for English, and Mainichi
Shimbun Corpus for Japanese. We chose to use the
BNC and Reuters Corpus because of their comple-
mentary nature: the BNC is a balanced corpus and
hence has a rounded coverage of NN compounds
(see Table 1), whereas the Reuters Corpus contains
newswire data which aligns relatively well in con-
tent with the newspaper articles in the Mainichi
Shimbun Corpus.
We calculated the corpus frequencies based on
the tag and dependency output of RASP (Briscoe
and Carroll, 2002) for English, and CaboCha (Kudo
and Matsumoto, 2002) for Japanese. RASP is a tag
sequence grammar-based stochastic parser which
attempts to exhaustively resolve inter-word depen-
dencies in the input. CaboCha, on the other hand,
chunks the input into head-annotated “bunsetsu” or
base phrases, and resolves only inter-phrase depen-
dencies. We thus independently determined the
intra-phrasal structure from the CaboCha output
based on POS-conditioned templates.
4 Evaluation
We evaluate the method over both JE and EJ trans-
lation selection, using the two sets of 750 NN com-
pounds described in a45 2.2. In each case, we first
evaluate system performance according to gold-
standard accuracy, i.e. the proportion of inputs
for which the (unique) gold-standard translation is
ranked top amongst the translation candidates. For
the method to have a chance at selecting the gold-
standard translation, we clearly must be able to
generate it. The first step is thus to identify in-
puts which have translation-compositional gold-
standard translations, and generate the translation
candidates for each. The translation-compositional
data has the distribution given in Table 4. The over-
all proportion of translation-compositional inputs
is somewhat lower than suggested by Tanaka and
Baldwin (2003a), although this is conditional on the
coverage of the particular dictionaries we use. The
degree of translation-compositionality appears to be
relatively constant across the three frequency bands,
a somewhat surprising finding as we had expected
the lower frequency NN compounds to be less con-
ventionalised and therefore have more straightfor-
wardly compositional translations.
We use the translation-compositional test data to
evaluate the proposed method (SVMa35 a11a37a36a22a38 ) against
CTQ and a simple baseline derived from CTQ, which
takes the most probable fully-specified translation
JE EJ
ALL 297/750 272/750
HIGH 99/250 108/250
MED 98/250 81/250
LOW 100/250 83/250
Table 4: Analysis of translation compositionality
Baseline CTQ SVMa0a2a1 SVMa3 SVMa0a4a1a6a5a7a3
JE .317 .367 .390 .382 .434
EJ .400 .416 .441 .296 .514
Table 5: Gold-standard translation accuracies
candidate (i.e. is equivalent to setting a38 a50 a18 and
a39 a50 a40 ). We additionally tested the proposed method
using just corpus-based features (SVMa35 a11 ) and bilin-
gual dictionary-based features (SVMa38 ) to get a bet-
ter sense for the relative impact of each on overall
performance. In the case of the proposed method
and its derivants, evaluation is according to 10-fold
stratified cross-validation, with stratification taking
place across the three frequency bands. The average
number of translations generated for the JE dataset
was 205.6, and that for the EJ dataset was 847.5.
We were unable to generate any translations for 17
(2.3%) and 57 (7.6%) of the NN compounds in the
JE and EJ datasets, respectively, due to there being
no word-level translations for Na9 and/or Na10 in the
combined ALTDIC/EDICT dictionaries.
The gold-standard accuracies are presented in Ta-
ble 5, with figures in boldface indicating a statis-
tically significant improvement over both CTQ and
the baseline.6 Except for SVMa38 in the EJ task, all
evaluated methods surpass the baseline, and all vari-
ants of SVM surpassed CTQ. SVMa35 a11 a36a22a38 appears to
successfully consolidate on SVMa35 a11 and SVMa38 , in-
dicating that our modelling of target language cor-
pus and crosslingual data is complementary. Over-
all, the results for the EJ task are higher than those
for the JE task. Part of the reason for this is that
Japanese has less translation variability for a given
pair of word translations, as discussed below.
In looking through the examples where a gold-
standard translation was not returned by the dif-
ferent methods, we often find that the unique-
ness of gold-standard translation has meant that
equally good translations (e.g. dollar note vs. the
gold-standard translation dollar bill for a8a10a9 a3a12a11
a13 dorua3shihei) or marginally lower-quality but per-
fectly acceptable translations (e.g. territorial issue
vs. the gold-standard translation of territorial dis-
pute for a14a16a15
a3a18a17a20a19 ryoudoa3moNdai) are adjudged
incorrect. To rate the utility of these near-miss
translations, we rated each non-gold-standard first-
ranking translation according to source language-
recoverability (L1-recoverability). L1-recoverable
6Based on the paired a65 test, a58a22a21a24a23a26a25a23a28a27
Baseline CTQ SVMa0a2a1 SVMa3 SVMa0a4a1a6a5a7a3
JE .616 .721 .764 .693 .839
EJ .621 .654 .721 .419 .783
Table 6: Silver-standard translation accuracies
Training Baseline CTQ SVMa0a4a1a6a5a7a3Band
data G S G S G S
All .464 .879HIGH
Local .425 .789 .445 .806 .462 .857
All .474 .889MED
Local .315 .665 .368 .797 .480 .878
All .332 .742LOW
Local .210 .393 .280 .569 .320 .720
Table 7: JE translation accuracies across different
frequency bands
translations are defined to be syntactically un-
marked, capture the basic semantics of the source
language expression and allow the source language
expression to be recovered with reasonable confi-
dence. While evaluation of L1-recoverability is in-
evitably subjective, we minimise bias towards any
given system by performing the L1-recoverability
annotation for all methods in a single batch, without
giving the annotator any indication of which method
selected which translation. The average number
of English and Japanese L1-recoverable translations
were 1.9 and 0.94, respectively. The principle rea-
son for the English data being more forgiving is the
existence of possessive- and PP-based paraphrases
of NN gold-standard translations (e.g. ammendment
of rule(s) as an L1-recoverable paraphrase of rule
ammendment).
We combine the gold-standard data and L1-
recoverable translation data together into a sin-
gle silver standard translation dataset, based upon
which we calculate silver-standard translation accu-
racy. The results for the translation-compositional
data are given in Table 6. Once again, we find
that the proposed method is superior to the base-
line and CTQ, and that the combination of crosslin-
gual and target language corpus data is superior
to the individual data sources. SVMa38 fares par-
ticularly badly under silver-standard evaluation as
it is unable to capture the target language lexi-
cal and constructional preferences as are needed to
generate syntactically-unmarked, natural-sounding
translations. Unsurprisingly, the increment between
gold-standard accuracy and silver-standard accu-
racy is greater for English than Japanese.
4.1 Accuracy over each frequency band
We next analyse the breakdown in gold- and silver-
standard accuracies across the three frequency
bands. In doing this, we test the hypothesis that
training over only translation data from the same
frequency band will produce better results than
Training Baseline CTQ SVMa0a4a1a6a5a7a3Band
data G S G S G S
All .630 .842HIGH
Local .451 .641 .463 .657 .657 .850
All .532 .762MED
Local .420 .655 .452 .674 .546 .776
All .396 .755LOW
Local .314 .561 .341 .633 .374 .708
Table 8: EJ translation accuracies across different
frequency bands
Baseline CTQ SVMa0a2a1 SVMa3 SVMa0a4a1a6a5a7a3
JE .358 .515 .490 .308 .549
EJ .208 .285 .350 .162 .277
Table 9: Silver-standard translation accuracies over
non-translation-compositional data
training over all the translation data. The results
for the JE and EJ translation tasks are presented
in Tables 7 and 8, respectively. The results based
on training over data from all frequency bands are
labelled All and those based on training over data
from only the same frequency band are labelled Lo-
cal; G is the gold-standard accuracy and S is the
silver-standard accuracy.
For each of the methods tested, we find that the
gold- and silver-standard accuracies drop as we go
down through the frequency bands, although the
drop off is markedly greater for gold-standard ac-
curacy. Indeed, silver-standard accuracy is con-
stant between the high and medium bands for the
JE task, and the medium and low frequency bands
for the EJ task. SVMa35 a11a37a36a22a38 appears to be robust over
low-frequency data for both tasks, with the abso-
lute difference in silver-standard accuracy between
the high and low frequency bands around only 0.10,
and never dropping below 0.70 for either the EJ or
JE task. There was very little difference between
training over data from all frequency bands as com-
pared to only the local frequency band, suggesting
that there is little to be gained from conditioning
training data on the relative frequency of the NN
compound we are seeking to translate.
4.2 Accuracy over non-translation-
compositional data
Finally, we evaluate the performance of the meth-
ods over the non-translation compositional data. We
are unable to give gold-standard accuracies here
as, by definition, the gold-standard translation is
not amongst the translation candidates generated
for any of the inputs. We are, however, able
to evaluate according to silver-standard accuracy,
constructing L1-recoverable translation data as for
the translation-compositional case described above.
The classifier is learned from all the translation-
compositional data, treating the gold-standard trans-
lations as positive exemplars as before.
The results are presented in Table 9. A large
disparity is observable here between the JE and
EJ accuracies, which is, once again, a direct re-
sult of Japanese being less forgiving when it comes
to L1-recoverable translations. For the translation-
compositional data, the EJ task displayed a simi-
larly diminished accuracy increment when the L1-
recoverable translation data was incorporated, but
this was masked by the higher gold-standard ac-
curacy for the task. The relative results for the
JE task largely mirror those for the translation-
compositonal data. In contrast, SVMa35 a11 a36 a38 actually
performs marginally worse than CTQ over the EJ
task, despite SVMa35 a11 performing above CTQ. That
is, the addition of dictionary data diminishes overall
accuracy, a slightly surprising result given the com-
plementary of corpus and dictionary data in all other
aspects of evaluation. It is possible that we could
get better results by treating both L1-recoverable
and gold-standard translations in the training data
as positive exemplars, which we leave as an item
for future research.
Combining the results from Table 9 with those
from Table 6, the overall silver-standard accuracy
over the JE data is 0.671 for SVMa35 a11a37a36a22a38 (compared to
0.602 for CTQ), and that over the EJ data is 0.461
(compared to 0.419 for CTQ).
In summary, we have shown our method to be su-
perior to both the baseline and CTQ over EJ and JE
translation tasks in terms of both gold- and silver-
standard accuracy. We also demonstrated that the
method successfully combines crosslingual and tar-
get language corpus data, and is relatively robust
over low frequency inputs.
5 Related work
One piece of research relatively closely related to
our method is that of Cao and Li (2002), who use
bilingual bootstrapping over Chinese and English
web data in various forms to translate Chinese NN
compounds into English. While we rely on bilin-
gual dictionaries to determine crosslingual similar-
ity, their method is based on contextual similarity
in the two languages, without assuming parallelism
or comparability in the corpus data. They report an
impressive F-score of 0.73 over a dataset of 1000
instances, although they also cite a prior-based F-
score (equivalent to our Baseline) of 0.70 for the
task, such that the particular data set they are deal-
ing with would appear to be less complex than that
which we have targeted. Having said this, contex-
tual similarity is an orthogonal data source to those
used in this research, and has the potential to further
improve the accuracy of our method.
Nagata et al. (2001) use “partially bilingual” web
pages, that is web pages which are predominantly
Japanese, say, but interspersed with English words,
to extract translation pairs. They do this by access-
ing web pages containing a given Japanese expres-
sion, and looking for the English expression which
occurs most reliably in its immediate vicinity. The
method achieves an impressive gold-standard accu-
racy of 0.62, at a recall of 0.68, over a combination
of simplex nouns and compound nominals.
Grefenstette (1999) uses web data to select En-
glish translations for compositional German and
Spanish noun compounds, and achieves an impres-
sive accuracy of 0.86–0.87. The translation task
Grefenstette targets is intrinsically simpler than that
described in this paper, however, in that he consid-
ers only those compounds which translate into NN
compounds in English. It is also possible that the
historical relatedness of languages has an effect on
the difficulty of the translation task, although fur-
ther research would be required to confirm this pre-
diction. Having said this, the successful use of web
data by a variety of researchers suggests an avenue
for future research in comparing our results with
those obtained using web data.
6 Conclusion and future work
We have proposed a method for translating NN
compounds which compositionally generates trans-
lation candidates and selects among them using a
target language model based on corpus statistics and
a translation model based on bilingual dictionaries.
Our SVM-based implementation was shown to out-
perform previous methods and be robust over low-
frequency NN compounds for JE and EJ translation
tasks.
Acknowledgements
This material is based upon work supported by the
National Science Foundation under Grant No. BCS-
0094638 and also the Research Collaboration between
NTT Communication Science Laboratories, Nippon
Telegraph and Telephone Corporation and CSLI, Stan-
ford University. We would like to thank Emily Bender,
Francis Bond, Dan Flickinger, Stephan Oepen, Ivan Sag
and the anonymous reviewers for their valuable input on
this research.

References
Jim Breen. 1995. Building an electronic Japanese-English dic-
tionary. Japanese Studies Association of Australia Confer-
ence.
Ted Briscoe and John Carroll. 2002. Robust accurate statistical
annotation of general text. In Proc. of the 3rd International
Conference on Language Resources and Evaluation (LREC
2002), pages 1499–1504, Las Palmas, Canary Islands.
Lou Burnard. 2000. User Reference Guide for the British Na-
tional Corpus. Technical report, Oxford University Com-
puting Services.
Yunbo Cao and Hang Li. 2002. Base noun phrase transla-
tion using Web data and the EM algorithm. In Proc. of the
19th International Conference on Computational Linguis-
tics (COLING 2002), Taipei, Taiwan.
Gregory Grefenstette. 1999. The World Wide Web as a re-
source for example-based machine translation tasks. In
Translating and the Computer 21: ASLIB’99, London, UK.
Satoru Ikehara, Satoshi Shirai, Akio Yokoo, and Hiromi
Nakaiwa. 1991. Toward an MT system without pre-editing
– effects of new methods in ALT-J/E–. In Proc. of the Third
Machine Translation Summit (MT Summit III), pages 101–
106, Washington DC, USA.
Kyo Kageura, Fuyuki Yoshikane, and Takayuki Nozawa. 2004.
Parallel bilingual paraphrase rules for noun compounds:
Concepts and rules for exploring Web language resources.
In Proc. of the Fourth Workshop on Asian Language Re-
sources, pages 54–61, Sanya, China.
Philipp Koehn and Kevin Knight. 2003. Feature-rich statisti-
cal translation of noun phrases. In Proc. of the 41st Annual
Meeting of the ACL, Sapporo, Japan.
Taku Kudo and Yuji Matsumoto. 2002. Japanese dependency
analysis using cascaded chunking. In Proc. of the 6th
Conference on Natural Language Learning (CoNLL-2002),
pages 63–9, Taipei, Taiwan.
Irene Langkilde and Kevin Knight. 1998. Generation that ex-
ploits corpus-based statistical knowledge. In Proc. of the
36th Annual Meeting of the ACL and 17th International
Conference on Computational Linguistics (COLING/ACL-
98), pages 704–710, Montreal, Canada.
Judith N. Levi. 1978. The Syntax and Semantics of Complex
Nominals. Academic Press, New York, USA.
Mainichi Newspaper Co. 2001. Mainichi Shimbun CD-ROM
2001.
Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, and Yoshi-
taka Hirano. 1999. Japanese Morphological Analysis Sys-
tem ChaSen Version 2.0 Manual. Technical Report NAIST-
IS-TR99009, NAIST.
Guido Minnen, John Carroll, and Darren Pearce. 2001. Ap-
plied morphological processing of English. Natural Lan-
guage Engineering, 7(3):207–23.
Masaaki Nagata, Teruka Saito, and Kenji Suzuki. 2001. Using
the Web as a bilingual dictionary. In Proc. of the ACL/EACL
2001 Workshop on Data-Driven Methods in Machine Trans-
lation, pages 95–102, Toulouse, France.
Grace Ngai and Radu Florian. 2001. Transformation-based
learning in the fast lane. In Proc. of the 2nd Annual Meeting
of the North American Chapter of Association for Compu-
tational Linguistics (NAACL2001), pages 40–7, Pittsburgh,
USA.
Tony Rose, Mark Stevenson, and Miles Whitehead. 2002. The
Reuters Corpus volume 1 – from yesterday’s news to tomor-
row’s language resources. In Proc. of the 3rd International
Conference on Language Resources and Evaluation (LREC
2002), pages 827–33, Las Palmas, Canary Islands.
Takaaki Tanaka and Timothy Baldwin. 2003a. Noun-noun
compound machine translation: A feasibility study on shal-
low processing. In Proc. of the ACL-2003 Workshop on
Multiword Expressions: Analysis, Acquisition and Treat-
ment, pages 17–24, Sapporo, Japan.
Takaaki Tanaka and Timothy Baldwin. 2003b. Translation
selection for Japanese-English noun-noun compounds. In
Proc. of the Ninth Machine Translation Summit (MT Sum-
mit IX), pages 89–96, New Orleans, USA.
