Multilingual Aligned Parallel Treebank Corpus Reflecting
Contextual Information and Its Applications
Kiyotaka Uchimoto
†
Yujie Zhang
†
Kiyoshi Sudo
‡
Masaki Murata
†
Satoshi Sekine
‡
Hitoshi Isahara
†
†
National Institute of Information and Communications Technology
3-5 Hikari-dai, Seika-cho, Soraku-gun,
Kyoto 619-0289, Japan
{uchimoto,yujie,murata,isahara}@nict.go.jp
‡
New York University
715 Broadway, 7th floor
New York, NY 10003, USA
{sudo,sekine}@cs.nyu.edu
Abstract
This paper describes Japanese-English-Chinese
aligned parallel treebank corpora of newspaper
articles. They have been constructed by trans-
lating each sentence in the Penn Treebank and
the Kyoto University text corpus into a cor-
responding natural sentence in a target lan-
guage. Each sentence is translated so as to
reflect its contextual information and is anno-
tated with morphological and syntactic struc-
tures and phrasal alignment. This paper also
describes the possible applications of the par-
allel corpus and proposes a new framework to
aid in translation. In this framework, paral-
lel translations whose source language sentence
is similar to a given sentence can be semi-
automatically generated. In this paper we show
that the framework can be achieved by using
our aligned parallel treebank corpus.
1 Introduction
Recently, accurate machine translation systems
can be constructed by using parallel corpora
(Och and Ney, 2000; Germann et al., 2001).
However, almost all existing machine transla-
tion systems do not consider the problem of
translating a given sentence into a natural sen-
tence reflecting its contextual information in the
target language. One of the main reasons for
this is that we had many problems that had to
be solved by one-sentence to one-sentence ma-
chine translation before we could solve the con-
textual problem. Another reason is that it was
diﬃcult to simply investigate the influence of
the context on the translation because sentence
correspondences of the existing bilingual doc-
uments are rarely one-to-one, and are usually
one-to-many or many-to-many.
On the other hand, high-quality treebanks
such as the Penn Treebank (Marcus et al., 1993)
and the Kyoto University text corpus (Kuro-
hashi and Nagao, 1997) have contributed to
improving the accuracies of fundamental tech-
niques for natural language processing such as
morphological analysis and syntactic structure
analysis. However, almost all of these high-
quality treebanks are based on monolingual cor-
pora and do not have bilingual or multilin-
gual information. There are few high-quality
bilingual or multilingual treebank corpora be-
cause parallel corpora have mainly been actively
used for machine translation between related
languages such as English and French, there-
fore their syntactic structures are not required
so much for aligning words or phrases. How-
ever, syntactic structures are necessary for ma-
chine translation between languages whose syn-
tactic structures are diﬀerent from each other,
such as in Japanese-English, Japanese-Chinese,
and Chinese-English machine translations, be-
cause it is more diﬃcult to automatically align
words or phrases between two unrelated lan-
guages than between two related languages. Ac-
tually, it has been reported that syntactic struc-
tures contribute to improving the accuracy of
word alignment between Japanese and English
(Yamada and Knight, 2001). Therefore, if we
had a high-quality parallel treebank corpus, the
accuracies of machine translation between lan-
guages whose syntactic structures are diﬀer-
ent from each other would improve. Further-
more, if the parallel treebank corpus had word
or phrase alignment, the accuracy of automatic
word or phrase alignment would increase by
using the parallel treebank corpus as training
data. However, so far, there is no aligned par-
allel treebank corpus whose domain is not re-
stricted. For example, the Japanese Electronics
Industry Development Association’s (JEIDA’s)
bilingual corpus (Isahara and Haruno, 2000)
has sentence, phrase, and proper noun align-
ment. However, it does not have morphologi-
cal and syntactic information, the alignment is
partial, and the target is restricted to a white
paper. The Advance Telecommunications Re-
search dialogue database (ATR, 1992) is a par-
allel treebank corpus between Japanese and En-
glish. However, it does not have word or phrase
alignment, and the target domain is restricted
to travel conversation.
Therefore, we have been constructing aligned
parallel treebank corpora of newspaper articles
between languages whose syntactic structures
are diﬀerent from each other since 2001; they
meet the following conditions.
1. It is easy to investigate the influence of the con-
text on the translation, which means the sen-
tences that come before and after a particular
sentence, and that help us to understand the
meaning of a particular word such as a pro-
noun.
2. The annotated information in the existing
monolingual high-quality treebanks can be uti-
lized.
3. They are open to the public.
To construct parallel corpora that satisfy these
conditions, each sentence in the Penn Tree-
bank (Release 2) and the Kyoto University text
corpus (Version 3.0) has been translated into
a corresponding natural sentence reflecting its
contextual information in a target language by
skilled translators, revised by native speakers,
and each parallel translation has been anno-
tated with morphological and syntactic struc-
tures, and phrasal alignment. Henceforth, we
call the parallel corpus that is constructed by
pursuing the above policy an aligned parallel
treebank corpus reflecting contextual informa-
tion. In this paper, we describe an aligned par-
allel treebank corpus of newspaper articles be-
tween Japanese, English, and Chinese, and its
applications.
2 Construction of Aligned Parallel
Treebank Corpus Reflecting
Contextual Information
2.1 Human Translation of Existing
Monolingual Treebank
The Penn Treebank is a tagged corpus of Wall
Street Journal material, and it is divided into 24
sections. The Kyoto University text corpus is a
tagged corpus of the Mainichi newspaper, which
isdividedinto16sectionsaccordingtothecat-
egories of articles such as the sports section and
the economy section. To maintain the consis-
tency of expressions in translation, a few partic-
ular translators were assigned to translate arti-
cles in a particular section, and the same trans-
lator was assigned to the same section. The
instructions to translators for Japanese-English
translation is basically as follows.
1. One-sentence to one-sentence translation as a
rule
Translate a source sentence into a target sen-
tence. In case the translated sentence becomes
unnatural by pursuing this policy, leave a com-
ment.
2. Natural translation reflecting contextual infor-
mation
Except in the case that the translated sentence
becomes unnatural by pursuing policy 1, trans-
late a source sentence into a target sentence
naturally.
By deletion, replacement, or supplementation,
let the translated sentence be natural in the
context.
In an entire article, the translated sentences
must maintain the same meaning and informa-
tion as those of the original sentences.
3. Translations of proper nouns
Find out the translations of proper nouns by
looking up the nouns in a dictionary or by using
a web search. In case a translation cannot be
found, use a temporary name and report it.
We started the construction of Japanese-
Chinese parallel corpus in 2002. The Japanese
sentences of the Kyoto University text corpus
were also translated into Chinese by human
translators. Then each translated Chinese sen-
tence was revised by a second Chinese native.
The instruction to the translators is the same
as that given in the Japanese-English human
translations.
The breakdown of the parallel corpora is
shown in Table 1. We are planning to trans-
late the remaining 18,714 sentences of the Kyoto
University text corpus and the remaining 30,890
sentences of the Penn Treebank. As for the nat-
uralness of the translated sentences, there are
207 (1%) unnatural English sentences of the
Kyoto University text corpus, and 462 (2.5%)
unnatural Japanese sentences of the Penn Tree-
bank generated by pursuing policy 1.
2.2 Morphological and Syntactic
Annotation
In the following sections, we describe the anno-
tated information of the parallel treebank cor-
pus based on the Kyoto University text corpus.
2.2.1 Morphological and Syntactic
Information of Japanese-English
corpus
Translated English sentences were analyzed by
using the Charniak Parser (Charniak, 1999).
Then, the parsed sentences were manually re-
vised. The definitions of part-of-speech (POS)
categories and syntactic labels follow those of
the Treebank I style (Marcus et al., 1993).
We have finished revising the 10,328 parsed
sentences that appeared from January 1st to
11th. An example of morphological and syn-
tactic structures is shown in Figure 1. In this
figure, “S-ID” means the sentence ID in the
Kyoto University text corpus. EOJ means the
boundary between a Japanese parsed sentence
andanEnglishparsedsentence.Thedefinition
of Japanese morphological and syntactic infor-
mation follows that of the Kyoto University text
corpus (Version 3.0). The syntactic structure is
represented by dependencies between Japanese
phrasal units called bunsetsus. The phrasal
Table 1: Breakdown of the parallel corpora
Original corpus Languages # of parallel sentences
Kyoto University text corpus Japanese-English 19,669 (from Jan. 1st to 17th in 1995)
Japanese-Chinese 38,383 (all)
Penn Treebank Japanese-English 18,318 (from section 0 to 9)
Total Japanese-English 37,987 (Approximately 900,000 English words)
Japanese-Chinese 38,383 (Approximately 900,000 Chinese words)
# S-ID:950104141-008
*02D
Mc��Mc��*��***
*12D
	G�a�OV�O*��
:�**
@^M*
�����
Q��	
:�**

��e�]*
�����
Q��
���**
ww*	�
�	�**
*26D
		 �T�w*������**
ppiQ�*Q����%�;�
zz*�	�:**
*34D
��`m��*���!��**
tt*	��	�**
*45D
tQ�\hQ�tQ���*<;��,�
*56D
>�V��X*������**
��*	��	�**
* 6 -1D
�low\lo����*�;�����%�;�
MMM�
�����
Q
���<;���
�
sMsMsM
���0�
Q	\�
����0�����,�
{{*�	�:**
EOJ
(S1 (S (NP (PRP They))
(VP (VP (VBD were)
(NP (DT all))
(ADJP (NP (QP (RB about)
(CD nineteen))
(NNS years))
(JJ old)))
(CC and)
(VP (VBD had)
(S (NP (DT no)
(NN strength))
(VP (VBN left)
(SBAR (S (VP (ADVP (RB even))
(TO to)
(VP (VB answer)
(NP (NNS questions))))))))))
(. .)))
EOE
Figure 1: Example of morphological and syn-
tactic information.
units or bunsetsus are minimal linguistic units
obtained by segmenting a sentence naturally in
terms of semantics and phonetics, and each of
them consists of one or more morphemes.
2.2.2 Chinese Morphological
Information of Japanese-Chinese
corpus
Chinese sentences are composed of strings of
Hanzi and there are no spaces between words.
The morphological annotation, therefore, in-
cludes providing tags of word boundaries and
POSs of words. We analyzed the Chinese sen-
tences by using the morphological analyzer de-
veloped by Peking University (Zhou and Duan,
1994). There are 39 categories in this POS set.
Then the automatically tagged sentences were
revised by the third native Chinese. In this
pass the Chinese translations were revised again
while the results of word segmentation and POS
tagging were revised. Therefore the Chinese
translations are obtained with a high quality.
We have finished revising the 12,000 tagged sen-
tences. The revision of the remaining sentences
is ongoing. An example of tagged Chinese sen-
tences is shown in Figure 2. The letters shown
Figure 2: Example of morphological informa-
tion of Chinese corpus.
after ’/’ indicate POSs. The Chinese sentence is
the translation of the Japanese sentence in Fig-
ure 1. The Chinese sentences are GB encoded.
The 38,383 translated Chinese sentences have
1,410,892 Hanzi and 926,838 words.
2.3 Phrasal Alignment
This section describes the annotated informa-
tion of 19,669 sentences of the Kyoto University
text corpus.
The minimum alignment unit should be as
small as possible, because bigger units can be
constructed from units of the minimum size.
However, we decided to define a bunsetsu as the
minimum alignment unit. One of the main rea-
sons for this is that the smaller the unit is, the
higher the human annotation cost is. Another
reason is that if we define a word or a morpheme
as a minimum alignment unit, expressions such
as post-positional particles in Japanese and arti-
cles in English often do not have alignments. To
eﬀectively absorb those expressions and to align
as many parts as possible, we found that a big-
ger unit than a word or a morpheme is suitable
as the minimum alignment unit. We call the
minimum alignment based on bunsetsu align-
ment units the bunsetsu unit translation pair.
Bigger pairs than the bunsetsu unit translation
pairs can be automatically extracted based on
the bunsetsu unit translation pairs. We call all
of the pairs, including bunsetsu unit transla-
tion pairs, translation pairs.Thebunsetsu unit
translation pairs for idiomatic expressions often
become unnatural. In this case, two or more
bunsetsu units are combined and handled as a
minimum alignment unit. The breakdown of
the bunsetsu unit translation pairs is shown in
Table 2.
Table 2: Breakdown of the bunsetsu unit trans-
lation pairs.
(1) total # of translation pairs 172,255
(2) # of diﬀerent translation pairs 146,397
(3) # of Japanese expressions 110,284
(4) # of English expressions 111,111
(5) average # of English expressions 1.33
corresponding to a Japanese expression ((2)/(3))
(6) average # of Japanese expressions 1.32
corresponding to a English expression ((2)/(4))
(7) # of ambiguous Japanese expressions 15,699
(8) # of ambiguous English expressions 12,442
(9) # of bunsetsu unit translation pairs 17,719
consisting of two or more bunsetsus
An example of phrasal alignment is shown in
Figure 3. A Japanese sentence is shown from
the line after the S-ID to the EOJ. Each line
indicates a bunsetsu. Each rectangular line in-
dicates a dependency between bunsetsus. The
leftmost number in each line indicates the bun-
setsu ID. The corresponding English sentence is
showninthenextlineafterthatoftheEOJ
(End of Japanese) until the EOE (End of En-
glish). The English expressions corresponding
to each bunsetsu are tagged with the corre-
sponding bunsetsu ID such as <Pid=”bunsetsu
ID”></P>. When there are two or more fig-
ures in the tag id such as id=”1,2”, it means two
or more bunsetsus are combined and handled as
a minimum alignment unit.
For example, we can extract the following
translation pairs from Figure 3.
 (J)�U(yunyuu-ga)/r�^�h(kaikin-sa-reta);
(E)that had been under the ban
 (J)����w(beikoku-san-ringo-no); (E)of apples
imported from the U.S.
 (J)H(U(dai-ichi-bin-ga); (E)The first cargo
 (J)�	Z^�h{(uridasa-reta); (E)was brought to the
market.
 (J)����w(beikoku-san-ringo-no)/H(U
(dai-ichi-bin-ga); (E)The first cargo / of apples im-
ported from the U.S.
# S-ID:950110003-001
1yyy�UGyyyyyyyyy
2yyr�^�hGyyyyyyyy
3y����wGyyyyyyy
4yyyyyH(U77Gyyyy
5yyyyyyy�z7[yyyy
6yyyyyU�	V�[yyyy
7yyyyyyyyy	4Qz888%
8yyyyyyyyy	NMwGy:
9yyyyyG	����srp7_
10yyyyyyyyyyy	s�o_
11yyyyyyyyy�	Z^�h{
EOJ
<P id="4">The first cargo</P> <P id="3">of apples
imported from the U.S.</P> <P id="1,2">that had been
under the ban</P> <P id="7">completed</P> <P id="6">
quarantine</P> <P id="7">and</P> <P id="11">was brought
to the market</P> <P id="10">for the first time</P>
<P id="5">on the 9th</P> <P id="9">at major supermarket
chain stores</P> <P id="8">in the Tokyo metropolitan
area</P> <P id="11">.</P>
EOE
Figure 3: Example of phrasal alignment.
 (J)����w(beikoku-san-ringo-no)/H(U
(dai-ichi-bin-ga)/�	Z^�h{(uridasa-reta); (E)The
first cargo / of apples imported from the U.S. / was
brought to the market.
Here, Japanese and English expressions are
divided by the symbol “;”, and “/” means a
bunsetsu boundary.
An overview of the criteria of the alignment
is as follows. Align as many parts as possible,
except if a certain part is redundant. More de-
tailed criteria will be attached with our corpus
when it is open to the public.
1. Alignment of English grammatical elements
that are not expressed in Japanese
English articles, possessive pronouns, infinitive
to, and auxiliary verbs are joined with nouns
and verbs.
2. Alignment between a noun and its substitute
expression
A noun can be aligned with its substitute ex-
pression such as a pronoun.
3. Alignment of Japanese ellipses
An English expression is joined with its related
elements. For example, the English subject is
joined with its related verb.
4. Alignment of supplementary or explanatory ex-
pression in English
Supplementary or explanatory expressions in
English are joined with their related words.
�Ex.
# S-ID:950104142-003
1y�B�tx7777G
2yy��`M�%yy9
3yy�q�s�qGy9
4yyyyyyyMOG9
5yyyyyyy��U[
6yyyyyyyyK�{
EOJ
<P id="1">The Chinese character used for "ka"</P>
has such meanings as "beautiful" and "splendid."
EOE
~"�B(ka)�tx(niwa)" corresponds to
"The Chinese character used for "ka""
5. Alignment of date and time
When a Japanese noun representing date and
time is adverbial, the English preposition is
joined with the date and time.
6. Alignment of coordinate structures
When English expressions represented by “X
(A + B)” correspond to Japanese expressions
represented by “XA + XB”, the alignment of
Xoverlaps.
�Ex.
# S-ID:950106149-005
1y�@Mpx777777G
2yyy�Z-pyyyy9
3y@��ST�z8%yy9
4yyyy
�G�-pyy9
5yyy���ST�z77[
6yyyyy:�srwG9
7yyyyyyyyyd:U[
8yyyyyyyy��lh{
EOJ
In the Kinki Region, disposal of wastes started
<P id="2"><P id="4"> at offshore sites of</P>
Amagasaki</P> and <P id="4">Izumiotsu</P> from
1989 and 1991 respectively.
EOE
~"�Z-(Amagasaki-oki)p(de)" corresponds to
"at offshore sites of Amagasaki"
~"
�G�-(Izumiotsu-oki)p(de)" corresponds to
"at offshore sites of�Izumiotsu"
3 Applications of Aligned Parallel
Treebank Corpus
3.1 Use for Evaluation of Conventional
Methods
The corpus as described in Section 2 can be
used for the evaluation of English-Japanese and
Japanese-English machine translation. We can
directly compare various methods of machine
translation by using this corpus. It can be sum-
marized as follows in terms of the characteristics
of the corpus.
One-sentence to one-sentence translation
can be simply used for the evaluation of
various methods of machine translation.
Morphological and syntactic information
canbeusedfortheevaluationofmethods
that actively use morphological and syntactic
information, such as methods for example-
based machine translation (Nagao, 1981;
Watanabe et al., 2003), or transfer-based
machine translation (Imamura, 2002).
Phrasal alignment is used for the evaluation of
automatically acquired translation knowledge
(Yamamoto and Matsumoto, 2003).
An actual comparison and evaluation is our
future work.
3.2 Analysis of Translation
One-sentence to one-sentence translation
reflects contextual information. Therefore, it
is suitable to investigate the influence of the
context on the translation. For example, we
can investigate the diﬀerence in the use of
demonstratives and pronouns between English
and Japanese. We can also investigate the
diﬀerence in the use of anaphora.
Morphological and syntactic information
and phrasal alignment canbeusedtoinvesti-
gate the appropriate unit and size of transla-
tion rules and the relationship between syntac-
tic structures and phrasal alignment.
3.3 Use in Conventional Systems
One-sentence to one-sentence translation
can be used for training a statistical translation
model such as GIZA++ (Och and Ney, 2000),
which could be a strong baseline system for
machine translation.
Morphological and syntactic information
and phrasal alignment canbeusedtoacquire
translation knowledge for example-based ma-
chine translation and transfer-based machine
translation.
In order to show what kind of units are help-
ful for example-based machine translation, we
investigated whether the Japanese sentences of
newspaper articles appearing on January 17,
1995, which we call test-set sentences, could be
translated into English sentences by using trans-
lation pairs appearing from January 1st to 16th
as a database. First, we found that only one out
of 1,234 test-set sentences agreed with one out
of 18,435 sentences in the database. Therefore,
a simple sentence search will not work well. On
the other hand, 6,659 bunsetsus out of 12,632
bunsetsus in the test-set sentences agreed with
those in the database. If words in bunsetsusare
expanded into their synonyms, the combination
of the expanded bunsetsus sets in the database
may cover the test-set sentences. Next, there-
fore, we investigated whether the Japanese test-
set sentences could be translated into English
sentences by simply combining translation pairs
appearing in the database. Given a Japanese
sentence, words were extracted from it and
translation pairs that include those words or
their synonyms, which were manually evalu-
ated, were extracted from the database. Then,
the English sentence was manually generated by
just combining English expressions in the ex-
tracted translation pairs. One hundred two rel-
atively short sentences (the average number of
bunsetsus is about 9.8) were selected as inputs.
The number of equivalent translations, which
mean that the translated sentence is grammat-
ical and has the same meaning as the source
sentence, was 9. The number of similar transla-
tions, which mean that the translated sentence
is ungrammatical, or diﬀerent or wrong mean-
ings of words, tenses, and prepositions are used
in the translated sentence, was 83. The num-
ber of other translations, which mean that some
words are missing, or the meaning of the trans-
lated sentence is completely diﬀerent from that
of the original sentence, was 10. For example,
the original parallel translation is as follows:
Japanese:^VUZx�	�qt�Zz
Sf�q�����
�qb��^;��Xt
��b�\q��`h{
English: New Party Sakigake proposed that towards the or-
dinary session, both parties found a council to dis-
cuss policy and Diet management.
Given the Japanese sentence, the translated
sentence was:
Translation:Sakigake Party suggested to set up an organiza-
tion between the two parties towards the regular
session of the Diet to discuss under the theme of
policies and the management of the Diet.
This result shows that only 9% of input sen-
tences can be translated into sentences equiv-
alent to the original ones. However, we found
that approximately 90% of input sentences can
be translated into English sentences that are
equivalent or similar to the original ones.
3.4 Similar Parallel Translation
Generation
The original aim of constructing an aligned par-
allel treebank corpus as described in Section 2 is
to achieve a new framework for translation aid
as described below.
It would be very convenient if multilingual
sentences could be generated by just writing
sentences in our mother language. Today, it
can be formally achieved by using commercial
machine translation systems. However, the au-
tomatically translated sentences are often in-
comprehensible. Therefore, we have to revise
the original and translated sentences by find-
ing and referring to parallel translation whose
source language sentence is similar to the orig-
inal one. In many cases, however, we cannot
find such similar parallel translations to the in-
put sentence. Therefore, it is diﬃcult for users
who do not have enough knowledge of the target
languages to generate comprehensible sentences
in several languages by just searching similar
parallel translations in this way. Therefore, we
propose to generate similar parallel translations
whose source language sentence is similar to
the input sentence. We call this framework for
translation aid similar parallel translation gen-
eration.
We investigated whether the framework can
be achieved by using our aligned parallel tree-
bank corpus. As the first step of this study,
we investigated whether an appropriate parallel
translation can be generated by simply combin-
ing translation pairs extracted from our aligned
parallel treebank corpus in the following steps.
1. Extract each content word with its adjacent
function word in each bunsetsu in a given sen-
tence
2. The extracted content words and their adjacent
function words are expanded into their syn-
onyms and class words whose major and minor
POS categories are the same
3. Find translation pairs including the expanded
content words with their expanded adjacent
function words in the given sentence
4. For each bunsetsu, select a translation pair that
has similar dependency relationship to those in
the given sentence
5. Generate a parallel translation by combining
the selected translation pairs
The input sentences were randomly selected
from 102 sentences described in Section 3.3.
The above steps, except the third step, were
basically conducted manually. The Examples
of the input sentences and generated parallel
translations are shown in Figure 4.
The basic unit of translation pairs in our
aligned parallel treebank corpus is a bunsetsu,
and the basic unit in the selection of transla-
tion pairs is also a bunsetsu. One of the ad-
vantages of using a bunsetsu as a basic unit is
that a Japanese expression represented as one
of various expressions in English, or omitted in
English, such as Japanese post-positional par-
ticles, is paired with a content word. There-
fore, the translation of such an expression is ap-
propriately selected together with the transla-
tion of a content word when a certain trans-
lation pair is selected. If the translation of
such an expression was selected independently
of the translation of a content word, the com-
bination of each translation would be ungram-
matical or unnatural. Another advantage of the
basic unit, bunsetsu, is that we can easily refer
to dependency information between bunsetsus
when we select an appropriate translation pair
because the original treebank has the depen-
dency information between bunsetsus. These
advantages are utilized in the above generation
steps. For example, in the first step, a content
word “q(kokkai, Diet session)” in the sec-
ond example in Figure 4 was extracted from the
bunsetsu “�	�q(tsuujo-kokkai, the ordinary
Diet session)t(ni, case marker)”, and it was
expanded into its class word “q(kai, meeting)”
in the second step. Then, a translation pair
“(J)��r�wVb��q(kokuren-kodomo-
no-kenri-iinkai)t(ni, case marker); (E)the UN
Committee on the Rights of the Child /(J)
0`(taishi); (E)towards” was extracted as a
translation pair in the third step. Since the
dependency between “��r�wVb��q
(kokuren-kodomo-no-kenri-iinkai,theUNCom-
mittee on the Rights of the Child)” and “0`
(taishi, towards)” is similar to that between “
�	�q(tsuujo-kokkai, the ordinary Diet ses-
sion)t(ni, case marker)” and “�Z(muke,to-
wards)” in the input sentence, this translation
pair was selected in the fourth step. Finally,
the bunsetsu “��r�wVb��q(kokuren-
kodomo-no-kenri-iinkai, the UN Committee on
the Rights of the Child)t(ni, case marker)”
and its translation “the UN Committee on the
Rights of the Child” was used for generation of
a parallel translation in the fifth step.
When we use the generated parallel transla-
tion for the exact translation of the input sen-
tence, we should replace “��r�wVb�
�q(kokuren-kodomo-no-kenri-iinkai)” and its
translation “the UN Committee on the Rights
of the Child” with “�	�q(tsuujo-kokkai,the
ordinary Diet session)” and its translation “the
ordinary Diet session” by consulting a bilingual
dictionary. In this example, “fw(sono)” and
“them” should also be replaced with “�X(ry-
oto)” and “both parties”. It is easy to identify
words in the generated translation that should
be replaced with words in the input sentence
because each bunsetsu in translation pairs is al-
ready aligned. In such cases, templates such as
“[q^(kaigi)]t(ni)�Z(muke)” and “towards
[council]” can be automatically generated by
generalizing content words expanded in the sec-
ond step and their translation in the generated
translation. The average number of English ex-
pressions corresponding to a Japanese expres-
sion is 1.3 as shown in Table 2. Even when there
are two or more possible English expressions, an
appropriate English expression can be chosen
by selecting a Japanese expression by referring
to dependencies in extracted translation pairs.
Therefore, in many cases, English sentences can
be generated just by reordering the selected ex-
pressions. The English word order was esti-
mated manually in this experiment. However,
we can automatically estimate English word or-
der by using a language model or an English
surface sentence generator such as FERGUS
(Bangalore and Rambow, 2000). Unnatural or
ungrammatical parallel translations are some-
times generated in the above steps. However,
comprehensible translations can be generated
as shown in Figure 4. The biggest advantage
of this framework is that comprehensible target
sentences can be generated basically by refer-
ring only to source sentences. Although it is
costly to search and select appropriate transla-
tion pairs, we believe that human labor can be
reduced by developing a human interface. For
example, when we use a Japanese text gener-
ation system from keywords (Uchimoto et al.,
2002), users should only select appropriate key-
words.
We are investigating whether or not we can
generate similar parallel translations to all of
the Japanese sentences appearing on January
17, 1995. So far, we found that we can gen-
erate similar parallel translations to 691 out of
840 sentences (the average number of bunsetsus
is about 10.3) including the 102 sentences de-
scribed in Section 3.3. We found that we could
not generate similar parallel translations to 149
out of 840 sentences.
In the proposed framework of similar paral-
lel translation generation, the language appear-
ing in a corpus corresponds to a controlled lan-
guage, and users are allowed to use only the
controlled language to write sentences in the
source language. We believe that high-quality
bilingual or multilingual documents can be gen-
erated by letting us adapt ourselves to the con-
trolled environment in this way.
4Conclusion
This paper described aligned parallel treebank
corpora of newspaper articles between lan-
guages whose syntactic structures are diﬀerent
from each other; they meet the following condi-
tions.
1. It is easy to investigate the influence of the con-
text on the translation.
2. The annotated information in the existing
monolingual high-quality treebanks can be uti-
lized.
3. It is open to the public.
To construct parallel corpora that satisfy
these conditions, each sentence in the existing
monolingual high-quality treebanks has been
translated into a corresponding natural sentence
reflecting its contextual information in a target
language by skilled translators, and each par-
allel translation has been annotated with mor-
phological and syntactic structures and phrasal
alignment.
This paper also described the possible ap-
plications of the parallel corpus and proposed
a similar parallel translation generation frame-
work. In this framework, a parallel translation
whose source language sentence is similar to a
given sentence can be semi-automatically gen-
erated. In this paper we demonstrated that
the framework could be achieved by using our
aligned parallel treebank corpus.
In the near future, the aligned parallel tree-
bank corpora will be open to the public, and
expanded. We are planning to use the corpora
actively for machine translation, as a transla-
tion aid, and for second language learning. We
are also planning to develop automatic or semi-
automatic alignment system and an eﬃcient in-
terface for machine translation aid.
Input sentence
(Japanese only)
�	
�q�
�qwq�x	
��p���z	��	��w	�q�A
R��	�t�_��{(Prime Minister
Murayama and Finance Minister Takemura met in the presidential oﬃce and they exchanged their
opinions, mainly on the issue of the new faction being formed by the New Democratic Union.)
Generated paral-
lel translation
(J)���	
�q�
�xq��	
��p�Vz	��	��w	�XA
R��	�t�_��`h{
(E) Finance Minister Takemura held the meeting at the oﬃcial residence to exchange views about the
formation of the new party of the New Democratic Union.
Input sentence
(Japanese only)
^VUZx�	�qt�Zz
Sf�q������qb��^;��Xt
��b�\q��`h{(New
Party Sakigake proposed that towards the ordinary session, both parties found a council to discuss policy
and Diet management.)
Generated paral-
lel translation
(J)^VUZx��r�wVb��qt�Zz
Sf�q������tz�`�O;�fwt
��b�
\q��`h
(E) Sakigake proposed to set up an organization between them towards the UN Committee on the Rights
of the Child to discuss under the theme of policies and the management of the Diet.
Input sentence
(Japanese only)
q�x	���t��	�XU	�
Xqw��
��p
�pM�\q�Z�
Mb����Klh{(The meeting
was also intended to slow the movement towards the new party by the New Democratic Union, which is
trying to deepen the relationship with the New Frontier Party.)
Generated paral-
lel translation
(J)q�x	���t��	�XU	�
Xqw��
��t
�pM�\q�Z�
M`h��UKlh{
(E) The meeting had meanings to restrict the movement that the new party of New Democratic Union
is progressing to strengthen the coalition with The New Frontier Party.
Input sentence
(Japanese only)
	�
Xw
�za�	:�^�x	G��z�V���qw	�q�A
Rwh�z	G��t�XtmX��	Zb�
\q�>�h{(Lower House Diet Member Tatsuo Kawabata of the New Frontier Party decided on the
16th that he would hand in notification of his secession to the party on the 17th, in order to form a new
faction with Sadao Yamahana’s group.)
Generated paral-
lel translation
(J)	�
Xw
�za�	:�^�xz	G��z1�b��q	�q�A
Rwh�z	G��t	�
\XtmX�x	Z
b�\q�>�h{
(E) On 16th Tatsuo Kawabata, a member of the House of Representatives of the New Frontier Party
decided to submit The notice to leave the party to the Shinsei Party on the 17th in order to establish a
new faction with Yuukichi Amano and others.
Input sentence
(Japanese only)
��q��x�	~���qw�g�oT�>b�{(As for the faction name in the Upper House,
they will decide after they consider how to form a relationship with Democratic Reform Union.)
Generated paral-
lel translation
(J)q��x��qw��`�lo>b�{
(E) The name of the faction will be decided after discussing the relationship with the JTUC.
Figure 4: Example of generated similar parallel translations.
Acknowledgments
We thank the Mainichi Newspapers for permis-
sion to use their data.

References
ATR. 1992. Dialogue Database. http://www.red.atr.co.jp/
database page/taiwa.html.
S. Bangalore and O. Rambow. 2000. Exploiting a Probabilis-
tic Hierarchical Model for Generation. In Proceedings of
the COLING, pages 42–48.
E. Charniak. 1999. A Maximum-Entropy-Inspired Parser.
Technical Report CS-99-12.
U. Germann, M. Jahr, K. Knight, D. Marcu, and K. Yamada
2001. Fast Decoding and Optimal Decoding for Machine
Translation. In Proceedings of the ACL-EACL, pages 228–
235.
K. Imamura. 2002. Application of translation knowledge ac-
quired by hierarchical phrase alignment for pattern-based
MT. In Proceedings of the TMI, pages 74–84.
H. Isahara and M. Haruno. 2000. Japanese-English aligned
bilingual corpora. In Jean Veronis, editor, Parallel Text
Processing - Alignment and Use of Translation Corpora,
pages 313–334. Kluwer Academic Publishers.
S. Kurohashi and M. Nagao. 1997. Building a Japanese
Parsed Corpus while Improving the Parsing System. In
Proceedings of the NLPRS, pages 451–456.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. 1993.
Building a Large Annotated Corpus of English: The Penn
Treebank. Computational Linguistics, 19(2):313–330.
M. Nagao. 1981. A Framework of a Mechanical Translation
between Japanese and English by Analogy Principle. In
Proceedings of the International NATO Symposium on Ar-
tificial and Human Intelligence.
F. J. Och and H. Ney. 2000. Improved Statistical Alignment
Models. In Proceedings of the ACL, pages 440–447.
K. Uchimoto, S. Sekine, and H. Isahara. 2002. Text Gen-
eration from Keywords. In Proceedings of the COLING,
pages 1037–1043.
H. Watanabe, S. Kurohashi, and E. Aramaki. 2003. Finding
Translation Patterns from Paired Source and Target De-
pendency Structures. In Michael Carl and Andy Way, ed-
itors, Recent Advances in Example-Based Machine Trans-
lation, pages 397–420. Kluwer Academic Publishers.
K. Yamada and K. Knight. 2001. A Syntax-based Statistical
Translation Model. In Proceedings of the ACL, pages 523–
530.
K. Yamamoto and Y. Matsumoto. 2003. Extracting Transla-
tion Knowledge from Parallel Corpora. In Michael Carl
and Andy Way, editors, Recent Advances in Example-
Based Machine Translation, pages 365–395. Kluwer Aca-
demic Publishers.
Q. Zhou and H. Duan. 1994. Segmentation and POS Tag-
ging in the Construction of Contemporary Chinese Cor-
pus. Journal of Computer Science of China, Vol.85. (in
Chinese)
