Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 874–881,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Word Alignment for Languages with Scarce Resources 
Using Bilingual Corpora of Other Language Pairs 
 
Haifeng Wang      Hua Wu      Zhanyi Liu 
Toshiba (China) Research and Development Center 
5/F., Tower W2, Oriental Plaza, No.1, East Chang An Ave., Dong Cheng District 
Beijing, 100738, China 
{wanghaifeng, wuhua, liuzhanyi}@rdc.toshiba.com.cn 
 
 
 
Abstract 
This paper proposes an approach to im-
prove word alignment for languages with 
scarce resources using bilingual corpora 
of other language pairs. To perform word 
alignment between languages L1 and L2, 
we introduce a third language L3. Al-
though only small amounts of bilingual 
data are available for the desired lan-
guage pair L1-L2, large-scale bilingual 
corpora in L1-L3 and L2-L3 are available. 
Based on these two additional corpora 
and with L3 as the pivot language, we 
build a word alignment model for L1 and 
L2. This approach can build a word 
alignment model for two languages even 
if no bilingual corpus is available in this 
language pair. In addition, we build an-
other word alignment model for L1 and 
L2 using the small L1-L2 bilingual cor-
pus. Then we interpolate the above two 
models to further improve word align-
ment between L1 and L2. Experimental 
results indicate a relative error rate reduc-
tion of 21.30% as compared with the 
method only using the small bilingual 
corpus in L1 and L2. 
1 Introduction 
Word alignment was first proposed as an inter-
mediate result of statistical machine translation 
(Brown et al., 1993). Many researchers build 
alignment links with bilingual corpora (Wu, 
1997; Och and Ney, 2003; Cherry and Lin, 2003; 
Zhang and Gildea, 2005). In order to achieve 
satisfactory results, all of these methods require a 
large-scale bilingual corpus for training. When 
the large-scale bilingual corpus is unavailable, 
some researchers acquired class-based alignment 
rules with existing dictionaries to improve word 
alignment (Ker and Chang, 1997). Wu et al. 
(2005) used a large-scale bilingual corpus in 
general domain to improve domain-specific word 
alignment when only a small-scale domain-
specific bilingual corpus is available. 
This paper proposes an approach to improve 
word alignment for languages with scarce re-
sources using bilingual corpora of other language 
pairs. To perform word alignment between lan-
guages L1 and L2, we introduce a third language 
L3 as the pivot language. Although only small 
amounts of bilingual data are available for the 
desired language pair L1-L2, large-scale bilin-
gual corpora in L1-L3 and L2-L3 are available. 
Using these two additional bilingual corpora, we 
train two word alignment models for language 
pairs L1-L3 and L2-L3, respectively. And then, 
with L3 as a pivot language, we can build a word 
alignment model for L1 and L2 based on the 
above two models. Here, we call this model an 
induced model. With this induced model, we per-
form word alignment between languages L1 and 
L2 even if no parallel corpus is available for this 
language pair. In addition, using the small bilin-
gual corpus in L1 and L2, we train another word 
alignment model for this language pair. Here, we 
call this model an original model. An interpo-
lated model can be built by interpolating the in-
duced model and the original model. 
As a case study, this paper uses English as the 
pivot language to improve word alignment be-
tween Chinese and Japanese. Experimental re-
sults show that the induced model performs bet-
ter than the original model trained on the small 
Chinese-Japanese corpus. And the interpolated 
model further improves the word alignment re-
sults, achieving a relative error rate reduction of 
874
21.30% as compared with results produced by 
the original model. 
The remainder of this paper is organized as 
follows. Section 2 discusses the related work. 
Section 3 introduces the statistical word align-
ment models. Section 4 describes the parameter 
estimation method using bilingual corpora of 
other language pairs. Section 5 presents the in-
terpolation model. Section 6 reports the experi-
mental results. Finally, we conclude and present 
the future work in section 7. 
2 Related Work 
A shared task on word alignment was organized 
as part of the ACL 2005 Workshop on Building 
and Using Parallel Texts (Martin et al., 2005). 
The focus of the task was on languages with 
scarce resources. Two different subtasks were 
defined: Limited resources and Unlimited re-
sources. The former subtask only allows partici-
pating systems to use the resources provided. 
The latter subtask allows participating systems to 
use any resources in addition to those provided. 
For the subtask of unlimited resources, As-
wani and Gaizauskas (2005) used a multi-feature 
approach for many-to-many word alignment on 
English-Hindi parallel corpora. This approach 
performed local word grouping on Hindi sen-
tences and used other methods such as dictionary 
lookup, transliteration similarity, expected Eng-
lish words, and nearest aligned neighbors. Martin 
et al. (2005) reported that this method resulted in 
absolute improvements of up to 20% as com-
pared with the case of only using limited re-
sources. Tufis et al. (2005) combined two word 
aligners: one is based on the limited resources 
and the other is based on the unlimited resources.  
The unlimited resource consists of a translation 
dictionary extracted from the alignment of Ro-
manian and English WordNet. Lopez and Resnik 
(2005) extended the HMM model by integrating 
a tree distortion model based on a dependency 
parser built on the English side of the parallel 
corpus. The latter two methods produced compa-
rable results with those methods using limited 
resources. All the above three methods use some 
language dependent resources such as dictionary, 
thesaurus, and dependency parser. And some 
methods, such as transliteration similarity, can 
only be used for very similar language pairs. 
In this paper, besides the limited resources for 
the given language pair, we make use of large 
amounts of resources available for other lan-
guage pairs to address the alignment problem for 
languages with scarce resources. Our method 
does not need language-dependent resources or 
deep linguistic processing. Thus, it is easy to 
adapt to any language pair where a pivot lan-
guage and corresponding large-scale bilingual 
corpora are available. 
3 Statistical Word Alignment 
According to the IBM models (Brown et al., 
1993), the statistical word alignment model can 
be generally represented as in equation (1).  
∑
=
a'
c|f,a'
c|fa,
c|fa,
)Pr(
)Pr(
)Pr(  
(1)
Where,  and  represent the source sentence 
and the target sentence, respectively
c f
1
. 
In this paper, we use a simplified IBM model 
4 (Al-Onaizan et al., 1999), which is shown in 
equation (2). This version does not take into ac-
count word classes in Brown et al. (1993). 
))))(()](([            
))()](([(           
)|( )|(             
   
         
)Pr(
0,1
1
0,1
11
11
1
2
0
0
0
00
∏
∏
∏∏
≠=
>
≠=
−
==
−
−⋅≠
+−⋅=
⋅⋅
⋅
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ −
=
m
aj
j
m
aj
ij
m
j
aj
l
i
ii
m
j
j
j
jpjdahj
jdahj
cftcn
pp
m
⊙
φ
φ
φ
φφ
c|fa,
 
(2)
ml,  are the lengths of the source sentence and 
the target sentence respectively. 
j  is the position index of the target word. 
j
a  is the position of the source word aligned to 
the j
th
 target word. 
i
φ  is the fertility of . 
i
c
0
p ,  are the fertility probabilities for , 
and 
1
p
0
c
1
10
=+pp . 
)|
j
aj
ct(f  is the word translation probability. 
)|(
ii
cn φ  is the fertility probability. 
)(
11 −
−
i
jd⊙ is the distortion probability for the 
head word of the cept. 
))((
1
jpjd −
>
 is the distortion probability for 
the non-head words of the cept. 
                                                 
1
 This paper uses c and f to represent a Chinese sentence 
and a Japanese sentence, respectively. And e represents an 
English sentence. 
875
}:{min)(
k
k
aikih == is the head of cept i. 
}:{max)(
kj
jk
aakjp ==
<
. 
i
⊙  is the center of cept i. 
During the training process, IBM model 3 is 
first trained, and then the parameters in model 3 
are employed to train model 4. For convenience, 
we describe model 3 in equation (3). The main 
difference between model 3 and model 4 lies in 
the calculation of distortion probability. 
∏∏
∏∏
≠==
==
−
⋅
⋅⋅
⋅
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ −
=
m
aj
j
m
j
aj
l
i
i
l
i
ii
m
j
j
mlajdcft
cn
pp
m
0,11
11
1
2
0
0
0
),,|()|(                   
!  )|(                   
   
)Pr(
00
φφ
φ
φ
φφ
c|fa,
 
(3)
4 Parameter Estimation Using Bilingual 
Corpora of Other Language Pairs 
As shown in section 3, the word alignment 
model mainly has three kinds of parameters that 
must be specified, including the translation prob-
ability, the fertility probability, and the distortion 
probability. The parameters are usually estimated 
by using bilingual sentence pairs in the desired 
languages, namely Chinese and Japanese here. In 
this section, we describe how to estimate the pa-
rameters without using the Chinese-Japanese 
bilingual corpus. We introduce English as the 
pivot language, and use the Chinese-English and 
English-Japanese bilingual corpora to estimate 
the parameters of the Chinese-Japanese word 
alignment model. With these two corpora, we 
first build Chinese-English and English-Japanese 
word alignment models as described in section 3. 
Then, based on these two models, we estimate 
the parameters of Chinese-Japanese word align-
ment model. The estimated model is named in-
duced model. 
The following subsections describe the 
method to estimate the parameters of Chinese-
Japanese alignment model. For reversed Japa-
nese-Chinese word alignment, the parameters 
can be estimated with the same method. 
4.1  Translation Probability 
Basic Translation Probability  
We use the translation probabilities trained 
with Chinese-English and English-Japanese cor-
pora to estimate the Chinese-Japanese probabil-
ity as shown in equation (4). In (4), we assume 
that the translation probability  is 
independent of the Chinese word . 
),|(
EJ ikj
ceft
i
c
)|()|(     
)|(),|(      
)|(
CEEJ
CEEJ
CJ
ik
e
kj
ik
e
ikj
ij
ceteft
cetceft
cft
k
k
∑
∑
⋅=
⋅=  
(4)
Where  is the translation probability 
for Chinese-Japanese word alignment. 
is the translation probability trained 
using the English-Japanese corpus.  is 
the translation probability trained using the Chi-
nese-English corpus. 
)|(
CJ ij
cft
)|(
EJ kj
eft
)|(
CE ik
cet
Cross-Language Word Similarity 
In any language, there are ambiguous words 
with more than one sense. Thus, some noise may 
be introduced by the ambiguous English word 
when we estimate the Chinese-Japanese transla-
tion probability using English as the pivot lan-
guage. For example, the English word "bank" has 
at least two senses, namely: 
bank1 - a financial organization 
bank2 - the border of a river 
Let us consider the Chinese word: 
河岸 - bank2 (the border of a river) 
And the Japanese word: 
銀行 - bank1 (a financial organization) 
In the Chinese-English corpus, there is high 
probability that the Chinese word "河岸 (bank2)"  
would be translated into the English word "bank". 
And in the English-Japanese corpus, there is also 
high probability that the English word "bank" 
would be translated into the Japanese word "銀
行( bank1)". 
As a result, when we estimate the translation 
probability using equation (4), the translation 
probability of " 銀行(bank1)" given " 河岸
(bank2)" is high. Such a result is not what we 
expect. 
In order to alleviate this problem, we intro-
duce cross-language word similarity to improve 
translation probability estimation in equation (4). 
The cross-language word similarity describes 
how likely a Chinese word is to be translated into 
a Japanese word with an English word as the 
pivot. We make use of both the Chinese-English 
corpus and the English-Japanese corpus to calcu-
late the cross language word similarity between a 
Chinese word c and a Japanese word f given an 
876
Input: An English word e , a Chinese word , and a Japanese word ; c f
The Chinese-English corpus; The English-Japanese corpus. 
(1) Construct Set 1: identify those Chinese-English sentence pairs that include the given Chinese 
word  and English word , and put the English sentences in the pairs into Set 1. c e
(2) Construct Set 2: identify those English-Japanese sentence pairs that include the given English 
word  and Japanese word , and put the English sentences in the pairs into Set 2. e f
(3) Construct the feature vectors  and  of the given English word using all other words as 
context in Set 1 and Set 2, respectively. 
CE
V
EJ
V
>=< ),(, ... ),,(),,(
1122111CE nn
ctectecteV  
>=< ),(, ... ),,(),,(
2222211EJ nn
ctectecteV  
Where  is the frequency of the context word . 
ij
ct
j
e 0=
ij
ct  if  does not occur in Set i . 
j
e
(4) Given the English word e , calculate the cross-language word similarity between the Chinese 
word  and the Japanese word  as in equation (5) c f
∑∑
∑
⋅
⋅
==
j
j
j
j
j
jj
ctct
ctct
VVefcsim
2
2
2
1
21
EJCE
)()(
),cos();,(                                     (5) 
Output: The cross language word similarity  of the Chinese word c and the Japanese 
word given the English word  
);,( efcsim
f e
Figure 1. Similarity Calculation 
English word e. For the ambiguous English word 
e, both the Chinese word c and the Japanese 
word f can be translated into e. The sense of an 
instance of the ambiguous English word e can be 
determined by the context in which the instance 
appears. Thus, the cross-language word similar-
ity between the Chinese word c and the Japanese 
word f can be calculated according to the con-
texts of their English translation e. We use the 
feature vector constructed using the context 
words in the English sentence to represent the 
context. So we can calculate the cross-language 
word similarity using the feature vectors. The 
detailed algorithm is shown in figure 1. This idea 
is similar to translation lexicon extraction via a 
bridge language (Schafer and Yarowsky, 2002). 
For example, the Chinese word "河岸" and its 
English translation "bank" (the border of a river) 
appears in the following Chinese-English sen-
tence pair: 
(a) 他们沿着河岸走回家。  
(b) They walked home along the river bank. 
The Japanese word "銀行 " and its English 
translation "bank" (a financial organization) ap-
pears in the following English-Japanese sentence 
pair: 
(c) He has plenty of money in the bank. 
(d) 彼は銀行預金が相当ある。  
The context words of the English word "bank" in 
sentences (b) and (c) are quite different. The dif-
ference indicates the cross language word simi-
larity of the Chinese word "河岸" and the Japa-
nese word "銀行" is low. So they tend to have 
different senses. 
Translation Probability Embedded with Cross 
Language Word Similarity 
Based on the cross language word similarity 
calculation in equation (5), we re-estimate the 
translation probability as shown in (6). Then we 
normalize it in equation (7). 
The word similarity of the Chinese word "河
岸 (bank2)" and the Japanese word " 銀行
(bank1)" given the word English word "bank" is 
low. Thus, using the updated estimation method, 
the translation probability of " 銀行(bank1)"  
given "河岸(bank2)" becomes low. 
));,()|()|((
)|('
CEEJ
CJ
kjiik
e
kj
ij
efcsimceteft
cft
k
⋅⋅=
∑
 
(6)
∑
=
'
CJ
CJ
CJ
)|'('
)|('
)|(
f
i
ij
ij
cft
cft
cft  
(7)
4.2  Fertility Probability 
The induced fertility probability is calculated as 
shown in (8). Here, we assume that the probabil-
877
ity ),|(
EJ iki
cen φ  is independent of the Chinese 
word . 
i
c
)|()|(
)|(),|(
)|(
CEEJ
CEEJ
CJ
ik
e
ki
ik
e
iki
ii
ceten
cetcen
cn
k
k
⋅=
⋅=
∑
∑
φ
φ
φ
 
(8)
Where, )|(
CJ ii
cn φ  is the fertility probability for 
the Chinese-Japanese alignment. )|(
EJ ki
en φ  is 
the trained fertility probability for the English-
Japanese alignment. 
4.3  Distortion Probability in Model 3 
With the English language as a pivot language, 
we calculate the distortion probability of model 3. 
For this probability, we introduce two additional 
parameters: one is the position of English word 
and the other is the length of English sentence. 
The distortion probability is estimated as shown 
in (9). 
)),,|Pr(),,,|Pr(         
),,,,|(Pr(
),,|,Pr(),,,,|Pr(
),,|,,Pr(
),,|(
,
,
,
CJ
mlinmlink
mlinkj
mlinkmlinkj
mlinkj
mlijd
nk
nk
nk
⋅
⋅=
⋅=
=
∑
∑
∑
 
(9)
Where, is the estimated distortion 
probability.  is the introduced position of an 
English word. n  is the introduced length of an 
English sentence.  
).,|(
CJ
mlijd
k
In the above equation, we assume that the po-
sition probability  is independent 
of the position of the Chinese word and the 
length of the Chinese sentence. And we assume 
that the position probability  is in-
dependent of the length of Japanese sentence. 
Thus, we rewrite these two probabilities as fol-
lows. 
),,,,|Pr( mlinkj
),,,|Pr( mlink
),,|(),,|Pr(),,,,|Pr(
EJ
mnkjdmnkjmlinkj =≈  
),,|(),,|Pr(),,,|Pr(
CE
nlikdnliknmlik =≈  
For the length probability, the English sen-
tence length n  is independent of the word posi-
tions i . And we assume that it is uniformly dis-
tributed. Thus, we take it as a constant, and re-
write it as follows.  
constant),|Pr(),,|Pr( == mlnmlin  
According to the above three assumptions, we 
ignore the length probability . Equa-
tion (9) is rewritten in (10).  
),|Pr( mln
∑
⋅=
nk
nlikdmnkjd
mlijd
,
CEEJ
CJ
),,|(),,|(
).,|(
 
(10)
4.4  Distortion Probability in Model 4 
In model 4, there are two parameters for the dis-
tortion probability: one for head words and the 
other for non-head words.  
Distortion Probability for Head Words 
The distortion probability for head 
words represents the relative position of the head 
word of the i
)(
11 −
−
i
jd⊙
th
 cept and the center of the (i-1)
th
 
cept. Let 
1−
−=Δ
i
jj⊙, then  is independent of 
the absolute position. Thus, we estimate the dis-
tortion probability by introducing another rela-
tive position 
jΔ
'jΔ of English words, which is 
shown in (11).    
∑
Δ
−
ΔΔ⋅Δ=
−=Δ
'
EJCE,1
1CJ,1
)'|(Pr)'(
)(
j
i
jjjd
jjd ⊙
 
(11)
Where, )(
1CJ1, −
−=Δ
i
jjd⊙is the estimated dis-
tortion probability for head words in Chinese-
Japanese alignment. is the distortion 
probability for head word in Chinese-English 
alignment. 
)'(
CE1,
jd Δ
)'|(Pr
EJ
jj ΔΔ  is the translation prob-
ability of relative Japanese position given rela-
tive English position.  
In order to simplify , we introduce 
and  and let 
)'|(Pr
EJ
jj ΔΔ
'j
1'−i
⊙
1'
''
−
−=Δ
i
jj⊙, where  and 
 are positions of English words. We rewrite 
'j
1'−i
⊙
)'|(Pr
EJ
jj ΔΔ  in (12).   
∑
Δ=−
Δ=−
−−
−−
−−
−−
=
−−=
ΔΔ
'':,'
:,
1'1EJ
1'1EJ
EJ
1'1'
11
),'|,(Pr
)'|(Pr
)'|(Pr
jjj
jjj
ii
ii
ii
ii
jj
jj
jj
⊙⊙
⊙⊙
⊙⊙
⊙⊙ 
(12)
The English word in position  is aligned to 
the Japanese word in position , and the English 
word in position  is aligned to the Japanese 
word in position . 
'j
j
1'−i
⊙
1−i
⊙
We assume that  and  are independent, 
 only depends on , and  only depends 
on . Then  can be esti-
mated as shown in (13). 
j
1−i
⊙
j 'j
1−i
⊙
1'−i
⊙),'|,(Pr
1'1EJ −− ii
jj⊙⊙
878
)|(Pr)'|(Pr
),'|,(Pr
1'1EJEJ
1'1EJ
−−
−−
⋅=
ii
ii
jj
jj
⊙⊙
⊙⊙
 
(13)
Both of the two parameters in (13) represent 
the position translation probabilities. Thus, we 
can estimate them from the distortion probability 
in model 3.  is estimated as shown in 
(14).  And  can be estimated in 
the same way. In (14), we also assume that the 
sentence length distribution  is inde-
pendent of the word position and that it is uni-
formly distributed. 
)'|(Pr
EJ
jj
)|(Pr
1'1EJ −− ii
⊙⊙
)'|,Pr( jml
∑
∑
∑
=
⋅=
=
ml
ml
ml
mljjd
jmlmljjd
jmljjj
,
EJ
,
EJ
,
EJEJ
),,'|(
)'|,Pr(),,'|(
)'|,,(Pr)'|(Pr
 
(14)
Distortion Probability for Non-Head Words 
The distortion probability de-
scribes the distribution of the relative position of 
non-head words. In the same way, we introduce 
relative position of English words, and model 
the probability in (15). 
))((
1
jpjd −
>
'jΔ
∑
Δ
>
>
ΔΔ⋅Δ=
−=Δ
'
EJCE,1
CJ,1
)'|(Pr)'(
))((
j
jjjd
jpjjd
 
(15)
))((
CJ1,
jpjjd −=Δ
>
is the estimated distortion 
probability for the non-head words in Chinese-
Japanese alignment.  is the distortion 
probability for non-head words in Chinese-
English alignment. 
)'(
CE1,
jd Δ
>
)'|(Pr
EJ
jj ΔΔ is the translation 
probability of the relative Japanese position 
given the relative English position.  
In fact,  has the same interpreta-
tion as in (12). Thus, we introduce two parame-
ters and  and let , where 
and  are positions of English words. The 
final distortion probability for non-head words 
can be estimated as shown in (16). 
)'|(Pr
EJ
jj ΔΔ
'j )'( jp )'('' jpjj −=Δ
'j )'( jp
)))'(|)((Pr)'|(Pr
)'(())((
')'(':)'(,'
)(:)(,
EJEJ
'
CE1,CJ1,
∑
∑
Δ=−
Δ=−
Δ
>>
⋅
⋅Δ=−=Δ
jjpjjpj
jjpjjpj
j
jpjpjj
jdjpjjd
 
(16)
5 Interpolation Model 
With the Chinese-English and English-Japanese 
corpora, we can build the induced model for Chi-
nese-Japanese word alignment as described in 
section 4. If we have small amounts of Chinese-
Japanese corpora, we can build another word 
alignment model using the method described in 
section 3, which is called the original model here. 
In order to further improve the performance of 
Chinese-Japanese word alignment, we build an 
interpolated model by interpolating the induced 
model and the original model.  
Generally, we can interpolate the induced 
model and the original model as shown in equa-
tion (17). 
)(Pr)1( )(Pr
)Pr(
IO
c|fa,c|fa,
c|fa,
⋅−+⋅= λλ
 
(17)
Where is the original model trained 
from the Chinese-Japanese corpus, and 
 is the induced model trained from the 
Chinese-English and English-Japanese corpora. 
)(Pr
O
c|fa,
)(Pr
I
c|fa,
λ  is an interpolation weight. It can be a constant 
or a function of f  and . c
 In both model 3 and model 4, there are mainly 
three kinds of parameters: translation probability, 
fertility probability and distortion probability. 
These three kinds of parameters have their own 
interpretation in these two models. In order to 
obtain fine-grained interpolation models, we in-
terpolate the three kinds of parameters using dif-
ferent weights, which are obtained in the same 
way as described in Wu et al. (2005). 
t
λ  repre-
sents the weights for translation probability. 
n
λ  
represents the weights for fertility probability. 
d3
λ  and 
d4
λ  represent the weights for distortion 
probability in model 3 and in model 4, respec-
tively. 
d4
λ  is set as the interpolation weight for 
both the head words and the non-head words. 
The above four weights are obtained using a 
manually annotated held-out set. 
6 Experiments 
In this section, we compare different word 
alignment methods for Chinese-Japanese align-
ment. The "Original" method uses the original 
model trained with the small Chinese-Japanese 
corpus.  The "Basic Induced" method uses the 
induced model that employs the basic translation 
probability without introducing cross-language 
word similarity. The "Advanced Induced" 
method uses the induced model that introduces 
the cross-language word similarity into the calcu-
lation of the translation probability. The "Inter-
polated" method uses the interpolation of the 
word alignment models in the "Advanced In-
duced" and "Original" methods. 
879
6.1 Data 
There are three training corpora used in this pa-
per: Chinese-Japanese (CJ) corpus, Chinese-
English (CE) Corpus, and English-Japanese (EJ) 
Corpus. All of these tree corpora are from gen-
eral domain. The Chinese sentences and Japa-
nese sentences in the data are automatically seg-
mented into words. The statistics of these three 
corpora are shown in table 1. "# Source Words" 
and "# Target Words" mean the word number of 
the source and target sentences, respectively. 
Language 
Pairs 
#Sentence 
Pairs 
# Source 
Words 
# Target 
Words 
CJ 21,977 197,072 237,834 
CE 329,350 4,682,103 4,480,034
EJ 160,535 1,460,043 1,685,204
Table 1. Statistics for Training Data 
Besides the training data, we also have held-
out data and testing data. The held-out data in-
cludes 500 Chinese-Japanese sentence pairs, 
which is used to set the interpolated weights de-
scribed in section 5. We use another 1,000 Chi-
nese-Japanese sentence pairs as testing data, 
which is not included in the training data and the 
held-out data. The alignment links in the held-out 
data and the testing data are manually annotated. 
Testing data includes 4,926 alignment links
2
. 
6.2  Evaluation Metrics 
We use the same metrics as described in Wu et al. 
(2005), which is similar to those in (Och and Ney, 
2000). The difference lies in that Wu et al. (2005) 
took all alignment links as sure links. 
If we use  to represent the set of alignment 
links identified by the proposed methods and  
to denote the reference alignment set, the meth-
ods to calculate the precision, recall, f-measure, 
and alignment error rate (AER) are shown in 
equations (18), (19), (20), and (21), respectively. 
It can be seen that the higher the f-measure is, 
the lower the alignment error rate is. Thus, we 
will only show precision, recall and AER scores 
in the evaluation results. 
G
S
C
S
||
||
G
CG
S
SS
precision
∩
=      
(18)
||
 ||
C
CG
S
SS
recall
∩
=  
(19)
                                                 
2
 For a non one-to-one link, if m source words are aligned to 
n target words, we take it as one alignment link instead of 
m∗n alignment links. 
||||
||2
CG
CG
SS
SS
fmeasure
+
∩
=  
(20)
fmeasure
SS
SS
AER −=
+
∩
−= 1
||||
||2
1
CG
CG
 
(21)
6.3 Experimental Results 
We use the held-out data described in section 6.1 
to set the interpolation weights in section 5. 
t
λ  is 
set to 0.3, 
n
λ  is set to 0.1, 
d3
λ  for model 3  is set 
to 0.5, and 
d4
λ  for model 4 is set to 0.1. With 
these parameters, we get the lowest alignment 
error rate on the held-out data. 
For each method described above, we perform 
bi-directional (source to target and target to 
source) word alignment and obtain two align-
ment results. Based on the two results, we get a 
result using "refined" combination as described 
in (Och and Ney, 2000). Thus, all of the results 
reported here describe the results of the "refined" 
combination. For model training, we use the 
GIZA++ toolkit
3
. 
Method Precision Recall AER 
Interpolated 0.6955 0.5802 0.3673
Advanced 
Induced 
0.7382 0.4803 0.4181
Basic  
Induced 
0.6787 0.4602 0.4515
Original 0.6026 0.4783 0.4667
Table 2. Word Alignment Results 
The evaluation results on the testing data are 
shown in table 2.  From the results, it can be seen 
that both of the two induced models perform bet-
ter than the "Original" method that only uses the 
limited Chinese-Japanese sentence pairs. The 
"Advanced Induced" method achieves a relative 
error rate reduction of 10.41% as compared with 
the "Original" method. Thus, with the Chinese-
English corpus and the English-Japanese corpus, 
we can achieve a good word alignment results 
even if no Chinese-Japanese parallel corpus is 
available. After introducing the cross-language 
word similarity into the translation probability, 
the "Advanced Induced" method achieves a rela-
tive error rate reduction of 7.40% as compared 
with the "Basic Induced" method. It indicates 
that cross-language word similarity is effective in 
the calculation of the translation probability. 
Moreover, the "interpolated" method further im-
proves the result, which achieves relative error 
                                                 
3
 It is located at http://www.fjoch.com/ GIZA++.html. 
880
rate reductions of 12.51% and 21.30% as com-
pared with the "Advanced Induced" method and 
the "Original" method. 
7 Conclusion and Future Work 
This paper presented a word alignment approach 
for languages with scarce resources using bilin-
gual corpora of other language pairs. To perform 
word alignment between languages L1 and L2, 
we introduce a pivot language L3 and bilingual 
corpora in L1-L3 and L2-L3. Based on these two 
corpora and with the L3 as a pivot language, we 
proposed an approach to estimate the parameters 
of the statistical word alignment model. This ap-
proach can build a word alignment model for the 
desired language pair even if no bilingual corpus 
is available in this language pair. Experimental 
results indicate a relative error reduction of 
10.41% as compared with the method using the 
small bilingual corpus. 
In addition, we interpolated the above model 
with the model trained on the small L1-L2 bilin-
gual corpus to further improve word alignment 
between L1 and L2. This interpolated model fur-
ther improved the word alignment results by 
achieving a relative error rate reduction of 
12.51% as compared with the method using the 
two corpora in L1-L3 and L3-L2, and a relative 
error rate reduction of 21.30% as compared with 
the method using the small bilingual corpus in 
L1 and L2. 
In future work, we will perform more evalua-
tions. First, we will further investigate the effect 
of the size of corpora on the alignment results. 
Second, we will investigate different parameter 
combination of the induced model and the origi-
nal model. Third, we will also investigate how 
simpler IBM models 1 and 2 perform, in com-
parison with IBM models 3 and 4. Last, we will 
evaluate the word alignment results in a real ma-
chine translation system, to examine whether 
lower word alignment error rate will result in 
higher translation accuracy. 
References 
Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin 
Knight, John Lafferty, Dan Melamed, Franz-Josef 
Och, David Purdy, Noah A. Smith, and David 
Yarowsky. 1999. Statistical Machine Translation 
Final Report. Johns Hopkins University Workshop. 
Niraj Aswani and Robert Gaizauskas. 2005. Aligning 
Words in English-Hindi Parallel Corpora. In Proc. 
of the ACL 2005 Workshop on Building and Using 
Parallel Texts: Data-driven Machine Translation 
and Beyond, pages 115-118.  
Peter F. Brown, Stephen A. Della Pietra, Vincent J. 
Della Pietra, and Robert L. Mercer. 1993. The 
Mathematics of Statistical Machine Translation: 
Parameter Estimation. Computational Linguistics, 
19(2): 263-311. 
Colin Cherry and Dekang Lin. 2003. A Probability 
Model to Improve Word Alignment. In Proc. of the 
41
st
  Annual Meeting of the Association for Compu-
tational Linguistics (ACL-2003), pages 88-95. 
Sue J. Ker and Jason S. Chang. 1997. A Class-based 
Approach to Word Alignment. Computational Lin-
guistics, 23(2): 313-343. 
Adam Lopez and Philip Resnik. 2005. Improved 
HMM Alignment Models for Languages with 
Scarce Resources. In Proc. of the ACL-2005 Work-
shop on Building and Using Parallel Texts: Data-
driven Machine Translation and Beyond, pages 83-
86. 
Joel Martin, Rada Mihalcea, and Ted Pedersen. 2005. 
Word Alignment for Languages with Scarce Re-
sources. In Proc. of the ACL-2005 Workshop on 
Building and Using Parallel Texts: Data-driven 
Machine Translation and Beyond, pages 65-74. 
Charles Schafer and David Yarowsky. 2002. Inducing 
Translation Lexicons via Diverse Similarity Meas-
ures and Bridge Languages. In Proc. of the 6
th
 
Conference on Natural Language Learning 2002 
(CoNLL-2002), pages 1-7. 
Dan Tufis, Radu Ion, Alexandru Ceausu, and Dan 
Stefanescu. 2005. Combined Word Alignments. In 
Proc. of the ACL-2005 Workshop on Building and 
Using Parallel Texts: Data-driven Machine Trans-
lation and Beyond, pages 107-110. 
Franz Josef Och and Hermann Ney. 2000. Improved 
Statistical Alignment Models. In Proc. of the 38
th
 
Annual Meeting of the Association for Computa-
tional Linguistics (ACL-2000), pages 440-447. 
Franz Josef Och and Hermann Ney. 2003. A System-
atic Comparison of Various Statistical Alignment 
Models. Computational Linguistics, 29(1):19-51. 
Dekai Wu. 1997. Stochastic Inversion Transduction 
Grammars and Bilingual Parsing of Parallel Cor-
pora. Computational Linguistics, 23(3):377-403. 
Hua Wu, Haifeng Wang, and Zhanyi Liu. 2005. 
Alignment Model Adaptation for Domain-Specific 
Word Alignment. In Proc. of the 43
rd
 Annual Meet-
ing of the Association for Computational Linguis-
tics (ACL-2005), pages 467-474. 
Hao Zhang and Daniel Gildea. 2005. Stochastic Lexi-
calized Inversion Transduction Grammar for 
Alignment. In Proc. of the 43
rd
 Annual Meeting of 
the Association for Computational Linguistics 
(ACL-2005), pages 475-482. 
881
