Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 913–920,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Boosting Statistical Word Alignment Using  
Labeled and Unlabeled Data 
 
Hua Wu      Haifeng Wang      Zhanyi Liu 
Toshiba (China) Research and Development Center 
5/F., Tower W2, Oriental Plaza, No.1, East Chang An Ave., Dong Cheng District 
Beijing, 100738, China 
{wuhua, wanghaifeng, liuzhanyi}@rdc.toshiba.com.cn 
 
 
 
Abstract 
This paper proposes a semi-supervised 
boosting approach to improve statistical 
word alignment with limited labeled data 
and large amounts of unlabeled data. The 
proposed approach modifies the super-
vised boosting algorithm to a semi-
supervised learning algorithm by incor-
porating the unlabeled data. In this algo-
rithm, we build a word aligner by using 
both the labeled data and the unlabeled 
data. Then we build a pseudo reference 
set for the unlabeled data, and calculate 
the error rate of each word aligner using 
only the labeled data. Based on this semi-
supervised boosting algorithm, we inves-
tigate two boosting methods for word 
alignment. In addition, we improve the 
word alignment results by combining the 
results of the two semi-supervised boost-
ing methods. Experimental results on 
word alignment indicate that semi-
supervised boosting achieves relative er-
ror reductions of 28.29% and 19.52% as 
compared with supervised boosting and 
unsupervised boosting, respectively. 
1 Introduction 
Word alignment was first proposed as an inter-
mediate result of statistical machine translation 
(Brown et al., 1993). In recent years, many re-
searchers build alignment links with bilingual 
corpora (Wu, 1997; Och and Ney, 2003; Cherry 
and Lin, 2003; Wu et al., 2005; Zhang and 
Gildea, 2005). These methods unsupervisedly 
train the alignment models with unlabeled data. 
A question about word alignment is whether 
we can further improve the performances of the 
word aligners with available data and available 
alignment models. One possible solution is to use 
the boosting method (Freund and Schapire, 
1996), which is one of the ensemble methods 
(Dietterich, 2000). The underlying idea of boost-
ing is to combine simple "rules" to form an en-
semble such that the performance of the single 
ensemble is improved. The AdaBoost (Adaptive 
Boosting) algorithm by Freund and Schapire 
(1996) was developed for supervised learning. 
When it is applied to word alignment, it should 
solve the problem of building a reference set for 
the unlabeled data. Wu and Wang (2005) devel-
oped an unsupervised AdaBoost algorithm by 
automatically building a pseudo reference set for 
the unlabeled data to improve alignment results. 
In fact, large amounts of unlabeled data are 
available without difficulty, while labeled data is 
costly to obtain. However, labeled data is valu-
able to improve performance of learners. Conse-
quently, semi-supervised learning, which com-
bines both labeled and unlabeled data, has been 
applied to some NLP tasks such as word sense 
disambiguation (Yarowsky, 1995; Pham et al., 
2005), classification (Blum and Mitchell, 1998; 
Thorsten, 1999), clustering (Basu et al., 2004), 
named entity classification (Collins and Singer, 
1999), and parsing (Sarkar, 2001). 
In this paper, we propose a semi-supervised 
boosting method to improve statistical word 
alignment with both limited labeled data and 
large amounts of unlabeled data. The proposed 
approach modifies the supervised AdaBoost al-
gorithm to a semi-supervised learning algorithm 
by incorporating the unlabeled data. Therefore, it 
should address the following three problems. The 
first is to build a word alignment model with 
both labeled and unlabeled data. In this paper, 
with the labeled data, we build a supervised 
model by directly estimating the parameters in 
913
the model instead of using the Expectation 
Maximization (EM) algorithm in Brown et al. 
(1993). With the unlabeled data, we build an un-
supervised model by estimating the parameters 
with the EM algorithm. Based on these two word 
alignment models, an interpolated model is built 
through linear interpolation. This interpolated 
model is used as a learner in the semi-supervised 
AdaBoost algorithm. The second is to build a 
reference set for the unlabeled data. It is auto-
matically built with a modified "refined" combi-
nation method as described in Och and Ney 
(2000). The third is to calculate the error rate on 
each round. Although we build a reference set 
for the unlabeled data, it still contains alignment 
errors. Thus, we use the reference set of the la-
beled data instead of that of the entire training 
data to calculate the error rate on each round.  
With the interpolated model as a learner in the 
semi-supervised AdaBoost algorithm, we inves-
tigate two boosting methods in this paper to im-
prove statistical word alignment. The first 
method uses the unlabeled data only in the inter-
polated model. During training, it only changes 
the distribution of the labeled data. The second 
method changes the distribution of both the la-
beled data and the unlabeled data during training. 
Experimental results show that both of these two 
methods improve the performance of statistical 
word alignment. 
In addition, we combine the final results of the 
above two semi-supervised boosting methods. 
Experimental results indicate that this combina-
tion outperforms the unsupervised boosting 
method as described in Wu and Wang (2005), 
achieving a relative error rate reduction of 
19.52%. And it also achieves a reduction of 
28.29% as compared with the supervised boost-
ing method that only uses the labeled data. 
The remainder of this paper is organized as 
follows. Section 2 briefly introduces the statisti-
cal word alignment model. Section 3 describes 
parameter estimation method using the labeled 
data. Section 4 presents our semi-supervised 
boosting method. Section 5 reports the experi-
mental results. Finally, we conclude in section 6. 
2 Statistical Word Alignment Model 
According to the IBM models (Brown et al., 
1993), the statistical word alignment model can 
be generally represented as in equation (1).  
∑
=
a'
e|f,a'
e|fa,
e|fa,
)Pr(
)Pr(
)Pr(  
(1)
Where  and f  represent the source sentence 
and the target sentence, respectively. 
e
In this paper, we use a simplified IBM model 
4 (Al-Onaizan et al., 1999), which is shown in 
equation (2). This simplified version does not 
take into account word classes as described in 
Brown et al. (1993). 
))))(()](([                
))()](([(                 
)|( )|(                 
   
 )Pr(
0,1
1
0,1
1
11
1
2
0
0
0
00
∏
∏
∏∏
≠=
>
≠=
==
−
−⋅≠
+−⋅=
⋅⋅
⋅
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ −
=
m
aj
j
m
aj
j
m
j
aj
l
i
ii
m
j
j
j
a
j
jpjdahj
cjdahj
eften
pp
m
ρ
φφ
φ
φ
φ
e|fa,
(2)
ml,  are the lengths of the source sentence and 
the target sentence respectively. 
j  is the position index of the target word. 
j
a  is the position of the source word aligned to 
the  target word. 
th
j
i
φ  is the number of target words that  is 
aligned to. 
i
e
0
p ,  are the fertility probabilities for , and 
1
p
0
e
1
10
=+pp . 
)|
j
aj
et(f  is the word translation probability. 
)|(
ii
en φ  is the fertility probability. 
)(
1
j
a
cjd
ρ
−  is the distortion probability for the 
head word of cept
1
 i. 
))((
1
jpjd −
>
 is the distortion probability for the 
non-head words of cept i. 
}:{min)(
k
k
aikih ==  is the head of cept i. 
}:{max)(
kj
jk
aakjp ==
<
. 
i
ρ  is the first word before  with non-zero  
i
e
fertility.  
i
c  is the center of cept i. 
3 Parameter Estimation with Labeled 
Data 
With the labeled data, instead of using EM algo-
rithm, we directly estimate the three main pa-
rameters in model 4: translation probability, fer-
tility probability, and distortion probability. 
                                                 
1
 A cept is defined as the set of target words connected to a source word 
(Brown et al., 1993).  
914
3.1 Translation Probability Where 1),( =yxδ  if yx = . Otherwise, 0),( =yxδ .  
The translation probability is estimated from the 
labeled data as described in (3). 
4 Boosting with Labeled Data and 
Unlabeled Data 
∑
=
'
)',(
),(
)|(
f
i
ji
ij
fecount
fecount
eft  
(3)
In this section, we first propose a semi-
supervised AdaBoost algorithm for word align-
ment, which uses both the labeled data and the 
unlabeled data. Based on the semi-supervised 
algorithm, we describe two boosting methods for 
word alignment. And then we develop a method 
to combine the results of the two boosting meth-
ods. 
Where  is the occurring frequency of 
 aligned to  in the labeled data. 
),(
ji
fecount
i
e
j
f
3.2 Fertility Probability 
The fertility probability )|(
ii
en φ  describes the 
distribution of the numbers of words that  is 
aligned to. It is estimated as described in (4).  
i
e
4.1 Semi-Supervised AdaBoost Algorithm 
for Word Alignment 
∑
=
'
),'(
),(
)|(
φ
φ
φ
φ
i
ii
ii
ecount
ecount
en  
(4)
Figure 1 shows the semi-supervised AdaBoost 
algorithm for word alignment by using labeled 
and unlabeled data. Compared with the super-
vised Adaboost algorithm, this semi-supervised 
AdaBoost algorithm mainly has five differences.  
Where ),(
ii
ecount φ describes the occurring fre-
quency of word  aligned to 
i
e
i
φ  target words in 
the labeled data. 
 Word Alignment Model  
0
p  and   describe the fertility probabilities 
for .  And  and  sum to 1. We estimate 
 directly from the labeled data, which is 
shown in (5). 
1
p
0
e
0
p
1
p
0
p
The first is the word alignment model, which 
is taken as a learner in the boosting algorithm. 
The word alignment model is built using both the 
labeled data and the unlabeled data. With the 
labeled data, we train a supervised model by di-
rectly estimating the parameters in the IBM 
model as described in section 3. With the unla-
beled data, we train an unsupervised model using 
the same EM algorithm in Brown et al. (1993). 
Then we build an interpolation model by linearly 
interpolating these two word alignment models, 
which is shown in (8). This interpolated model is 
used as the model  described in figure 1. 
l
M
 
Aligned
NullAligned
p
#
##
0
−
=  
(5)
Where  is the occurring frequency of 
the target words that have counterparts in the 
source language. is the occurring fre-
quency of the target words that have no counter-
parts in the source language. 
Aligned#
Null#
3.3 Distortion Probability 
)(Pr)1()(Pr
)Pr(
US
e|fa,e|fa,
e|fa,
⋅−+⋅= λλ
 
(8)There are two kinds of distortion probability in 
model 4: one for head words and the other for 
non-head words. Both of the distortion probabili-
ties describe the distribution of relative positions 
Thus, if we let 
i
cjj
ρ
−=Δ
1
 and )(
1
jpjj −=Δ
>
, 
the distortion probabilities for head words and 
non-head words are estimated in (6) and (7) with 
the labeled data, respectively. 
Where  and  are the 
trained supervised model and unsupervised 
model, respectively. 
)(Pr
S
e|fa, )(Pr
U
e|fa,
λ  is an interpolation weight. 
We train the weight in equation (8) in the same 
way as described in Wu et al. (2005).  
Pseudo Reference Set for Unlabeled Data 
∑∑
∑
Δ
−Δ
−Δ
=Δ
'
1
'
'
'
,
''
1
,
1
11
),(
),(
)(
jcj
cj
i
i
i
i
cjj
cjj
jd
ρ
ρ
ρ
ρ
δ
δ
 
(6)
∑∑
∑
>
Δ
>
>
>>
−Δ
−Δ
=Δ
'
1
''
)(,
'''
1
)(,
1
11
))(,(
))(,(
)(
jjpj
jpj
jpjj
jpjj
jd
δ
δ
 
(7)
The second is the reference set for the unla-
beled data. For the unlabeled data, we automati-
cally build a pseudo reference set. In order to 
build a reliable pseudo reference set, we perform 
bi-directional word alignment on the training 
data using the interpolated model trained on the 
first round. Bi-directional word alignment in-
cludes alignment in two directions (source to 
915
Input: A training set  including m  bilingual sentence pairs;  
T
S
The reference set  for the training data; 
T
R
The reference sets  and  ( ) for the labeled data  and the unlabeled 
data  respectively, where 
L
R
U
R
TUL
, RRR ⊆
L
S
U
S
LUT
SSS ∪=  and NULL
LU
=∩ SS ; 
A loop count L. 
(1) Initialize the weights: 
mimiw ,...,1,/1)(
1
==  
(2) For , execute steps (3) to (9).  L l to1=
(3) For each sentence pair i, normalize the 
weights on the training set: 
∑
==
j
lll
mijwiwip ,...,1),(/)()(  
(4) Update the word alignment model  
based on the weighted training data. 
l
M
(5) Perform word alignment on the training set 
with the alignment model :  
l
M
)(
lll
pMh =  
(6) Calculate the error of  with the reference 
set : 
l
h
L
R
∑
⋅=
i
ll
iip )()( αε  
Where )(iα  is calculated as in equation (9). 
(7) If 2/1>
l
ε , then let , and end the 
training process. 
1−= lL
(8) Let )1/(
lll
εεβ −= . 
(9) For all i, compute new weights: 
nknkiwiw
lll
/))(()()(
1
β⋅−+⋅=
+
 
where, n represents n alignment links in 
the i
th
 sentence pair. k represents the num-
ber of error links as compared with . 
T
R
Output: The final word alignment result for a source word e : 
∑
=
⋅⋅==
L
l
ll
l
ff
fehfeWTfeRSeh
1
F
)),((),()
1
(logmaxarg),(maxarg)( δ
β
 
Where 1),( =yxδ  if yx = . Otherwise, 0),( =yxδ .  is the weight of the alignment link 
 produced by the model , which is calculated as described in equation (10). 
),( feWT
l
),( fe
l
M
Figure 1. The Semi-Supervised Adaboost Algorithm for Word Alignment 
target and target to source) as described in Och 
and Ney (2000). Thus, we get two sets of align-
ment results  and  on the unlabeled data. 
Based on these two sets, we use a modified "re-
fined" method (Och and Ney, 2000) to construct 
a pseudo reference set .  
1
A
2
A
U
R
(1) The intersection  is added to the 
reference set . 
21
AAI ∩=
U
R
(2) We add  to  if a) is satis-
fied or both b) and c) are satisfied.  
21
)  ,( AAfe ∪∈
U
R
a) Neither  nor  has an alignment in  
and  is greater than a threshold 
e f
U
R
)|( efp
1
δ . 
∑
=
'
)',(
),(
)|(
f
fecount
fecount
efp  
Where  is the occurring fre-
quency of the alignment link  in 
the bi-directional word alignment results. 
),( fecount
)  ,( fe
b)  has a horizontal or a vertical 
neighbor that is already in . 
)  ,( fe
U
R
c) The set does not contain 
alignments with both horizontal and ver-
tical neighbors. 
),(
U
feR ∪
 Error of Word Aligner 
The third is the calculation of the error of the 
individual word aligner on each round. For word 
alignment, a sentence pair is taken as a sample. 
Thus, we calculate the error rate of each sentence 
pair as described in (9), which is the same as de-
scribed in Wu and Wang (2005).  
 
||||
||2
1)(
RW
RW
SS
SS
i
+
∩
−=α  
(9)
Where  represents the set of alignment 
links of a sentence pair i identified by the indi-
vidual interpolated model on each round.  is 
the reference alignment set for the sentence pair. 
W
S
R
S
With the error rate of each sentence pair, we 
calculate the error of the word aligner on each 
round. Although we build a pseudo reference set 
 for the unlabeled data, it contains alignment 
errors. Thus, the weighted sum of the error rates 
of sentence pairs in the labeled data instead of 
that in the entire training data is used as the error 
of the word aligner. 
U
R
 
916
 Weights Update for Sentence Pairs  
The forth is the weight update for sentence 
pairs according to the error and the reference set. 
In a sentence pair, there are usually several word 
alignment links. Some are correct, and others 
may be incorrect. Thus, we update the weights 
according to the number of correct and incorrect 
alignment links as compared with the reference 
set, which is shown in step (9) in figure 1.  
 Weights for Word Alignment Links  
The fifth is the weights used when we con-
struct the final ensemble. Besides the weight 
)/1log(
l
β , which is the confidence measure of 
the  word aligner, we also use the weight 
 to measure the confidence of each 
alignment link produced by the model . The 
weight  is calculated as shown in (10). 
Wu and Wang (2005) proved that adding this 
weight improved the word alignment results. 
th
l
),( feWT
l
l
M
),( feWT
l
∑∑
+
×
=
''
),'()',(
),(2
),(
ef
l
fecountfecount
fecount
feWT
(10) 
Where  is the occurring frequency 
of the alignment link  in the word align-
ment results of the training data produced by the 
model . 
),( fecount
)  ,( fe
l
M
4.2 Method 1 
This method only uses the labeled data as train-
ing data. According to the algorithm in figure 1, 
we obtain  and . Thus, we only 
change the distribution of the labeled data. How-
ever, we build an unsupervised model using the 
unlabeled data. On each round, we keep this un-
supervised model unchanged, and we rebuild the 
supervised model by estimating the parameters 
as described in section 3 with the weighted train-
ing data. Then we interpolate the supervised 
model and the unsupervised model to obtain an 
interpolated model as described in section 4.1. 
The interpolated model is used as the alignment 
model  in figure 1. Thus, in this interpolated 
model, we use both the labeled and unlabeled 
data. On each round, we rebuild the interpolated 
model using the rebuilt supervised model and the 
unchanged unsupervised model. This interpo-
lated model is used to align the training data.  
LT
SS =
LT
RR =
l
M
According to the reference set of the labeled 
data, we calculate the error of the word aligner 
on each round. According to the error and the 
reference set, we update the weight of each sam-
ple in the labeled data. 
4.3 Method 2 
This method uses both the labeled data and the 
unlabeled data as training data. Thus, we set 
ULT
SSS ∪=  and 
ULT
RRR ∪=  as described in 
figure 1. With the labeled data, we build a super-
vised model, which is kept unchanged on each 
round.
2
 With the weighted samples in the train-
ing data, we rebuild the unsupervised model with 
EM algorithm on each round. Based on these two 
models, we built an interpolated model as de-
scribed in section 4.1. The interpolated model is 
used as the alignment model  in figure 1. On 
each round, we rebuild the interpolated model 
using the unchanged supervised model and the 
rebuilt unsupervised model. Then the interpo-
lated model is used to align the training data. 
l
M
Since the training data includes both labeled 
and unlabeled data, we need to build a pseudo 
reference set  for the unlabeled data using the 
method described in section 4.1.  According to 
the reference set  of the labeled data, we cal-
culate the error of the word aligner on each 
round. Then, according to the pseudo reference 
set  and the reference set , we update the 
weight of each sentence pair in the unlabeled 
data and in the labeled data, respectively.  
U
R
L
R
U
R
L
R
There are four main differences between 
Method 2 and Method 1.  
(1) On each round, Method 2 changes the distri-
bution of both the labeled data and the unla-
beled data, while Method 1 only changes the 
distribution of the labeled data. 
(2) Method 2 rebuilds the unsupervised model, 
while Method 1 rebuilds the supervised 
model.  
(3) Method 2 uses the labeled data instead of the 
entire training data to estimate the error of 
the word aligner on each round. 
(4) Method 2 uses an automatically built pseudo 
reference set to update the weights for the 
sentence pairs in the unlabeled data. 
4.4 Combination 
In the above two sections, we described two 
semi-supervised boosting methods for word 
alignment. Although we use interpolated models 
                                                 
2
 In fact, we can also rebuild the supervised model accord-
ing to the weighted labeled data. In this case, as we know, 
the error of the supervised model increases. Thus, we keep 
the supervised model unchanged in this method. 
917
for word alignment in both Method 1 and 
Method 2, the interpolated models are trained 
with different weighted data. Thus, they perform 
differently on word alignment. In order to further 
improve the word alignment results, we combine 
the results of the above two methods as described 
in (11). 
  
)),(),((maxarg
)(
2211
F3,
feRSfeRS
eh
f
⋅+⋅= λλ
ods to calculate the precision, recall, f-measure, 
and alignment error rate (AER) are shown in 
equations (12), (13), (14), and (15). It can be 
seen that the higher the f-measure is, the lower 
the alignment error rate is.  
|S|
|SS|
G
CG
∩
=precision      
(12)
|S|
 |SS|
C
CG
∩
=recall  (11)
(13)
||||
||2
CG
CG
SS
SS
fmeasure
+
∩×
=  Where  is the combined hypothesis for 
word alignment.  and  are the 
two ensemble results as shown in figure 1 for 
Method 1 and Method 2, respectively. 
)(
F3,
eh
),(
1
feRS ),(
2
feRS
1
λ  and 
2
λ  
are the constant weights. 
(14)
fmeasure
SS
SS
AER −=
+
∩×
−= 1
||||
||2
1
CG
CG
 
(15)
5.3 Experimental Results 
5 Experiments 
With the data in section 5.1, we get the word 
alignment results shown in table 2. For all of the 
methods in this table, we perform bi-directional 
(source to target and target to source) word 
alignment, and obtain two alignment results on 
the testing set. Based on the two results, we get 
the "refined" combination as described in Och 
and Ney (2000). Thus, the results in table 2 are 
those of the "refined" combination. For EM 
training, we use the GIZA++ toolkit
4
. 
In this paper, we take English to Chinese word 
alignment as a case study. 
5.1 Data 
We have two kinds of training data from general 
domain: Labeled Data (LD) and Unlabeled Data 
(UD). The Chinese sentences in the data are 
automatically segmented into words. The statis-
tics for the data is shown in Table 1. The labeled 
data is manually word aligned, including 156,421 
alignment links. 
Data 
# Sentence 
Pairs 
# English 
Words 
 Results of Supervised Methods  
Using the labeled data, we use two methods to 
estimate the parameters in IBM model 4: one is 
to use the EM algorithm, and the other is to esti-
mate the parameters directly from the labeled 
data as described in section 3.  In table 2, the 
method "Labeled+EM" estimates the parameters 
with the EM algorithm, which is an unsupervised 
method without boosting. And the method "La-
beled+Direct" estimates the parameters directly 
from the labeled data, which is a supervised 
method without boosting. "Labeled+EM+Boost" 
and "Labeled+Direct+Boost" represent the two 
supervised boosting methods for the above two 
parameter estimation methods.  
# Chinese 
Words 
LD 31,069 255,504 302,470 
UD 329,350 4,682,103 4,480,034
Table 1. Statistics for Training Data 
We use 1,000 sentence pairs as testing set, 
which are not included in LD or UD. The testing 
set is also manually word aligned, including 
8,634 alignment links in the testing set
3
.  
5.2 Evaluation Metrics 
We use the same evaluation metrics as described 
in Wu et al. (2005), which is similar to those in 
(Och and Ney, 2000). The difference lies in that 
Wu et al. (2005) take all alignment links as sure 
links. 
Our methods that directly estimate parameters 
in IBM model 4 are better than that using the EM 
algorithm.  "Labeled+Direct" is better than "La-
beled+EM", achieving a relative error rate reduc-
tion of 22.97%. And "Labeled+Direct+Boost" is 
better than "Labeled+EM+Boost", achieving a 
relative error rate reduction of 22.98%. In addi-
tion, the two boosting methods perform better 
than their corresponding methods without
 If we use  to represent the set of alignment 
links identified by the proposed method and  
to denote the reference alignment set, the meth-
G
S
C
S
                                                 
3
 For a non one-to-one link, if m source words are aligned to 
n target words, we take it as one alignment link instead of 
m∗n alignment links. 
                                                 
4
 It is located at http://www.fjoch.com/ GIZA++.html. 
918
Method Precision Recall F-Measure AER 
Labeled+EM 0.6588 0.5210 0.5819 0.4181 
Labeled+Direct 0.7269 0.6609 0.6924 0.3076 
Labeled+EM+Boost 0.7384 0.5651 0.6402 0.3598 
Labeled+Direct+Boost 0.7771 0.6757 0.7229 0.2771 
Unlabeled+EM 0.7485 0.6667 0.7052 0.2948 
Unlabeled+EM+Boost 0.8056 0.7070 0.7531 0.2469 
Interpolated 0.7555 0.7084 0.7312 0.2688 
Method 1 0.7986 0.7197 0.7571 0.2429 
Method 2 0.8060 0.7388 0.7709 0.2291 
Combination 0.8175 0.7858 0.8013 0.1987 
Table 2. Word Alignment Results 
boosting. For example, "Labeled+Direct+Boost" 
achieves an error rate reduction of 9.92% as 
compared with "Labeled+Direct". 
Results of Unsupervised Methods   
With the unlabeled data, we use the EM algo-
rithm to estimate the parameters in the model. 
The method "Unlabeled+EM" represents an un-
supervised method without boosting. And the 
method "Unlabeled+EM+Boost" uses the same 
unsupervised Adaboost algorithm as described in 
Wu and Wang (2005). 
The boosting method "Unlabeled+EM+Boost" 
achieves a relative error rate reduction of 16.25% 
as compared with "Unlabeled+EM". In addition, 
the unsupervised boosting method "Unla-
beled+EM+Boost" performs better than the su-
pervised boosting method "Labeled+Direct+ 
Boost", achieving an error rate reduction of 
10.90%. This is because the size of labeled data 
is too small to subject to data sparseness problem.  
Results of Semi-Supervised Methods    
By using both the labeled and the unlabeled 
data, we interpolate the models trained by "La-
beled+Direct" and "Unlabeled+EM" to get an 
interpolated model. Here, we use "interpolated" 
to represent it. "Method 1" and  "Method 2" rep-
resent the semi-supervised boosting methods de-
scribed in section 4.2 and section 4.3, respec-
tively. "Combination" denotes the method de-
scribed in section 4.4, which combines "Method 
1" and "Method 2".  Both of the weights 
1
λ  and 
2
λ  in equation (11) are set to 0.5. 
 "Interpolated" performs better than the meth-
ods using only labeled data or unlabeled data. It 
achieves relative error rate reductions of 12.61% 
and 8.82% as compared with "Labeled+Direct" 
and "Unlabeled+EM", respectively. 
Using an interpolation model, the two semi-
supervised boosting methods "Method 1" and 
"Method 2" outperform the supervised boosting 
method "Labeled+Direct+Boost", achieving a 
relative error rate reduction of 12.34% and 
17.32% respectively. In addition, the two semi-
supervised boosting methods perform better than 
the unsupervised boosting method "Unlabeled+ 
EM+Boost". "Method 1" performs slightly better 
than "Unlabeled+EM+Boost". This is because 
we only change the distribution of the labeled 
data in "Method 1". "Method 2" achieves an er-
ror rate reduction of 7.77% as compared with 
"Unlabeled+EM+Boost". This is because we use 
the interpolated model in our semi-supervised 
boosting method, while "Unlabeled+EM+Boost" 
only uses the unsupervised model. 
Moreover, the combination of the two semi-
supervised boosting methods further improves 
the results, achieving relative error rate reduc-
tions of 18.20% and 13.27% as compared with 
"Method 1" and "Method 2", respectively. It also 
outperforms both the supervised boosting 
method "Labeled+Direct+Boost" and the unsu-
pervised boosting method "Unlabeled+EM+ 
Boost", achieving relative error rate reductions of 
28.29% and 19.52% respectively.  
Summary of the Results    
From the above result, it can be seen that all 
boosting methods perform better than their corre-
sponding methods without boosting. The semi-
supervised boosting methods outperform the su-
pervised boosting method and the unsupervised 
boosting method. 
6 Conclusion and Future Work 
This paper proposed a semi-supervised boosting 
algorithm to improve statistical word alignment 
with limited labeled data and large amounts of 
unlabeled data. In this algorithm, we built an in-
terpolated model by using both the labeled data 
919
and the unlabeled data. This interpolated model 
was employed as a learner in the algorithm. Then, 
we automatically built a pseudo reference for the 
unlabeled data, and calculated the error rate of 
each word aligner with the labeled data.  Based 
on this algorithm, we investigated two methods 
for word alignment. In addition, we developed a 
method to combine the results of the above two 
semi-supervised boosting methods. 
Experimental results indicate that our semi-
supervised boosting method outperforms the un-
supervised boosting method as described in Wu 
and Wang (2005), achieving a relative error rate 
reduction of 19.52%. And it also outperforms the 
supervised boosting method that only uses the 
labeled data, achieving a relative error rate re-
duction of 28.29%. Experimental results also 
show that all boosting methods outperform their 
corresponding methods without boosting. 
In the future, we will evaluate our method 
with an available standard testing set. And we 
will also evaluate the word alignment results in a 
machine translation system, to examine whether 
lower word alignment error rate will result in 
higher translation accuracy. 
References 
Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin 
Knight, John Lafferty, Dan Melamed, Franz-Josef 
Och, David Purdy, Noah A. Smith, and David 
Yarowsky. 1999. Statistical Machine Translation 
Final Report. Johns Hopkins University Workshop. 
Sugato Basu, Mikhail Bilenko, and Raymond J. 
Mooney.  2004. Probabilistic Framework for Semi-
Supervised Clustering. In Proc. of the 10
th
 ACM 
SIGKDD International Conference on Knowledge 
Discovery and Data Mining (KDD-2004), pages 
59-68.  
Avrim Blum and Tom Mitchell. 1998. Combing La-
beled and Unlabeled Data with Co-training. In 
Proc. of the 11
th
 Conference on Computational 
Learning Theory (COLT-1998), pages1-10.  
Peter F. Brown, Stephen A. Della Pietra, Vincent J. 
Della Pietra, and Robert L. Mercer. 1993. The 
Mathematics of Statistical Machine Translation: 
Parameter Estimation. Computational Linguistics, 
19(2): 263-311. 
Colin Cherry and Dekang Lin. 2003. A Probability 
Model to Improve Word Alignment. In Proc. of the 
41
st
 Annual Meeting of the Association for Compu-
tational Linguistics (ACL-2003), pages 88-95. 
Michael Collins and Yoram Singer. 1999. Unsuper-
vised Models for Named Entity Classification. In 
Proc. of the Joint SIGDAT Conference on Empiri-
cal Methods in Natural Language Processing and 
Very Large Corpora (EMNLP/VLC-1999), pages 
100-110. 
Thomas G. Dietterich. 2000. Ensemble Methods in 
Machine Learning. In Proc. of the First Interna-
tional Workshop on Multiple Classifier Systems 
(MCS-2000), pages 1-15. 
Yoav Freund and Robert E. Schapire. 1996. Experi-
ments with a New Boosting Algorithm. In Proc. of 
the 13
th
 International Conference on Machine 
Learning (ICML-1996), pages 148-156. 
Franz Josef Och and Hermann Ney. 2000. Improved 
Statistical Alignment Models. In Proc. of the 38
th
 
Annual Meeting of the Association for Computa-
tional Linguistics (ACL-2000), pages 440-447. 
Franz Josef Och and Hermann Ney. 2003. A System-
atic Comparison of Various Statistical Alignment 
Models. Computational Linguistics, 29(1):19-51. 
Thanh Phong Pham, Hwee Tou Ng, and Wee Sun Lee 
2005. Word Sense Disambiguation with Semi-
Supervised Learning. In Proc. of the 20th National 
Conference on Artificial Intelligence (AAAI 2005), 
pages 1093-1098. 
Anoop Sarkar. 2001. Applying Co-Training Methods 
to Statistical Parsing. In Proc. of the 2
nd
 Meeting of 
the North American Association for Computational 
Linguistics( NAACL-2001), pages 175-182. 
Joachims Thorsten. 1999. Transductive Inference for 
Text Classification Using Support Vector Ma-
chines. In Proc. of the 16
th
 International Confer-
ence on Machine Learning (ICML-1999), pages 
200-209. 
Dekai Wu. 1997. Stochastic Inversion Transduction 
Grammars and Bilingual Parsing of Parallel Cor-
pora. Computational Linguistics, 23(3): 377-403. 
Hua Wu and Haifeng Wang. 2005. Boosting Statisti-
cal Word Alignment. In Proc. of the 10
th
 Machine 
Translation Summit, pages 313-320. 
Hua Wu, Haifeng Wang, and Zhanyi Liu. 2005. 
Alignment Model Adaptation for Domain-Specific 
Word Alignment. In Proc. of the 43
rd
 Annual Meet-
ing of the Association for Computational Linguis-
tics (ACL-2005), pages 467-474. 
David Yarowsky. 1995. Unsupervised Word Sense 
Disambiguation Rivaling Supervised Methods. In 
Proc. of the 33
rd
 Annual Meeting of the Association 
for Computational Linguistics (ACL-1995), pages 
189-196.  
Hao Zhang and Daniel Gildea. 2005. Stochastic Lexi-
calized Inversion Transduction Grammar for 
Alignment. In Proc. of the 43
rd
 Annual Meeting of 
the Association for Computational Linguistics 
(ACL-2005), pages 475-482. 
920
