Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 25–32,
Ann Arbor, June 2005. c©Association for Computational Linguistics, 2005
Bilingual Word Spectral Clustering for Statistical Machine Translation
Bing Zhao† Eric P. Xing† ‡ Alex Waibel†
†Language Technologies Institute
‡Center for Automated Learning and Discovery
Carnegie Mellon University
Pittsburgh, Pennsylvania 15213
{bzhao,epxing,ahw}@cs.cmu.edu
Abstract
In this paper, a variant of a spectral clus-
tering algorithm is proposed for bilingual
word clustering. The proposed algorithm
generates the two sets of clusters for both
languages efficiently with high seman-
tic correlation within monolingual clus-
ters, and high translation quality across
the clusters between two languages. Each
cluster level translation is considered as
a bilingual concept, which generalizes
words in bilingual clusters. This scheme
improves the robustness for statistical ma-
chine translation models. Two HMM-
based translation models are tested to use
these bilingual clusters. Improved per-
plexity, word alignment accuracy, and
translation quality are observed in our ex-
periments.
1 Introduction
Statistical natural language processing usually suf-
fers from the sparse data problem. Comparing to
the available monolingual data, we have much less
training data especially for statistical machine trans-
lation (SMT). For example, in language modelling,
there are more than 1.7 billion words corpora avail-
able: English Gigaword by (Graff, 2003). However,
for machine translation tasks, there are typically less
than 10 million words of training data.
Bilingual word clustering is a process of form-
ing corresponding word clusters suitable for ma-
chine translation. Previous work from (Wang et al.,
1996) showed improvements in perplexity-oriented
measures using mixture-based translation lexicon
(Brown et al., 1993). A later study by (Och,
1999) showed improvements on perplexity of bilin-
gual corpus, and word translation accuracy using a
template-based translation model. Both approaches
are optimizing the maximum likelihood of parallel
corpus, in which a data point is a sentence pair: an
English sentence and its translation in another lan-
guage such as French. These algorithms are es-
sentially the same as monolingual word clusterings
(Kneser and Ney, 1993)—an iterative local search.
In each iteration, a two-level loop over every possi-
ble word-cluster assignment is tested for better like-
lihood change. This kind of approach has two draw-
backs: first it is easily to get stuck in local op-
tima; second, the clustering of English and the other
language are basically two separated optimization
processes, and cluster-level translation is modelled
loosely. These drawbacks make their approaches
generally not very effective in improving translation
models.
In this paper, we propose a variant of the spec-
tral clustering algorithm (Ng et al., 2001) for bilin-
gual word clustering. Given parallel corpus, first, the
word’s bilingual context is used directly as features
- for instance, each English word is represented by
its bilingual word translation candidates. Second,
latent eigenstructure analysis is carried out in this
bilingual feature space, which leads to clusters of
words with similar translations. Essentially an affin-
ity matrix is computed using these cross-lingual fea-
tures. It is then decomposed into two sub-spaces,
which are meaningful for translation tasks: the left
subspace corresponds to the representation of words
in English vocabulary, and the right sub-space cor-
responds to words in French. Each eigenvector is
considered as one bilingual concept, and the bilin-
gual clusters are considered to be its realizations in
two languages. Finally, a general K-means cluster-
25
ing algorithm is used to find out word clusters in the
two sub-spaces.
The remainder of the paper is structured as fol-
lows: in section 2, concepts of translation models
are introduced together with two extended HMMs;
in section 3, our proposed bilingual word cluster-
ing algorithm is explained in detail, and the related
works are analyzed; in section 4, evaluation metrics
are defined and the experimental results are given;
in section 5, the discussions and conclusions.
2 Statistical Machine Translation
The task of translation is to translate one sentence
in some source language F into a target language E.
For example, given a French sentence with J words
denoted as fJ1 = f1f2...fJ, an SMT system auto-
matically translates it into an English sentence with
I words denoted by eI1 = e1e2...eI. The SMT sys-
tem first proposes multiple English hypotheses in its
model space. Among all the hypotheses, the system
selects the one with the highest conditional proba-
bility according to Bayes’s decision rule:
ˆeI1 = argmax
{eI1}
P(eI1|fJ1 ) = argmax
{eI1}
P(fJ1 |eI1)P(eI1),
(1)
where P(fJ1 |eI1) is called translation model, and
P(eI1) is called language model. The translation
model is the key component, which is the focus in
this paper.
2.1 HMM-based Translation Model
HMM is one of the effective translation models (Vo-
gel et al., 1996), which is easily scalable to very
large training corpus.
To model word-to-word translation, we introduce
the mapping j → aj, which assigns a French word
fj in position j to a English word ei in position
i = aj denoted as eaj. Each French word fj is
an observation, and it is generated by a HMM state
defined as [eaj,aj], where the alignment aj for po-
sition j is considered to have a dependency on the
previous alignment aj−1. Thus the first-order HMM
is defined as follows:
P(fJ1 |eI1) =
summationdisplay
aJ1
Jproductdisplay
j=1
P(fj|eaj)P(aj|aj−1), (2)
where P(aj|aj−1) is the transition probability. This
model captures the assumption that words close in
the source sentence are aligned to words close in
the target sentence. An additional pseudo word of
“NULL” is used as the beginning of English sen-
tence for HMM to start with. The (Och and Ney,
2003) model includes other refinements such as spe-
cial treatment of a jump to a Null word, and a uni-
form smoothing prior. The HMM with these refine-
ments is used as our baseline. Motivated by the work
in both (Och and Ney, 2000) and (Toutanova et al.,
2002), we propose the two following simplest ver-
sions of extended HMMs to utilize bilingual word
clusters.
2.2 Extensions to HMM with word clusters
Let F denote the cluster mapping fj → F(fj), which
assigns French word fj to its cluster ID Fj = F(fj).
Similarly E maps English word ei to its cluster ID
of Ei = E(ei). In this paper, we assume each word
belongs to one cluster only.
With bilingual word clusters, we can extend the
HMM model in Eqn. 1 in the following two ways:
P(fJ1 |eI1) = summationtextaJ
1
producttextJ
j=1 P(fj|eaj)·
P(aj|aj−1,E(eaj−1),F(fj−1)),
(3)
where E(eaj−1) and F(fj−1) are non overlapping
word clusters (Eaj−1,Fj−1)for English and French
respectively.
Another explicit way of utilizing bilingual word
clusters can be considered as a two-stream HMM as
follows:
P(fJ1 ,FJ1 |eI1,EI1) =summationtext
aJ1
producttextJ
j=1 P(fj|eaj)P(Fj|Eaj)P(aj|aj−1).
(4)
This model introduces the translation of bilingual
word clusters directly as an extra factor to Eqn. 2.
Intuitively, the role of this factor is to boost the trans-
lation probabilities for words sharing the same con-
cept. This is a more expressive model because it
models both word and the cluster level translation
equivalence. Also, compared with the model in Eqn.
3, this model is easier to train, as it uses a two-
dimension table instead of a four-dimension table.
However, we do not want this P(Fj|Eaj) to dom-
inate the HMM transition structure, and the obser-
26
vation probability of P(fj|eaj) during the EM itera-
tions. Thus a uniform prior P(Fj) = 1/|F| is intro-
duced as a smoothing factor for P(Fj|Eaj):
P(Fj|Eaj) = λP(Fj|Eaj)+(1−λ)P(Fj), (5)
where |F| is the total number of word clusters in
French (we use the same number of clusters for both
languages). λ can be chosen to get optimal perfor-
mance on a development set. In our case, we fix it to
be 0.5 in all our experiments.
3 Bilingual Word Clustering
In bilingual word clustering, the task is to build word
clusters F and E to form partitions of the vocabular-
ies of the two languages respectively. The two par-
titions for the vocabularies of F and E are aimed to
be suitable for machine translation in the sense that
the cluster/partition level translation equivalence is
reliable and focused to handle data sparseness; the
translation model using these clusters explains the
parallel corpus {(fJ1 ,eI1)} better in terms of perplex-
ity or joint likelihood.
3.1 From Monolingual to Bilingual
To infer bilingual word clusters of (F,E), one can
optimize the joint probability of the parallel corpus
{(fJ1 ,eI1)} using the clusters as follows:
(ˆF, ˆE) = argmax
(F,E)
P(fJ1 ,eI1|F,E)
= argmax
(F,E)
P(eI1|E)P(fJ1 |eI1,F,E).(6)
Eqn. 6 separates the optimization process into two
parts: the monolingual part for E, and the bilingual
part for F given fixed E. The monolingual part is
considered as a prior probability:P(eI1|E), and E can
be inferred using corpus bigram statistics in the fol-
lowing equation:
ˆE = argmax
{E}
P(eI1|E)
= argmax
{E}
Iproductdisplay
i=1
P(Ei|Ei−1)P(ei|Ei). (7)
We need to fix the number of clusters beforehand,
otherwise the optimum is reached when each word
is a class of its own. There exists efficient leave-one-
out style algorithm (Kneser and Ney, 1993), which
can automatically determine the number of clusters.
For the bilingual part P(fJ1 |eI1,F,E), we can
slightly modify the same algorithm as in (Kneser
and Ney, 1993). Given the word alignment {aJ1}
between fJ1 and eI1 collected from the Viterbi path
in HMM-based translation model, we can infer ˆF as
follows:
ˆF = argmax
{F}
P(fJ1 |eI1,F,E)
= argmax
{F}
Jproductdisplay
j=1
P(Fj|Eaj)P(fj|Fj). (8)
Overall, this bilingual word clustering algorithm is
essentially a two-step approach. In the first step, E
is inferred by optimizing the monolingual likelihood
of English data, and secondly F is inferred by op-
timizing the bilingual part without changing E. In
this way, the algorithm is easy to implement without
much change from the monolingual correspondent.
This approach was shown to give the best results
in (Och, 1999). We use it as our baseline to compare
with.
3.2 Bilingual Word Spectral Clustering
Instead of using word alignment to bridge the par-
allel sentence pair, and optimize the likelihood in
two separate steps, we develop an alignment-free al-
gorithm using a variant of spectral clustering algo-
rithm. The goal is to build high cluster-level trans-
lation quality suitable for translation modelling, and
at the same time maintain high intra-cluster similar-
ity , and low inter-cluster similarity for monolingual
clusters.
3.2.1 Notations
We define the vocabulary VF as the French vo-
cabulary with a size of |VF|; VE as the English vo-
cabulary with size of |VE|. A co-occurrence matrix
C{F,E} is built with |VF| rows and |VE| columns;
each element represents the co-occurrence counts of
the corresponding French word fj and English word
ei. In this way, each French word forms a row vec-
tor with a dimension of |VE|, and each dimensional-
ity is a co-occurring English word. The elements in
the vector are the co-occurrence counts. We can also
27
view each column as a vector for English word, and
we’ll have similar interpretations as above.
3.2.2 Algorithm
With C{F,E}, we can infer two affinity matrixes
as follows:
AE = CT{F,E}C{F,E}
AF = C{F,E}CT{F,E},
where AE is an |VE|×|VE| affinity matrix for En-
glish words, with rows and columns representing
English words and each element the inner product
between two English words column vectors. Corre-
spondingly, AF is an affinity matrix of size |VF|×
|VF| for French words with similar definitions. Both
AE and AF are symmetric and non-negative. Now
we can compute the eigenstructure for both AE and
AF . In fact, the eigen vectors of the two are corre-
spondingly the right and left sub-spaces of the orig-
inal co-occurrence matrix of C{F,E} respectively.
This can be computed using singular value decom-
position (SVD): C{F,E} = USV T , AE = VS2V T ,
and AF = US2UT , where U is the left sub-space,
and V the right sub-space of the co-occurrence ma-
trix C{F,E}. S is a diagonal matrix, with the singular
values ranked from large to small along the diagonal.
Obviously, the left sub-space U is the eigenstructure
for AF ; the right sub-space V is the eigenstructure
for AE.
By choosing the top K singular values (the square
root of the eigen values for both AE and AF ), the
sub-spaces will be reduced to: U|VF|×K and V|VE|×K
respectively. Based on these subspaces, we can carry
out K-means or other clustering algorithms to in-
fer word clusters for both languages. Our algorithm
goes as follows:
• Initialize bilingual co-occurrence matrix
C{F,E} with rows representing French words,
and columns English words. Cji is the co-
occurrence raw counts of French word fj and
English word ei;
• Form the affinity matrix AE = CT{F,E}C{F,E}
and AF = CT{F,E}C{F,E}. Kernels can also be
applied here such as AE = exp(C{F,E}C
T
{F,E}
σ2 )for English words. Set A
Eii = 0 and AFii = 0,
and normalize each row to be unit length;
• Compute the eigen structure of the normalized
matrix AE, and find the k largest eigen vectors:
v1,v2,...,vk; Similarly, find the k largest eigen
vectors of AF : u1,u2,...,uk;
• Stack the k eigenvectors of v1,v2,...,vk in
the columns of YE, and stack the eigenvectors
u1,u2,...,uk in the columns for YF ; Normalize
rows of both YE and YF to have unit length. YE
is size of |VE|×k and YF is size of |VF|×k;
• Treat each row of YE as a point in R|VE|×k, and
cluster them into K English word clusters us-
ing K-means. Treat each row of YF as a point in
R|VF|×k, and cluster them into K French word
clusters.
• Finally, assign original word ei to cluster Ek
if row i of the matrix YE is clustered as Ek;
similar assignments are for French words.
Here AE and AF are affinity matrixes of pair-wise
inner products between the monolingual words. The
more similar the two words, the larger the value.
In our implementations, we did not apply a kernel
function like the algorithm in (Ng et al., 2001). But
the kernel function such as the exponential func-
tion mentioned above can be applied here to control
how rapidly the similarity falls, using some carefully
chosen scaling parameter.
3.2.3 Related Clustering Algorithms
The above algorithm is very close to the variants
of a big family of the spectral clustering algorithms
introduced in (Meila and Shi, 2000) and studied in
(Ng et al., 2001). Spectral clustering refers to a class
of techniques which rely on the eigenstructure of
a similarity matrix to partition points into disjoint
clusters with high intra-cluster similarity and low
inter-cluster similarity. It’s shown to be computing
the k-way normalized cut: K −trY TD−12AD−12Y
for any matrix Y ∈ RM×N. A is the affinity matrix,
and Y in our algorithm corresponds to the subspaces
of U and V .
Experimentally, it has been observed that using
more eigenvectors and directly computing a k-way
partitioning usually gives better performance. In our
implementations, we used the top 500 eigen vectors
to construct the subspaces of U and V for K-means
clustering.
28
3.2.4 K-means
The K-means here can be considered as a post-
processing step in our proposed bilingual word clus-
tering. For initial centroids, we first compute the
center of the whole data set. The farthest centroid
from the center is then chosen to be the first initial
centroid; and after that, the other K-1 centroids are
chosen one by one to well separate all the previous
chosen centroids.
The stopping criterion is: if the maximal change
of the clusters’ centroids is less than the threshold of
1e-3 between two iterations, the clustering algorithm
then stops.
4 Experiments
To test our algorithm, we applied it to the TIDES
Chinese-English small data track evaluation test set.
After preprocessing, such as English tokenization,
Chinese word segmentation, and parallel sentence
splitting, there are in total 4172 parallel sentence
pairs for training. We manually labeled word align-
ments for 627 test sentence pairs randomly sampled
from the dry-run test data in 2001, which has four
human translations for each Chinese sentence. The
preprocessing for the test data is different from the
above, as it is designed for humans to label word
alignments correctly by removing ambiguities from
tokenization and word segmentation as much as pos-
sible. The data statistics are shown in Table 1.
English Chinese
Train
Sent. Pairs 4172
Words 133598 105331
Voc Size 8359 7984
Test
Sent. Pairs 627
Words 25500 19726
Voc Size 4084 4827
Unseen Voc Size 1278 1888
Alignment Links 14769
Table 1: Training and Test data statistics
4.1 Building Co-occurrence Matrix
Bilingual word co-occurrence counts are collected
from the training data for constructing the matrix
of C{F,E}. Raw counts are collected without word
alignment between the parallel sentences. Practi-
cally, we can use word alignment as used in (Och,
1999). Given an initial word alignment inferred by
HMM, the counts are collected from the aligned
word pair. If the counts are L-1 normalized, then
the co-occurrence matrix is essentially the bilingual
word-to-word translation lexicon such as P(fj|eaj).
We can remove very small entries (P(f|e) ≤ 1e−7),
so that the matrix of C{F,E} is more sparse for eigen-
structure computation. The proposed algorithm is
then carried out to generate the bilingual word clus-
ters for both English and Chinese.
Figure 1 shows the ranked Eigen values for the
co-occurrence matrix of C{F,E}.
0 100 200 300 400 500 600 700 800 900 10000.5
1
1.5
2
2.5
3
3.5 Eigen values of affinity matrices
Top 1000 Eigen Values
Eigen Values
(a) co−occur counts from init word alignment(b) raw co−occur counts from data
Figure 1: Top-1000 Eigen Values of Co-occurrence
Matrix
It is clear, that using the initial HMM word align-
ment for co-occurrence matrix makes a difference.
The top Eigen value using word alignment in plot a.
(the deep blue curve) is 3.1946. The two plateaus
indicate how many top K eigen vectors to choose to
reduce the feature space. The first one indicates that
K is in the range of 50 to 120, and the second plateau
indicates K is in the range of 500 to 800. Plot b. is
inferred from the raw co-occurrence counts with the
top eigen value of 2.7148. There is no clear plateau,
which indicates that the feature space is less struc-
tured than the one built with initial word alignment.
We find 500 top eigen vectors are good enough
for bilingual clustering in terms of efficiency and ef-
fectiveness.
29
4.2 Clustering Results
Clusters built via the two described methods are
compared. The first method bil1 is the two-step op-
timization approach: first optimizing the monolin-
gual clusters for target language (English), and af-
terwards optimizing clusters for the source language
(Chinese). The second method bil2 is our proposed
algorithm to compute the eigenstructure of the co-
occurrence matrix, which builds the left and right
subspaces, and finds clusters in such spaces. Top
500 eigen vectors are used to construct these sub-
spaces. For both methods, 1000 clusters are inferred
for English and Chinese respectively. The number
of clusters is chosen in a way that the final word
alignment accuracy was optimal. Table 2 provides
the clustering examples using the two algorithms.
settings cluster examples
mono-E1 entirely,mainly,merely
mono-E2 10th,13th,14th,16th,17th,18th,19th20th,21st,23rd,24th,26th
mono-E3 drink,anglophobia,carota,giant,gymnasium
bil1-C3 xe0,x64,x81,x98,x96xcb,x79x51,x79
bil2-E1 alcoholic cognac distilled drinkscotch spirits whiskey
bil2-C1 xb8xcb,xcb,x7f,xf4x80,x32,x86xfx79,xaex68,x37,xa0xc2x7d,x36,x85x2cx12,x6bx12
bil2-E2 evrec harmony luxury people sedan sedanstour tourism tourist toward travel
bil2-C2 x97x12xb2xc,x73x89,x10xb4,xfbx28,x1bxb8,x75xb0,x40x71,x40x89,x7c,x7cxcc,x2dx7c
Table 2: Bilingual Cluster Examples
The monolingual word clusters often contain
words with similar syntax functions. This hap-
pens with esp. frequent words (eg. mono-E1 and
mono-E2). The algorithm tends to put rare words
such as “carota, anglophobia” into a very big cluster
(eg. mono-E3). In addition, the words within these
monolingual clusters rarely share similar transla-
tions such as the typical cluster of “week, month,
year”. This indicates that the corresponding Chi-
nese clusters inferred by optimizing Eqn. 7 are not
close in terms of translational similarity. Overall, the
method of bil1 does not give us a good translational
correspondence between clusters of two languages.
The English cluster of mono-E3 and its best aligned
candidate of bil1-C3 are not well correlated either.
Our proposed bilingual cluster algorithm bil2
generates the clusters with stronger semantic mean-
ing within a cluster. The cluster of bil2-E1 relates
to the concept of “wine” in English. The mono-
lingual word clustering tends to scatter those words
into several big noisy clusters. This cluster also has a
good translational correspondent in bil2-C1 in Chi-
nese. The clusters of bil2-E2 and bil2-C2 are also
correlated very well. We noticed that the Chinese
clusters are slightly more noisy than their English
corresponding ones. This comes from the noise in
the parallel corpus, and sometimes from ambiguities
of the word segmentation in the preprocessing steps.
To measure the quality of the bilingual clusters,
we can use the following two kind of metrics:
• Average epsilon1-mirror (Wang et al., 1996): The epsilon1-
mirror of a class Ei is the set of clusters in
Chinese which have a translation probability
greater than epsilon1. In our case, epsilon1 is 0.05, the same
value used in (Och, 1999).
• Perplexity: The perplexity is defined as pro-
portional to the negative log likelihood of the
HMM model Viterbi alignment path for each
sentence pair. We use the bilingual word clus-
ters in two extended HMM models, and mea-
sure the perplexities of the unseen test data af-
ter seven forward-backward training iterations.
The two perplexities are defined as PP1 =
exp(−summationtextJj=1 log(P(fj|eaj)P(aj|aj−1,Eaj−1,
Fj−1))/J) and PP2 = exp(−J−1summationtextJj=1 log(
P(fj|eaj)P(aj|aj−1)P(Fj−1|Eaj−1))) for the
two extended HMM models in Eqn 3 and 4.
Both metrics measure the extent to which the trans-
lation probability is spread out. The smaller the bet-
ter. The following table summarizes the results on
epsilon1-mirror and perplexity using different methods on
the unseen test data.
algorithms epsilon1-mirror HMM-1 Perp HMM-2 Perp
baseline - 1717.82
bil1 3.97 1810.55 352.28
bil2 2.54 1610.86 343.64
The baseline uses no word clusters. bil1 and bil2
are defined as above. It is clear that our proposed
method gives overall lower perplexity: 1611 from
the baseline of 1717 using the extended HMM-1.
If we use HMM-2, the perplexity goes down even
more using bilingual clusters: 352.28 using bil1, and
343.64 using bil2. As stated, the four-dimensional
30
table of P(aj|aj−1,E(eaj−1),F(fj−1)) is easily
subject to overfitting, and usually gives worse per-
plexities.
Average epsilon1-mirror for the two-step bilingual clus-
tering algorithm is 3.97, and for spectral cluster-
ing algorithm is 2.54. This means our proposed al-
gorithm generates more focused clusters of transla-
tional equivalence. Figure 2 shows the histogram for
the cluster pairs (Fj,Ei), of which the cluster level
translation probabilities P(Fj|Ei) ∈ [0.05,1]. The
interval [0.05,1] is divided into 10 bins, with first bin
[0.05,0.1], and 9 bins divides[0.1,1] equally. The
percentage for clusters pairs with P(Fj|Ei) falling
in each bin is drawn.
Histogram of (F,E) pairs with P(F|E) > 0.05 
0
0.1
0.2
0.3
0.4
0.5
0.6
0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Ten bins for P(F|E) ranging from [0.05, 1.0] 
spec-bi-clusteringtwo-step-bi-clustering
Figure 2: Histogram of cluster pairs (Fj,Ei)
Our algorithm generates much better aligned clus-
ter pairs than the two-step optimization algorithm.
There are 120 cluster pairs aligned with P(Fj|Ei) ≥
0.9 using clusters from our algorithm, while there
are only 8 such cluster pairs using the two-step ap-
proach. Figure 3 compares the epsilon1-mirror at different
numbers of clusters using the two approaches. Our
algorithm has a much better epsilon1-mirror than the two-
step approach over different number of clusters.
Overall, the extended HMM-2 is better than
HMM-1 in terms of perplexity, and is easier to train.
4.3 Applications in Word Alignment
We also applied our bilingual word clustering in a
word alignment setting. The training data is the
TIDES small data track. The word alignments are
manually labeled for 627 sentences sampled from
the dryrun test data in 2001. In this manually
aligned data, we include one-to-one, one-to-many,
and many-to-many word alignments. Figure 4 sum-
marizes the word alignment accuracy for different
e-mirror over different settings
00.5
11.5
22.5
33.5
44.5
0 200 400 600 800
100
0
120
0
140
0
160
0
180
0
200
0
number of clusters
e
-
m
i
r
r
o
r
BIL2: Co-occur raw countsBIL2: Co-occur counts from init word-align
BIL1: Two-step optimizationFigure 3: epsilon1-mirror with different settings
methods. The baseline is the standard HMM trans-
lation model defined in Eqn. 2; the HMM1 is de-
fined in Eqn 3, and HMM2 is defined in Eqn 4. The
algorithm is applying our proposed bilingual word
clustering algorithm to infer 1000 clusters for both
languages. As expected, Figure 4 shows that using
F-measure of word alignment
38.00%
39.00%
40.00%
41.00%
42.00%
43.00%
44.00%
45.00%
1 2 3 4 5 6 7HMM Viterbi Iterations
F
-
m
e
a
s
u
r
e
Baseline HMMExtended HMM-1
Extended HMM-2Figure 4: Word Alignment Over Iterations
word clusters is helpful for word alignment. HMM2
gives the best performance in terms of F-measure of
word alignment. One quarter of the words in the test
vocabulary are unseen as shown in Table 1. These
unseen words related alignment links (4778 out of
14769) will be left unaligned by translation models.
Thus the oracle (best possible) recall we could get
is 67.65%. Our standard t-test showed that signifi-
cant interval is 0.82% at the 95% confidence level.
The improvement at the last iteration of HMM is
marginally significant.
4.4 Applications in Phrase-based Translations
Our pilot word alignment on unseen data showed
improvements. However, we find it more effective
in our phrase extraction, in which three key scores
31
are computed: phrase level fertilities, distortions,
and lexicon scores. These scores are used in a lo-
cal greedy search to extract phrase pairs (Zhao and
Vogel, 2005). This phrase extraction is more sen-
sitive to the differences in P(fj|ei) than the HMM
Viterbi word aligner.
The evaluation conditions are defined in NIST
2003 Small track. Around 247K test set (919 Chi-
nese sentences) specific phrase pairs are extracted
with up to 7-gram in source phrase. A trigram
language model is trained using Gigaword XinHua
news part. With a monotone phrase-based decoder,
the translation results are reported in Table 3. The
Eval. Baseline Bil1 Bil2
NIST 6.417 6.507 6.582
BLEU 0.1558 0.1575 0.1644
Table 3: NIST’03 C-E Small Data Track Evaluation
baseline is using the lexicon P(fj|ei) trained from
standard HMM in Eqn. 2, which gives a BLEU
score of 0.1558 +/- 0.0113. Bil1 and Bil2 are using
P(fj|ei) from HMM in Eqn. 4 with 1000 bilingual
word clusters inferred from the two-step algorithm
and the proposed one respectively. Using the clus-
ters from the two-step algorithm gives a BLEU score
of 0.1575, which is close to the baseline. Using clus-
ters from our algorithm, we observe more improve-
ments with BLEU score of 0.1644 and a NIST score
of 6.582.
5 Discussions and Conclusions
In this paper, a new approach for bilingual word
clustering using eigenstructure in bilingual feature
space is proposed. Eigenvectors from this feature
space are considered as bilingual concepts. Bilin-
gual clusters from the subspaces expanded by these
concepts are inferred with high semantic correla-
tions within each cluster, and high translation quali-
ties across clusters from the two languages.
Our empirical study also showed effectiveness of
using bilingual word clusters in extended HMMs for
statistical machine translation. The K-means based
clustering algorithm can be easily extended to do hi-
erarchical clustering. However, extensions of trans-
lation models are needed to leverage the hierarchical
clusters appropriately.

References
P.F. Brown, Stephen A. Della Pietra, Vincent. J.
Della Pietra, and Robert L. Mercer. 1993. The mathe-
matics of statistical machine translation: Parameter es-
timation. In Computational Linguistics, volume 19(2),
pages 263–331.
David Graff. 2003. Ldc gigaword corpora: English gi-
gaword (ldc catalog no: Ldc2003t05). In LDC link:
http://www.ldc.upenn.edu/Catalog/index.jsp.
R. Kneser and Hermann Ney. 1993. Improved clus-
tering techniques for class-based statistical language
modelling. In European Conference on Speech Com-
munication and Technology, pages 973–976.
Marina Meila and Jianbo Shi. 2000. Learning segmenta-
tion by random walks. In Advances in Neural Informa-
tion Processing Systems. (NIPS2000), pages 873–879.
A. Ng, M. Jordan, and Y. Weiss. 2001. On spectral
clustering: Analysis and an algorithm. In Advances in
Neural Information Processing Systems 14: Proceed-
ings of the 2001.
Franz J. Och and Hermann Ney. 2000. A comparison of
alignment models for statistical machine translation.
In COLING’00: The 18th Int. Conf. on Computational
Linguistics, pages 1086–1090, Saarbrucken, Germany,
July.
Franz J. Och and Hermann Ney. 2003. A systematic
comparison of various statistical alignment models. In
Computational Linguistics, volume 29, pages 19–51.
Franz J. Och. 1999. An efficient method for determin-
ing bilingal word classes. In Ninth Conf. of the Europ.
Chapter of the Association for Computational Linguis-
tics (EACL’99), pages 71–76.
Kristina Toutanova, H. Tolga Ilhan, and Christopher D.
Manning. 2002. Extensions to hmm-based statistical
word alignment models. In Proc. of the Conference on
Empirical Methods in Natural Language Processing.
S. Vogel, Hermann Ney, and C. Tillmann. 1996. Hmm
based word alignment in statistical machine transla-
tion. In Proc. The 16th Int. Conf. on Computational
Lingustics, (Coling’96), pages 836–841.
Yeyi Wang, John Lafferty, and Alex Waibel. 1996.
Word clustering with parallel spoken language cor-
pora. In proceedings of the 4th International Con-
ference on Spoken Language Processing (ICSLP’96),
pages 2364–2367.
Bing Zhao and Stephan Vogel. 2005. A generalized
alignment-free phrase extraction algorithm. In ACL
2005 Workshop: Building and Using Parallel Cor-
pora: Data-driven Machine Translation and Beyond,
Ann Arbor, Michigan.
