Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 41–44,
New York, June 2006. c©2006 Association for Computational Linguistics
Spectral Clustering for Example Based Machine Translation
Rashmi Gangadharaiah
LTI
Carnegie Mellon University
Pittsburgh P.A. 15213
rgangadh@andrew.cmu.edu
Ralf Brown
LTI
Carnegie Mellon University
Pittsburgh P.A. 15213
ralf@cs.cmu.edu
Jaime Carbonell
LTI
Carnegie Mellon University
Pittsburgh P.A. 15213
jgc@cs.cmu.edu
Abstract
Prior work has shown that generaliza-
tion of data in an Example Based Ma-
chine Translation (EBMT) system, re-
duces the amount of pre-translated text re-
quired to achieve a certain level of accu-
racy (Brown, 2000). Several word clus-
tering algorithms have been suggested to
perform these generalizations, such as k-
Means clustering or Group Average Clus-
tering. The hypothesis is that better con-
textual clustering can lead to better trans-
lation accuracy with limited training data.
In this paper, we use a form of spectral
clustering to cluster words, and this is
shown to result in as much as 29.08% im-
provement over the baseline EBMT sys-
tem.
1 Introduction
In EBMT, the source sentence to be translated
is matched against the source language sentences
present in a corpus of source-target sentence pairs.
When a partial match is found, the corresponding
target translations are obtained through subsenten-
tial alignment. These partial matches are put to-
gether to obtain the final translation by optimizing
translation and alignment scores and using a statisti-
cal target language model in the decoding process.
Prior work has shown that EBMT requires large
amounts of data (in the order of two to three mil-
lion words) (Brown, 2000) of pre-translated text, to
function reasonably well. Thus, some modification
of the basic EBMT method is required to make it ef-
fective when less data is available. In order to use
the available text efficiently, systems such as, (Veale
and Way, 1997) and (Brown, 1999), convert the ex-
amples in the corpus into templates against which
the new text can be matched. Thus, source-target
sentence pairs are converted to source-target gener-
alized template pairs. An example of such a pair is
shown below:
The session opened at 2p.m
La s´eance est ouverte ´a 2 heures
The <event><verb-past-tense> at <time>
La <event><verb-past-tense> a <time>
This single template can be used to translate differ-
ent source sentences, including for example,
The session adjourned at 6p.m
The seminar opened at 8a.m
if ‘session’ and ‘seminar’ are both generalized to
‘<event>’, ‘opened’ and ‘adjourned’ are both gen-
eralized to ‘<verb-past-tense>’ and finally ‘6p.m’
and ‘8a.m’ are both generalized to ‘<time>’.
The system used by (Brown, 1999) performs
its generalization using both equivalence classes of
words and a production rule grammar. This paper
describes the use of spectral clustering (Ng. et. al.,
2001; Zelnik-Manor and Perona, 2004), for auto-
mated extraction of equivalence classes. Spectral
clustering is seen to be superior to Group Average
Clustering (GAC) (Brown, 2000) both in terms of
semantic similarity of words falling in a single clus-
ter, and overall BLEU score (Papineni. et. al., 2002)
in a large scale EBMT system.
The next section explains the term vectors ex-
tracted for each word, which are then used to cluster
words into equivalence classes and provides an out-
line of the Standard GAC algorithm. Section 3 de-
scribes the spectral clustering algorithm used. Sec-
41
tion 4 lists results obtained in a full evaluation of the
algorithm. Section 5 concludes and discusses direc-
tions for future work.
2 Term vectors for clustering
Using a bilingual dictionary, usually created using
statistical methods such as those of (Brown et. al.,
1990) or (Brown, 1997), and the parallel text, a
rough mapping between source and target words can
be created. This word pair is then treated as an in-
divisible token for future processing. For each such
word pair we then accumulate counts for each to-
ken in the surrounding context of its occurrences
(N words, currently 3, immediately prior to and
N words immediately following). The counts are
weighted with respect to distance from occurrence,
with a linear decay (from 1 to 1/N) to give great-
est importance to the words immediately adjacent to
the word pair being examined. These counts form a
pseudo-document for each pair, which are then con-
verted into term vectors for clustering.
In this paper, we compare our algorithm against
the incremental GAC algorithm(Brown, 2000). This
method examines each word pair in turn, comput-
ing a similarity measure to every existing cluster.
If the best similarity measure is above a predeter-
mined threshold, the new word is placed in the cor-
responding cluster, otherwise a new cluster is cre-
ated if the maximum number of clusters has not yet
been reached.
3 Spectral clustering
Spectral clustering is a general term used to de-
scribe a group of algorithms that cluster points using
the eigenvalues of ‘distance matrices’ obtained from
data. In our case, the algorithm described in (Ng.
et. al., 2001) was performed with certain variations
that were proposed by (Zelnik-Manor and Perona,
2004) to compute the scaling factors automatically
and for the k-Means orthogonal treatment (Verma
and Meila, 2003) during the initialization. These
scaling factors help in self-tuning distances between
points according to the local statistics of the neigh-
borhoods of the points. The algorithm is briefly de-
scribed below.
1. Let S =s1,s2,....sn, denote the term vectors to
be clustered into k classes.
2. Form the affinity matrix A defined by
Aij = exp(−d2(si,sj)/σiσj) for i negationslash= j
Aii = 1
Where, d(si,sj) = 1/(sim(si,sj)+epsilon1)
sim(si,sj) is the Cosine similarity between si
and sj, epsilon1 is used to prevent the ratio from be-
coming infinity
σi is the set of local scaling parameters for si.
σi = d(si,sT) where, sT is the Tth neighbor of
point si for some fixed T (7 for this paper).
3. Define D to be the diagonal matrix given by,
Dii = ΣjAij
4. Compute L = D−1/2AD−1/2
5. Select k eigenvectors corresponding to k
largest eigenvalues (k is presently an externally
set parameter). The eigenvectors are normal-
ized to have unit length. Form matrix U by
stacking all the eigenvectors in columns.
6. Form the matrix Y by normalizing U’s rows,
Yij = Uij/
radicalBig
(ΣjU2ij)
7. Perform k-Means clustering treating each row
of Y as a point in k dimensions. The k-Means
algorithm is initialized either with random cen-
ters or with orthogonal vectors.
8. After clustering, assign the point si to cluster c
if the corresponding row i of the matrix Y was
assigned to cluster c.
9. Sum the distances between the members and
the centroid of each cluster to obtain the classi-
fication cost.
10. Goto step 7, iterate for a fixed number of it-
erations. In this paper, 20 iterations were per-
formed with orthogonal k-Means initialization
and 5 iterations with random k-Means initial-
ization.
11. The clusters obtained from the iteration with
least classification cost are selected as the k
clusters.
4 Preliminary Results
The clusters obtained from the spectral clustering
method are seen by inspection to correspond to more
natural and intuitive word classes than those ob-
tained by GAC. Even though this is subjective and
not guaranteed to lead to improve translation perfor-
mance, it shows that maybe the increased power of
spectral clustering to represent non-convex classes
42
(non-convex in the term vector domain) could be
useful in a real translation experiment. Some ex-
ample classes are shown in Table 1. The first
class in an intuitive sense corresponds to measure-
ment units. We see that in the <units> case,
GAC misses some of the members which are ac-
tually distributed among many different classes and
hence these are not well generalized. In the second
class <months>, spectral clustering has primarily
the months in a single class whereas GAC adds a
number of seemingly unrelated words to the clus-
ter. The classes were all obtained by finding 80
clusters in a 20,000-sentence pair subset of the IBM
Hansard Corpus (Linguistic Data Consortium, 1997)
for spectral clustering. 80 was chosen as the number
of clusters since it gave the highest BLEU score in
the evaluation. For GAC, 300 clusters were used as
this gave the best performance.
To show the effectiveness of the clustering meth-
ods in an actual evaluation, we set up the following
experiment for an English to French translation task
on the Hansard corpus. The training data consists of
three sets of size 10,000 (set1), 20,000 (set2) and
30,000 (set3) sentence pairs chosen from the first
six files of the Hansard Corpus. Only sentences of
length 5 to 21 words were taken. Only words with
frequency of occurrence greater than 9 were chosen
for clustering because more contextual information
would be available when the word occurs frequently
and this would help in obtaining better clusters. The
test data was chosen to be a set of 500 sentences ob-
tained from files 20, 40, 60 and 80 of the Hansard
corpus with 125 sentences from each file. Each of
the methods was run with different number of clus-
ters and results are reported only for the optimal
number of clusters in each case.
The results in Table 2 show that spectral clus-
tering requires moderate amounts of data to get a
large improvement. For small amounts of data it is
slightly worse than GAC, but neither gives much im-
provement over the baseline. For larger amounts of
data, again both methods are very similar, though
spectral clustering is better. Finally, for moderate
amounts of data, when generalization is the most
useful, spectral clustering gives a significant im-
provement over the baseline as well as over GAC.
By looking at the clusters obtained with varying
amounts of data, it can be concluded that high pu-
Table 1: Clusters for <units> and <months>
Spectral clustering GAC
“adjourned” “hre” “adjourned” “hre”
“cent” “%”
“days” “jours”
“families” “familles” “families” “familles”
“hours” “heures”
“million” “millions” “million” “millions”
“minutes” “minutes”
“o’clock” “heures” “o’clock” “heures”
“p.m.” “heures” “p.m.” “heures”
“p.m.” “hre”
“people” “personnes” “people” “personnes”
“per” “%” “per” “%”
“times” “fois” “times” “fois”
“years” “ans”
“august” “aoˆut” “august” “aoˆut”
“december” “d´ecembre” “december” “d´ecembre”
“february” “f´evrier” “february” “f´evrier”
“january” “janvier” “january” “janvier”
“march” “mars” “march” “mars”
“may” “mai” “may” “mai”
“november” “novembre” “november” “novembre”
“october” “octobre” “october” “octobre”
“only” “seulement” “only” “seulement”
“june” “juin” “june” “juin”
“july” “juillet” “july” “juillet”
“april” “avril” “april” “avril”
“september” “septembre” “september” “septembre”
“page” “page”
“per” “$”
“recognize” “parole”
“recognized” “parole”
“recorded” “page”
“section” “article”
“since” “depuis” “since” “depuis”
“took” “s´eance”
“under” “loi”
43
Table 2: % Relative improvement over baseline EBMT
# clus is the number of clusters for best performance
GAC Spectral
% Rel imp #clus % Rel imp #clus
10k 3.33 50 1.37 20
20k 22.47 300 29.08 80
30k 2.88 300 3.88 200
rity clusters can be obtained with even just moderate
amounts of data.
5 Conclusions and future work
From the experimental results we see that spectral
clustering leads to relatively purer and more intu-
itive clusters. These clusters result in an improved
BLEU score in comparison with the clusters ob-
tained through GAC. GAC can only collect clusters
in convex regions in the term vector space, while
spectral clustering is not limited in this regard. The
ability of spectral clustering to represent non-convex
shapes arises due to the projection onto the eigen-
vectors as described in (Ng. et. al., 2001).
As future work, we would like to analyze the
variation in performance as the amount of data in-
creases. It is widely known that increasing the
amount of training data in a generalized EBMT sys-
tem eventually leads to saturation of performance,
where all clustering methods perform about as well
as baseline. Thus, all methods have an operating re-
gion where they are the most useful. We would like
to locate and extend this region for spectral cluster-
ing.
Also, it would be interesting to compare the clus-
ters obtained with spectral clustering and the Part of
Speech tags of the words in the same cluster, espe-
cially for languages such as English where good tag-
gers are available.
Finally, an important direction of research is in
automatically selecting the number of clusters for
the clustering algorithm. To do this, we could use
information from the eigenvalues or the distribution
of points in the clusters.
Acknowledgment
This work was funded by National Business Center
award NBCHC050082.
References
Andrew Ng, Michael Jordan, and Yair Weiss 2001. On
Spectral Clustering: Analysis and an algorithm. In Ad-
vances in Neural Information Processing Systems 14:
Proceeding of the 2001 Conference, pages 849-856,
Vancouver, British Columbia, Canada, December.
Deepak Verma and Marina Meila. 2003. Comparison of
Spectral Clustering Algorithms. http://www.ms.
washington.edu/∼spectral/.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei
Jing Zhu. 2002. BLEU: a method for Automatic
Evaluation of Machine Translation. In Proceedings
of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL 2002), pages 311-
318,Philadelphia, PA, July. http://acl.ldc.
upenn.edu/P/P02
Linguistic Data Consortium. 1997. Hansard Corpus of
Parallel English and French. Linguistic Data Con-
sortium, December. http://www.ldc.upenn.
edu/
L. Zelnik-Manor and P. Perona 2004 Self-Tuning Spec-
tral Clustering. In Advances in Neural Information
Processing Systems 17: Proceeding of the 2004 Con-
ference.
Peter Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F.
Jelinek, J. Lafferty, R. Mercer and P. Roossin. 1990.
A Statistical Approach to Machine Translation. Com-
putational Linguistics, 16:79-85.
Ralf D. Brown. 1997. Automated Dictionary Extrac-
tion for “Knowledge-Free” Example-Based Transla-
tion. In Proceedings of the Seventh International Con-
ference on Theoretical and Methodological Issues in
Machine Translation (TMI-97), pages 111-118, Santa
Fe, New Mexico, July. http://www.cs.cmu.
edu/∼ralf/papers.html
Ralf D. Brown. 1999. Adding Linguistic Knowledge
to a Lexical Example-Based Translation System. In
Proceedings of the Eighth International Conference
on Theoretical and Methodological Issues in Machine
Translation(TMI-99), pages 22-32, August. http:
//www.cs.cmu.edu/∼ralf/papers.html
Ralf. D. Brown. 2000. Automated Generalization of
Translation Examples. In Proceedings of Eighteenth
International Conference on Computational Linguis-
tics (COLING-2000), pages 125-131, Saarbr¨ucken,
Germany.
Tony Veale and Andy Way. 1997. Gaijin: A Template-
Driven Bootstrapping Approach to Example-Based
Machine Translation. In Proceedings of NeMNLP97,
New Methods in Natural Language Processing, Sofia,
Bulgaria, September. http://www.compapp.
dcu.ie/∼tonyv/papers/gaijin.html.
44
