Extracting Word Sequence Correspondences
with Support Vector Machines
Kengo SATO and Hiroaki SAITO
Department of Information and Computer Science
Keio University
3-14-1, Hiyoshi, Kohoku, Yokohama 223-8522, Japan
fsatoken,hxsg@nak.ics.keio.ac.jp
Abstract
This paper proposes a learning and extracting
method of word sequence correspondences from
non-aligned parallel corpora with Support Vector
Machines, which have high ability of the generaliza-
tion, rarely cause over-fit for training samples and
can learn dependencies of features by using a kernel
function. Our method uses features for the trans-
lation model which use the translation dictionary,
the number of words, part-of-speech, constituent
words and neighbor words. Experiment results in
which Japanese and English parallel corpora are
used archived 81.1 % precision rate and 69.0 % re-
call rate of the extracted word sequence correspon-
dences. This demonstrates that our method could
reduce the cost for making translation dictionaries.
1 Introduction
Translation dictionaries used in multilingual natu-
ral language processing such as machine transla-
tion have been made manually, but a great deal of
labor is required for this work and it is di cult
to keep the description of the dictionaries consis-
tent. Therefore, researches of extracting transla-
tion pairs from parallel corpora automatically be-
come active recently (Gale and Church, 1991; Kaji
and Aizono, 1996; Tanaka and Iwasaki, 1996; Kita-
mura and Matsumoto, 1996; Fung, 1997; Melamed,
1997; Sato and Nakanishi, 1998).
This paper proposes a learning and extract-
ing method of bilingual word sequence correspon-
dences from non-aligned parallel corpora with Sup-
port Vector Machines (SVMs) (Vapnik, 1999).
SVMs are ones of large margin classifiers (Smola
et al., 2000) which are based on the strategy where
margins between separating boundary and vectors
of which elements express the features of train-
ing samples is maximized. Therefore, SVMs have
higer ability of the generalization than other learn-
ing models such as the decision trees and rarely
cause over-fit for training samples. In addition, by
using kernel functions, they can learn non-linear
separating boundary and dependencies between the
features. Therefore, SVMs have been recently used
for the natural language processing such as text
categorization (Joachims, 1998; Taira and Haruno,
1999), chunk identification (Kudo and Matsumoto,
2000b), dependency structure analysis (Kudo and
Matsumoto, 2000a).
The method proposed in this paper does not re-
quire aligned parallel corpora which do not exist too
many at present. Therefore, without limiting appli-
cable domains, word sequence correspondences can
been extracted.
2 Support Vector Machines
SVMs are binary classifiers which linearly separate
d dimension vectors to two classes. Each vector rep-
resents the sample which has d features. It is distin-
guished whether given sample ~x = (x1;x2;:::;xd)
belongs toX1 orX2 by equation (1) :
f (~x)=sign(g(~x))=
( 1  ~x2X
1
 
 1  ~x2X2 (1)
where g(~x) is the hyperplain which separates two
classes in which ~w and b are decided by optimiza-
tion.
g(~x)=~w ~x+b (2)
Let supervise signals for the training samples be
expressed as
yi =
( 1  ~x
i 2X1
 
 1  ~xi 2X2 
whereX1 is a set of positive samples andX2 is a set
of negative samples.
If the training samples can be separated linearly,
there could exist two or more pairs of ~w and b that
PSfrag replacements X1
X22=jj~wjj
~w ~x+b=0
~w ~x+b=1
~w ~x+b= 1
Figure 1: A separating hyperplain
satisfy equation (1). Therefore, give the following
constraints :
8i; yi(~w ~xi+b) 1 0 (3)
Figure 1 shows that the hyperplain which sepa-
rates the samples. In this figure, solid line shows
separating hyperplain ~w ~x+b = 0 and two dotted
lines show hyperplains expressed by ~w ~x+b= 1.
The constraints (3) mean that any vectors must not
exist inside two dotted lines. The vectors on dotted
lines are called support vectors and the distance be-
tween dotted lines is called a margin, which equals
to 2=jj~wjj.
The learning algorithm for SVMs could optimize
~w and b which maximize the margin 2=jj~wjjor min-
imize jj~wjj2=2 subject to constraints (3). According
to Lagrange’s theory, the optimization problem is
transformed to minimizing the Lagrangian L :
L= 12jj~wjj2+
nX
i=1
 i yi(~w ~xi+b 1) (4)
where  i  0 (i = 1;:::;n) are the Lagrange mul-
tipliers. By di erentiating with respect to ~w and b,
the following relations are obtained,
@L
@~w =~w 
nX
i=1
 iyi~x=0 (5)
@L
@b =
nX
i=1
 iyi =0 (6)
and substituting equations (5) (6) into equation (4)
to obtain
D= 12
nX
i=1
nX
j=1
 i jyiy j~xi ~x j+
nX
i=1
 i (7)
Consequently, the optimization problem is trans-
formed to maximizing the object function D subject
to Pni=1 iyi = 0 and  i  0. For the optimal pa-
rameters  = arg max D, each training sample ~xi
where  i >0 is corresponding to support vector.
~w can be obtained from equation (5) and b can be
obtained from
b=yi ~w ~xi
where ~xi is an arbitrary support vector. From equa-
tion (2) (5), the optimal hyperplain can be expressed
as the following equation with optimal parameters
  :
g(~x)=
nX
i=1
  i yi~xi ~x+b (8)
The training samples could be allowed in some
degree to enter the inside of the margin by changing
equation (3) to :
8i; yi(~w ~xi+b) 1+ i  0 (9)
where i  0 are called slack variables. At this time,
the maximal margin problem is enhanced as mini-
mizing jj~wjj2=2+CPni=1 i, where C expresses the
weight of errors. As a result, the problem is to max-
imize the object function D subject toPni=1 iyi =0
and 0  i  C.
For the training samples which cannot be sepa-
rated linearly, they might be separated linearly in
higher dimension by mapping them using a non-
linear function:
 : Rd 7!Rd0
A linear separating in Rd0 for (~x) is same as a non-
linear separating in Rd for~x. Let satisfy
K(~x;~x0)= (~x)  (~x0) (10)
where K(~x;~x0) is called kernel function. As a result,
the object function is rewritten to
D= 12
nX
i=1
nX
j=1
 i jyiy jK(~xi;~x j)+
nX
i=1
 i (11)
and the optimal hyperplain is rewritten to
g(~x)=
nX
i=1
  i yiK(~xi;~x)+b (12)
Note that  does not appear in equation (11) (12).
Therefore, we need not calculate in higher dimen-
sion.
The well-known kernel functions are the polyno-
mial kernel function (13) and the Gaussian kernel
function (14).
K(~x;~x0)=(~x ~x0+1)p (13)
K(~x;~x0)=exp
0B
BBB@ jj~x ~x0jj2
2 2
1C
CCCA (14)
A non-linear separating using one of these kernel
functions is corresponding to separating with con-
sideration of the dependencies between the features
in Rd.
3 Extracting Word Sequence
Correspondences with SVMs
3.1 Outline
The method proposed in this paper can obtain word
sequence correspondences (translation pairs) in the
parallel corpora which include Japanese and En-
glish sentences. It consists of the following three
steps:
1. Make training samples which include positive
samples as translation pairs and negative sam-
ples as non-translation pairs from the train-
ing corpora manually, and learn a translation
model from these with SVMs.
2. Make a set of candidates of translation pairs
which are pairs of phrases obtained by pars-
ing both Japanese sentences and English sen-
tences.
3. Extract translation pairs from the candidates by
inputting them to the translation model made in
step 1.
3.2 Features for the Translation Model
To apply SVMs for extracting translation pairs, the
candidates of the translation pairs must be converted
into feature vectors. In our method, they are com-
posed of the following features:
1. Features which use an existing translation dic-
tionary.
(a) Bilingual word pairs in the translation
dictionary which are included in the can-
didates of the translation pairs.
(b) Bilingual word pairs in the translation
dictionary which are co-occurred in the
context in which the candidates appear.
2. Features which use the number of words.
(a) The number of words in Japanese phrases.
(b) The number of words in English phrases.
3. Features which use the part-of-speech.
(a) The ratios of appearance of noun, verb,
adjective and adverb in Japanese phrases.
(b) The ratios of appearance of noun, verb,
adjective and adverb in English phrases.
4. Features which use constituent words.
(a) Constituent words in Japanese phrases.
(b) Constituent words in English phrases.
5. Features which use neighbor words.
(a) Neighbor words which appear in Japanese
phrases just before or after.
(b) Neighbor words which appear in English
phrases just before or after.
Two types of the features which use an existing
translation dictionary are used because the improve-
ment of accuracy can be expected by e ectively us-
ing existing knowledge in the features. For features
(1a), words included in a candidate of the trans-
lation pair are looked up with the translation dic-
tionary and the bilingual word pairs in the candi-
date become features. They are based on the idea
that a translation pair would include many bilingual
word pairs. Each bilingual word pair included in
the dictionary is allocated to the dimension of the
feature vectors. If a bilingual word pair appears in
the candidate of translation pair, the value of the
corresponding dimension of the vector is set to 1,
and otherwise it is set to 0. For features (1b), all
pairs of words which co-occurred with a candidate
of the translation pair are looked up with the trans-
lation dictionary and the bilingual word pairs in the
dictionary become features. They are based on the
idea that the context of the words which appear in
neighborhood looks like each other for the trans-
lation pairs although expressed in the two di erent
languages (Kaji and Aizono, 1996). The candidates
are converted into the feature vectors just like (1a).
Features (2a) (2b) are based on the idea that there
is a correlation in the number of constituent words
of the phrases of both languages in the translation
pair. The number of constituent words of each lan-
guage is used for the feature vector.
Features (3a) (3b) are based on the idea that there
is a correlation in the ratio of content words (noun,
verb, adjective and adverb) which appear in the
phrases of both languages in a translation pair. The
ratios of the numbers of noun, verb, adjective and
adverb to the number of words of the phrases of
each language are used for the feature vector.
For features (4a) (4b), each content word (noun,
verb, adjective and adverb) is allocated to the di-
mension of the feature vectors for each language. If
a word appears in the candidate of translation pair,
the value of the corresponding dimension of the vec-
tor is set to 1, and otherwise it is set to 0.
For features (5a) (5b), each content words (noun,
verb, adjective and adverb) is allocated to the di-
mension of the feature vectors for each language. If
a word appears in the candidate of translation pair
just before or after, the value of the corresponding
dimension of the vector is set to 1, and otherwise it
is set to 0.
3.3 Learning the Translation Model
Training samples which include positive samples as
the translation pairs and negative samples as the
non-translation pairs are made from the training
corpora manually, and are converted into the fea-
ture vectors by the method described in section 3.2.
For supervise signals yi, each positive sample is as-
signed to +1 and each negative sample is assigned
to  1. The translation model is learned from them
by SVMs described in section 2. As a result, the
optimal parameters  for SVMs are obtained.
3.4 Making the Candidate of the Translation
Pairs
A set of candidates of translation pairs is made from
the combinations of phrases which are obtained by
parsing both Japanese and English sentences. How
to make the combinations does not require sen-
tence alignments between both languages. Because
the set grows too big for all the combinations, the
phrases used for the combinations are limited in up-
per bound of the number of constituent words and
only noun phrases and verb phrases.
3.5 Extracting the Translation Pairs
The candidates of the translation pairs are converted
into the feature vectors with the method described
in section 3.2. By inputting them to equation (8)
with the optimal parameters   obtained in section
3.3, +1 or  1 could be obtained as the output for
each vector. If the output is+1, the candidate corre-
sponding to the input vector is the translation pair,
otherwise it is not the translation pair.
4 Experiments
To confirm the e ectiveness of the method de-
scribed in section 3, we did the experiments where
the English Business Letter Example Collection
published from Nihon Keizai Shimbun Inc. are used
as parallel corpora, which include Japanese and En-
glish sentences which are examples of business let-
ters, and are marked up at translation pairs.
As both training and test corpora, 1,000 sentences
were used. The translation pairs which are already
marked up in the corpora were corrected to the form
described in section 3.4 to be used as the positive
samples. Japanese sentences were parsed by KNP 1
and English sentences were parsed by Apple Pie
Parser 2. The negative samples of the same number
as the positive samples were randomly chosen from
combinations of phrases which were made by pars-
ing and of which the numbers of constituent words
were below 8 words. As a result, 2,000 samples
(1,000 positives and 1,000 negatives) for both train-
ing and test were prepared.
The obtained samples must be converted into the
feature vectors by the method described in section
3.2. For features (1a) (1b), 94,511 bilingual word
pairs included in EDICT 3 were prepared. For fea-
tures (4a) (4b) (5a) (5b), 1,009 Japanese words and
890 English words which appeared in the training
corpora above 3 times were used. Therefore, the
number of dimensions for the feature vectors was
94;511 2+1 2+4 2+1;009+890+1;009+890=
192;830.
S V Mlight 4 was used for the learner and the clas-
sifier of SVMs. For the kernel function, the squared
polynomial kernel (p = 2 in equation (13)) was
used, and the error weight C was set to 0:01.
The translation model was learned by the train-
ing samples and the translation pairs were extracted
from the test samples by the method described in
section 3.
1http://www-lab25.kuee.kyoto-u.ac.jp/
nl-resource/knp.html
2http://www.cs.nyu.edu/cs/projects/proteus/
app/
3http://www.csse.monash.edu.au/~jwb/edict.
html
4http://svmlight.joachims.org/
 0
 20
 40
 60
 80
 100
 0  2  4  6  8  10  12  14  16  18  20
rate (%)
the number of the training samples (x1.0e02)
Precision
Recall
Figure 2: Transition in the precision rate and the
recall rate when the number of the training samples
are increased
Table 1 shows the precision rate and the recall
rate of the extracted translation pairs, and table 2
shows examples of the extracted translation pairs.
Table 1: Precision and recall rate
Outputs Corrects Precision Recall
851 690 81.1 % 69.0 %
5 Discussion
Figure 2 shows the transition in the precision rate
and the recall rate when the number of the training
samples are increased from 100 to 2,000 by every
100 samples. The recall rate rose according to the
number of the training samples, and reaching the
level-o in the precision rate since 1,300. There-
fore, it suggests that the recall rate can be improved
without lowering the precision rate too much by in-
creasing the number of the training samples.
Figure 3 shows that the transition in the precision
rate and the recall rate when the number of the bilin-
gual word pairs in the translation dictionary are in-
creased from 0 to 90,000 by every 5,000 pairs. The
precision rate rose almost linearly according to the
number of the pairs, and reaching the level-o in the
recall rate since 30,000. Therefore, it suggests that
the precision rate can be improved without lowering
the recall rate too much by increasing the number of
the bilingual word pairs in the translation dictionary.
Table 3 shows the precision rate and the recall
rate when each kind of features described in section
3.2 was removed. The values in parentheses in the
columns of the precision rate and the recall rate are
 0
 20
 40
 60
 80
 100
 0  10  20  30  40  50  60  70  80  90  100
rate (%)
the size of dictionary (x1.0e03)
Precision
Recall
Figure 3: Transition in the precision rate and the
recall rate when the number of the bilingual word
pairs in the translation dictionary are increased
di erences with the values when all the features are
used. The fall of the precision rate when the features
which use the translation dictionary (1a) (1b) were
removed and the fall of the recall rate when the fea-
tures which use the number of words (2a) (2b) were
removed were especially large.
It is clear that feature (1a) (1b) could restrict
the translation model most strongly in all features.
Therefore, if feature (1a) (1b) were removed, it
causes a good translation model not to be able to
be learned only by the features of the remainder
because of the weak constraints, wrong outputs in-
creased, and the precision rate has fallen.
Only features (2a) (2b) surely appear in all sam-
ples although some other features appeared in the
training samples may not appear in the test samples.
So, in the test samples, the importance of features
(2a) (2b) are increased on the coverage of the sam-
ples relatively. Therefore, if features (2a) (2b) were
removed, it causes the recall rate to fall because of
the low coverage of the samples.
6 Related Works
With di erence from our method, there have been
researches which are based on the assumption of
the sentence alignments for parallel corpora (Gale
and Church, 1991; Kitamura and Matsumoto, 1996;
Melamed, 1997). (Gale and Church, 1991) has used
the  2 statistics as the correspondence level of the
word pairs and has showed that it was more e ective
than the mutual information. (Kitamura and Mat-
sumoto, 1996) has used the Dice coe cient (Kay
and R¨oschesen, 1993) which was weighted by the
logarithm of the frequency of the word pair as the
Table 2: Examples of translation pairs extracted by our method
Japanese English
a0a2a1a2a3a2a4a2a5a2a6a2a7a9a8a10a5a2a6a2a11 chairman of a special program committee
a12a14a13a16a15a2a17a14a18a2a19a10a20a22a21 o cially retired as
a23 a17 a8a2a24a10a1a2a25a14a8a9a26a16a27a2a28a14a29a2a30 a21a16a31a14a32a2a33a22a34 would like to say an o cial farewell
30a35 a18a10a36a14a33a2a37a39a38 a8a10a40a2a41 my thirty years of experience
a42a10a43a14a44 a8a10a45a14a29a10a46a9a47 a37 sharpen up on my golf
Table 3: Precision rate and recall rate when each kind of features is removed
Feature Num. Outputs Corrects Precision (%) Recall (%)
(1a) 94,511 891 686 77:0 ( 4:1) 68:6 ( 0:4)
(1b) 94,511 1,058 719 68:0 ( 13:1) 71:9 (+2:9)
(1) 189,022 1,237 756 61:1 ( 20:0) 75:6 (+6:6)
(2a) 1 742 611 82:3 (+1:3) 61:1 ( 7:9)
(2b) 1 755 600 79:5 ( 1:6) 60:0 ( 9:0)
(2) 2 489 404 82:6 (+1:5) 40:4 ( 28:6)
(3a) 4 846 685 81:0 ( 0:1) 68:5 ( 0:5)
(3b) 4 834 660 79:1 ( 1:9) 66:0 ( 3:0)
(3) 8 840 661 78:7 ( 2:4) 66:1 ( 2:9)
(4a) 1,009 814 668 82:1 (+1:0) 66:8 ( 2:2)
(4b) 890 855 698 81:6 (+0:6) 69:8 (+0:8)
(4) 1,899 838 689 82:2 (+1:1) 68:9 ( 0:1)
(5a) 1,009 844 683 80:9 ( 0:2) 68:3 ( 0:7)
(5b) 890 851 688 80:8 ( 0:3) 68:8 ( 0:2)
(5) 1,899 845 682 80:7 ( 0:4) 68:2 ( 0:8)
All features 192,830 851 690 81.1 69.0
correspondence level of the word pairs. (Melamed,
1997) has proposed the Competitive Linking Algo-
rithm for linking the word pairs and a method which
calculates the optimized correspondence level of the
word pairs by hill climbing.
These methods could archive high accuracy be-
cause of the assumption of the sentence alignments
for parallel corpora, but they have the problem with
narrow applicable domains because there are not too
many parallel corpora with sentence alignments at
present. However, because our method does not
require sentence alignments, it can be applied for
wider applicable domains.
Like our method, researches which are not based
on the assumption of the sentence alignments for
parallel corpora have been done (Kaji and Aizono,
1996; Tanaka and Iwasaki, 1996; Fung, 1997).
They are based on the idea that the context of
the words which appear in neighborhood looks
like each other for the translation pairs although
expressed in two di erent languages. (Kaji and
Aizono, 1996) has proposed the correspondence
level calculated by the size of intersection between
co-occurrence sets with the word included in an ex-
isting translation dictionary. (Tanaka and Iwasaki,
1996) has proposed a method for obtaining the
bilingual word pairs by optimizing the matrix of the
translation probabilities so that the distance of the
matrices of the probabilities of co-occurrences of
words which appeared in each language might be-
come small. (Fung, 1997) has calculated the vectors
in which the weighted mutual information between
the word in the corpora and the word included in an
existing translation dictionary was an element, and
has used these inner products as the correspondence
level of word pairs.
There is a common point between these method
and ours on the idea that the context of the words
which appear in neighborhood looks like each other
for the translation pairs because features (1b) are
based on the same idea. However, since our method
caught extracting the translation pairs as the ap-
proach of the statistical machine learning, it could
be expected to improve the performance by adding
new features to the translation model. In addition,
if learning the translation model for the training
samples is done once with our method, the model
need not be learned again for new samples although
it needs the positive and negative samples for the
training data. However, the methods introduced
above must learn a new model again for new cor-
pora.
(Sato and Nakanishi, 1998) has proposed a
method for learning a probabilistic translation
model with Maximum Entropy (ME) modeling
which was the same approach of the statistical ma-
chine learning as SVMs, in which co-occurrence
information and morphological information were
used as features and has archived 58.25 % accuracy
with 4,119 features. ME modeling might be similar
to SVMs on using features for learning a model, but
feature selection for ME modeling is more di cult
because ME modeling is easier to cause over-fit for
training samples than SVMs. In addition, ME mod-
eling cannot learn dependencies between features,
but SVMs can learn them automatically using a ker-
nel function. Therefore, SVMs could learn more
complex and e ective model than ME modeling.
7 Conclusion
In this paper, we proposed a learning and ex-
tracting method of bilingual word sequence corre-
spondences from non-aligned parallel corpora with
SVMs. Our method used features for the transla-
tion model which use the translation dictionary, the
number of words, the part-of-speech, constituent
words and neighbor words. Experiment results in
which Japanese and English parallel corpora are
used archived 81.1 % precision rate and 69.0 %
recall rate of the extracted translation pairs. This
demonstrates that our method could reduce the cost
for making translation dictionaries.
Acknowledgments
We would like to thank Nihon Keizai Shimbun Inc.
for giving us the research application permission of
the English Business Letter Example Collection.

References

Pascale Fung. 1997. Finding terminology translation
from non-parallel corpora. In Proceeding of the 5th
Workshop on Very Large Corpora, pages 192-202.

William A. Gale and Kenneth W. Church. 1991. Identi-
fying word correspondances in parallel texts. In Pro-
ceedings of the 2nd Speech and Natural Language
Workshop, pages 152-157.

Thorsten Joachims. 1998. Text categorization with sup-
port vector machines: Learning with many relevant
features. In the 10th European Conference on Ma-
chine Learning, pages 137-142.

Hiroyuki Kaji and Toshiko Aizono. 1996. Extracting
word correspondences from bilingual corpora based
on word co-occurrence information. In Proceedings
of the 16th International Conference on Computa-
tional Linguistics, pages 23-28.

Martin Kay and Martin R¨oschesen. 1993. Text-
translation alignment. Computational Linguistics,
19(1):121-142.

Mihoko Kitamura and Yuji Matsumoto. 1996. Auto-
matic extraction of word sequence correspondences in
parallel corpora. In Proceeding of the 4th Workshop
on Very Large Corpora, pages 78-89.

Taku Kudo and Yuji Matsumoto. 2000a. Japanese de-
pendency structure analysis based on support vector
machines. In Proceedings of the 2000 Joint SIGDAT
Conference on Emprical Methods in Natural Lan-
guage Processing and Very Large Corpora, pages 18-
25, Hong Kong, October.

Taku Kudo and Yuji Matsumoto. 2000b. Use of support
vector learning for chunk identification. In Proceed-
ings of the 4th Conference on Computational Natural
Language Learning and the 2nd Learning Language
in Logic Workshop, pages 142-144, Lisbon, Septem-
ber.

I. Dan Melamed. 1997. A word-to-word model of trans-
lation equivalence. In Proceedings of the 35th Annual
Meeting of the Association for Computational Lin-
guistics, pages 490-497.

Kengo Sato and Masakazu Nakanishi. 1998. Maximum
entropy model learning of the translation rules. In
Proceedings of the 36th Annual Meeting of the Asso-
ciation for Computational Linguistics and the 17th In-
ternational Conference on Computational Linguistics,
pages 1171-1175, August.

Alexander J. Smola, Peter J. Bartlett, bernha Sch¨olkopf,
and Dale Schuurmans, editors. 2000. Advances in
Large Margin Classifiers. MIT Press.

Hirotoshi Taira and Masahiko Haruno. 1999. Feature
selection in svm text categorization. In Proceedings
of the 16th National Conference of the American As-
socitation of Artificial Intelligence, pages 480-486,
Florida, July.

Kumiko Tanaka and Hideya Iwasaki. 1996. Extraction
of lexical translatins from non-aligned corpora. In
Proceedings of the 16th International Conference on
Computational Linguistics, pages 580-585.

Vladimir Naumovich Vapnik. 1999. The Nature of Sta-
tistical Learning Theory (Statistics for Engineering
and Information Seience). Springer-Verlag Telos, 2nd
edition, December.
