A Comparative Evaluation of Data-driven Models in Translation
Selection of Machine Translation
–Yu-Seop Kim ? Jeong-Ho Chang ?Byoung-Tak Zhang
– Ewha Institute of Science and Technology, Ewha Woman’s Univ.
Seoul 120-750 Korea, yskim01@ewha.ac.kr⁄
? Schools of Computer Science and Engineering, Seoul National Univ.
Seoul 151-742 Korea, fjhchang, btzhangg@bi.snu.ac.kry
Abstract
We present a comparative evaluation of two
data-driven models used in translation selec-
tionofEnglish-Koreanmachinetranslation. La-
tent semantic analysis(LSA) and probabilistic
latent semantic analysis (PLSA) are applied for
the purpose of implementation of data-driven
models in particular. These models are able to
represent complex semantic structures of given
contexts, like text passages. Grammatical rela-
tionships, stored in dictionaries, are utilized in
translation selection essentially. We have used
k-nearest neighbor (k-NN) learning to select an
appropriate translation of the unseen instances
in the dictionary. The distance of instances in
k-NN is computed by estimating the similar-
ity measured by LSA and PLSA. For experi-
ments, we used TREC data(AP news in 1988)
for constructing latent semantic spaces of two
models and Wall Street Journal corpus for eval-
uating the translation accuracy in each model.
PLSA selected relatively more accurate transla-
tions than LSA in the experiment, irrespective
of the value of k and the types of grammatical
relationship.
1 Introduction
Construction of language associated resources
like thesaurus, annotated corpora, machine-
readable dictionary and etc. requires high de-
gree of cost, since they need much of human-
efiort, which is also dependent heavily upon hu-
man intuition. A data-driven model, however,
does not demand any of human-knowledge,
knowledge bases, semantic thesaurus, syntac-
tic parser or the like. This model represents
⁄ He is supported by Brain Korea 21 project performed
by Ewha Institute of Science and Technology.
y They are supported by Brain Tech. project controlled
by Korean Ministry of Science and Technology.
latent semantic structure of contexts like text
passages. Latentsemanticanalysis(LSA)(Lan-
dauer et al., 1998) and probabilistic latent se-
mantic analysis (PLSA) (Hofmann, 2001) fall
under the model.
LSA is a theory and method for extracting
and representing the contextual-usage meaning
of words. This method has been mainly used
for indexing and relevance estimation in infor-
mation retrieval area (Deerwester et al., 1990).
And LSA could be utilized to measure the co-
herenceoftexts(Foltzetal.,1998). Byapplying
the basic concept, a vector representation and
a cosine computation, to estimate relevance of
a word and/or a text and coherence of texts,
we could also estimate the semantic similarity
between words. It is claimed that LSA repre-
sents words of similar meaning in similar ways
(Landauer et al., 1998).
Probabilistic LSA (PLSA) is based on proba-
bilisticmixturedecompositionwhileLSAisona
linear algebra and singular value decomposition
(SVD) (Hofmann, 1999b). In contrast to LSA,
PLSA’s probabilistic variant has a sound statis-
tical foundation and deflnes a proper generative
model of the data. Both two techniques have
a same idea which is to map high-dimensional
vectors representing text documents, to a lower
dimensional representation, called a latent se-
mantic space (Hofmann, 1999a).
Dagan (Dagan et al., 1999) performed a com-
parative analysis of several similarity measures,
which based mainly on conditional probability
distribution. And the only elements in the dis-
tribution are words, which appeared in texts.
However, LSA and PLSA expressed the latent
semantic structures, called a topic of the con-
text.
In this paper, we comparatively evaluated
these two techniques performed in translation
selection of English-Korean machine transla-
tion. First, we built a dictionary storing tu-
ples representing the grammatical relationship
of two words, like subject-verb, object-verb, and
modifler-modifled. Second, with an input tuple,
inwhichaninputwordwouldbetranslatedand
the other would be used as an argument word,
translation is performed by searching the dic-
tionary with the argument word. Third, if the
argument word is not listed in the dictionary,
we used k-nearest neighbor learning method to
determine which class of translation is appro-
priate for the translation of an input word. The
distance used in discovering the nearest neigh-
bors was computed by estimating the similarity
measured on above latent semantic spaces.
In the experiment, we used 1988 AP news
corpus from TREC-7 data (Voorhees and Har-
man, 1998) for building latent semantic spaces
and Wall Street Journal (WSJ) corpus for con-
structing a dictionary and test sets. We ob-
tained 11-20% accuracy improvement, compar-
ing to a simple dictionary search method. And
PLSAhasshownthatitsabilitytoselectanap-
propriate translation is superior to LSA as an
extent of up to 3%, without regard to the value
of k and grammatical relationship.
In section 2, we discuss two of data-driven
models, LSA and PLSA. Section 3 describes
ways of translation with a grammatical rela-
tion dictionary and k-nearest neighbor learning
method. Experiment is explained in Section 4
andconcludingremarksarepresentedinSection
5.
2 Data-Driven Model
For the data-driven model which does not re-
quire additional human-knowledge in acquiring
information, Latent Semantic Analysis (LSA)
andProbabilisticLSA(PLSA)areappliedtoes-
timate semantic similarity among words. Next
twosubsectionswillexplainhowLSAandPLSA
are to be adopted to measuring semantic simi-
larity.
2.1 Latent Semantic Analysis
ThebasicideaofLSAisthattheaggregateofall
the word contexts in which a given word does
and does not appear provides a set of mutual
constraints that largely determines the similar-
ityofmeaningofwordsandsetsofwordstoeach
other(Landaueretal.,1998)(GotohandRenals,
1997). LSA also extracts and infers relations of
expected contextual usage of words in passages
of discourse. It uses no human-made dictionar-
ies, knowledge bases, semantic thesaurus, syn-
tactic parser or the like. Only raw text parsed
into unique character strings is needed for its
input data.
The flrst step is to represent the text as a
matrix in which each row stands for a unique
word and each column stands for a text passage
or other context. Each cell contains the occur-
rence frequency of a word in the text passage.
Next, LSA applies singular value decomposi-
tion (SVD) to the matrix. SVD is a form of
factor analysis and is deflned as
A = U§V T (1)
,where § is a diagonal matrix composed of
nonzero eigen values of AAT or ATA, and U
and V are the orthogonal eigenvectors associ-
atedwiththe r nonzeroeigenvaluesof AAT and
ATA, respectively. One component matrix (U)
describes the original row entities as vectors of
derivedorthogonalfactorvalue,another(V)de-
scribes the original column entities in the same
way, andthethird(§)isadiagonalmatrixcon-
taining scaling values when the three compo-
nents are matrix-multiplied, the original matrix
is reconstructed.
The singular vectors corresponding to the
k(k • r) largest singular values are then used
to deflne k-dimensional document space. Using
thesevectors,m£kandn£kmatricesUk andVk
mayberedeflnedalongwith k£k singularvalue
matrix Pk. It is known that Ak = Uk§kV Tk is
the closest matrix of rank k to the original ma-
trix.
LSA can represent words of similar meaning
insimilarways. Thiscanbeclaimedbythefact
thatonecompareswordswithsimilarvectorsas
derived from large text corpora. The term-to-
term similarity is based on the inner products
between two row vectors of A, AAT = U§2UT.
One might think of the rows of U§ as deflning
coordinates for terms in the latent space. To
calculate the similarity of coordinates, V1 and
V2, cosine computation is used:
cos` = V1¢V2kV
1 k¢kV2 k
(2)
2.2 Probabilistic Latent Semantic
Analysis
Probabilistic latent semantic analysis (PLSA)
is a statistical technique for the analysis of two-
modeandco-occurrencedata,andhasproduced
some meaningful results in such applications
as language modelling (Gildea and Hofmann,
1999)anddocumentindexingininformationre-
trieval (Hofmann, 1999b). PLSA is based on
aspect model where each observation of the co-
occurrence data is associated with a latent class
variable z 2 Z = fz1;z2;:::;zKg (Hofmann,
1999a). For text documents, the observation is
an occurrence of a word w 2W in a document
d 2 D, and each possible state z of the latent
class represents one semantic topic.
A word-document co-occurrence event,
(d;w), is modelled in a probabilistic way where
it is parameterized as in
P(d;w) = X
z
P(z)P(d;wjz)
= X
z
P(z)P(wjz)P(djz): (3)
Here, w and d are assumed to be condition-
ally independent given a speciflc z. P(wjz) and
P(djz) are topic-speciflc word distribution and
document distribution, respectively. The three-
way decomposition for the co-occurrence data
is similar to that of SVD in LSA. But the ob-
jective function of PLSA, unlike that of LSA,
is the likelihood function of multinomial sam-
pling. And the parameters P(z);P(wjz), and
P(djz) are estimated by maximization of the
log-likelihood function
L = X
d2D
X
w2W
n(d;w)logP(d;w); (4)
and this maximization is performed using the
EM algorithm as for most latent variable mod-
els. Details on the parameter estimation are
referred to (Hofmann, 1999a). To compute
the similarity of w1 and w2, P(zkjw1)P(zkjw2)
should be approximately computed with being
derived from
P(zkjw) = P(zk)P(wjzk)P
zk2Z P(zk)P(wjzk)
(5)
And we can evaluate similarities with the
low-dimensional representation in the semantic
topic space P(zkjw1) and P(zkjw2).
3 Translation with Grammatical
Relationship
3.1 Grammatical Relationship
We used grammatical relations stored in the
form of a dictionary for translation of words.
The structure of the dictionary is as follows
(Kim and Kim, 1998):
T(Si) =
8>
><
>>:
T1 if Cooc(Si; S1)
T2 if Cooc(Si; S2)
:::
Tn otherwise;
(6)
where Cooc(Si;Sj) denotes grammatical co-
occurrenceofsourcewordsSi andSj,whichone
means an input word to be translated and the
other means an argument word to be used in
translation, and Tj is the translation result of
the source word. T(¢) denotes the translation
process.
Table1showsagrammaticalrelationshipdic-
tionary for an English verb Si =‘build’ and its
objectnounsasaninputwordandanargument
word, respectively. The dictionary shows that
the word ‘build’ is translated into flve difierent
translated words in Korean, depending on the
context. For example, ‘build’ is translated into
‘geon-seol-ha-da’ (‘construct’) when its object
noun is a noun ‘plant’ (=‘factory’), into ‘che-
chak-ha-da’ (‘produce’) when co-occurring with
the object noun ‘car’, and into ‘seol-lip-ha-da’
(‘establish’) in the context of object noun ‘com-
pany’ (Table 2).
One of the fundamental di–culties in co-
occurrence-based approaches to word sense dis-
ambiguation (translation selection in this case)
is the problem of data sparseness or unseen
words. For example, for an unregistered object
noun like ‘vehicle’ in the dictionary, the correct
translation of the verb cannot be selected us-
ing the dictionary described above. In the next
subsection, we will present k-nearest neighbor
method that resolves this problem.
3.2 k-Nearest Neighbor Learning for
Translation Selection
The similarity between two words on latent se-
mantic spaces is required when performing k-
NN search to select the translation of a word.
The nearest instance of a given word is decided
by selecting a word with the highest similarity
to the given word.
Table 1: Examples of co-occurrence word lists for a verb ‘build’ in the dictionary
Meaning of ‘build’ in Korean (Tj) Collocated Object Noun (Sj)
‘geon-seol-ha-da’ (= ‘construct’) plant facility network ...
‘geon-chook-ha-da’ (= ‘design’) house center housing ...
‘che-chak-ha-da’ (= ‘produce’) car ship model ...
‘seol-lip-ha-da’ (= ‘establish’) company market empire ...
‘koo-chook-ha-da’ (= ‘develop’) system stake relationship ...
Table 2: Examples of translation of ‘build’
source words translated words (in Korean) sense of the verb
‘build a plant’ ) ‘gong-jang-eul geon-seol-ha-da’ ‘construct’
‘build a car’ ) ‘ja-dong-cha-reul che-chak-ha-da’ ‘produce’
‘build a company’ ) ‘hoi-sa-reul seol-lip-ha-da’ ‘establish’
The k-nearest neighbor learning algorithm
(Cover and Hart, 1967)(Aha et al., 1991) as-
sumes all instances correspond to points in the
n-dimensional space Rn. We mapped the n-
dimensionalspaceintothe n-dimensionalvector
of a word for an instance. The nearest neigh-
bors of an instance are deflned in terms of the
standard Euclidean distance.
Then the distance between two instances xi
and xj, D(xi;xj), is deflned to be
D(xi;xj) =
q
(a(xi)¡a(xj))2 (7)
and a(xi) denotes the value of instance xi, sim-
ilarly to cosine computation between two vec-
tors. Let us consider learning discrete-valued
target functions of the form f : Rn ! V,
where V is the flnite set fv1;:::;vsg. The k-
nearest neighbor algorithm for approximating a
discrete-valued target function is given in Table
3.
The value ^f(xq) returned by this algorithm
as its estimate of f(xq) is just the most com-
mon value of f among the k training examples
nearest to xq.
4 Experiment and Evaluation
4.1 Data for Latent Space and
Dictionary
In the experiment, we used two kinds of cor-
pus data, one for constructing LSA and PLSA
spaces and the other for building a dictionary
containing grammatical relations and a test set.
79,919 texts in 1988 AP news corpus from
TREC-7datawasindexedwithastemmingtool
and19,286wordswiththefrequencyofabove20
Table 3: The k-nearest neighbor learning algo-
rithm.
† Training
{ For each training example hx;f(x)i,
add the example to the list
training examples.
† Classiflcation
{ Given a query instance xq to be clas-
sifled,
⁄ Let x1;:::;xk denote the k in-
stances from training examples
that are nearest to xq.
⁄ Return
^f(xq)ˆargmaxv2V kX
i=1
–(v;f(xi)) ;
where –(a;b) = 1 if a = b and
–(a;b) = 0 otherwise.
are extracted. We built 200 dimensions in SVD
ofLSAand128latentdimensionsofPLSA.The
difierence of the numbers was caused from the
degree of computational complexity in learning
phase. Actually, PLSA of 128 latent factors re-
quired 50-fold time as much as LSA hiring 200
eigen-vectorspaceduringbuildinglatentspaces.
This was caused by 50 iterations which made
the log likelihood maximized. We utilized a sin-
Figure 1: The accuracy ration of verb-object
gle vector lanczos algorithm derived from SVD-
PACK when constructing LSA space. (Berry
et al., 1993). We generated both of LSA and
PLSA spaces, with each word having a vector
of 200 and 128 dimensions, respectively. The
similarity of any two words could be estimated
by performing cosine computation between two
vectors representing coordinates of the words in
the spaces.
Table 4 shows 5 most similar words of ran-
domly selected words from 3,443 examples. We
extracted 3,443 example sentences containing
grammatical relations, like verb-object, subject-
verb and adjective-noun, from Wall Street Jour-
nal corpus of 220,047 sentences and other
newspapers corpus of 41,750 sentences, totally
261,797 sentences. We evaluated the accu-
racy performance of each grammatical rela-
tion. 2,437, 188, and 818 examples were uti-
lized for verb-object, subject-verb, and adjective-
noun, respectively. The selection accuracy was
measured using 5-fold cross validation for each
grammatical relation. Sample sentences of each
grammatical relation were divided into flve dis-
joint samples and each sample became a test
sample once in the experiment and the remain-
ing four samples were combined to make up a
collocation dictionary.
4.2 Experimental Result
Table 5 and flgure 1-3 show the results of
translation selection with respect to the applied
model and to the value of k. As shown in Table
5, similarity based on data-driven model could
improve the selection accuracy up to 20% as
Figure 2: The accuracy ration of subject-verb
Figure3: Theaccuracyrationof adjective-noun
contrasted with the direct matching method.
We could obtain the result that PLSA could
improve the accuracy more than LSA in almost
all cases. The amount of improvement is varied
from -0.12% to 2.96%.
As flgure 1-3 show, the value of k had afiec-
tion to the translation accuracy in PLSA, how-
ever, not in LSA. From this, we could not de-
clare whether the value of k and translation ac-
curacy have relationship of each other or not
in the data-driven models described in this pa-
per. However,wecouldalsoflndthatthedegree
of accuracy was raised in accordance with the
value of k in PLSA. From this, we consequently
inferred that the latent semantic space gener-
atedbyPLSAhadmoresounddistributionwith
re ection of well-structured semantic structure
than LSA. Only one of three grammatical re-
Table 4: Lists of 5 most semantically similar words for randomly selected words generated from
LSA, and PLSA. The words are stems of original words. The flrst row of each selected word stands
for the most similar words in LSA semantic space and the second row stands for those in the PLSA
space.
selected words most similar words
plant westinghous isocyan shutdown zinc manur
radioact hanford irradi tritium biodegrad
car buick oldsmobil chevrolet sedan corolla
highwai volkswagen sedan vehicular vehicle
home parapleg broccoli coconut liverpool jamal
memori baxter hanlei corwin headston
business entrepreneur corpor custom ventur flrm
digit compat softwar blackston zayr
ship vessel sail seamen sank sailor
destroy frogmen maritim skipper vessel
Table5: Translationaccuracyinvariouscase. Theflrstcolumnstandsforeachgrammaticalrelation
and the second column stands for the used models, LSA or PLSA. And other three columns stand
for the accuracy ratio (rm) with respect to the value of k. The numbers in parenthesis of the flrst
column show the translation accuracy ratio of simple dictionary search method (rs). And numbers
in the other parenthesis were obtained by rm ¥rs.
grammatical used k = 1 k = 5 k = 10
relations model
verb-object LSA 84.41(1.17) 83.01(1.16) 84.24(1.17)
(71.85) PLSA 84.53(1.18) 85.35(1.19) 86.05(1.20)
subject-verb LSA 83.99(1.11) 84.62(1.11) 84.31(1.11)
(75.93) PLSA 86.85(1.14) 87.49(1.15) 87.27(1.15)
adjective-noun LSA 80.93(1.15) 80.32(1.14) 80.93(1.15)
(70.54) PLSA 80.81(1.15) 82.27(1.17) 82.76(1.17)
lations, subj-verb, showed an exceptional case,
which seemed to be caused by the small size of
examples, 188.
Selection errors taking place in LSA and
PLSA models were caused mainly by the fol-
lowing reasons. First of all, the size of vocab-
ulary should be limited by computation com-
plexity. In this experiment, we acquired below
20,000 words for the vocabulary, which could
not cover a section of corpus data. Second, the
stemming algorithm was not robust for an in-
dexing. For example, ‘house’ and ‘housing’ are
regarded as a same word as ‘hous’. This fact
broughtabouthardnessinre ectingtheseman-
tic structure more precisely. And flnally, the
meaning of similar word is somewhat varied in
the machine translation fleld and the informa-
tion retrieval fleld. The selectional restriction
tends to depend a little more upon semantic
type like human-being, place and etc., than on
the context in a document.
5 Conclusion
This paper describes a comparative evaluation
of the accuracy performance in translation se-
lection based on data-driven models. LSA and
PLSA were utilized for implementation of the
models, which are mainly used in estimating
similarity between words. And a manually-
built grammatical relation dictionary was used
for the purpose of appropriate translation se-
lection of a word. To break down the data
sparseness problem occurring when the dictio-
nary is used, we utilized similarity measure-
ments schemed out from the models. When an
argumentwordisnotincludedinthedictionary,
themost k similarwordstothewordarediscov-
ered in the dictionary, and then the meaning of
thegrammatically-relatedclassforthemajority
of the k words is selected as the translation of
an input word.
We evaluated the accuracy ratio of LSA and
PLSA comparatively and classifled the exper-
iments with criteria of the values of k and
the grammatical relations. We acquired up to
20% accuracy improvement, compared to direct
matching to a collocation dictionary. PLSA
showed the ability to select translation better
than LSA, up to 3%. The value of k is strongly
related with PLSA in translation accuracy, not
too with LSA. That means the latent semantic
space of PLSA has more sound distribution of
latentsemanticsthanthatofLSA.Eventhough
longer learning time than LSA, PLSA is benefl-
cial in translation accuracy and distributional
soundness. A distributional soundness is ex-
pected to have better performance as the size
of examples is growing.
However, we should resolve several problems
raised during the experiment. First, a robust
stemming tool should be exploited for more ac-
curate morphology analysis. Second, the opti-
mal value of k should be obtained, according to
thesizeofexamples. Finally,weshoulddiscover
more speciflc contextual information suited to
this type of problem. While simple text could
be used properly in IR, MT should require an-
other type of information.
The data-driven models could be applied to
other sub-flelds related with semantics in ma-
chine translation. For example, to-inflnitive
phrase and preposition phrase attachment dis-
ambiguationproblemcanalsoapplythesemod-
els. And syntactic parser could apply the mod-
els for improvement of accurate analysis by us-
ingsemanticinformationgeneratedbythemod-
els.

References

D. Aha, D. Kibler, and M. Albert. 1991.
Instance-based learning algorithms. Machine
Learning, 6:37{66.

M. Berry, T. Do, G. O’Brien, V. Krishna, and
S. Varadhan. 1993. Svdpackc: Version 1.0
user’s guide. Technical Report CS{93{194,
University of Tennessee, Knoxville, TN.

T. Cover and P. Hart. 1967. Nearest neighbor
pattern classiflcation. IEEE Transactions on
Information Theory, 13:21{27.

I. Dagan, L. Lee, and F. Fereira. 1999.
Similarity-basedmodelsofwordcooccurrence
probabilities. Machine Learning, 34:43{69.

S. Deerwester, S. Dumais, G. Furnas, T. Lan-
dauer, and R. Harshman. 1990. Indexing
by latent semantic analysis. Journal of the
American Society for Information Science,
41:391{407.

P. Foltz, W. Kintsch, and T. Landauer. 1998.
Themesurementoftextualcoherencewithla-
tent semantic analysis. Discourse Processes,
25:285{307.

D. Gildea and T. Hofmann. 1999. Topic based
language models using em. In Proceedings of
the 6th European Conference on Speech Com-
munication and Technology (Eurospeech99).

D. Gotoh and S. Renals. 1997. Document
space models using latent semantic analysis.
In Proceedings of Eurospeech-97, pages 1443{
1446.

T. Hofmann. 1999a. Probabilistic latent se-
mantic analysis. In Proceedings of the Fif-
teenth Conference on Uncertainty in Artifl-
cial Intelligence (UAI’99).

T. Hofmann. 1999b. Probabilistic latent se-
mantic indexing. In Proceedings of the 22th
Annual International ACM SIGIR conference
on Research and Developement in Informa-
tion Retrieval (SIGIR99), pages 50{57.

T. Hofmann. 2001. Unsupervised learning by
probabilistic latent semantic analysis. Ma-
chine Learning Journal, 42(1):177{196.

Y.KimandY.Kim. 1998. Semanticimplemen-
tation based on extended idiom for english to
koreanmachinetranslation. The Asia-Paciflc
Association for Machine Translation Journal,
21:23{39.

T. K. Landauer, P. W. Foltz, and D. Laham.
1998. An introduction to latent semantic
analysis. Discourse Processes, 25:259{284.

E.VoorheesandD.Harman. 1998. Overviewof
the seventh text retrieval conference (trec-7).
In Proceedings of the Seventh Text REtrieval
Conference (TREC-7), pages 1{24.
