Structural Feature Selection For English-Korean Statistical 
Machine Translation 
Seonho Kim, Juntae Yoon, Mansuk Song 
{ pobi, j tyoon, mssong} @decemb er.yonsei, ac.kr 
\])ept. of Computer Science, 
Yonsci University, Seoul, Korea 
Abstract 
When aligning texts in very different s such 
as Korean and English, structural features beyond 
word or phrase give useful intbrmation. In this pa- 
per, we present a method for selecting struetm'al 
features of two s, from which we construct 
a model that assigns the conditional probabilities 
to corresponding tag sequences in bilingual English- 
Korean corpora. For tag sequence mapl)ing 1)etween 
two langauges, we first, define a structural feature 
fllnction which represents statistical prol)erties of 
elnpirical distribution of a set of training samples. 
The system, based on maximmn entrol)y coneet)t, se- 
le(:ts only ti;atures that pro(luee high increases in log- 
likelihood of training salnl)les. These structurally 
mat)ped features are more informative knowledge for 
statistical machine translation t)etween English and 
Korean. Also, the inforum.tion can help to reduce the 
1)arameter sl)ace of statisti('al alignment 1)y eliminat- 
ing synta(:tically uiflikely alignmenls. 
1 Introduction 
Aligned texts have been used for derivation of 1)ilin- 
gual dictioimries and terminoh)gy databases which 
are useflfl for nlachine translation and cross lan- 
guages infornmtion retriewfl. Thus, a lot of align- 
ment techniques have been suggested at; the sen- 
tence (Gale et al., 1993), phrase (Shin et al., 1996), 
nomt t)hrase (Kupiec, 1993), word (Brown et al., 
1993; Berger et al., 1996; Melamed, 1997), collo- 
cation (Smadja et al., 1996) and terminology level. 
Seine work has used lexical association measures 
for word alignments. However, the association mea- 
sures could be misled since a word in a source lan- 
guage frequently co-occurs with more titan one word 
in a target . In other work, iterative re- 
estimation techniques have beets emt)loyed. They 
were usually incorporated with the EM algorithm 
mid dynmnic progranmfing. In that case, the prob- 
al)ilities of aligmnents usually served as 1)arameters 
in a model of statistical machine translation. 
In statistical machine translation, IBM 1~5 mod- 
els (Brown et al., 1993) based on the source-chmmel 
model have been widely used and revised for many 
 donmins and applications. It has also 
shortconfing that it needs much iteration time for 
parameter estimation and high decoding complex- 
ity, however. 
Much work has been done to overcome the prob- 
lem. Wu (1996) adopted chammls that eliminate 
syntactically unlikely alignments and Wang et al. 
(1998) presented a model based on structures of two 
la.nguages. Tilhnann et al. (1997) suggested the 
dynanfie programming lmsed search to select the 
best alignment and preprocessed bilingual texts to 
remove word order differences. Sate et al. (1998) 
and Och et al. (1998) proposed a model for learn- 
ing translation rules with morphological information 
mid word category in order to improve statistical 
translation. 
Furthemlore, llla,lly researches assullle(t Olle-to- 
one correspondence due to the coml)lexity and com- 
Imtati(m time of statistical aliglunents. Although 
this assumption Ire'ned out t;o 1)e useful for align- 
ment of close lallguages such as English and French, 
it is not, applicabh~ to very different s, in 
particular, Korean and English where there is rarely 
(:lose corresl)ondence in order at the word level. For 
such s, even phrase level alignment, not to 
mei~tion word aligmnent, does not gives good trans- 
lation due to structural diflbrence. Itence, structural 
features beyond word or t)hrase should t)e consid- 
ered to get t)etter translation 1)etween English and 
Koreml. In addition, the construction of structural 
bilingual texts would be more informative for ex- 
tracting linguistic knowledge. 
In this paper, we suggest a method for structural 
mat)t)ing of bilingual  on the basis of the 
maximum entorl)y and feature induction fl'alnework. 
Our model based on POS tag sequence mapl)ing has 
two advantages: First;, it can reduce a lot of 1)armne- 
ters in statistical machilm translation by eliminating 
syntactically unlikely aligmnents. Second, it: can be 
used as a t)reprocessor for lexical alignments of bilin- 
gual corpora although it; (:an be also exl)loited 1)y it- 
self tbr alignment. In this case, it would serve as the 
first stet) of alignment for reducing the 1)arameter 
sI)ace. 
439 
2 Motivation 
In order to devise parameters for statistical model- 
ing of translation, we started our research from the 
IBM model which has bee:: widely used by :nany 
researches. The IBM model is represented with the 
formula shown in (1) 
l 17t 
v(f, al ) = II I-I t(fJl%)d(jlaj,m, l) 
i=1 j=l (1) 
Here, n is the fertility probability that an English 
word generates n h'end: words, t is tim aligmnent 
probability that the English word c generates the 
French word f, and d is the distortion probability 
that an English word in a certain t)osition will gener- 
ate a lh'ench word in a certain 1)osition. This formula 
is Olm of many ways in which p(f, ale ) can tie writtm. 
as the product of a series of conditional prot)at)ilities. 
In above model, the distortion probability is re-- 
lated with positional preference(word order). Since 
Korean is a free order , the probability is 
not t~asible in English-Korean translation. 
Furthermore, the difference between two lan- 
guages leads to the discordance between words that 
the one-to-one correst)ondence between words gen- 
erally does not keel). The n:odel (1), however, as-- 
sumed that an English word cat: be connected with 
multiple French words, but that each French word 
is connected to exactly one English word inch:ding 
the empty word. hl conclusion, many-to-:nany :nap-- 
pings are not allowed in this model. 
According to our ext)eri:nent, inany-to-nmny 
mappings exceed 40% in English and Korean lexical 
aligninents. Only 25.1% of then: can be explained 
by word for word correspondences. It means that we 
need a statistical model which can lmndle phrasal 
mat) pings. 
In the case of the phrasal mappings, a lot of pa- 
rameters should be searched eve:: if we restrict the 
length of word strings. Moreover, in order to prop-- 
erly estimate t)arameters we need much larger voI-- 
ume of bilingual aligned text than it in word-for- 
word modeling. Even though such a large corpora 
exist sometimes, they do not come up with the lex-- 
ical alignments. 
For this problem, we here consider syntactic fea- 
tures which are importmlt in determining structures. 
A structural feature means here a mapt)ing between 
tag sequences in bilingual parallel sentences. 
If we are concerned with tag sequence alignments, 
it is possible to estimate statistical t)armneters in 
a relatively small size of corpora. As a result, we 
can remarkably reduce the problem space for possi- 
ble lexical alignments, a sort of t probability in (1), 
which improve the complexity of a statistical ma- 
chine translation model. 
If there are similarities between corresponding tag 
sequences in two , tile structural features 
would be easily computed or recognized. However, 
a tag sequence in English can be often translated 
into a completely different tag sequence in Korean 
as follows. 
can/MD -+ ~-, cul/ENTR1 su/NNDE1 'iss/ AJMA 
da/ENTE 
It nmans that similarities of tag features between two 
s are not; kept all the time and it is neces- 
saw to get the most likely tag sequence mappings 
that reflect structural correspondences between two 
s. 
In this paper, the tag sequence mappings are ob- 
taind by automatic feature selection based on the 
maximum entropy model. 
3 Problem Setting 
In tiffs ctlat)ter, we describe how the features are 
related to the training data. Let tc be an English 
tag sequence and tk be a Korean tag sequence. Let 
Ts be the set of all possible tag sequence niapI)ings in 
a aligned sentence, S. We define a feature function 
(or a feature) as follows: 
1 pair(t~,tk) C "\]-s f(t~,tk) = 0 othcrwi.s'c 
It indicates co-occurrence information l)etween 
tags appeared in Ts. f(t¢,tk) expresses the infor- 
mation for predicting that te maps into ta.. A fea- 
ture means a sort of inforination for predicting some- 
thing. In our model, co-occurrence information on 
the same aligned sentence is used for a feature, while 
context is used as a feature in Inost of systems using 
maximum entropy. It can be less informative than 
context. Hence, we considered an initial supervision 
and feature selection. 
Our model starts with initial seed(active) features 
for mapI)ing extracted by SUl)ervision. In the next 
step, thature pool is constructed from training sam- 
ples fro:n filtering and oifly features with a large gain 
to the model are added into active feature set. The 
final outputs of our model are the set of active t'ea- 
tures, their gain values, and conditional probabilities 
of features which maximize the model. Tim results 
can be embedded in parameters of statistical ma- 
chine translation and hell) to construct structural 
bilingual text. 
Most alignment algorithm consists of two steps: 
(1) estimate translation probabilities. 
(2) use these probabilities to search for most t)roba- 
ble alignment path. 
Our study is focused on (1), especially the part of 
tag string alignments. 
Next, we will explain the concept of the model. 
We are concerned with an ot)timal statistical inodel 
which can generate the traiifing samples. Nmnely, 
our task is to construct a stochastic model that pro- 
440 
(1) duces outl)ut tag sequenc0, "~k, given a tag sequence ~+~,-~. 
To The l)roblem of interest is to use Salnt/les of • --J~\,What .... 
tagged sentences to observe the/)charier of the ran- u~,,~ 
(loin t)roeess. 'rile model p estinmtes tile conditional tt'2,Y 
probability that tile process will outlmt t,~, given t~.. ~,~/~ ,o~!! 
It is chosen out of a set of all allowed probability o~,~ ~ 
e}~..¢0,, me (tistributions .... 
The fbllowing steps are emt)loyed for ()tit" model, v~ / 
Input: a set L of POS-labeled bilingual aligned 
sentences. 
I. Make a set ~: of corresl)ondence pairs of tag 
sequences, (t~, tk) from a small portion of L by 
supervision. 
2. Set 2F into a set of active features, A. 
3. Maximization of 1)arameters, A of at:tire fea- 
tures 1)y IIS(hnproved Iterative Sealing) algo- 
rithm. 
4. Create a feature pool set ?9 of all possible align- 
nmnts a(t(,, tk) from tag seqllellces of samples. 
5. Filter 7 ) using frequency and sintilarity with M. 
6. Coml)ute the atit)roximate gains of tkmtm:es in "p. 
7. Select new features(A/') with a large gain vahle, 
and add A. 
Outt)ut: p(tklt~,)whcrc(t(,, t~.) C M and their Ai. 
We I)egan with training samples comi)osed of 
English-Korean aligned sentence t)airs, (e,k). Since 
they included long sentences, w(', 1)roke them into 
shorter ones. The length of training senl;en(:es was 
limited to ml(h',r 14 on the basis of English. It is 
reasona,bh; \])(',(:&llSe we are interested in not lexical 
alignments lint tag sequence aliglmients. The sam- 
ples were tagged using brill's tagger and qVIorany' 
that we iml)lenmnted as a Korean tagger. Figure \] 
shows the POS tags we considered. For simplicity, 
we adjusted some part of Brill's tag set. 
In the, sut)ervision step, 700 aligned sentences were 
used to construct the tag sequences mal)I)ings wlfich 
are referred to as an active feature set A. As Fig- 
ure 2 shows, there are several ways in constructing 
the corresl)ondem;es. We chose the third mapping 
although (1) can be more useflll to explain Korean 
with I)redieate-argunmnt structure. Since a subject 
of a English sentence is always used for a subject 
tbrln in Korean, we exlcuded a subject case fi'onl ar- 
gulnents of a l/redicate. For examl)le, 'they' is only 
used for a subject form, whereas 'me' is used for a 
object form and a dative form. 
II1 tile next step, training events, (t,:, It.) are con- 
structed to make a feature 1)eel froln training sam- 
pies. The event consists of a tag string t,, of a English 
(2) (31 
• ~lm- ~" 
-- + ~'~Whatever 
Figure 2: Tag sequence corresl)ondences at the 
phrase level 
1)OS-tagged sentence and a tag string tL~ of the cor- 
responding Korean POS-tagged sentence and it Call 
be represented with indicator functions fi(t~, tk). 
For a given sequence, the features were drawn 
fl'om all adjacent i)ossible I)airs and sonic interrupted 
pairs. Only features (tci, tfii ) out of the feature pool 
that meet the following conditions are extracted. 
• #(l, ei,t~:i) _> 3, # is count 
• there exist H:.~,, where (t(,i,tt.~.) in A and the 
similarity(sanle tag; colin|;) of lki an(1 tkx _> 0.6 
Table \] shows possible tL'atures, for a given aligned 
sentence , 'take her out - g'mdCOrcul baggcuro 
dcrfleoflara'. 
Since the set of the structural ti;atm'es for align- 
ment modeling is vast, we constructed a maximum 
entrol)y model for p(tkltc) by the iterative model 
growing method. 
4 Maximum Entropy 
To explain our method, we l)riefly des(:ribe the con- 
(:ept of maximum entrol)y. Recently, many al)- 
lnoaches l)ased on the maximum entroi)y lnodel have 
t)een applied to natural  processing (Berger 
eL al., \]994; Berger et al., 1996; Pietra et al., 1997). 
Suppose a model p which assigns a probability to 
a random variable. If we don't have rely knowledge, 
a reasonal)le solution for p is the most unifbrnl dis- 
tribution. As some knowledge to estilnate the model 
p are added, tile solution st)ace of p are more con- 
strained and the model would lie close to the ol\]timal 
probability model. 
For the t)url/ose of getting tile optimal 1)robability 
model, we need to maxi\]nize the unifl)rnlity under 
some constraints we have. ltere, the constraints are 
related with features. A feature, fi is usually rel/re - 
sented with a binary indicator funct, ion. The inlpor- 
tahoe of a feature, fi can be identified by requiring 
that the model accords with it. 
As a (:onstraint, the expected vahle of fi with re- 
spect to tile model P(fi) is supposed to be the same 
as tile exl)ected value of fi with respect to empiri(:al 
distril)ution in training saml)les, P(fi). 
441 
TAG 
cb 
DT 
PW 
JJ 
JJS 
MD 
NNP 
PDT 
PRP 
RB 
RBS 
SYM 
UH VBD 
VBN 
WP$ 
NOT 
BED 
BEG HVD 
DOD 
DESCRIPTION 
comma 
conjunction,coordinating 
determiner foreign word 
adjective, ordinal 
adjective, superlative 
modal auxiliary 
noun, proper, singular 
pre-determiner 
pronoun, personal 
adverb 
adverb, superlative 
symbol 
interjection verb, past tense 
verb, past participle 
WH-pronoun, possessive 
not be verb, past tense 
be verb, present participle have verb, past participle 
do verb, past tense 
TAG 
CD 
EX 
IN 
J JR 
LS 
NN 
NNPS 
POS 
PRP$ 
RBR 
RP TO 
VBP VBG 
WDT 
WR8 
BEP 
BEN 
HVP 
DOP 
DON 
DESCRIPTION 
sentence terminator TAG 
numeral, cardinal existential there NNIN1 
preposition, subordinating NNIN2 
adjective, comparative NNDE1 NNDE2 
list item marker PN 
noun, common NU 
noun, proper, plural VBMA 
genitive marker AJMA pronoun, possessive CO 
adverb, comparative AX 
particle ADCO to or infinitive marker APSE 
verb, present tense CJ 
verb, present participle ANCO 
WH-determiner ANDE 
WH-adverb ANNU be verb. present tense EX 
be verb, past participle LQ 
have verb, present tense RQ 
do verb, present tense SY do verb, past participle 
POS 
proper noun 
common noun 
common-dependent noun unit-dependent noun 
pronoun 
number verb 
adjective 
copula 
auxiliary verb 
constituent adverb 
sentential adverb 
conjunctive adverb 
configurative adnominal 
demonstrative adnominal 
numeral adnominal 
exclamination left quotation mark 
right quotation mark 
symbols 
TAG 
PPCA1 
PPCA2 
PPCA3 
PPCA4 
PPAD 
PPCJ 
PPAU 
ENTE 
ENCO1 
ENCO2 
ENCO3 
ENTRt 
ENTR2 
ENTR3 
ENCM PE 
SF 
PF 
CM SO 
Figure 1: English Tags (left) and Korean Tags (right) 
POS 
nominative postposition 
accusative postposition 
possessive postposition 
vocative postposition 
adverbial postposition 
conjunctive postposition 
auxiliary postposition 
final ending coordinate ending 
subordinate ending 
auxiliary ending 
adnominal ending 
nominal ending 
adverbial ending 
ending+postposition 
pre-ending 
suffix prefix 
comma termination 
~P~II'~ llq llll ll~l'~l~t Ill i i LU i If(i| II IL'II|II~ ;'~ ;~ lil|'tlll I I Ill I LI~M 
\[VBI'+IN\] \[take+out\] \[1+3\] \[wp\] \[tak(q \[t\] 
\[VBP+PI~P\] \[take+her\] \[1+2\] \[W3P+PRP+IN\] \[take+her+out\] \[1+2+3\] 
\[PRP\] \[ho,-\] \[2\] \[IN\] \[out\] \[3\] 
\[I'PCA2+P1)AD-FVBMA\] \[rcul+euro+deryeoga\] \[2+4+5\] 
\[PN\] b .... :/~o\] ill \[PI)AI)+VBIvlA-FENTE\] \[reu/+cure-I- dcrycoga+ra\] \[4+5+6\] 
\[NNIN2\] \[bagg \] \[3\] \[NNIN2+PPAD\] \[bagg+euro\] \[3+41 
\[ENTE\] \[ra\] \[6\] \[P1)AD+VBMA\] \[curo+deryeoga\] \[4+5\] 
\[PPAD+VBMA+ENTE\] \[euro+deryeoga+ra\] \[4+5+6\] \[PPCA2+NNIN2+PPAD+VBMA\] \[reul+bagg+curo+ deryeoga \] \[2+3+4+5\] 
\[PPCA2+NNIN2+PPAI)+VBMA+ENTE\] \[reul+bagg+euro+dcryeoga+ra \] \[2+3+4+5+6\] \[P1)CA2+NNIN2+PPAD+VBMA\] \[renl+deryeoga \] \[2+3+4+5\] 
\[PPCA2+NNIN2+Pt)AD+VBMA+ENTE\] \[reul+deryeoga+ra\] \[2+3+4+5+6\] 
Table 1: possible tag sequences 
In sun1, the maxilnunl entropy fralnework finds 
the model which has highest entropy(most uniform)~ 
given constraints. It is related to the constrained 
optimization. To select a model from a constrained 
set, C of allowed l)rol)ability distributions, the model 
p, C C with maximum entropy H(p) is chosen. 
In general, for the constrained optimization prob- 
lem, Lagrange inultipliers of the number of features 
can be used. However, it was proved that the model 
with maximum entropy is equivalent to the model 
that maximizes the log likelihood of the training 
samples like (2) if we can assume it as an exponential 
model. 
hi (2), the left side is Lagrangian of the condi-. 
tional entropy and the right side is inaxilnlHn log-. 
likelihood. We use the right side equation of (2) to 
select I. for the best model p,. 
~,g,,~..~,(- ~.,,. ~(~)v(yl~)logv(vlx)+~,,(v(f,)-~(/,))) (2) 
:a,'9,,,ax~, ~.,,. ~(x,v)lo.~n,(ylx) 
Since t, cannot be tbund analytically, we use 
the tbllowing improved iterative scaling algorithm to 
colnpute I, of n active features in .4 in total sam-- 
ples. 
1. Start with li = 0 for all i 6 {1,2,...,n} 
2. Do for ca.oh i ~ {\],2,...,n} : 
(a) Let AAi be the solution to the log likeli- 
hood 
(b) Update the value of Ai into li + A,h, 
~. ..... ~(.,,v)A(:~,v) where AAi = log ~,:, ~i.~)v~(?11.~)/~(.~,v) 
px(yl:r) = ~-A--e(~ x'f'("':')) zx (:~.) ' , 
z (x) = E:, c(E, 
3. Stop if not all the Ai have converged, otherwise 
go to step 2 
Tile exponential model is represented as (3). Here, 
li is the weight of feature fi. In ore" model, since 
only one feature is applied to each pair of x and y, 
it can be represented as (4) and fi is the feature 
related with x and y. 
~(ylx) = ~i C'f'(x'Y) (3) 
cAifi(x,Y) 
= (4) 
442 
5 Feature selection 
Only a small subset of features will 1)e emph)yed in 
a model by sele(:ting useflfl feal;m'es from (;tie flmture 
1)ool 7 ). Let 1).,4 lie (;tie optimal mo(lel constrained 
by a set of active features M and A U J'i 1)e ,/lfi. Le(; 
PAf~ be the ot)timal model in the space of l)rol)abil- 
ity distribution C(Afi). The optimal model can be 
tel)resented as (5). Here, the optimal model means 
a maxilmnn entropy nlodd. 
1 v~ :, 
= z,,.(:,;) p'~ (:11,)::' ("'") 
zo,(:.) = ~ v.~(::l:,,)c"S'(*'"> (5) 
Y 
The imi)rovement of l;he model regarding the ad- 
dition of a single feature fi can be estiumted by mea- 
suring the difference of maximmn log-likelihood be- 
tween L(pAf~) and L(pA). We denote the gain of 
t~ature fitiy A(~lfi) an(l it can be r(!t/resented in 
(G). 
A(A.I'd - .,..,:,;,,cAI~(.) 
('A:,(,,,) = J~(>t:,)-- L(v,O 
= _ ~(:~)~,.~(:,/I,.): :'('''') 
x y 
÷'~P(.fi) IS) 
Note that a model PA has a, set of t)arameters A 
which means weights of teatures. The m(idel P.Afl 
contains the l)ara.lnetc.rs an(I the new \[)a.l'a, lllCi;('~r (11 
with l'eSl)ect t() the t'eal;ure fi. W'hen adding a new 
feature to A, the optimal values (if all parame(ers of 
probability (listril)u(,ion change. To make th(; (:om- 
i)utation of feature selection tractal)le, we al)l)roxi- 
mate that the addition of a feature fi affec(;s only 
the single 1)aranxeter a, as shown in (5). 
qShe following a.lgoritlnn is used for (;omputing the 
gain of the model with rest)ect to fi. We referred 
to the studies of (Berger et al., 1996; Pietra e.t al., 
1997). We skip tile detailed contents and 1)root~. 
1. Let 
1 if P(fi) <_ PA(J;) 
r = -1 oth, erwise 
2. Set a0 = 0 
3. Repeat the following until GAff(%,) has con- 
verged : 
Coilll)llte 0@1,+ i frOlll og n llSillg 
a log (1 ! -~6t:' ('"~) ~ ctn+l = (xn + 7" ,. (;~:~(~,,): 
Compute GaV~ (a,~+l) using 
GAA (a) = - Ea,/3(a,') log Z,,(:,:) + ctf)(fi) , 
c'A:, (,~) = ~(k) - Ex ~(~0M(:,.), 
G" :ct~ A:,, , = -- E.~ P(")V2i:, ((fi -- M(;,;))" la;) 
set description # of disjoint total 
features cvtults 
A active feat m'es 1483 4113 
P feature Calldidat, es 3172 63773 
N new f'eaLures 97 5503 
Table 2: Summery of Features Selected 
where (:~ ~ (l~n+ 1 
A f~ = A u f~, 
M(z) - p~f~ (fila-) , 
PP4S, (fil':) --- E. ~'~s, (:,¢l:r)k(:., ~J) 
d. Set ~ AL(Afi) <-- GAS,(ct.) 
This algorittun is iteratively comtmted using Net- 
Well'S method. \¥e cmt recognize the iml)ortance of a 
fl;ature with the gain value. As mentioned above, it 
means how much the feature accords with the model. 
We viewed the feature as tile information that Q. and 
t, occur together. 
6 Experimental results 
The total saml)les consists of 3,000 aligned Sellteiice 
pairs of English-Korean, which were extracted from 
news on the web site of 'Korea Times' and a. maga- 
zine fl)r English learning. 
In the initial step, we manually constucted (;tie 
correspondences of tag sequences with 700 POS- 
tagged sentence I)airs. hi the SUl)ervision step, 
we extracte(t l.,d83 correct tag sequence corresl)on- 
it(miles its shown in Table 2, and it work as active 
features. As a feature I)OOI, 3,172 (lisjoint %a(;ures 
of tag sequence ma.I)pings were retrieved. 1% is very 
important to make atomic thatures. 
We maxinfized A of active features with resl)ect 
to total smnples using improved the iterative scal- 
ing algoritlun. Figure 3 shows Ai of each feature 
.f(Q31,:P+.m,ttO C A. There a.re nlany corresl)on- 
dence 1)atterns with resl)ect to the Englsh tag string, 
'BEP+JJ'. 
Note that p(tt~lQ) is comtmted by the exponential 
model of (4) mid the conditional probability is the 
saine with empirical probal)ility in (7). Since the 
wflue of p(ylx) shows the maxinmm likelihood, it is 
proved that each A was converged correctly. 
# of (.% y) occurs in sam, pie 
P(ylx) - n, um, ber of times of a: (7) 
hi feature selection step, we chose useflll fea- 
tures with the gain threshold of 0.008. Figure 
4 shows some feaures with a large gain. Anion\ 
then1, tag sequences mapping including 'RB' are er- 
roneous. It means that position of adverb in Ko- 
rean is very complicated to handle. Also, proper 
noun in English aligned coInmon nouns in Korean 
443 
English 
BEP+JJ 
BEP+JJ 
BEP+JJ 
BEP+JJ 
BEP+JJ 
BEP+JJ 
BEP+JJ 
BEP+JJ 
BEP+JJ 
BEP+JJ 
BEP+JJ 
Feature(x,y) 
Korean 
VBMA+ENCO3+AX+ENTE 
VBMA 
AJ MA 
AJMA+ENTE 
VBMA+ENTE 
NNIN2+CO 
NNIN2+CO+VBMA 
NNIN2+PPCA1 +VBMA+ENTE 
NNIN2+CO+ENTE 
NNIN2+PPCA2+AX+ENTE 
NNIN2+PPCA1 +VBMA 
1 
10.1369 
8.8520 8.6787 
8.2628 
7.2379 
7.1372 
6.9909 
8.8402 
6.8308 
6.4256 
6.4250 
Figure 3: A 
\[PRP\] you ~/'\[PN\] cJ~ (dangsin) 
"-\[PPAU\] ~-(eun) 
\[RB\] usually \[ADCO\] EttX41_~(dachero) 
HVP\] have, /\[NNIN2\] ~uf~(ilbanseok) 
TO\] to /~ \[PPAD\] 0tl(e) 
\[VBP\] take ;/\[VBMA\] N(anj) 
,\[ENCO3\] O~OIE~(ayaman) 
\[J J\] regular ...... ':-:- \[AX\] "$F(ha) 
\[NN\] seating/' \[ENTE\] L E}(nda) 
Figure 5: Best Lexical alignment 
because of tagging errors. Note that in the case of 
'PN+PPCA2+PPAD+VBMA', it is not an adjacent 
string but an interrupted string. It means ttlat a 
verb in English generally map to a verb taking as 
argument the accusative and adverbial postposition 
in Korean. 
One way of testing usefulness of our method is 
to construct structured aligned bilingual sentences. 
Table 3 shows lexical alignments using tag sequence 
alignments drawn from our algorithm for a given 
sentence, 'you usually have to take regular seating 
- dangsineun dachcw ilbanscokc anjayaman handa' 
and Figure 5 shows the best lexical alignment of the 
sentence. 
We conducted the exi)eriment on 100 sentences 
composed of words in length 14 or less and siln- 
lilY chose the most likely paths. As tim result, the 
accuray was about 71.1~. It shows that we can 
partly use the tag sequence alignments for lexical 
alignments. We will extend tlle structural mapping 
model with consideration to the lexical information. 
The parmneters, the conditional probabilities about 
stuctural mappings will be embedded in a statisti- 
cal model. Table 4 shows conditional probabilities 
of seine features according to 'DT+NN'. In general, 
determiner is translated into NULL or adnominal 
word in Korean. 
7 Conclusion 
When aligning Englist>Korean sentences, the differ- 
ences of word order and word unit require structural 
information. For tiffs reason, we tried structural tag 
c(x,y) 
162.0C 
45.00 
39.00 
25.00 
9.00 
8.00 
7.00 
6.00 6.00 
4.00 
4.00 
P(Ylx) 
0.4247 
0.1180 0.0996 
0.0655 
0.0236 
0.0210 
0.0183 
0.0157 0.0157 
0.0105 
0.0105 
Example 
English 
are+prepared 
are+careful 
am+healthy 
is+new 
am+sure 
am+rich 
is+selfish 
is+patriotic 
is+reasonable 
is+reprehensible 
is+helpful 
Korean 
~kllKl+0~+°~+~ LI El - 0-94 8i 
~J 8t+ ~ LI C\[ 
~X\[+01 0191~+01+~LIEt 
011~;Xt+9\[+~+EF 'NPJ ~ +01 +g~ 
.~N+01 +.El 
of active features in A 
t~ 
I)T-t-NN 
I)Tq-NN 
DT-FNN 
DT+NN 
DT-t-NN 
DT-FNN 
I)T-FNN 
etc 
t~ 
NNIN2 
ANI)E+NNIN2 
ANNUWNNDE2 
NNIN2+PPCA1 
NNIN2+NNIN2 
NNIN2-FPPAU 
ADCO 
eI;c 
p(t~l*~) 0.524131 
0.15161 
0.091036 
0.063515 
0.058322 
0.05768 
0.049622 
Table 4: Conditional Probability 
string mapping using maximum entropy modeling 
and feature selection concept. We devised a nlodel 
that generates a English tag string given a Korean 
tag string. From initial active structural features, 
useful features are extended by feature selection. 
Tile retrieved features and parameters can be em- 
bedded in statistical maclfine translation and reduce 
the complexity of searching. We showed that they 
can helpful to construct structured aligned bilingual 
sentences. 
444 
Feature(x,y) 
X 
VBP+PRP+TO 
BEP+RBR+IN 
DT4-CD 
JJ+IN 
VBG+TO 
BEP 
BEP 
NNP 
NNP 
TO+PRP 
TO+PRP 
MD+FIB 
MD+RB 
MD+BEP 
NNP+NNS 
VBP+TO+VBP 
BED+VSN+IN 
BED+VSN+IN 
Y 
PNYPPCA2+PPAD+VBMA 
PPAD+AJMA 
NU+NNDE2 
PPAD+AJMA+ENTR1 
PPAD+PPCA2+VBMA 
PPCA1 +AJMA 
PPCA1 +AJMA+ENTE 
NNIN2+PPAU 
NNIN2+NNIN2 
PN+PPAD 
PN+PPCA1 
ENTRI+NNDEI+CO 
ENTRI+NNDEI+CO+ENTE 
CO+ENCO2+VBMA 
NNIN2+SF 
PPCA2+VBMA+ENCO2+VBM~ 
PPAD+VBMA+PE+ENTE 
PPAD+VBMA+PE 
r,~; P(ylx) /~ L(Afi) 
9.8687 0.1722 0.0194 
9.6780 0.3265 0.0192 
9.2799 0.2449 0.0190 
9.6343 0.2450 0.0190 
9.9542 0,3269 0.0189 
9.6720 0.2941 0.0188 
9.2724 0.1961 0.0188 
8.7481 0.1225 0.0182 
9.1397 0.1337 0.0180 
9.5634 0.2307 0.0180 
9.2604 0.1730 0.0180 
9.2445 0.1548 0.0177 
9.2564 0.1548 0.0177 
8.5435 0.0934 0.0177 
9.1597 0.1470 0,0176 
8.9928 0.1278 0.0174 
9.1511 0.1704 0.0173 
9.1636 0.1706 0,0173 
English 
send+Nm+to - - 
is+more+than 
the+two 
smarter+than 
serving+to 
is 
are 
IBM 
Harvard 
to+him 
to+her ? 
? 
should+be 
English+books 
request+to+send 
was+thrown+to 
were+sent+to 
Example 
Korean 
21+ N+-0tF+-~ L\]I 
-h20+~ ¥+N 
~H+~PA+N+L ~OII3tI+-~+U~ 
-0I+OA 
-01+~+H 
IBM+~- 
3+OIl)II ~LI +)F 
? 
? 
Ol+OiOI+N 
-N+~Lll+Kf~+5/N bF 
~011?t1+c3 ~1XI+~+FA 
Figure d: Some fbatm'es with a large gain 
Tag alignment (km(litiolml Lexical aliglulw, nt 
l)l{P : PN+I)PAU 0.150109 you : dangsin+cun 
lt\]3 : AI)CO 0.142193 usmtlly : dachero 
II, B : NNIN2+PI'AI) 0.038105 usually : ilbanseol¢-l-e 
IIVP+TO : I~N(JO3-bAX+I"NTE 0,982839 have+to : ayaman+handa 
VBP : P1)AI)+VBMA 0.05022,l take : e+anj 
VBP : VBMA+F, NCO3+AX+I,;NTE 0.011110 take : anjay+aman-Fha+nda 
VI3P : P1)AI)+VI~MA+ENCO3-}-AX-I-ENT\],~ 0,001851 take : e-Fanjayaimm+handa 
VI3P-FJJ : NNIN2+PI)AI)-\]-VBMA 0.057657 take-t-regular : ilbansenk+e+,allj 
.I.J+NN : NNIN2 0.581791 regular+seating : ill)anseok 

References 

Adam L. Berger, Peter F. Brown, Stephen A. 
Della Pietra, Vincent J. Della Pietra, John R. 
Gillett, John D. Lafferty, Robert L. Mercer, Harry 
Printz, and Lubos Ures. 1994. The Calldie sys- 
tem for machine translation, hi Proceedings of the 
ARPA Conference on Human Language Technol- 
ogy, Plainsborough, New Jersey. 

Adam L. Berger, Stephen A. Della Pietra, and Vin- 
cent J. Della Pietra. 1996. A maximum entropy 
approacll to natural  processing. Compu- 
tational Linguistics, 22(1):39-73. 

Peter F. Brown, John Cocke, Stephen A. Della 
Pietra, Vincent J. Della Pietra, Fredrick Jelinek, 
John D. Lafferty, Robert L. Mercer, and Paul S. 
Roossin. 1990. A statistical approach to nlachine 
translation. Computational Linguistics, 16(2):79- 
85 

Peter F. Brown, Stephen A. Della Pietra, Vincent J. 
Della Pietra, Robert L. Mercer. 1993. The math- 
enlatics of statistical machine translation: pa- 
an(, 3: l,exi(:al aligmnents using tag alignments 
rameter estimation. Computational Linguistics, 
19(2):263-311. 

Stanley F. Chert. 1993. Aligning sentences in bilin- 
gual corpora using lexical information. In l'rocccd- 
ings of ACL ,71, 9-16. 

A. P. Dempster, N. M. Laird and 1). 13. l{ubin. 
1976. Maximum likelihood fi'om incomplet,e data 
via the EM algorithm. The Royal ,S'tatistics Soci- 
ety, 39(B) 205-237. 

Williain A. Gale, Kenneth W. Church. 1993. A pro- 
gram fbr aligning senten(:es in bilingual (-orl)ora. 
Coml)utational Linguistics, \]9:75-102. 

Frederick Jelinek. \]997. Statistical Methods for 
Speech Recognition MIT Press. 

Marin Kay, Martin Roscheisen. 1993. Text- 
translation alignment. Computational Linguis- 
tics, 19:121-142. 

Julian Kupiec. 1993. An algorithm tbr finding noun 
phrase corresl)ondenccs in bilingual corl)ola. In 
Proceedings of ACL 31, 17-22. 

Yuji Matsmno~o, Hiroyuki Ishimoto, Takehito Ut- 
sure. 1993. Structural inatching of para.llel texts. 
In Proceedings of ACL 3I, 23-30. 

I. Dan Melame(l. 1997. A word-to-word model of 
translation equivalence. In PTvcccdings of ACL 
35/EACL 8, 1.6-23. 

Frmlz Josef Och mid Ilans Wel)cr. 1.998. hnt)rov- 
ing Statistical Natural Language Translation with 
Categories and Rules. In Procccdings of ACL 
36/COLING, 985-989. 

Stephen A. Della Pietra, Vincent J. Della Pietra, 
John D. La.tl'erty. 1997. llnducing features of ran- 
dora fields. IEEE ~IYansactions on Pattern Anal- 
ysis and Machine Intelligence, 19(4):380-393. 

Frank Smadja, Kathleen R. McKeown, and Vasileios 
Hatziw~ssiloglou. 1996. Translating collocations 
fi)r bilingual lexicons: A statistical approa(:h. 
Computational Linguistics, 22 (1) :1-38. 

Kengo Sate 1998. Maximum Entrol)y Model Learn- 
ing of the Translation Rules. In Procccdinfls of 
ACL 35/COLING, 1171-1175. 

Jung H. Shin, Y(mng S. Han, and Key-Sun 
Choi. 1996. Bilingual knowledge acquisition from 
Korean-English paralM cort)us using aligmnent 
method. In Proceedings of COLING 96. 

C. Tilhnann, S. Vogel, H. Ney, and A. Zubiaga. 
1997. A I)P t)ased sea.rch using monotone a.lign- 
ments in statistical translation. In Procccdings of 
ACL 35/EACL 8, 289-296. 

Ye-Yi Wa.ng and Alex Waibel. 1997. Decoding algo- 
rithm in statistical machine translation. In Pro- 
cccdinfls of ACL 35/EACL 8, 366-372. 

Ye-Yi Wang and Alex Waibel. 1998. Modeling with 
structures in machine translation. In Procccdings 
of ACL 36/COLING 

Dekai Wu 1996. A t)olynonlial-time algorithm for 
statistical machine translation. In Proceeding of 
A CL 34. 
