File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1040_metho.xml
Size: 13,751 bytes
Last Modified: 2025-10-06 14:14:07
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1040"> <Title>Bilingual Knowledge Acquisition from Korean-English Parallel Corpus Using Alignment Method ( Korean-English Alignment at Word and Phrase Level )</Title> <Section position="3" start_page="230" end_page="233" type="metho"> <SectionTitle> 2 Korean/English Alignment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="230" end_page="230" type="sub_section"> <SectionTitle> Model 2.1 English/Fren('h aligmne.nt nm(lel </SectionTitle> <Paragraph position="0"> To detine p(f\]e), the 1)robability of the French sen tence f given the l&quot;,nglish sentence e, Brown et al. (1991) ;ulol)ted the translation nlo(lel in which each word in e acts independently to produce the words in f. When a typical alignm('at is denoted by a, the l)rol)ability off given (: can l)e written as the sum over all l)ossibh', alignments (Brown et ;d. 1991) v(fl(:) (t) a Given an aligmnent a between e and f, Brown ctal. (199l) has shown that one can estimate p(f,al('. ) as the product of the following thre.c terms (l~erger (% al. 19!),5).</Paragraph> <Paragraph position="2"> In l, he al)ove equation, p(nlc) denotes the l)roba bility that the l&quot;,nglish word e generates n l,'rench words and p(fle) denotes the probability that the l&quot;mglish word e generates the l&quot;rench word 1'. d(f, ale. ) rel)resents the. distortion prol)abilil,y that in about how the words are reordered in the l!'rench output.</Paragraph> <Paragraph position="3"> in the above methods, only one English word in reb~t(xl to one or n lq:ench words. The (lister Lion probabilities are defined on the positional relations such as absolute or relative positions of matching words.</Paragraph> </Section> <Section position="2" start_page="230" end_page="231" type="sub_section"> <SectionTitle> 2.2 Characterlsl;i(-s of Korean/English </SectionTitle> <Paragraph position="0"> alignm('.nt Unlike the. case of l';nglish-l,'ren('h alignnt(mt, Ko rean and gnglish have dilfer(:mt word units to be aligned, for an English sentence consists o\[' words whereas a t,\[oreatt sentence consists of wordl>hrases (compound words). Typically a word-phrase is (:otnl)osed of one or more content words and postpositional function words.</Paragraph> <Paragraph position="1"> A Korean word is usually a smaller unit than an English word and a word-phrase is larger than an English word. For this reason the exact thatch as in English-French pair is hard to establish for the case of Koean-English (Shin et al. 1995). Consequently word-to-word or word-to-word-phrase alignment t)etwcen Korean and l';nglish will suf'+ fee from trait mistnatch attd low accuracy. The complication of unit mismatch often implies the need of non-flmctional aligntnent such as many-to-many mapping. Non-flmctiomd mapping tnay also occur in the l!htglish-French case, but with much less frequency.</Paragraph> <Paragraph position="2"> 'l'he table 1 shows the degree of mismatch between English words and Korean words that are analyzed by our atttomatic POS tagger and tnorphological analyzer. When we checked randomly selected 200 sentence pairs by hand, only aa.s% or all pairs have one+to-one correspondences between English words and Korean words.</Paragraph> </Section> <Section position="3" start_page="231" end_page="232" type="sub_section"> <SectionTitle> 2.3 Korean to English Alignment </SectionTitle> <Paragraph position="0"> In this section, we propose a Korean to English aligmnent method that aligns in both word and phrase lewds at the same t.ime. First, we introduce the method in word-to-word alignment, att(l then extend it to inchMe phrase-to-phrase alignment.</Paragraph> <Paragraph position="1"> By definition, a phrase in this paper refers to a linguistic unit of 1Tlore general structure than it is recognized in general from the terms, noun and adverb phrases. A phrase is any arbitrary sequence of ad, iaeent words in a sentence.</Paragraph> <Paragraph position="2"> word-to-word correspondences) In t;he developrnent of our method, we follow the basic idea of' statisticaL1 translation proposed by Brown et al. (11993). '\['o every pair of sentences of e and k, we assign a value p(elk), the probability that a translator will pro(luce e as its translation of k, where e is a sequence of English words and k is a sequence of Korean words.</Paragraph> <Paragraph position="4"> In equation 3, n and m are the nmnl)er of words in the English sentence e and its correspoudil G Korean sentence k respectively, cj and kl are tit{> aligtdng unit between l'2nglish sentence e and Korean sentence k. cj rq+resenl,s j-th word in I&quot;nglish sentence and k/ represents i-th word in Korean sentence. For example, in Figure 1 English word &quot;the&quot; is ct and Korean word &quot;ku&quot; is kt. The base method of word level aligtnncnt is extend('d with 1)hrase-level alignntettt that ow'xcomes the dHDrence of matching unit and provides more opportunity for the extraction of richer lit> guistic information such as l)hrasal-lewq bilingual dictionary. To cot)e with the data sparseness problem caused by considering all possible phrases, we represent phrases by the tag sequences of their component words.</Paragraph> <Paragraph position="5"> If an English sentence e and its Korean translation k are partitioned into a sequence of' phrases p~. and t)~ of all possible sequences s(e, k), we can write p(elk) as in equation 5 where l)~ and Pk are phrase sequences and a(p+, t>~:) denotes all possible alignments between Pe and Pk.</Paragraph> <Paragraph position="7"> If we represent the phra.se-to-phrase correspondences using the tag sequence of phrase and words composing phrase, The equatiou 5 can be rewritten as in equation 6 letting phrase match be represented by the tag sequence of phrases as well as words. \[n equation 6, k~ ~ is j-th phrase of \]d '~ , and l(kP~l , j e denotes the tag sequence of words composing phrase kj'Pk. IP~\[ is the number of phrases in a phrase sequence Pc.</Paragraph> <Paragraph position="9"> The likelihood of all alignable cases within bilingual phrase is defined as in equation 7, where \[e~+l</Paragraph> <Paragraph position="11"> \[:igure :/shows how tip l>rol>lem o1&quot; word unit ,hismatch can t>e dealt wit, h in the phrase level aligu,lien\[,. null lit the example, d ''~ = (The houst') (is gradually <iisintegratiug) (with ~llg{, ), aml c;\[' _ (The Itousc), P, I% (it ~ Tile, /,(C t ) = (determitmr ttoun), ,,qt'*'+ =- (ku cil>-Utl), /~'~'~ --ku, ,'<>Sl>Cctiwqy.</Paragraph> </Section> <Section position="4" start_page="232" end_page="232" type="sub_section"> <SectionTitle> 2.4 Parameter re.estimatlon </SectionTitle> <Paragraph position="0"> With the <:onstraiut that the st,tti ov<w a+ll alignnte,tts should be 1, the reestintatiott a.lgorith,n can be d<'.rivt'd to give equation 8 Ibr word t.ranslntion probal>ility and equation 10 for I>hras<~ <'<~l:r<'sl>ondence prolmt>ility. This proc<'ss, wht'n apl>lied repeatedly, must give a localty ot>tint;d est.inmtion of the l>ara.rneters \[ollowing I, he l)riucil>h? (>\[&quot; t, he EM algorit, hnt (Brown et al. 1993)(I)etrtt>stcr et al. 1977).</Paragraph> <Paragraph position="1"> p(clk)< ..... ti,,,,~> <ltmotcs the alignment <:atolldates that satisfies < conditio't~ >. l:'or <:alculating p(clk), only constant t <-ases of a.ligntnenl;s nt'.cd to be <:onsidered in tim prol>Os<'d alignnttutt algorithm t>ecause most ;digntnc.nl. <-avitlidatcs have very low prol)al)ility l;ha.t 1.h<:y may I>c igttore(l.</Paragraph> <Paragraph position="3"> Let us <:all tim exl>ected Iltttrtl:,el; el&quot; l;imc'.s, that k matches with e in the corresl>onding sentence k and e, the count of e giwm k. By using the notation (:(elk), the ree, stitnation forntula ofp(elk ) can be induced as equation 8 using \[&quot;,M ntethod.</Paragraph> <Paragraph position="4"> : (,qk)< .... :,;,,,=,.~ > o(,:l< ,,, k) -- ~,(,~lk) tO) When we de,tote c(:,lG) the expected number o\[&quot; tames I.ha.t, a. tag s<~ctuen<:e of English I>hrase corre-Sl>onds to a tag sequence of Korean l>hrase as in equa, t, ion I I. Then the reest.imation algorithut of l)(l,.,\[Ik) is giwm as in eqlt;~tt.ion t0.</Paragraph> <Paragraph position="5"> cx >ecl, ed numl>er of t~ ~XEn ~ z,(z, lea:) '~- - ~:-l.otal expected nu|nber o\[ t,. given t~</Paragraph> <Paragraph position="7"> I&quot;or tim exl;e,l(led tnethod of phrase alignment, the Itase model is an intcrntediatc stage for the estima l,iou of word-to-w<)rd f>rol)abilith~s. Who phraset,()-\[)h\[)O.rse i)rol>abililics are reesl,\[tna.t~c~d Ul>Olt I, hc inil.ial vnlu<'~s <)\[ word:to--word prol>al>ilties.</Paragraph> </Section> <Section position="5" start_page="232" end_page="233" type="sub_section"> <SectionTitle> 2.5 Alignment algorithm </SectionTitle> <Paragraph position="0"> The alignntent process of gen<'rating Korean phrases and selecting their matching i>hrases in l';nglish can l>e formul;tted around I.he l)rincipl<~ o\[' tlyna.mi<: l>rogramming. 'l'he l>rol)ability va.lttc (\[efil,ed ill e(Itl;t,l;ion (\] ;\]rill\[ 'T iS t|~C/,~et\[ t;() cc, ittpute nla.tchi,,g prol)ahility of t)(\]c:,a) and l)(cj,b).</Paragraph> <Paragraph position="1"> p(ej,~,) stand for tilt'. I)hl';t,se (;Olllposed of 1) In/\[tiber of w(>Ms from j-th woM ill ;t sC.lll.etlce. (~i iS used to ke.<q) Life seh~cl.ed phrase sequence ,tp t.(> i-th word a.nd ~i denotes its sC()l'C. N attd M are uuvnl)er of words of Koreall sentence and I:mglish se,g;e,tc<~' r<~sl)eCtively. '\['he c<>nstanl~ value l, is tie-. \[iued as tna.xinntm ntt,nt>er el' words which c(>nsis(, of a phrase.</Paragraph> <Paragraph position="2"> a(g~h.~ ) is a hi 0 .... = (j, (z, b) Although the aligmnent algorithm described above with the COml>texity of O(I,:2MN) is simple and c\[licicnt,, this algorit, hm has the limit, alion caused by the assumption of dynanfic programming. The dynamic programming in the context of alig|nnent assumes fltat th,+, previous selections do not interfere with the fllture decisions. The alignment decision, however, may depend on the previous matches to the extent that the results from dynamic programming inay not be sufficiently accurate. One popular solution is to maintain upper t-best cases instead of just one as following where max-t denotes the t-th max candidate.</Paragraph> <Paragraph position="4"> = arg max- t \[~i_~(t') + logp(kiC/,, ej, b)\] I<tt<T,I<j<N l<a~L,I<b<L As a result, the running complexity of the proposed algorithm becomes O(TL2MN). Taking T and L as constants, the order of complexity becomes O(MN).</Paragraph> <Paragraph position="5"> As another method to relax the problem of decision dependency on the previous matches, preemptive scheme to find max matching of phrase ki,~ is adopted. In the preemptive aligmnent, the previous selection can be rematched with the better selection found by later decision.</Paragraph> <Paragraph position="6"> In following algorithm, ~(ki,a,n) denote ej,v which has n-th highest matching wdue with Korean phrase ki,~ among all possible matching Korean phrase and u(ki,a,n) carry the weight tbr tile matching. ~i,b indicate matched Korean phrase with ej,b in current status and v~j,~ denote their matching weight. 'l'he established matching in previous stage can be changed when another matching, which has higer matching weight, is identified in this algorithm.</Paragraph> <Paragraph position="8"> Although the proposed algorithm can not cover all possible alignment cases, the proposed algorithm produces resonably accurate alignment results efliciently as is demonstrated in the following section.</Paragraph> </Section> <Section position="6" start_page="233" end_page="233" type="sub_section"> <SectionTitle> 2.6 Experilnents </SectionTitle> <Paragraph position="0"> The total training corpus tbr our experiments consists of 254,100 English words and 178,300 Korean word-phrases. The content of training corpus is summarized in table 2.</Paragraph> <Paragraph position="1"> A tIMM Part-of-Speech tagger is used to tag words beibre aligmnents. An accurate IIMM designed by the authors for Korean sentences taking into account the fact that a Korean sentence is a sequence of word-phrases is used (Shin et al.</Paragraph> <Paragraph position="2"> 95). The l)enn Treebank POS tagset that is composed of 48 tags and 52 Korean tagset is used in the tagging. The errors that is generated by morphological analysis and tagging cause many of the alignment errors.</Paragraph> <Paragraph position="3"> qb avoid the noise due to the insufficient bilin gum sentences, we adopted two significance filter-ing methods that were introduced by Wu and Xia (1994). First, the Korean sentences consisting of words with more than 5 occurrences in the corpus are considered in the experiment. Second, we se..</Paragraph> <Paragraph position="4"> lected the English words that accounts for the top 0.80 of the translation probability density given a Korean word.</Paragraph> <Paragraph position="5"> When we selected 200 sentence pairs randomly and manually tested aligned results, we obtained 68.7% precision at the phrase level and 89.2% precision of bilingual dictionary induced from the alignment. The table 3 and 4 illustrate tile bilingual knowledge acquired from the aligned results. The information in table 4 is the unique product of phrase-level alignment.</Paragraph> </Section> </Section> class="xml-element"></Paper>