File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2175_metho.xml
Size: 18,371 bytes
Last Modified: 2025-10-06 14:13:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2175"> <Title>Bilingual Text, Matching using Bilingual Dictionary and Statistics</Title> <Section position="3" start_page="1076" end_page="1076" type="metho"> <SectionTitle> 2 The Framework of Bilingual </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1076" end_page="1076" type="sub_section"> <SectionTitle> Text Matching </SectionTitle> <Paragraph position="0"> The overall framework of bilingual text inatching is depicted in l:ig. 1. Although our framework is in> plemented for Japanese and l~;nglish, it is bmguage il,dependent.</Paragraph> <Paragraph position="1"> First, bilingual texts are aligned at sentence level using word correspondence inforin~tiol, which is available in bilingual dictionaries or estimated by statis. tic~l techniques. &quot;Statistical estim~ttioa&quot; at text; level indicates that lengtl>based statistical techniques arc applied if necessary. (At present, they are not implemented.) &quot;Stati,stical c.stimation&quot; ~tt sentence level indicates that word-to.-word correspondences ~re e> timated by statistic~d techniques. Then, eueh 1110110 lingual sentence is parsed into a disjunctive dependency structure ~Hld structurally matched using word correspondence information. In the course of struc.</Paragraph> <Paragraph position="2"> tural matching, lexical and synt~Lctic tunbiguities of monolinguM sentences are resolved. FinMly, from the matching results, monolinguM lexical knowledge and translation patterns m'e acquired.</Paragraph> <Paragraph position="3"> So fitr, we have implemented the following,: sentence ~dignment btLsed-on word correspondence information, word correspondence estimation by cooccnl'rence-ffequency-based methods in GMe mid Church (19.~H) and Kay and R6scheisen (1993), structured Imttehlng of parallel sentences (Matsumoto et a l., 1993), and case Dame acquisition of Japanese verbs (Utsuro et al., 1993). In the remainder of this paper, we describe the specifications of sentence aliglmlent: and word correspondence estimation in sections 3 and 4, then report the results of small experiments and evMuate our framework in section 5.</Paragraph> </Section> </Section> <Section position="4" start_page="1076" end_page="1078" type="metho"> <SectionTitle> 3 Sentence Alignment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1076" end_page="1076" type="sub_section"> <SectionTitle> 3.1 Bilingual Sentence Alignment Problem </SectionTitle> <Paragraph position="0"> Ill this section, we formally define the problem of bilingual sentence Mignment) Let S be a text of n sentences of a language, and T be a text of m sentences of another language and suppose that S and T are translation of each other:</Paragraph> <Paragraph position="2"> Let p be a pair of minimal corresponding segments in texts S and T. Suppose that p consists of x sentences sa-~+l, * * *, Sa in S and y sentences t~l,. * *, tb in T and is denoted by the following:</Paragraph> <Paragraph position="4"> Note that x and y could be 0. In this paper, we call the pair of minimM corresponding segments in bilingual texts a sentence bead. 2 Then, sentences in bilingual texts of S and T are aligned into a sequence</Paragraph> <Paragraph position="6"> We put some restriction on possibilities of sentence alignment. We assume that each sentence belongs to only one sentence bead and order constraints must be preserved in sentence alignment. Supposing Pi = (ai, xi; bi, Yi), those constraints are expressed in the following:</Paragraph> <Paragraph position="8"> Suppose that a scoring function h can be defined for estimating the validity of each sentence bead pi.</Paragraph> <Paragraph position="9"> Then, bilingual sentence alignment problem can be defined as an optimization problem that finds a sequence P of sentence beads which optimizes the total score H of the sequence P:</Paragraph> <Paragraph position="11"/> </Section> <Section position="2" start_page="1076" end_page="1078" type="sub_section"> <SectionTitle> 3.2 Bilingual Sentence Alignment us- </SectionTitle> <Paragraph position="0"> ing Word Correspondence Information null In this section, we describe the specification of our sentence alignment method based-on word correspondence information. 3 lIn this paper, we do not describe paragraph alignment process. For the moment, our paragraph alignment program is not reliable enough and the results of sentence alignment are better without paragraph alignment tban with paragraph alignment. Since bilingual texts in our bilingual corpus are not so long, the computational cost of sentence Mignment is not serious problem even without paragraph Mignment.</Paragraph> <Paragraph position="1"> in Murao (1991) was done under the supervision of Prof. M. Nagao and Prof. S. Sato (JAIST, East).</Paragraph> <Paragraph position="2"> Before aligning sentences in bilingual texts, content words are extracted frmn each sentence (after each sentence is morphologicMly analyzed if necessary), and word correspondences are found using both bilingum dictionaries and statistical information source for word correspondence. Then, using those word correspondence information, the score h of a sentence bead p is calculated as follows.</Paragraph> <Paragraph position="3"> First, supposing p= (a, x; b, y), and let n~(a, x) and nt(b,y) be the numbers of content words in the sequences of sentences s~4,... ,s, and t~v~,,...,tb respectively, and n~t(p) be the nunlber of corresponding word pairs in p. Then, the score h ofp is defined as the ratio of n~t(p) to the sum of n~(a,x) and nt(b,y): t,,(p) ~,(p) Let Pi be the sequence of sentence beads fi'om the begbming of the bilingual text up to the bead pi:</Paragraph> <Paragraph position="5"> Then, we assume that the score H(Pi) of Pi follows the recursion equation below:</Paragraph> <Paragraph position="7"> Let Hm(ai, bl) be the maximum score of aligning a part of S (from the beginning up to the ai(=ai_l+xi) th sentence) and a part of T (f,'om the beginning up to bi(=bi-l+yi) - th sentence). Then, Equation I is transformed into: //.~(a~, b~) where the initial condition is:</Paragraph> <Paragraph position="9"> We limit the pair (xi,Yi) of the numbers of sentences in a sentence bead to some probable ones. For the remnant, we allow only 1-1, 1-2, 1-3, 1-4, 2-2 as pMrs of the numbers of sentences: (xi,yi) e {(1,1),(1,2),(2,1),(1,3), (3, 1), (1,4), (4, 1), (2, 2)} This optimization problem is solvable as a standard problem in dynamic programming. Dynamic programming is applied to bilingual sentence alignment in most of previous works (Brown et al., 1991; Gate and Church, 1993; Chen, 1993).</Paragraph> </Section> </Section> <Section position="5" start_page="1078" end_page="1078" type="metho"> <SectionTitle> 4 Word Correspondence Esti- </SectionTitle> <Paragraph position="0"> mation In this section, first we describe estimation flmctions based-on co-occnrrence frequencies. Then, we show how to incorporate word correspondence information available in bilingual dictionaries and to estimate word correspondences not included in bilingual dictionaries. Finally, we describe the threshold fnnction for extracting corresponding word pairs.</Paragraph> <Section position="1" start_page="1078" end_page="1078" type="sub_section"> <SectionTitle> 4.1 Estimation Function </SectionTitle> <Paragraph position="0"> in the following, we assume that sentences in the bilingual text are already aligned.</Paragraph> <Paragraph position="1"> Let w, and w~ be words in the texts S and T respectively, we define the following frequencies: freq(w~,,wt) = (frequency of wa and wt's co-occurring in a sentence head)</Paragraph> <Paragraph position="3"> N - (total number of sentence beads) Then, estimation functions of Gale's (Gale and Church, 1991 ) and Kay's (Kay and RSscheisen, 1993) are given a.s below.</Paragraph> <Paragraph position="4"> Let a ,-~ d be as follows:</Paragraph> <Paragraph position="6"> Then, the validity of word correspondence w, and wt is estimated by the following value:</Paragraph> <Paragraph position="8"> The validity of word correspondence w~ and wt is estimated by the following value:</Paragraph> <Paragraph position="10"/> </Section> <Section position="2" start_page="1078" end_page="1078" type="sub_section"> <SectionTitle> 4.2 Incorporating Bilingual Dictio- </SectionTitle> <Paragraph position="0"> nary By incorporating word correspondence information available in bilingual dictionaries, it becomes easier to estimate word correspondences not included in bilingum dictionaries.</Paragraph> <Paragraph position="1"> Let w, be a word in the text S and wt,w~ be words in the text T. Suppose that the correspondence of w, and wt is included in bilingual dictionaries, while the correspondence of w, and w~ is not included. Then the problem is to estimate the validity of word correspondence of w,~ and 'w' t.</Paragraph> <Paragraph position="2"> Let freq(w~,wt), freq(w~,w~), freq(w~), freq(wt), and freq(w~) be the same as above, and frcq(ws, wt, w't)be the frequency of w,, we, and w't's co-occurring in a sentence bead. Then, we solve the problem above by defining freq'(w,,w~), fveq'(w,), freq'(w't) , and N' which becmne the inputs to Gale's method or Kay's method. We describe two different ways of defining those vMues.</Paragraph> <Paragraph position="3"> Estimation I One is to estimate all the word correspondences equally except that the co-occurrence of wa and wt is preferred to that of ,w~ and w~. freq'(w~,w'~), fveq'(w~), freq'(w't) , and N' are given below: 4</Paragraph> <Paragraph position="5"> When w~, wt, and w~ are co-occurring in a sentence bead, the co-occurrence of w~ and wt is preferred and that of w~ and w I is ignored. Thus, freq'(w,,w~t) is obtained by snbtracting the fi'equeney of all those cases fiom the real co-occurrence frequency of w, and w' t. But, freq'(w~) and freq'(w~) are the same as the real frequencies and the estimated word correspondences reflect the real co-occurrence frequencies in the input text. (Compare with Estimation II.) Word correspondences both included and not included in bilingual dictionaries are equally estimated their validities. null Estimation II (File other is to remove from the input text all the co-occurrences of word pairs included in bilingual dictionaries, freq'(w,,'a4) , freq'(w~), freq'(w~), ~md N' are given below: 4It can happen thtLt, within it sentence bead, one word of a language has more than one corresponding word~ of the Ol)posite language and all the correspondences are included in bilinguM dictionaries. In that case, formalizations in this section need some modifications.</Paragraph> <Paragraph position="7"> With this option, after all the co-occurrences of word pairs included in bilingual dictionaries are removed from the input text, word correspondences not included in bilingual dictionaries are estimated their validities.</Paragraph> <Paragraph position="8"> In the following sections, we temporarily adopt Estimation I for estimating word correspondences not included in bilingual dictionaries. It is necessary to further investigate and compare the two estimation methods with large-scale experiments.</Paragraph> </Section> <Section position="3" start_page="1078" end_page="1078" type="sub_section"> <SectionTitle> 4.3 Threshold Function </SectionTitle> <Paragraph position="0"> As a threshold function for extracting appropriate corresponding word pairs, we use a hyperbolic fimction of word frequency and estimated value for word correspondence.</Paragraph> <Paragraph position="1"> At first, we define the following variables and constants: s</Paragraph> <Paragraph position="3"> a = (constant for eliminating low frequency words) ( 1.0 for both h 9 and hk ) b = (constant for eliminating words) with low estimated value) ( 0.1 for h a and 0.3 for hk ) c - (lower bound of word frequency) ( 2.5 for both h a and hk ) Then, the threshold function g(x, y) is defined as below: null</Paragraph> <Paragraph position="5"> And the condition for extracting corresponding word pairs is given below: g(x,y) > 1 , x>c When using extracted word correspondences in sentence alignment and structural matching, at present we ignore the estimated values and use estimated word correspondences and word correspondences in bilingual dictionaries equally.</Paragraph> <Paragraph position="6"> SNote that values for constants arc determined temporarily and need fllrther investigation with large-scale experiments. Especially, constants related to word frequency have to be tnned to the length of texts.</Paragraph> </Section> </Section> <Section position="6" start_page="1078" end_page="1078" type="metho"> <SectionTitle> 5 Experiment and Evaluation </SectionTitle> <Paragraph position="0"> In this section, we report the results of a small experiment on aligning sentences in bilingual texts and statistically estimating word correspondences.</Paragraph> <Paragraph position="1"> The sentence alignment program and the word correspondence estimation program are named AlignCO.</Paragraph> <Paragraph position="2"> The processing steps of AlignCO are as follows: 1. Given a bilingual text, content words are extracted from each sentence.</Paragraph> <Paragraph position="3"> 2. A Japanese-English dictionary of about 50,000 en null tries is consulted and word correspondence information is extracted for content words of each sentence. null 3. The sentence alignment program named AlignCO/A aligns sentences in the input text by the method stated in section 3.2.</Paragraph> <Paragraph position="4"> 4. Given the aligned sentences in the bilingual text, the word correspondence estimation program named AlignCO/C estimates word correspondences which are not included in the Japanese-English dictionary with option Estimation I in section 4.2.</Paragraph> </Section> <Section position="7" start_page="1078" end_page="7080" type="metho"> <SectionTitle> 5. Combining word correspondence information </SectionTitle> <Paragraph position="0"> available in the Japanese-English dictionary and estimated by AlignCO/C, sentences in the input text are realigned.</Paragraph> <Paragraph position="1"> As input Japanese-English bilingual texts, we use two short texts of different length -- 1) &quot;The Dilemma of National Development and Democracy&quot; (305 Japanese sentences and 300 English sentences, henceforth &quot;dilemma&quot;), 2) &quot;Pacific Asia in the Post-Cold-War World&quot; (134 Japanese sentences and 123 English sentences, henceforth &quot;cold-war&quot;). Since the results of Gale's method and Kay's method did not differ so much, we show the result of Gale's method only.</Paragraph> <Section position="1" start_page="1078" end_page="7080" type="sub_section"> <SectionTitle> 5.1 Sentence Alignment </SectionTitle> <Paragraph position="0"> The followings are five best results of sentence alignment before and after estimating word correspondences not included in the Japanese-English dictio nary. The results are improved after estimating word correspondences not included in the bilingual dictionary. null &quot;dilemma&quot; number of errors average (five best solutions) error rate</Paragraph> </Section> <Section position="2" start_page="7080" end_page="7080" type="sub_section"> <SectionTitle> 5.2 Word Correspondence Estimation </SectionTitle> <Paragraph position="0"> We classify the estimated word correspondences into three categories, &quot;correct&quot;, &quot;part of phrase&quot;, and &quot;wrong&quot;. &quot;part of 1)hrase&quot; means that the estimated word correspondence can be considered ~ part of corresponding phrases. &quot;errm&quot; rate&quot; is tile ratio of the number of &quot;wrong&quot; word correspondences to the total nunlber. &quot;dilemma&quot; \[~otal 1\[ col:feet \]phras~ wr error ,'ate 87 \] 53 30 __1 4.6% l_8r j _:_ / * &quot;cold-war&quot; I- lI l\[ -- total \[\[ correct ~l,hrase I wrong l~errdegr rate The result of &quot;dilemma&quot; is better than that of &quot;coldwar&quot;. This is because the former is longer than the latter.</Paragraph> <Paragraph position="1"> The tbllowings are example word correspondences of each category where f~, ft, and f~, are freq(w~), freq(wt), and freq(w~, we) respectively. The parenthesized correspondence is not extracted by the threshokl flmetion.</Paragraph> <Paragraph position="2"> W,. we _ hg fs ft f*t ~'d~ ~ does 0.49 6 3 3 )XP V'~'f:: and ().41 47 62 5 Most of &quot;correct&quot; corresl)ondences are proper names like &quot; X J> Y 7/ sultan&quot;, or those which have different parts of speech, like &quot; 1'1 Ill (noun) - liberal (adjective)&quot; and &quot; *~:'i~/(noun) econonlic (adjective)&quot;, or those which can be considered as translation equivalents but not included in the Japanese-English dictionary, like &quot; &quot;~ (news) press&quot;.</Paragraph> <Paragraph position="3"> The examples of &quot;part of phrase&quot; form a phrase correspondence &quot; ..O.:.\[~{~)lll civilian supremacy&quot;. The former &quot;wrong&quot; correspondence &quot; ,~,~l~ (meanlug) - does&quot; comes from the cm'resf)ondence of long distance dependent phrases &quot;,~I~, ~,J'7~ does mean&quot;. The latter &quot;wrong&quot; correspondence &quot; :~.\[zf(: (pacitic ocean) and&quot; is extracted by Gale's method because both freq(j~'lz&quot;i'fi)and freq(and) are high and close to tile total number of sentence beads. This correspondence is not extracted by Kay's method.</Paragraph> <Paragraph position="4"> Then, in Fig. 2, we illustrate the relation between the estimated value h~(w~,w 0 of Gale's mettlod and the co-occurrence frequency freq(w~, wt) for the text &quot;dilemma&quot;. Tile threshold function seenls optimized so that it extracts as many word correspondences of the category &quot;correct&quot; and &quot;part of phrase&quot; as possible, and extracts as few word correspondences of the category &quot;wrong&quot; as possible.</Paragraph> </Section> </Section> class="xml-element"></Paper>