File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1009_metho.xml

Size: 14,682 bytes

Last Modified: 2025-10-06 14:13:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1009">
  <Title>BUILDING AN MT I)ICTIONARY FROM PARAI~LEI~ TEXTS BASED ON LINGUISTIC AND STATISTICAL INIi'ORMATION</Title>
  <Section position="4" start_page="0" end_page="76" type="metho">
    <SectionTitle>
2 APPROACtt TO BUILDING AN MT
1)ICTIONARY
</SectionTitle>
    <Paragraph position="0"> Our goal in building an MT dictionary from parallcl texts is to develop a robust method which enables highly accurate extraction of translation pairs from a relatively small amount of parallel texts as well as from parallel texts containing severe distortions.</Paragraph>
    <Paragraph position="1"> In real-world applications, generally it is extremely difficult especially for MT users to obtain a large amount of high quality parallel texts of one specific domain. If source and target languages do not belong to the same linguistic family, like Japanese and Fnglish, tile situation becomes grave.</Paragraph>
    <Paragraph position="2"> As one typical example of MT dictionary compilation, we have selected Japanese and English patent doemnents which contain many state-of-the-m~t technical terms. Althougb thes~ documents are not cul- null from paralh.q texts turally biased, in many cases, tile organization between Japanese and English greatly differs and extensive changes are made ill translating from Japanese to English text and vice vm.~a. Hence, tile difficulty of word extraction from patents.</Paragraph>
    <Paragraph position="3"> To solve this problem, we explored the appropriate integration method considering the use of linguistic information and statistical information to this end. Lingt, istic information is useful in making an intelligent judgment about correspondence between two languages even from partial texts because of its lexical, syntactic, and semantic knowledge; statistical information is characterized by its robustness against noise because it can tnmsform many actual examples into an abstract fom~.</Paragraph>
    <Paragraph position="4"> Below is the flow of ot, r method illustrated in Fig. 1 :  (1) Unit Extraction: Pmls of documents (&amp;quot;units&amp;quot;) are extracted from both Japanese and English texts.</Paragraph>
    <Paragraph position="5"> (2) Unit Mapping: I&amp;mh Japanese nnit is mapped into English units. (3) Term Extraction: Japanese term candidates are extracted by the NP recognizer.</Paragraph>
    <Paragraph position="6"> (4) Translation Candidate Generation: English translation candidates for Japanese terms are extracted from English units.</Paragraph>
    <Paragraph position="7"> (5) English Translation D;timation:  Tim translation candidates are evah, ated to obtain the best one, Tim subsequent sections show tim details of each processing.</Paragraph>
  </Section>
  <Section position="5" start_page="76" end_page="77" type="metho">
    <SectionTitle>
3 FORMING UNIT CORRI{SPON-
DENCES
</SectionTitle>
    <Paragraph position="0"> The plausible hypothesis that parallel sentences cont,,in corresponding linguistic expressions is the major premise in Kupiec (1993). This type of info,mation should be wklely used. The problem is that tim alignment method based on tile sentence bead model (Brown, 1991) is not applicable to patent documents due to their severe disto,fions in doculnent strtlctures and selltence correspolldences. Conse-quently, we have introduced a concept called &amp;quot;unit&amp;quot; which corresponds to a pa~t of sentence and adopted a new method to extract corresponding units by using linguistic knowledge as a primaxy source of hi formation.</Paragraph>
    <Section position="1" start_page="76" end_page="76" type="sub_section">
      <SectionTitle>
3.1 l,:xh'aclion of Units
</SectionTitle>
      <Paragraph position="0"> First, units are extracted from parallel texts.</Paragraph>
      <Paragraph position="1"> The unit corresponds to sentences or phrases ill tile text. Terms which should be extracted can be found within a unit. &amp;quot;File rest of words in the unit is called contextual infommtion for tile extracted term. Tile size of units determines tile effectiveness of the st,eceeding unit mapping process. For exa,npie, if we set noun phrases (enny words in a dictionaly) as :.1 unit, no contextual information is available, and thus tim probability that corresponding relations hold decreases. In our present implementation, we set sentences as a unit for tile first approximation. null</Paragraph>
    </Section>
    <Section position="2" start_page="76" end_page="77" type="sub_section">
      <SectionTitle>
3.2 Mal)ping of Uniis
</SectionTitle>
      <Paragraph position="0"> Next, the unit mapping process creates a conesponding unit table from Japanese ~,nd English vails. This table stores the correslmndenee relationship between milts and its likelihood. The likeli.. hood is calculated based on the linguistic information in an MT bilingual dictionary, Our trait mapping algorithm is given below:  (1) l,ct ,1 be a set of all content words in tile Japanese unit JU. (m iS tim number of words) ,1 ={ Jl'J2 ..... lm} (2) l.et E be a set of all content words in the F, nglish unit \[{\[J. (n is tile number of words) E=:{ E 1,1{2...F; n} J (3) .v is the number of .li's whose translation candi- null date list includes some Ej in E.</Paragraph>
      <Paragraph position="1"> (4) y is the number of Ej's which is included in the translation candidate list of some Ji in J.</Paragraph>
      <Paragraph position="2"> (5) The correspondence likelihood CL is given by</Paragraph>
      <Paragraph position="4"> with the highest CL(JU, EU) are stored in the corresponding unit table.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="77" end_page="77" type="metho">
    <SectionTitle>
4 GENERATING TRANSLATION
CANDIDATES
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
4.1 Extraction of Japanese Terms
</SectionTitle>
      <Paragraph position="0"> Errors in the extraction of terms and phrases from parallel texts eventually lead to a failure in acquiring the correct term/phrase correspondences.</Paragraph>
      <Paragraph position="1"> In Kupiec (1993) and Yamamoto (1993), term and phrase extraction is applied to both of parallel texts. In contrast, we extract from units only Japanese terms, thereby reducing the errors caused by term/phrase recognizer. Japanese NP's can be recognized more accurately than English NP's because Japanese has considerably less multi-category words.</Paragraph>
      <Paragraph position="2"> In the current implementation, the following two types of term candidates are extracted by the  Our NP recognizer utilizes the sentence awdyzer of a practical MT system. The word dictionary includes approximately 70,000 Japanese entries.</Paragraph>
    </Section>
    <Section position="2" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
4.2 Finding Translation Candidates
</SectionTitle>
      <Paragraph position="0"> Generation of English translation candidates for a Japanese term is essentially based on the following hypothesis: Hypothesis 1 The English translation of an extracted term in a Japanese unit is contained in the English cormsponding unit.</Paragraph>
      <Paragraph position="1"> Now an arbitrary word sequence in corresponding units can be a translation candidate of the Japanese term. We extract English translation candi- null When the extracted term appears in N Japanese units, NxM English units will be stored in the corresponding unit table with their correspondence likelihood. The N highest corresponding units within NxM combinations are extracted. When N is less than M, the M highest combinations arc selected.  Suppose that tile correct English translation of the Japanese term JW is EW, and that the mnnber of Japanese units in which JW appears is FJU(JW) (= N). From ltypothesis 1 that the translation is contained in the corresponding units EU I, EU 2 ..... EUFJU(JW ), EW would be a word sequence which often appears in corresponding units. In order to get such EW, we use n-gram data.</Paragraph>
      <Paragraph position="2"> The frequency of each n-gram (1 &lt;_ n _&lt; 2 x (the number of component words in JW)) data in FJU(JW) English units is calculated and then EW candidates are ranked by the frequency as EWC 1, EWC 2 .... EWCj. Because EWC with a low frequency in the corresponding units is unlikely to be the correct wanslation, the data with a frequency less than FJU(JW) 4 are heuristically excluded from the candidates. The data containing be verb and the data which starts or ends with a preposition or an article are also excluded from the candidates.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="77" end_page="79" type="metho">
    <SectionTitle>
5 ESTIMATING ENGLISH TRANSLA-
TIONS
</SectionTitle>
    <Paragraph position="0"> The translation likelihood (TL) of one translation candidate EWCi for the term JW is defined as: TL(JW, EWCi) = F(TLS(JW, EWCi), TLL(JW, EWCi)) where TI~S(JW, EWCi) is &amp;quot;'Franslation Likelihood based on Statistical information,&amp;quot; and TLL(JW, EWCi) &amp;quot;Translatiou Likelihood based on Linguistic info rmat ion 2</Paragraph>
    <Section position="1" start_page="77" end_page="78" type="sub_section">
      <SectionTitle>
5.1 Statistical hfformation
</SectionTitle>
      <Paragraph position="0"> TLS(JW, EWCi) is the frequency score based on the statistical information from Hypothesis 1 that a word which appears as often in tile corresponding units as JW in Japanese units is more likely to be EW. It is quantitatively defined as tile probability in which the translation candidate appears in the corresponding traits. That is,</Paragraph>
      <Paragraph position="2"> where FEU(EWCi) is the number of corresponding units in which EWCi appears.</Paragraph>
    </Section>
    <Section position="2" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
5.2 Linguistic Information
</SectionTitle>
      <Paragraph position="0"> TLL(JW, EWCi) is tile word similarity score based on the accuracy of the correspondence term JW and the translation candidate EWCi obtained by using linguistic information in tile MT bilingual dictionary. Suppose one translation candidate of term JW=WJl, wJ2 .... wJk is EWCi=we 1, we 2 .... we I.</Paragraph>
      <Paragraph position="1"> Then we use the following hypottmsis.</Paragraph>
    </Section>
    <Section position="3" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
Hypothesis 2
</SectionTitle>
      <Paragraph position="0"> (a) If the length of EWCi is close to the length of JW, JW and EWCi are likely to correspond each other.</Paragraph>
      <Paragraph position="1"> (b) JW and EWCi with more word translation  correspondences are likely to correspond each other.</Paragraph>
      <Paragraph position="2"> Under this hypothesis, the following correspondence relation (1) is the best. Term JW and translation candidate EWCi have the same length k(-I), and all of their component words correspond in the dictionary, wJi:~we i indicates that we i is included in wJi's translation candidates in the MT bilingual dictionary. null (1) wJl=*we 1, wJ2~we 2 ...... wJk~We k More generally, tim relation of each word (w j) in term JW and each word (we) in translation candidate EWCi is classified into the following four classes:</Paragraph>
      <Paragraph position="4"> it) shows a pair whose correspondence is not described in the bilingual dictionary, iii) and iv) indicate that the corresponding word for wj or we is missing. In iii), JW is longer than EWCi; and vice versa in iv).</Paragraph>
      <Paragraph position="5"> In order to estimate correspondence between JW and EWCi, i) and it) are scored by similarity to the virtual translation which holds the relation (I). When the nmnber of words is the same, score Q (constant) is given, c~Q (ct&gt;0) is added to Q when there is a translation relation to reflect higher reliability of i). Therefore, Q+aQ=(I-,c~)Q is given to the word pair of i), and Q to the word pair of it).</Paragraph>
      <Paragraph position="6"> Now since we disregard the word order of a term, JW and EWCi are represented as sets of words:</Paragraph>
      <Paragraph position="8"> The number of words with a lexical correspondence relation in wj and we, the number of words in wj without a relation and the number of words in we without a relation are counted as x, y, z respectively. That is, x -~ y = k and x + z= l.</Paragraph>
      <Paragraph position="9"> T\[.I.(JW, EWCi) is given as the ratio of tile score of the vmual translation to the score of</Paragraph>
      <Paragraph position="11"> of c~ is determined as 2 by evaluating sample tnmslalion pairs.</Paragraph>
      <Paragraph position="12"> Followings are the TLI,'s of three EWC's for JW:vk -- 7&amp;quot; :./ ff .:t I. ~Jy:,~ which consists of four component words (k=4); &amp;quot;:,l--- 7&amp;quot; :/(=open),&amp;quot; &amp;quot;tf .~, I-</Paragraph>
      <Paragraph position="14"/>
    </Section>
    <Section position="4" start_page="78" end_page="79" type="sub_section">
      <SectionTitle>
5.3 Combination of Statistical and Lin-
</SectionTitle>
      <Paragraph position="0"> guistic Information We define the translation likelihood TL(JW, EWCi) as below: TL(JW, EWCi) -: m TLS(JW, F.WCi) + n TLL(JW, EWCi) m-{ tl Examining the value with the ratio n/ttl constant, a low value of TI.S(JW, EWCi) ill affects the total score, especially when the frequency  FJU(JW) is 5 or less. This shows that TLS(JW, EWCi) should be much weighed for JW's which appear often, but not for JW's with a low freqt,ency. Therefore we tentatively define ~ = n/m as a function of frequency FJU(JW), because !3 sbould be higher when FJU(JW) is low.</Paragraph>
      <Paragraph position="2"> where r is a possible minimum frequency, aqd s is limit of 13 as the word frequency is high enough.</Paragraph>
      <Paragraph position="3"> Values p=4, q=l, r=l, and s=0.5 are used in tile following experiments. By introducing 13, F is rewritten as: F(TLS(JW, EWCi), TLL(JW, EWCi) ) = _TLS(JW, EWCi) + 13 TLL(JW, EWCi)  In case {FJU(JW)} q is equal to or less than r, is meaningless, For such JW's, TL(JW, EWCi) is redefined as simply: TL(JW, EWCi) = TLL(JW, EWCi).</Paragraph>
      <Paragraph position="4"> Finally the translation candidate EWC i with the largest value of TL(JW, EWCi) is assumed to be the correct English translation.</Paragraph>
      <Paragraph position="5"> Table 1 shows the translation candidates for JW: ~ 7&amp;quot; &gt;&amp;quot; ~&amp;quot; ~, I- ,~jY~ with the best three TL's. Its frequency in Japanese text is FJU(JW) = 19 (13 4 + 0.5 = 0.72). Consequently, the correct  bit line configuration 19 1.00 0.58 0.82 open bit line 18 0.95 0.75 0.86 open bit line configuration 18 0.95 0.83 0.90</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML