File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1009_metho.xml

Size: 14,968 bytes

Last Modified: 2025-10-06 14:07:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1009">
  <Title>A Robust Cross-Style Bilingual Sentences Alignment Model</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Statistical Sentence Alignment Model
</SectionTitle>
    <Paragraph position="0"> Since an English-Chinese bilingual corpus will be adopted in our experiments, we will denote the source text with m sentences as ESm1 , and its corresponding target text, with n sentences, as CSn1. Let Mi = {typei,1,***,typei,Ni} denote the i-th possible alignment-candidate, consisting of Ni Alignment-Passages of typei,j, j = 1,***,Ni; where typei,j is the matching type (e.g., 1[?]1, 0[?]1, 1[?]0, etc.) of the j-th Alignment-Passage in the i-th alignment-candidate, and Ni denotes the number of the total Alignment-Passages in the i-th alignmentcandidate. Then the statistical alignment model is to find the Bayesian estimate M[?] among all possible alignment candidates, shown in the following</Paragraph>
    <Paragraph position="2"> According to the Bayesian rule, the maximization problem in (2.1) is equivalent to solving the following maximization equation</Paragraph>
    <Paragraph position="4"> where Aligned-Pairi,j, j = 1,***,Ni, denotes the j-th aligned English-Chinese bilingual sentence groups pair in the i-th alignment candidate.</Paragraph>
    <Paragraph position="5"> Assume that</Paragraph>
    <Paragraph position="7"> and different typei,j in the i-th alignment candidate are statistically independent2, then the above maximization problem can be approached by searching for</Paragraph>
    <Paragraph position="9"> where ^M denotes the desired candidate.</Paragraph>
    <Paragraph position="10"> 2A more reasonable one should be the first-order Markov model (i.e., Type-Bigram model); however, it will significantly increase the searching time and thus is not adopted in this paper.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.1 Baseline Model
</SectionTitle>
    <Paragraph position="0"> To make the above model feasible, Aligned-Pairi,j should be first transformed into an appropriate feature space. The baseline model will use both the length of sentence [Brown et al., 91; Gale and Church, 93] and English cognates [Wu, 94], and is shown as follows:</Paragraph>
    <Paragraph position="2"> where dc and dw denote the normalized differences of characters and words as explained in the following; dc is defined to be (ltc [?] clsc)/radicalbiglscs2c, where lsc and ltc are the character numbers of the aligned bilingual portions of source text and target text, respectively, under consideration; c denotes the proportional constant for target-character-count and s2c denotes the corresponding target-character-count variance per source-character. Similarly, dw is defined to be (ltw [?] wlsw)/radicalbiglsws2w, where lsw and ltw are the word numbers of the aligned bilingual portions of source text and target text, respectively; w denotes the proportional constant for target-word-count and s2w denotes the corresponding target-word-count variance per sourceword. Also, the random variables dc and dw are assumed to have bivariate normal distribution and each possesses a standard normal distribution with mean 0 and variance 1. Furthermore, dcognate denotes (Number of English cognates found in the given Chinese sentences[?]Number of corresponding English cognates found in the given English sentences), and is Poisson3 distributed independent of its associated matching-type; also assume that dcognate is independent of other features (i.e., character-count and word-count).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Proposed Transfer Lexicon Model
</SectionTitle>
      <Paragraph position="0"> Since transfer-lexicons are usually regarded as more reliable cues for aligning sentences when the alignment task is performed by human, the above baseline model is further enhanced by adding 3Since almost all those English cognates found in the given Chinese sentences can be found in the corresponding English sentences, dcognate had better to be modeled as a Poisson distribution for a rare event (rather than Normal distribution as some papers did).</Paragraph>
      <Paragraph position="1"> those associated transfer lexicons to it. Those translated Chinese words, which are derived from each English word (contained in given English sentences) by looking up some kinds of dictionaries, can be viewed as transfer-lexicons because they are very likely to appear in the translated Chinese sentence. However, as the distribution of various possible translations (for each English lexicon) found in our bilingual corpus is far more diversified4 compared with those transfer-lexicons obtained from the dictionary, only a small number of transfer-lexicons can be matched if the exact-match is specified. Therefore, each Chinese-Lexicon obtained from the dictionary is first augmented with its associated Chinese characters, and then the augmented transfer-lexicons set are matched with the target Chinese sentence(s). Once an element of the augmented transfer-lexicons set is matched in the target Chinese sentence, it is counted as being matched. So we compute the Normalized-Transfer-Lexicon-Matching-Measure, dTransfer[?]Lexicons which denotes [(Number of augmented transfer-lexicons matched[?]Number of augmented transfer-lexicons unmatched)/ Total Number of augmented transfer-lexicons sets], and add it to the original model as another additional feature.</Paragraph>
      <Paragraph position="2"> Assume follows normal distribution and the associated parameters are estimated from the training set, Equation (2.5) is then replaced by</Paragraph>
      <Paragraph position="4"/>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Implementation
</SectionTitle>
    <Paragraph position="0"> The best bilingual sentence alignment in those above models can be found by utilizing a dynamic programming algorithm, which is similar to the dynamic time warping algorithm used in speech recognition [Rabiner and Juang, 93]. Currently, the  translated into EK, K, PK, E K, EPK, P }, * * * etc., for a specific sense in the given corpus; however, the transfer entries listed in the dictionary are EK and P } only.</Paragraph>
    <Paragraph position="1"> Case I (Length-Type Error) (E1) Compared to this, modern people have relatively better nutrition and mature faster, working women marry later, and there has been a great decrease in frequency of births, so that the number of periods in a lifetime correspondingly increases, so it is not strange that the number of people afflicted with endometriosis increases greatly. (C1) oui, UHADossSTXoeoBEL, &lt;vThC/wu, Th&gt;YbyuxA, oTh2~%VaYbo$O&lt;, ac143qc142aeP66OO6.--NJ7SOH (E2) The problem is not confined to women.</Paragraph>
    <Paragraph position="2"> (E3) Sperm activity also noticeably decreases in men over forty, says Taipei Medical College urologist Chang Han-sheng. (C2) .EuTh4,ETXa4EuSOHuJ(, ETBY=oc1416}peAy'OEOTC&gt;&gt;cI+SUBETX3L+-C;zSOH Case II (Length&amp;Lexicon-Type Error) (E1) Second, the United States as well as Japan have provided lucrative export markets for countries in this region. (E2) The U.S. was particularly generous in the postwar years, keeping its markets open to products from Asia and giving nascent industries in the region a chance to catch up.</Paragraph>
    <Paragraph position="3"> (C1) wY, D(i1AerC[a&amp;quot;=1ss1, U=1E-ihEAdDELoe}eEOTSOH  maximum number of either source sentences or target sentences allowed in each alignment unit is set to be 4 (i.e., we will not consider those matching-types of 5[?]1, 5[?]2, 1[?]5, etc).</Paragraph>
    <Paragraph position="4"> Let {s1,***,sm} and {t1,***,tn} be the parallel bilingual source and target sentences, and let S(m,n) be the maximum accumulated score between {s1,***,sm} and {t1,***,tn} under the best alignment path. Then S(m,n) can be evaluated recursively with the initial condition of S(0,0) = 0 in the following way:</Paragraph>
    <Paragraph position="6"> where score(h,k) denotes the local scoring function to evaluate the local passage of matching type h[?]k.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Performance Evaluation
</SectionTitle>
    <Paragraph position="0"> In the experiments, a training set consisting of 7,331 pairs of bilingual sentences, and a testing set with 1,514 pairs of bilingual sentences are extracted from the Caterpillar User Manual which is mainly about machinery. The cross-style testing set contains 274 pairs of bilingual sentences selected from the Sinorama Magazine, which is a general magazine (for introducing Taiwan to foreign visitors) with its topics covering law, politics, education, technology, science, etc. Figure 1 is an illustration of bilingual Sinorama Magazine texts.</Paragraph>
    <Paragraph position="1"> For comparing the performance of alignment, both precision rate (p) and recall rate (r), defined as follows, are measured; however, only their associated F-measure5 is reported for saving space.</Paragraph>
    <Paragraph position="3"> A Sequential-Forward-Selection (SFS) procedure [Devijver, 82], based on the performance measured from the Caterpillar User Manual, is then adopted to rank different features. Among them, the Chinese transfer lexicon feature (abbreviated as CTL in the table), which only adopts Normalized-Transfer-Lexicon-Matching-Measure and matching-type priori distribution (i.e., P(typei,j)), is first selected, then CL feature (which adopts character-length), WL feature (using word-length) and EC feature (using English cognate) follow in sequence, as reported in Table 4.1.</Paragraph>
    <Paragraph position="4"> The selection sequence verifies our previous supposition that the transfer-lexicon is a more reliable feature and contributes most to the aligning task. Table 4.1 clearly shows that the proposed robust model achieves a 60% F-measure error reduction (from 14.4% to 5.8%) compared with the baseline model (i.e., improving the cross-style performance from 85.6% to 94.2% in F-measure). The  are still useful, even though they are relatively unreliable. null</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Error Analysis
</SectionTitle>
    <Paragraph position="0"> In order to understand more about the behavior of the various features, we classify all errors which occurs in aligning Sinorama Magazine in Table 5.1; the error dominated by the prior distribution of matching type is called matching-type error, the error dominated by length feature is called lengthtype error, and the error caused from both length features and lexical-related features (either one is not dominant) is called length&amp;lexicon-type error6.</Paragraph>
    <Paragraph position="1"> From Table 5.1, it is found that the matching-type errors dominate in the baseline model. To investigate the matching-type error, the prior distributions of matching-types under training set [Caterpillar User Manual] and testing set II [Sinorama Magazine] are given in Table 5.2. The comparison clearly shows that the matching-type distribution varies significantly across different domains, and that explains why the baseline model (which only considers length-based features and matching-type distribution) fails to achieve the similar performance in the cross-style test. However, as the 1-1 matching-type always dominates in both texts, the matching-type distribution still provide useful information for aligning sentences when it is jointly considered with the lexical-related feature. For those Length-Type errors generated from the base-line model in Table 5.1, different statistical characteristics across different styles are listed in Table  tical characteristics of those length-based features vary significantly across different styles. Furthermore, although English-cognates are reliable cues for aligning bi-lingual sentences and occurs quite a few times in the technical manual (such as company names: IBM, HP, etc., and some special technical terms such as RS-232, etc), they almost never occur in a general magazine such as the one that we test. Therefore, they provide no help for aligning corpus in such domains.</Paragraph>
    <Paragraph position="2"> Table 5.1 also shows that errors distribute differently in the proposed robust model. The lengthtype, instead of matching-type, now dominates errors, which implies that the mismatching effect resulting from different distributions of matching types has been diluted by the transfer-lexicon feature. Furthermore, the score of erroneous lexicontype assignment never dominates any error found in the proposed robust model, which verifies our supposition that transfer-lexicons are more reliable cues for aligning sentences.</Paragraph>
    <Paragraph position="3"> To further investigate those remaining errors generated from the proposed robust model, two error examples are given in Figure 1. The first case shows an example of Length-Type Error, in which the short sentence (E2) is erroneously merged with the long sentence (E1) and results in an erroneous alignment [E1, E2 : C1] and [E3 : C2]. (The correct alignment should be [E1 : C1] and [E2, E3 : C2].) Generally speaking, if a short source sentence is enclosed by two long source sentences in both sides, and they are jointly translated into two long target sentences, then it is error prone compared with other cases. The main reason is that this short source sentence would contain only a few words and thus its associated transfer- null lexicons are not sufficient enough to override the wrong preference given by the length-based feature (which would assign similar score to both mergedirections). null The second case shows an example of Length&amp;Lexicon-Type Error, in which the source sentence (E1) is erroneously deleted and results in an erroneous alignment [E1: Delete] and [E2 : C1]. (The correct alignment should be [E1, E2 : C1].) The main reason is that the meaning of sentence (E1) is similar to that of (E2) but stated in different words, and the translator has merged the redundant information in his/her translation.</Paragraph>
    <Paragraph position="4"> Therefore, the length-feature prefers to delete the first source sentence. On the other hand, since most of those associated transfer-lexicons in the source sentence E1 cannot be found in the corresponding target sentence C1, the Transfer-Lexicon feature also prefers to delete the first source sentence E1. It seems that this kind of errors would require further knowledge from language understanding to solve them, and is beyond the scope of this paper.</Paragraph>
    <Paragraph position="5"> 7The occurrence rate is defined as Number of sentences that contained congates/Total number of sentences</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML