File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0315_metho.xml
Size: 17,013 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0315"> <Title>Efficient Optimization for Bilingual Sentence Alignment Based on Linear Regression</Title> <Section position="3" start_page="0" end_page="11" type="metho"> <SectionTitle> 2 System of Mining Parallel Text </SectionTitle> <Paragraph position="0"> One crucial component of statistical machine translation (SMT) system is the parallel text mining from Internet. Several processing modules are applied to collect, extract, convert, and clean the text from Internet.</Paragraph> <Paragraph position="1"> The components in our system include: * A web crawler, which collects potential parallel html documents based on link information following (Philip Resnik 1999); * A bilingual html parser (based on flex for efficiency), which is designed for both Chinese and English html documents. The paragraphs' boundaries within the html structure are kept.</Paragraph> <Paragraph position="2"> * A character encoding detector, which judges if the Chinese html document is GB2312 encoding or BIG5 encoding.</Paragraph> <Paragraph position="3"> * An encoding converter, which converts the BIG5 documents to GB2312 encoding.</Paragraph> <Paragraph position="4"> * A language identifier to ensure that source and target documents are both of the proper language. (Noord's Implementation).</Paragraph> <Paragraph position="5"> * A Chinese word segmenter, which parses the Chinese strings into Chinese words.</Paragraph> <Paragraph position="6"> * A document alignment program, which judges if the document pair is close translation candidates, and filters out those non-translation pairs.</Paragraph> <Paragraph position="7"> * A sentence boundary detector, which is based on punctuation and capitalized characters; * And the key component, a sentence alignment pro- null gram, which aligns and extracts potential parallel sentence pairs from the candidate document pairs. After sentence alignment, each candidate of a parallel sentence pair is then re-scored by the regression models (to be described in section 5). These scores are used to judge the quality of the aligned sentences. Thus one can select the aligned sentence pairs, which have high alignment quality scores, to re-estimate the system's parameters.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Sentence Alignment </SectionTitle> <Paragraph position="0"> Our sentence alignment program uses IBM Model-1 based perplexity (section 2.2) to calculate the similarity of each sentence pair. Dynamic programming is applied to find Viterbi path for sentence alignments of the bilingual comparable document pair. In our dynamic programming implementation, we allow for seven alignment types between English and Chinese sentences: * 1:1 - exact match, where one sentence is the translation of the other one; * 2:2 - the break point between two sentences in the source document is different from the segmentation in the target document. E.g. part of sentence one in the source might be translated as part of the second sentence in the target; * 2:1, 1:2, and 3:1 - these cases are similar to the case before: they handle differences in how a text is split into sentences. The case 1:3 has not been used in the final configuration of the system, as this type did not occur in any significant number; * 1:0 (deletion) and (0:1) insertion - a sentence in the source document is missing in the translation or vice versa.</Paragraph> <Paragraph position="1"> The deletion and insertion types are discarded, and the remaining types are extracted to be used as potential parallel data. In general, one Chinese sentence corresponds to several English sentences. In (Bing and Stephan, 2002), experiments on a 10-year XinHua news story collection from the Linguistic Data Consortium (LDC) show that alignment types like (2:1) and (3:1) are common, and this 7-type alignment is shown to be reliable for English-Chinese sentence alignment. However, only a small part of the whole 10-year collection was pre-aligned (Xiaoyi, 1999) and extracted for sentence alignment.</Paragraph> <Paragraph position="2"> The picture can be very different when directly mining the data from Internet. Due to the mismatch between the training data and the data collected from Internet, the vocabulary coverage can be very low; the data is very noisy; and the data aligned is not strictly parallel. The percentage of alignment types of insertion (0:1) and deletion (1:0) become very high as shown in section 5.</Paragraph> <Paragraph position="3"> The aligned sentence pairs are subject to many alignment errors. The alignment errors are not desired in the re-training of the system, and need to be removed.</Paragraph> <Paragraph position="4"> Though the sentence alignment outputs a score from Viterbi path for each of the aligned sentence pairs, this score is only a rough estimation of the alignment quality. A more reliable re-scoring of the data is desirable to estimate the alignment quality as a post processing step to filter out the errors and noise from the aligned data.</Paragraph> </Section> <Section position="2" start_page="0" end_page="11" type="sub_section"> <SectionTitle> 2.2 Statistical Translation Lexicon </SectionTitle> <Paragraph position="0"> We use a statistical translation lexicon known as IBM Model-1 in (Brown et al., 1993) for both efficiency and simplicity.</Paragraph> <Paragraph position="1"> In our approach, Model-1 is the conditional probability that a word f in the source language is translated given word e in the target language, t(f|e). This probability can be reliably estimated using the expectation-maximization (EM) algorithm (Cavnar, W. B. and J. M. Trenkle, 1994).</Paragraph> <Paragraph position="2"> Given training data consisting of parallel sentences: }..1),,{(</Paragraph> <Paragraph position="4"> With the conditional probability t(f|e), the probability for an alignment of foreign string F given English string</Paragraph> <Paragraph position="6"> The probability of alignment F given E: )|( EFP is shown to achieve the global maximum under this EM framework as stated in (Brown et al.,1993). In our approach, equation (1) is further normalized so that the probability for different lengths of F is comparable at the word level:</Paragraph> <Paragraph position="8"> The alignment models described in (Brown et al., 1993) are all based on the notion that an alignment aligns each source word to exactly one target word.</Paragraph> <Paragraph position="9"> This makes this type of alignment models asymmetric.</Paragraph> <Paragraph position="10"> Thus by using the conditional probability t(e|f) translation lexicon trained from English (source) to Chinese (target), different aspects of the bilingual lexical information can be captured. A similar probability to (2) can be defined based on this reverse translation lexicon:</Paragraph> <Paragraph position="12"> Starting from the Hong Kong news corpora provided by LDC, we trained the translation lexicons to be used in the parallel sentence alignment. Each sentence pair has a perplexity, which is calculated using the minus log of the probability eg. equation (2).</Paragraph> </Section> </Section> <Section position="4" start_page="11" end_page="11" type="metho"> <SectionTitle> 3 Alignment Models </SectionTitle> <Paragraph position="0"> The alignment model is aimed at automatically predicting the alignment scores of a bilingual sentence alignment program. By scoring the alignment quality of the sentence pairs, we can filter out those mis-aligned sentence pairs, and save our SMT system from being corrupted by mis-aligned data.</Paragraph> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.1 Lexicon Based Models </SectionTitle> <Paragraph position="0"> It is necessary to include lexical features in the aligned quality evaluation. One way is to use the translation lexicon based perplexity as in our sentence alignment program.</Paragraph> <Paragraph position="1"> For each of the aligned sentence pairs, the sentence alignment generated a score, which is solely based on equation (2). Using this score only, we can do a simple filtering by setting a threshold of perplexity. The sentence pairs which have a higher perplexity than the threshold will be removed. However the perplexity based on (2) is definitely not discriminative enough to evaluate the quality of aligned sentence pairs.</Paragraph> <Paragraph position="2"> In our experiment, it showed that perplexity (3) has more discriminative power in judging the quality of the aligned sentence pairs for Chinese-English sentence alignment. It is also possible that equation (2) is more suitable for other language pairs. Both (2) and (3) are applied in our sentence alignment quality judgment, which is to be explained in section 4.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.2 Sentence Length Models </SectionTitle> <Paragraph position="0"> As was shown in the sentence alignment literature (Church, K.W. 1993), the sentence length ratio is also a very good indication of the alignment of a sentence pair for languages from a similar family such as French and English. For language pairs from very different families such as Chinese and English, the sentence length ratio is also a good indication of alignment quality as shown in our experiments.</Paragraph> <Paragraph position="1"> For the language pair of Chinese and English, the sentence length can be defined in several different ways. In general, a Chinese sentence does not have word boundary information; so one way to define Chinese sentence length is to count the number of bytes of the sentence. Another way is to first segment the Chinese sentence into words (section 3.2.2) and count how many words are in the sentence. For English sentences, we can similarly define the length in bytes and in words. The length ratio is assumed to be a Gaussian distribution. The mean and variance are calculated from the parallel training corpus, which, in our case, is the Hong Kong parallel corpus with 290K parallel sentence pairs. The word segmenter for Chinese is to parse the Chinese string into words. Different word segmenters can generate different numbers of words for the same Chinese sentence.</Paragraph> <Paragraph position="2"> There are many word segmenters publicly available.</Paragraph> <Paragraph position="3"> In our experiments, we applied a two-pass strategy to segment the word according to the dictionary of the LDC bilingual dictionary of Chinese-English. The two-pass started first from left to right, and then from right back to left, to calculate the maximum word frequency and select one best path to segment the words.</Paragraph> <Paragraph position="4"> In general, the sentence length is not sensitive to the segmenters used. But for reliability, we want each segmented word can have an English translation, thus we used the LDC bilingual dictionary as a reference word list for segmentation.</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.2.3 Sentence Length Model </SectionTitle> <Paragraph position="0"> Assume the alignment probability of ),|( tsAP is only related to the length of source sentence s and target</Paragraph> <Paragraph position="2"> where ||s and ||t are the sentence lengths of s and t.</Paragraph> <Paragraph position="3"> The difference of the length |)||,( |tsd is assumed to be a Gaussian distribution (Church, K.W. 1993) and can be normalized as follows: where c is a constant indicating the mean length ratios between source and target sentences and s is the variance of the length ratios.</Paragraph> <Paragraph position="4"> In our case, we applied three length models described in the following Table 1: ured in words L-3 English sentence is measured in words and Chinese sentence is measured in bytes The means and s of the length ratios for each of the length models are calculated from Hong Kong news parallel corpus. The statistics of the three sentence length models are shown in Table 2. In general, the smaller the variance, the better the sentence length model can be. From Table 2 we observe that the bytes based length ratio model has significantly larger variance (3.82) than the other two models (L-2: 0.79, L-3: 0.71). This means L1 is not as reliable as L2 and L3. Both L2 and L3 have similar variance, which indicates measuring English sentences in words will entail smaller variance in length model; measuring Chinese sentences in bytes or words entails only a slight difference in variance. This also indicates that the length model is not so sensitive to the Chinese word segmenter applied. L-1, L-2 and L-3 capture the length relationship of parallel sentence in different views. Their modeling power has overlap, but they also compensate each other in capturing the parallel characteristics of good translation quality. A combination of these models can potentially bring further improvement, which is shown in our experiment in section 6.</Paragraph> </Section> </Section> <Section position="5" start_page="11" end_page="11" type="metho"> <SectionTitle> 4 Regression Model </SectionTitle> <Paragraph position="0"> Rather than doing a binary decision (classification) that the aligned sentence pair is either good or not, the regression can give a confidence score indicating how good the alignment can be, thus offering more flexibility in decisions. Predicting the alignment quality using the candidate models is considered as a regression problem in that different scores are combined together.</Paragraph> <Paragraph position="1"> There are many ways such as genetic programming, to combine the candidate models, and regression is one of the straight forward and efficient ones. So in this work, we explored linear regression.</Paragraph> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.1 Candidate Models </SectionTitle> <Paragraph position="0"> We have five candidate models described in section 3. They are: PP1, the perplexity based on the word pair conditional probability p(f|e) in equation (2); PP2, the perplexity based on the reverse word pair conditional probability p(e|f) in equation (3); L-1, Length ratio model measured in bytes (mean=1.59, var=3.82); L-2, length ratio model measured in words (mean=1.01, var=0.79); L-3, length ratio model, where the English sentence is measured in words and the Chinese sentence is measured in bytes (mean=0.33, var=0.71). These five models capture different aspects of the aligned quality of the sentence pair. The idea is to combine these five models together to get better prediction of the aligned quality.</Paragraph> <Paragraph position="1"> Linear regression is applied to combine these five models. It is trained from the observation of the five models together with the label of human judgment on a training set.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.2 Regression Model Training </SectionTitle> <Paragraph position="0"> The linear regression model tries to discover the equation for a line that most nearly fits the given data (Trevor Hastie et al. 2001). That linear equation is then used to predict values for the data.</Paragraph> <Paragraph position="1"> Now given human subject judgment of the aligned translation quality of sentence pairs, we can train a regression model based on the five models we described in section 4.1 under the objective of least square errors. The human evaluation is measures translation quality of aligned pairs on a discrete 6-point scale between 1 (very bad) and 5 (perfect translation). The score 0 was used for alignments that were not genuine translation e.g., both sentences were from the same language. We will use n for the number of total sentence pairs labeled by humans and used in training.</Paragraph> <Paragraph position="2"> Let A= [PP1, PP2, L-1, L-2, L-3] be the machine-generated scores for each of the sentence pairs. In our case, A is a 5xn matrix.</Paragraph> <Paragraph position="3"> Let H= [Human-Judgment-Score] be the human evaluation of the sentence pairs on a 6-point scale. In our case, H is a 1xn matrix.</Paragraph> <Paragraph position="4"> In linear regression modeling, a linear transformation matrix W should satisfy the least square error criterion: null singular vector decomposition (SVD). After W is calculated, the predicted score from the regression model is:</Paragraph> <Paragraph position="6"> where 'H is the final predicted alignment quality score of the regression model. We can also view 'H as a weighted sum of the five models shown in section 4.1.</Paragraph> <Paragraph position="7"> The calculation of 'H reduces to a linear weighted summation, which is very efficient to compute.</Paragraph> </Section> </Section> class="xml-element"></Paper>