File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0315_evalu.xml

Size: 4,996 bytes

Last Modified: 2025-10-06 13:58:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0315">
  <Title>Efficient Optimization for Bilingual Sentence Alignment Based on Linear Regression</Title>
  <Section position="6" start_page="11" end_page="11" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> 1500 pairs of comparable html document pairs were obtained from bilingual web pages crawled from Internet. After preprocessing, filtering, and sentence alignment, the alignment types were distributed as shown in  From Table 3, we see the data is very noisy, containing a large portion of insertions (23.7%) and deletions (41.9%). This is very different from the LDC XinHua pre-aligned collection provided by LDC, which is relatively clean.</Paragraph>
    <Paragraph position="1"> For this set of English-Chinese bilingual sentences, we randomly selected 200 sentence pairs, focusing on Viterbi alignment scores below 12.0 from sentence alignment, which was an empirically determined threshold (The alignment scores here were purely reflecting the Model-1 parameters using equation (2)). Three human subjects then had to score the 'translation quality' of every sentence pair, using a 6 point scale described in section 4.2. We further excluded very short sentences from consideration and evaluated 168 remaining sentences.</Paragraph>
    <Paragraph position="2"> Pearson R correlation is applied to calculate the magnitude of the association between two variables (humanhuman or human-machine in our case) that are on an interval or ratio scale. The correlation coefficients (Pearson R) between human subjects were in Table 4 (all are statistically significant):  Overall, more than 2/3 of the human scores are identical or differ by only 1 (between subjects).</Paragraph>
    <Paragraph position="3"> For the automatic score prediction, the five component scores described in section 4.1 are used, which are then combined using a standard Linear Regression as described in section 4.2. Table 5 shows the correlation between alignment scores based on Model X and human subjects' predicted quality scores:  The data we used in our training of the lexicon is Hong Kong news parallel data from LDC. There are 290K parallel sentence pairs, with 7 million words of English and 7.3 million Chinese words after segmentation. The IBM Model-1 for PP-1 and PP-2 are both trained using 5 EM iterations. The other three length models are also calculated from the same 290K sentence pairs. Punctuation is removed before the calculation of all automatic score prediction models.</Paragraph>
    <Paragraph position="4"> The regression model here is the standard linear regression using the observations from three human subjects as described in section 4.1. The average performance of the regression model is shown in the bottom line of the above Table 5. The average correla- null tion varies from 0.53 upto 0.72, which shows that the regression model has a very strong positive correlation with the human judgment.</Paragraph>
    <Paragraph position="5"> Also from Table 5, we see both lexicon based models: PP-1 and PP-2 are better than the length models in term of correlation with human scorer. Model PP-2 has the largest correlation, and is slightly better than PP-1. PP-2 is based on the conditional probability of p(e|f), which models the generation of an English word from a Chinese word. The vocabulary size of Chinese is usually smaller than English vocabulary size, so this model can be more reliably estimated than the reverse direction of p(f|e). This explains why PP-2 is slightly better than PP-1.</Paragraph>
    <Paragraph position="6"> For sentence length models, we see L-2, for which the lengths of both the English sentence and the Chinese sentence are measured in words, has the best performance among the three settings of a sentence length model. This indicates that the length model measured in words is more reliable.</Paragraph>
    <Paragraph position="7"> Also shown in Table 5, the naive interpolation of these different models, i.e. just using each model with equal weight, resulted in lower correlation than the best single alignment model.</Paragraph>
    <Paragraph position="8"> We also performed correlation experiments with varied numbers of training sentences from either Human-1/Human-2/Human-3 or from all of the three human subjects. We picked the first 30/60/90/120 labeled sentence pairs for training and saved the last 48 sentence pairs for testing. The average performance of the regression model is as follows:  The average correlation of the regression models showed here increased noticeably when the training set was increased from 30 sentence pairs to 90 sentence pairs. More sentence pairs caused no or only marginal improvements (esp. for the third human subject).</Paragraph>
    <Paragraph position="9"> Figure 1 shows a scatter plot, which illustrates a good correlation (here: Pearson R=0.74) between our regression model predictors and the human scorers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML