File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/j93-1004_evalu.xml

Size: 17,591 bytes

Last Modified: 2025-10-06 14:00:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-1004">
  <Title>A Program for Aligning Sentences in Bilingual Corpora</Title>
  <Section position="6" start_page="83" end_page="88" type="evalu">
    <SectionTitle>
6. Evaluation
</SectionTitle>
    <Paragraph position="0"> To evaluate align, its results were compared with a human alignment. All of the UBS sentences were aligned by a primary judge, a native speaker of English with a reading knowledge of French and German. Two additional judges, a native speaker of French and a native speaker of German, respectively, were used to check the primary judge on 43 of the more difficult paragraphs having 230 sentences (out of 118 total paragraphs with 725 sentences). Both of the additional judges were also fluent in English, having spent the last few years living and working in the United States, though they were both more comfortable with their native language than with English.</Paragraph>
    <Paragraph position="1"> The materials were prepared in order to make the task somewhat less tedious for the judges. Each paragraph was printed in three columns, one for each of the three languages: English, French, and German. Blank lines were inserted between sentences.</Paragraph>
    <Paragraph position="2"> The judges were asked to draw lines between matching sentences. The judges were also permitted to draw a line between a sentence and &amp;quot;null&amp;quot; if they thought that the sentence was not translated. For the purposes of this evaluation, two sentences were defined to &amp;quot;match&amp;quot; if they shared a common clause. (In a few cases, a pair of sentences shared only a phrase or a word, rather than a clause; these sentences did not count as a &amp;quot;match&amp;quot; for the purposes of this experiment.) After checking the primary judge with the other two judges, it was decided that the primary judge's results were sufficiently reliable that they could be used as a standard for evaluating the program. The primary judge made only two mistakes on the 43 hard paragraphs (one French mistake and one German mistake), whereas the program made 44 errors on the same materials. Since the primary judge's error rate is so much lower than that of the program, it was decided that we needn't be concerned with the primary judge's error rate. If the program and the judge disagree, we can assume that the program is probably wrong.</Paragraph>
    <Paragraph position="3"> The 43 &amp;quot;hard&amp;quot; paragraphs were selected by looking for sentences that mapped to something other than themselves after going through both German and French.</Paragraph>
    <Paragraph position="4">  Specifically, for each English sentence, we attempted to find the corresponding German sentences, and then for each of them, we attempted to find the corresponding French sentences, and then we attempted to find the corresponding English sentences, which should hopefully get us back to where we started. The 43 paragraphs included all sentences in which this process could not be completed around the loop. This relatively small group of paragraphs (23% of all paragraphs) contained a relatively large fraction of the program's errors (82%). Thus, there seems to be some verification that this trilingual criterion does in fact succeed in distinguishing more difficult paragraphs from less difficult ones.</Paragraph>
    <Paragraph position="5"> There are three pairs of languages: English-German, English-French, and French-German. We will report on just the first two. (The third pair is probably dependent on the first two.) Errors are reported with respect to the judge's responses. That is, for each of the &amp;quot;matches&amp;quot; that the primary judge found, we report the program as correct if it found the &amp;quot;match&amp;quot; and incorrect if it didn't. This procedure is better than comparing on the basis of alignments proposed by the algorithm for two reasons.</Paragraph>
    <Paragraph position="6"> First, it makes the trial &amp;quot;blind,&amp;quot; that is, the judge does not know the algorithm's result when judging. Second, it allows comparison of results for different algorithms on a common basis.</Paragraph>
    <Paragraph position="7"> The program made 36 errors out of 621 total alignments (5.8%) for English-French and 19 errors out of 695 (2.7%) alignments for English-German. Overall, there were 55 errors out of a total of 1316 alignments (4.2%). The higher error rate for English-French alignments may result from the German being the original, so that the English and German differ by one translation, while the English and French differ by two translations.</Paragraph>
    <Paragraph position="8"> Table 6 breaks down the errors by category, illustrating that complex matches are more difficult. 1-1 alignments are by far the easiest. The 2-1 alignments, which come next, have four times the error rate for 1-1. The 2-2 alignments are harder still, but a majority of the alignments are found. The 3-1 and 3-2 alignments are not even considered by the algorithm, so naturally all three instances of these are counted as errors. The most embarrassing category is 1-0, which was never handled correctly. In addition, when the algorithm assigns a sentence to the 1-0 category, it is also always wrong. Clearly, more work is needed to deal with the 1-0 category. It may be necessary to consider language-specific methods in order to deal adequately with this case.</Paragraph>
    <Paragraph position="9"> Since the algorithm achieves substantially better performance on the 1-1 regions, one interpretation of these results is that the overall low error rate is due to the high frequency of 1-1 alignments in English-French and English-German translations.</Paragraph>
    <Paragraph position="10">  Translations to linguistically more different languages, such as Hebrew or Japanese, might encounter a higher proportion of hard matches.</Paragraph>
    <Paragraph position="11"> We investigated the possible dependence of the error rate on four variables:  1. Sentence Length 2. Paragraph Length 3. Category Type 4. Distance Measure.</Paragraph>
    <Paragraph position="12">  We used logistic regression (Hosmer and Lemeshow 1989) to see how well each of the four variables predicted the errors. The coefficients and their standard deviations are shown in Table 7. Apparently, the distance measure is the most useful predictor, as indicated by the last column. In fact, none of the other three factors was found to contribute significantly beyond the effect of the distance measure, indicating that the distance measure is already doing an excellent job, and we should not expect much improvement if we were to try to augment the measure to take these additional factors into account.</Paragraph>
    <Paragraph position="13"> The fact that the score is such a good predictor of performance can be used to extract a large subcorpus that has a much smaller error rate. By selecting the best scoring 80% of the alignments, the error rate can be reduced from 4% to 0.7%. In general, we can trade off the size of the subcorpus and the accuracy by setting a threshold, and rejecting alignments with a score above this threshold. Figure 4 examines this trade-off in more detail.</Paragraph>
    <Paragraph position="14"> Less formal tests of the error rate in the Hansards suggest that the overall error rate is about 2%, while the error rate for the easy 80% of the sentences is about 0.4%. Apparently the Hansard translations are more literal than the UBS reports. It took 20 hours of real time on a sun 4 to align 367 days of Hansards, or 3.3 minutes per Hansard-day. The 367 days of Hansards contained about 890,000 sentences or about 37 million &amp;quot;words&amp;quot; (tokens). About half of the computer time is spent identifying tokens, sentences, and paragraphs, and about half of the time is spent in the align program itself.</Paragraph>
    <Paragraph position="15"> The overall error, 4.2%, that we get on the UBS corpus is considerably higher than the 0.6% error reported by Brown, Lai, and Mercer (1991). However, a direct comparison is misleading because of the differences in corpora and the differences in sampling. We have observed that the Hansards are much easier than the UBS. Our error rate drops by about 50% in that case. Aligning the UBS French and English texts is more difficult than aligning the English and German, because the French and English  Extracting a subcorpus with lower error rate. The fact that the score is such a good predictor of performance can be used to extract a large subcorpus that has a much smaller error rate. In general, we can trade off the size of the subcorpus and the accuracy by setting a threshold and rejecting alignments with a score above this threshold. The horizontal axis shows the size of the subcorpus, and the vertical axis shows the corresponding error rate. An error rate of about 2/3% can be obtained by selecting a threshold that would retain approximately 80% of the corpus.</Paragraph>
    <Paragraph position="16"> versions are separated by two translations, both being translations of the German original. In addition, IBM samples only the 1-1 alignments, which are much easier than any other category, as one can see from Table 6.</Paragraph>
    <Paragraph position="17"> Given these differences in testing methodology as well as the differences in the algorithms, we find the methods giving broadly similar results. Both methods give results with sufficient accuracy to use the resulting alignments, or selected portions thereof, for acquisition of lexical information. And neither method achieves human accuracy on the task. (Note that one difference between their method and ours is that they never find 2-2 alignments. This would give their method a minimum overall error rate of 1.4% on the UBS corpus, three times the human error rate on hard paragraphs.) We conclude that a sentence alignment method that achieves human accuracy will  need to have lexical information available to it.</Paragraph>
    <Paragraph position="18"> 7. Variations and Extensions</Paragraph>
    <Section position="1" start_page="85" end_page="87" type="sub_section">
      <SectionTitle>
7.1 Measuring Length in Terms Of Words Rather than Characters
</SectionTitle>
      <Paragraph position="0"> It is interesting to consider what happens if we change our definition of length to count words rather than characters. It might seem that a word is a more natural linguistic unit than a character. However, we have found that words do not perform as well as  Computational Linguistics Volume 19, Number 1 characters. In fact, the &amp;quot;words&amp;quot; variation increases the number of errors dramatically (from 36 to 50 for English-French and from 19 to 35 for English-German). The total errors were thereby increased from 55 to 85, or from 4.2% to 6.5%.</Paragraph>
      <Paragraph position="1"> We believe that characters are better because there are more of them, and therefore there is less uncertainty. On the average, there are 117 characters per sentence (including white space) and only 17 words per sentence. Recall that we have modeled variance as proportional to sentence length, V(I) = s21. Using the character data, we found previously that s 2 ~ 6.5. The same argument applied to words yields s 2 ~ 1.9. For comparison's sake, it is useful to consider the ratio of x/-V~/m (or equivalently, s/x/-m), where m is the mean sentence length. We obtain x/V(m)/m ratios of 0.22 for characters and 0.33 for words, indicating that characters are less noisy than words, and are therefore more suitable for use in align.</Paragraph>
      <Paragraph position="2"> Although Brown, Lai, and Mercer (1991) used lengths measured in words, comparisons of error rates between our work and theirs will not test whether characters or words are more useful. As set out in the previous section, there are numerous differences in testing methodology and materials. Furthermore, there are apparently many differences between the IBM algorithm and ours other than the units of measurement, which could also account for any difference on performance. Appropriate methodology is to compare methods with only one factor varying, as we do here.</Paragraph>
    </Section>
    <Section position="2" start_page="87" end_page="87" type="sub_section">
      <SectionTitle>
7.2 Ignoring Paragraph Boundaries
</SectionTitle>
      <Paragraph position="0"> Recall that align is a two-step process. First, paragraph boundaries are identified and then sentences are aligned within paragraphs. We considered eliminating the first step and found a threefold degradation in performance. The English-French errors were increased from 36 to 84, and the English-German errors from 19 to 86. The overall errors were increased from 55 to 170. Thus the two-step approach reduces errors by a factor of three. It is possible that performance might be improved further still by introducing additional alignment steps at the clause and/or phrase levels, but testing this hypothesis would require access to robust parsing technology.</Paragraph>
    </Section>
    <Section position="3" start_page="87" end_page="87" type="sub_section">
      <SectionTitle>
7.3 Adding a 2-2 Category
</SectionTitle>
      <Paragraph position="0"> The original version of the program did not consider the category of 2-2 alignments.</Paragraph>
      <Paragraph position="1"> Table 6 shows that the program was right on 10 of 15 actual 2-2 alignments. This was achieved at the cost of introducing 2 spurious 2-2 alignments. Thus in 12 tries, the program was right 10 times, wrong 2 times. This is significantly better than chance, since there is less than 1% chance of getting 10 or more heads out of 12 flips of a fair coin. Thus it is worthwhile to include the 2-2 alignment possibility.</Paragraph>
    </Section>
    <Section position="4" start_page="87" end_page="88" type="sub_section">
      <SectionTitle>
7.4 Using More Accurate Parameter Estimates
</SectionTitle>
      <Paragraph position="0"> When we discussed the estimation of the model parameters, c and s 2, we mentioned that it is possible to fit the parameters more accurately if we estimate different values for each language pair, but that doing so did not seem to increase performance by very much. In fact, we found exactly the same total number of errors, although the errors are slightly different. Changing the parameters resulted in four changes to the output for English-French (two right and two wrong), and two changes to the output for English-German (one right and one wrong). Since it is more convenient to use language-independent parameter values, and doing so doesn't seem to hurt performance very much (if at all), we have decided to adopt the language-independent values.</Paragraph>
      <Paragraph position="1">  William A. Gale and Kenneth W. Church Program for Aligning Sentences</Paragraph>
    </Section>
    <Section position="5" start_page="88" end_page="88" type="sub_section">
      <SectionTitle>
7.5 Extensions
</SectionTitle>
      <Paragraph position="0"> 7.5.1 Hard and Soft Boundaries. Recall that we rejected one of the French documents because one paragraph was omitted and two paragraphs were duplicated. We could have handled this case if we had employed a more powerful paragraph alignment algorithm. In fact, in aligning the Canadian Hansards, we found that it was necessary to do something more elaborate than we did for the UBS data. We decided to use more or less the same procedure for aligning paragraphs within a document as the procedure that we used for aligning sentences within a paragraph. Let us introduce the distinction between hard and soft delimiters. The alignment program is defined to move soft delimiters as necessary within the constraints of the hard delimiters.</Paragraph>
      <Paragraph position="1"> Hard delimiters cannot be modified, and there must be equal numbers of them. When aligning sentences within a paragraph, the program considers paragraph boundaries to be &amp;quot;hard&amp;quot; and sentence boundaries to be &amp;quot;soft.&amp;quot; When aligning paragraphs within a document, the program considers document boundaries to be &amp;quot;hard&amp;quot; and paragraph boundaries to be &amp;quot;soft.&amp;quot; This entension has been incorporated into the implementation presented in the appendix.</Paragraph>
      <Paragraph position="2"> 7.5.2 Augmenting the Dictionary Function to Consider Words. Many alternative alignment procedures such as Kay and R6scheisen (unpublished) make use of words. It ought to help to know that the English string &amp;quot;house&amp;quot; and the French string &amp;quot;maison&amp;quot; are likely to correspond. Dates and numbers are perhaps an even more extreme example. It really ought to help to know that the English string &amp;quot;1988&amp;quot; and the French string &amp;quot;1988&amp;quot; are likely to correspond. We are currently exploring ways to integrate these kinds of clues into the framework described above. However, at present, the algorithm does not have access to lexical constraints, which are clearly very important. We expect that once these clues are properly integrated, the program will achieve performance comparable to that of the primary judge. However, we are still not convinced that it is necessary to process these lexical clues, since the current performance is sufficient for many applications, such as building a probabilistic dictionary. It is remarkable just how well we can do without lexical constraints. Adding lexical constraints might slow down the program and make it less useful as a first pass.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML