XML Viewer - p91-1023

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/91/p91-1023_evalu.xml
Size: 8,195 bytes
Last Modified: 2025-10-06 14:00:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="P91-1023">
  <Title>A PROGRAM FOR ALIGNING SENTENCES IN BILINGUAL CORPORA</Title>
  <Section position="5" start_page="180" end_page="183" type="evalu">
    <SectionTitle>
5. Evaluation
</SectionTitle>
    <Paragraph position="0"> To evaluate align, its results were compared with a human alignment. All of the UBS sentences were aligned by a primary judge, a native speaker of English with a reading knowledge of French and German. Two additional judges, a native speaker of French and a native speaker of German, respectively, were used to check the primary judge on 43 of the more difficult paragraphs having 230 sentences (out of 118 total paragraphs with 725 sentences). Both of the additional judges were also fluent in English, having spent the last few years living and working in the United States, though they were both more comfortable with their native language than with English.</Paragraph>
    <Paragraph position="1"> The materials were prepared in order to make the task somewhat less tedious for the judges. Each paragraph was printed in three columns, one for each of the three languages: English, French and German. Blank lines were inserted between sentences. The judges were asked to draw lines between matching sentences. The judges were also permitted to draw a line between a sentence and &amp;quot;null&amp;quot; if they thought that the sentence was not translated. For the purposed of this evaluation, two sentences were defined to &amp;quot;match&amp;quot; if they shared a common clause. (In a few cases, a pair of sentences shared only a phrase or a word, rather than a clause; these sentences did not count as a &amp;quot;match&amp;quot; for the purposes of this experiment.) After checking the primary judge with the other two judges, it was decided that the primary judge's results were sufficiently reliable that they could be used as a standard for evaluating the program. The primary judge made only two mistakes on the 43 hard paragraphs (one French mistake and one German mistake), whereas the program made 44 errors on the same materials.</Paragraph>
    <Paragraph position="2"> Since the primary judge's error rate is so much lower than that of the program, it was decided that we needn't be concerned with the primary judge's error rate. If the program and the judge disagree, we can assume that the program is probably wrong.</Paragraph>
    <Paragraph position="3"> The 43 &amp;quot;hard&amp;quot; paragraphs were selected by looking for sentences that mapped to something other than themselves after going through both German and French. Specifically, for each English sentence, we attempted to find the  corresponding German sentences, and then for each of them, we attempted to find the corresponding French sentences, and then we attempted to find the corresponding English sentences, which should hopefully get us back to where we started. The 43 paragraphs included all sentences in which this process could not be completed around the loop. This relatively small group of paragraphs (23 percent of all paragraphs) contained a relatively large fraction of the program's errors (82 percent). Thus, there does seem to be some verification that this trilingual criterion does in fact succeed in distinguishing more difficult paragraphs from less difficult ones. There are three pairs of languages: English-German, English-French and French-German. We will report just the first two. (The third pair is probably dependent on the first two.) Errors are reported with respect to the judge's responses.</Paragraph>
    <Paragraph position="4"> That is, for each of the &amp;quot;matches&amp;quot; that the primary judge found, we report the program as correct ff it found the &amp;quot;match&amp;quot; and incorrect ff it didn't This convention allows us to compare performance across different algorithms in a straightforward fashion.</Paragraph>
    <Paragraph position="5"> The program made 36 errors out of 621 total alignments (5.8%) for English-French, and 19 errors out of 695 (2.7%) alignments for English-German. Overall, there were 55 errors out of a total of 1316 alignments (4.2%).</Paragraph>
    <Paragraph position="6"> handled correctly. In addition, when the algorithm assigns a sentence to the 1-0 category, it is also always wrong. Clearly, more work is needed to deal with the 1-0 category. It may be necessary to consider language-specific methods in order to deal adequately with this case.</Paragraph>
    <Paragraph position="7"> We observe that the score is a good predictor of performance, and therefore the score can be used to extract a large subcorpus which has a much smaller error rate. By selecting the best scoring 80% of the alignments, the error rate can be reduced from 4% to 0.7%. In general, we can trade off the size of the subcorpus and the accuracy by setting a threshold, and rejecting alignments with a score above this threshold.</Paragraph>
    <Paragraph position="8"> Figure 2 examines this trade-off in more detail.</Paragraph>
    <Paragraph position="9">  Table 6 breaks down the errors by category, illustrating that complex matches are more difficulL I-I alignments are by far the easiest.</Paragraph>
    <Paragraph position="10"> The 2-I alignments, which come next, have four times the error rate for I-I. The 2-2 alignments are harder still, but a majority of the alignments are found. The 3-I and 3-2 alignments arc not even considered by the algorithm, so naturally all three are counted as errors. The most embarrassing category is I-0, which was never  good predictor of performance can be used to extract a large subcorpus which has a much smaller error rate. In general, we can trade-off the size of the subcorpus and the accuracy by-setting a threshold, and rejecting alignments with a score above this threshold.</Paragraph>
    <Paragraph position="11"> The horizontal axis shows the size of the subcorpus, and the vertical axis shows the corresponding error rate. An error rate of about 2/3% can be obtained by selecting a threshold that would retain approximately 80% of the corpus.</Paragraph>
    <Paragraph position="12"> Less formal tests of the error rate in the Hansards suggest that the overall error rate is about 2%, while the error rate for the easy 80% of the sentences is about 0.4%. Apparently the Hansard translations are more literal than the UBS reports. It took 20 hours of real time on a sun 4 to align 367 days of Hansards, or 3.3 minutes per Hansard-day. The 367 days of Hansards contain about 890,000 sentences or about 37 million &amp;quot;words&amp;quot; (tokens). About half of the computer time is spent identifying tokens, sentences, and paragraphs, while the other half of the time is spent in the align program itself.</Paragraph>
    <Paragraph position="13">  6. Measuring Length In Terms Of Words Rather than Characters  It is interesting to consider what happens if we change our definition of length to count words rather than characters. It might seem that words are a more natural linguistic unit than characters  (Brown, Lai and Mercer, 1991). However, we have found that words do not perform nearly as well as characters. In fact, the &amp;quot;words&amp;quot; variation increases the number of errors dramatically (from 36 to 50 for English-French and from 19 to 35 for English-German). The total errors were thereby increased from 55 to 85, or from 4.2% to 6.5%.</Paragraph>
    <Paragraph position="14"> We believe that characters are better because there are more of them, and therefore there is less uncertainty. On the average, the~re are 117 characters per sentence (including white space) and only 17 words per sentence. Recall that we have modeled variance as proportional to sentence length, V = s 2 I. Using the character data, we found previously that s 2= 6.5. The same argument applied to words yields s 2 = 1.9. For comparison sake, it is useful to consider the ratio of ~/(V(m))lm (or equivalently, sl~m), where m is the mean sentence length. We obtain ff(m)lm ratios of 0.22 for characters and 0.33 for words, indicating that characters are less noisy than words, and are therefore more suitable for use in align.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML