File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/j93-1004_concl.xml

Size: 2,692 bytes

Last Modified: 2025-10-06 13:57:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-1004">
  <Title>A Program for Aligning Sentences in Bilingual Corpora</Title>
  <Section position="7" start_page="88" end_page="89" type="concl">
    <SectionTitle>
8. Conclusions
</SectionTitle>
    <Paragraph position="0"> This paper has proposed a method for aligning sentences in a bilingual corpus, based on a simple probabilistic model, described in Section 3. The model was motivated by the observation that longer regions of text tend to have longer translations, and that shorter regions of text tend to have shorter translations. In particular, we found that the correlation between the length of a paragraph in characters and the length of its translation was extremely high (0.991). This high correlation suggests that length might be a strong clue for sentence alignment.</Paragraph>
    <Paragraph position="1"> Although this method is extremely simple, it is also quite accurate. Overall, there was a 4.2% error rate on 1316 alignments, averaged over both English-French and English-German data. In addition, we find that the probability score is a good predictor of accuracy, and consequently, it is possible to select a subset of 80% of the alignments with a much smaller error rate of only 0.7%.</Paragraph>
    <Paragraph position="2"> The method is also fairly language-independent. Both English-French and English-German data were processed using the same parameters. If necessary, it is possible to fit the six parameters in the model with language-specific values, though, thus far, we have not found it necessary to do so.</Paragraph>
    <Paragraph position="3">  Computational Linguistics Volume 19, Number 1 We have examined a number of variations. In particular, we found that it is better to use characters rather than words in counting sentence length. Apparently, the performance is better with characters because there is less variability in the differences of sentence lengths so measured. Using words as units increases the error rate by hall from 4.2% to 6.5%.</Paragraph>
    <Paragraph position="4"> In the future, we would hope to extend the method to make use of lexical constraints. However, it is remarkable just how well we can do without such constraints. We might advocate our simple character alignment procedure as a first pass, even to those who advocate the use of lexical constraints. Our procedure would complement a lexical approach quite well. Our method is quick but makes a few percent errors; a lexical approach is probably slower, though possibly more accurate. One might go with our approach when the scores are small, and back off to a lexical-based approach as necessary.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML