File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/91/p91-1023_concl.xml
Size: 2,715 bytes
Last Modified: 2025-10-06 13:56:45
<?xml version="1.0" standalone="yes"?> <Paper uid="P91-1023"> <Title>A PROGRAM FOR ALIGNING SENTENCES IN BILINGUAL CORPORA</Title> <Section position="6" start_page="183" end_page="183" type="concl"> <SectionTitle> 7. Conclusions </SectionTitle> <Paragraph position="0"> This paper has proposed a method for aligning sentences in a bilingual corpus, based on a simple probabilistic model, described in Section 3. The model was motivated by the observation that longer regions of text tend to have longer translations, and that shorter regions of text tend to have shorter translations. In particular, we found that the correlation between the length of a paragraph in characters and the length of its translation was extremely high (0.991). This high correlation suggests that length might be a strong clue for sentence alignment.</Paragraph> <Paragraph position="1"> Although this method is extremely simple, it is also quite accurate. Overall, there was a 4.2% error rate on 1316 alignments, averaged over both English-French and English-German data. In addition, we find that the probability score is a good predictor of accuracy, and consequently, it is possible to select a subset of 80% of the alignments with a much smaller error rate of only 0.7%.</Paragraph> <Paragraph position="2"> The method is also fairly language-independent-Both English-French and English-German data were processed using the same parameters. If necessary, it is possible to fit the six parameters in the model with language-specific values, though, thus far, we have not found it necessary (or even helpful) to do so.</Paragraph> <Paragraph position="3"> We have examined a number of variations. In particular, we found that it is better to use characters rather than words in counting sentence length. Apparently, the performance is better with characters because there is less variability in the ratios of sentence lengths so measured. Using words as units increases the error rate by half, from 4.2% to 6.5%.</Paragraph> <Paragraph position="4"> In the future, we would hope to extend the method to make use of lexical constraints. However, it is remarkable just how well we can do without such constraints. We might advocate the simple character length alignment procedure as a useful first pass, even to those who advocate the use of lexical constraints. The character length procedure might complement a lexical conslraint approach quite well, since it is quick but has some errors while a lexical approach is probably slower, though possibly more accurate. One might go with the character length procedure when the distance scores are small, and back off to a lexical approach as necessary.</Paragraph> </Section> class="xml-element"></Paper>