File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0315_intro.xml
Size: 2,861 bytes
Last Modified: 2025-10-06 14:01:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0315"> <Title>Efficient Optimization for Bilingual Sentence Alignment Based on Linear Regression</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In many instances, multilingual natural language systems like machine translation systems are developed and trained on parallel corpora. When faced with a different, unseen text genre, however, translation performance usually drops noticeably. One way to remedy this situation is to adapt and retrain the system parameters based on bilingual data from the same source or at least a closely related source. A bilingual sentence alignment program (Gale and Church, 1991, and Brown et al., 1991) is the crucial part in this adaptation procedure, in that it collects bilingual document pairs from the Internet, and identifies sentence pairs, which should have a high likelihood of being correct translations of each other. The set of identified bilingual parallel sentence pairs is then added to the training set for parameter reestimation. null As is well known, text mined from the Internet is very noisy. Even after careful html parsing and filtering for text size and language, the text from comparable html-page pairs still contains mismatches of content or non-parallel junk text, and the sentence order can be too different to be aligned. Together with a large mismatch of vocabulary, the aligned sentence pairs, which are extracted from these collected comparable html-page pairs, contain a number of low translation quality alignments. These need to be removed before the re-training of the MT system.</Paragraph> <Paragraph position="1"> In this paper, we present an approach to automatically optimizing the alignment scores of such a bilingual sentence alignment program. The alignment score is a combination (by linear regression) of two word translation lexicon scores and three sentence length scores and predicts the translation quality scores from a set of human annotators. We also present experiments analyzing how many different human scorers are needed for good prediction and also how many sentence pairs should be scored per human annotator.</Paragraph> <Paragraph position="2"> The paper is structured as follows: in section 2, the text mining system is briefly described. In section 3, five sentence alignment models based on lexical information and sentence length are explained. In section 4, a regression model is proposed to combine the five models to get further improvement in predicting alignment quality. We describe alignment experiments in section 5, focusing on the correlation between the alignment scores predicted by the sentence alignment models and by humans. Conclusions are given in section 6.</Paragraph> </Section> class="xml-element"></Paper>