File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1009_intro.xml

Size: 4,899 bytes

Last Modified: 2025-10-06 14:01:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1009">
  <Title>A Robust Cross-Style Bilingual Sentences Alignment Model</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Since the bilingual corpus is a valuable resource for training statistical language models [Dagon, 91; Su et al., 95; Su and Chang, 99] and sentence alignment is the first step for most such tasks, many alignment approaches have been proposed in the literature [Brown, 91; Gale and Church, 93; Wu, 94; Vogel et al., 96; Och and Ney, 2000]. Most of those reported approaches use the sentence length as the main feature to perform the alignment task.</Paragraph>
    <Paragraph position="1"> For example, Brown et al. (91) used the feature of number-of-words for alignment, and [Gale and Church,93] claimed that better performance can be achieved (5.8% error rate for English-French corpus) if the number-of-characters is adopted instead. As cognates are reliable cues for language pairs derived from the same family, Church (93) also attacked this problem by considering cognates additionally. Because most of those reported work are performed on those Indo-European language-pairs, for testing the performance on non-Indo-European languages, Wu (94) had tried both length and cognate features on the Hong Kong Hansard English-Chinese corpus, and 7.9% error rate has been reported. Besides, sentence alignment can also be indirectly achieved via more complicated word corresponding models [Brown et al., 93; Vogel et al., 96; Och and Ney, 2000]. Since those word corresponding models, which also achieve similar performance, are more complicated and run relatively slow, they seems to be over-killed for the task of aligning sentences and will not be discussed in this paper.</Paragraph>
    <Paragraph position="2"> Although length-based approaches above mentioned are simple and can achieve good performance, they are usually trained and tested in the text with the same style. Therefore, they are style-dependent approaches. Since performing supervised-training for each style is not feasible in many applications, it would be interesting to know whether those length-based approaches can still achieve the similar performance if they are tested in the text with different styles other than the training corpora. An experiment was thus conducted to train the parameters with a machinery technical manual; the performance is then tested on a general magazine (for introducing Taiwan to foreign visitors). It shows that the testing set performance of the length-based model (with cognates considered) would drop from 98.2% (tested in the same technical domain) to 85.6% (tested in the new general magazine) in F-measure. After investigating those errors, it has been found that the length distribution and alignment-type distribution (used by those length-based approaches) vary significantly across the texts of different styles (as would be shown in Tables 5.2 and 5.3), and the cognate-frequency1 drops greatly from the technical manual to a general magazine in non-Indo-European languages (as would be shown in Table 5.3).</Paragraph>
    <Paragraph position="3"> On the other hand, sentence length is seldom used by a human to align bilingual sentences. They usually do not align bilingual sentences by counting the number of characters (or words) in the sentence pairs. Instead, since a large percentage of content words in the source text would be translated into their translation-duals to preserve the meaning in the target text, transfer-lexicons are usually used for aligning sentences when the alignment task is performed by human. To enhance the robustness across different styles, transfer-lexicons are thus integrated into the traditional sentence-length based model in the proposed robust statistical model described below. After integrating transfer-lexicons into the model, a 60% F-measure error reduction (from 14.4% to 5.8%) has been observed, which corresponds to improving the cross-style performance from 85.6% to 94.2% in F-measure.</Paragraph>
    <Paragraph position="4"> The details of the proposed robust model, the associated features extracted from the bilingual corpora, and the probabilistic scoring function will be given in Section 2. In Section 3, we briefly mention some implementation issues. The associated performance evaluation is given in Section 4, and Section 5 would address error analysis and discusses the limitation of the proposed statistical model. Finally, the concluding remarks are given in Section  nouns (such as those company names of IBM, HP; or the technical terms such as IEEE-1394, etc.) that appear in the Chinese text. As they are most likely to be directly copied from the English sentence into the corresponding Chinese one, they are reliable cues.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML