File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1006_intro.xml

Size: 3,088 bytes

Last Modified: 2025-10-06 14:02:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1006">
  <Title>Improved Word Alignment Using a Symmetric Lexicon Model</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Word-aligned bilingual corpora are an important knowledge source for many tasks in natural language processing. Obvious applications are the extraction of bilingual word or phrase lexica (Melamed, 2000; Och and Ney, 2000). These applications depend heavily on the quality of the word alignment (Och and Ney, 2000). Word alignment models were first introduced in statistical machine translation (Brown et al., 1993). The alignment describes the mapping from source sentence words to target sentence words.</Paragraph>
    <Paragraph position="1"> Using the IBM translation models IBM-1 to IBM-5 (Brown et al., 1993), as well as the Hidden-Markov alignment model (Vogel et al., 1996), we can produce alignments of good quality. In (Och and Ney, 2003), it is shown that the statistical approach performs very well compared to alternative approaches, e.g. based on the Dice coefficient or the competitive linking algorithm (Melamed, 2000).</Paragraph>
    <Paragraph position="2"> A central component of the statistical translation models is the lexicon. It models the word translation probabilities. The standard training procedure of the statistical models uses the EM algorithm. Typically, the models are trained for one translation direction only.</Paragraph>
    <Paragraph position="3"> Here, we will perform a simultaneous training of both translation directions, source-to-target and target-to-source. After each iteration of the EM algorithm, we combine the two lexica to a symmetric lexicon. This symmetric lexicon is then used in the next iteration of the EM algorithm for both translation directions.</Paragraph>
    <Paragraph position="4"> We will propose and justify linear and loglinear interpolation methods.</Paragraph>
    <Paragraph position="5"> Statistical methods often suffer from the data sparseness problem. In our case, many words in the bilingual sentence-aligned texts are singletons, i.e. they occur only once. This is especially true for the highly inflected languages such as German. It is hard to obtain reliable estimations of the translation probabilities for these rarely occurring words. To overcome this problem (at least partially), we will smooth the lexicon probabilities of the full-form words using a probability distribution that is estimated using the word base forms. Thus, we exploit that multiple full-form words share the same base form and have similar meanings and translations.</Paragraph>
    <Paragraph position="6"> We will evaluate these methods on the German-English Verbmobil task and the French-English Canadian Hansards task. We will show statistically significant improvements compared to state-of-the-art results in (Och and Ney, 2003). On the Canadian Hansards task, the symmetrization methods will result in an improvement of more than 30% relative.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML