File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1005_intro.xml

Size: 5,705 bytes

Last Modified: 2025-10-06 14:02:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1005">
  <Title>Improving Statistical Word Alignment with a Rule-Based Machine Translation System</Title>
  <Section position="2" start_page="0" end_page="21" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Bilingual word alignment is first introduced as an intermediate result in statistical machine translation (SMT) (Brown et al. 1993). Besides being used in SMT, it is also used in translation lexicon building (Melamed 1996), transfer rule learning (Menezes and Richardson 2001), example-based machine translation (Somers 1999), etc. In previous alignment methods, some researches modeled the alignments as hidden parameters in a statistical translation model (Brown et al. 1993; Och and Ney 2000) or directly modeled them given the sentence pairs (Cherry and Lin 2003).</Paragraph>
    <Paragraph position="1"> Some researchers used similarity and association measures to build alignment links (Ahrenberg et al. 1998; Tufis and Barbu 2002). In addition, Wu (1997) used a stochastic inversion transduction grammar to simultaneously parse the sentence pairs to get the word or phrase alignments.</Paragraph>
    <Paragraph position="2"> Generally speaking, there are four cases in word alignment: word to word alignment, word to multi-word alignment, multi-word to word alignment, and multi-word to multi-word alignment. One of the most difficult tasks in word alignment is to find out the alignments that include multi-word units. For example, the statistical word alignment in IBM translation models (Brown et al. 1993) can only handle word to word and multi-word to word alignments.</Paragraph>
    <Paragraph position="3"> Some studies have been made to tackle this problem. Och and Ney (2000) performed translation in both directions (source to target and target to source) to extend word alignments. Their results showed that this method improved precision without loss of recall in English to German alignments. However, if the same unit is aligned to two different target units, this method is unlikely to make a selection. Some researchers used preprocessing steps to identity multi-word units for word alignment (Ahrenberg et al. 1998; Tiedemann 1999; Melamed 2000). The methods obtained multi-word candidates based on continuous N-gram statistics. The main limitation of these methods is that they cannot handle separated phrases and multi-word units in low frequencies. null In order to handle all of the four cases in word alignment, our approach uses both the alignment information in statistical translation models and translation information in a rule-based machine translation system. It includes three steps. (1) A statistical translation model is employed to perform word alignment in two directions  (English to Chinese, Chinese to English). (2) A rule-based English to Chinese translation system is employed to obtain Chinese translations for each English word or phrase in the source language. (3) The translation information in step (2) is used to improve the word alignment results in step (1).</Paragraph>
    <Paragraph position="4"> A critical reader may pose the question &amp;quot;why  We use English-Chinese word alignment as a case study. not use a translation dictionary to improve statistical word alignment?&amp;quot; Compared with a translation dictionary, the advantages of a rule-based machine translation system lie in two aspects: (1) It can recognize the multi-word units, particularly separated phrases, in the source language. Thus, our method is able to handle the multi-word alignments with higher accuracy, which will be described in our experiments. (2) It can perform word sense disambiguation and select appropriate translations while a translation dictionary can only list all translations for each word or phrase. Experimental results show that our approach improves word alignments in both precision and recall as compared with the state-of-the-art technologies. null</Paragraph>
    <Section position="1" start_page="2" end_page="21" type="sub_section">
      <SectionTitle>
Statistical Word Alignment
</SectionTitle>
      <Paragraph position="0"> Statistical translation models (Brown, et al. 1993) only allow word to word and multi-word to word alignments. Thus, some multi-word units cannot be correctly aligned. In order to tackle this problem, we perform translation in two directions (English to Chinese and Chinese to English) as described in Och and Ney (2000). The GIZA++ toolkit is used to perform statistical alignment.</Paragraph>
      <Paragraph position="1"> Thus, for each sentence pair, we can get two alignment results. We use and to represent the alignment sets with English as the source language and Chinese as the target language or vice versa. For alignment links in both sets, we use i for English words and j for Chinese words.</Paragraph>
      <Paragraph position="2">  Where, represents the index position of the source word aligned to the target word in position x. For example, if a Chinese word in position j is connected to an English word in position i, then . If a Chinese word in position j is connected to English words in positions i and , then .</Paragraph>
      <Paragraph position="4"> We call an element in the alignment set an alignment link. If the link includes a word that has no translation, we call it a null link. If k words have null links, we treat them as k different null links, not just one link.</Paragraph>
      <Paragraph position="5">  In the following of this paper, we will use the position number of a word to refer to the word. Based on and , we obtain their intersection set, union set and subtraction set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML