XML Viewer - p03-1039

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1039_metho.xml
Size: 12,946 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1039">
  <Title>Chunk-based Statistical Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Word Alignment Based Statistical
Translation
</SectionTitle>
    <Paragraph position="0"> Word alignment based statistical translation represents bilingual correspondence by the notion of word alignment A, allowing one-to-many generation from each source word. Figure 1 illustrates an example of English and Japanese sentences, E and J, with sample word alignments. In this example, &amp;quot;show  Under this word alignment assumption, the translation model P(J|E) can be further decomposed without approximation.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.1 IBM Model
</SectionTitle>
    <Paragraph position="0"> During the generation process from E to J, P(J,A|E) is assumed to be structured with a couple of processes, such as insertion, deletion and reorder. A scenario for the word alignment based translation model defined by Brown et al. (1993), for instance IBM Model 4, goes as follows (refer to Figure 2).</Paragraph>
    <Paragraph position="1">  1. Choose the number of words to generate for each source word according to the Fertility Model. For example, &amp;quot;show&amp;quot; was increased to 2 words, while &amp;quot;me&amp;quot; was deleted.</Paragraph>
    <Paragraph position="2"> 2. Insert NULLs at appropriate positions by the NULL Generation Model. Two NULLs were inserted after each &amp;quot;show&amp;quot; in Figure 2. 3. Translate word-by-word for each generated word by looking up the Lexicon Model. One of the two &amp;quot;show&amp;quot; words was translated to &amp;quot;mise.&amp;quot; 4. Reorder the translated words by referring to the Distortion Model. The word &amp;quot;mise&amp;quot; was re- null ordered to the 5th position, and &amp;quot;uindo&amp;quot; was reordered to the 1st position. Positioning is determined by the previous word's alignment to capture phrasal constraints.</Paragraph>
    <Paragraph position="3"> For the meanings of each symbol in each model, refer to Brown et al. (1993).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Problems of Word Alignment Based
Translation Model
</SectionTitle>
      <Paragraph position="0"> The strategy for the word alignment based translation model is to translate each word by generating multiple single words (a bag of words) and to determine the position of each translated word. Although show</Paragraph>
      <Paragraph position="2"> this procedure is sufficient to capture the bilingual correspondence for similar language pairs, some issues remain for drastically different pairs: Insertion/Deletion Modeling Although deletion was modeled in the Fertility Model, it merely assigns zero to each deleted word without considering context. Similarly, inserted words are selected by the Lexical Model parameter and inserted at the positions determined by a binomial distribution.</Paragraph>
      <Paragraph position="3"> This insertion/deletion scheme contributed to the simplicity of this representation of the translation processes, allowing a sophisticated application to run on an enormous bilingual sentence collection.</Paragraph>
      <Paragraph position="4"> However, it is apparent that the weak modeling of those phenomena will lead to inferior performance for language pairs such as Japanese and English.</Paragraph>
      <Paragraph position="5"> Local Alignment Modeling The IBM Model 4 (and 5) simulates phrasal constraints, although there were implicitly implemented as its Distortion Model parameters. In addition, the entire reordering is determined by a collection of local reorderings insufficient to capture the long-distance phrasal constraints. null The next section introduces an alternative modeling, chunk-based statistical translation, which was intended to resolve the above two issues.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="4" type="metho">
    <SectionTitle>
3 Chunk-based Statistical Translation
</SectionTitle>
    <Paragraph position="0"> Chunk-based statistical translation models the process of chunking for both the source and target sentences, E and J,</Paragraph>
    <Paragraph position="2"> where J and E are the chunked sentences for J and E, respectively, defined as two-dimentional ar- null rays. For instance, J i,j represents the jth word of the ith chunk. The number of chunks for source and target is assumed to be equal, |J |= |E|, so that each chunk can convey a unit of meaning without added/subtracted information. The term P(J,J,E|E) is further decomposed by chunk alignment A and word alignment for each chunk translationA. null  The notion of alignment A is the same as those found in the word alignment based translation model, which assigns a source chunk index for each target chunk. Ais a two-dimensional array which assigns a source word index for each target word per chunk. For example, Figure 3 shows two-level alignments taken from the example in Figure 1. The target chunk at position 3,J  , &amp;quot;mise tekudasai&amp;quot; is aligned to the first position (A  = 1), and both the words &amp;quot;mise&amp;quot; and &amp;quot;tekudasai&amp;quot; are aligned to the first position of the source sentence (A</Paragraph>
    <Paragraph position="4"> =1).</Paragraph>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.1 Translation Model Structure
</SectionTitle>
      <Paragraph position="0"> The term P(J,J,A,A,E|E) is further decomposed with approximation according to the scenario de- null scribed below (refer to Figure 4).</Paragraph>
      <Paragraph position="1"> 1. Perform chunking for source sentence E by P(E|E). For instance, chunks of &amp;quot;show me&amp;quot; and &amp;quot;the one&amp;quot; were derived. The process is modeled by two steps: (a) Selection of chunk size (Head Model).</Paragraph>
      <Paragraph position="3"> with chunk size more than 0 (ph</Paragraph>
      <Paragraph position="5"> treated as a head word, otherwise a non-head (refer to the words in bold in Figure  4).</Paragraph>
      <Paragraph position="6"> (b) Associate each non-head word to a head word (Chunk Model). Each non-head</Paragraph>
      <Paragraph position="8"> )), where h is the position of a head word and c(E) is a function to map a word E to its word class (i.e. POS). For instance, &amp;quot;the  &amp;quot;is associated with the head word &amp;quot;one  &amp;quot; located at 4[?]3=+1.</Paragraph>
      <Paragraph position="9"> 2. Select words to be translated with Deletion and Fertility Model.</Paragraph>
      <Paragraph position="10"> (a) Select the number of head words. For each head word E h (ph h &gt;0), choose fertilityph h according to the Fertility Modeln(ph</Paragraph>
      <Paragraph position="12"> We assume that the head word must be translated, therefore ph</Paragraph>
      <Paragraph position="14"> one of them is selected as a head word at target position using a uniform distribu-</Paragraph>
      <Paragraph position="16"> (b) Delete some non-head words. For each non-head word E</Paragraph>
      <Paragraph position="18"> delete it according to the Deletion Model</Paragraph>
      <Paragraph position="20"> is the head word in the same chunk and d</Paragraph>
      <Paragraph position="22"> is deleted, otherwise 0.</Paragraph>
      <Paragraph position="23"> 3. Insert some words. In Figure 4, NULLs were inserted for two chunks. For each chunk E</Paragraph>
      <Paragraph position="25"> including spurious words, is translated to J</Paragraph>
      <Paragraph position="27"> 5. Reorder words. Each word in a chunk is reordered according to the Reorder Model</Paragraph>
      <Paragraph position="29"> ). The chunk reordering is taken after the Distortion Model of IBM Model 4, where the position is determined by the relative position from the head word,  6. Reorder chunks. All of the chunks are  reordered according to the Chunk Reorder Model, P(A|E,J). The chunk reordering is also similar to the Distortion Model, where the positioning is determined by the relative position from the previous alignment  that the reordering is dependent on head words. To summarize, the chunk-based translation model can be formulated as</Paragraph>
      <Paragraph position="31"/>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.2 Characteristics of chunk-based Translation
Model
</SectionTitle>
      <Paragraph position="0"> The main difference to the word alignment based translation model is the treatment of the bag of word translations. The word alignment based translation model generates a bag of words for each source word, while the chunk-based model constructs a set of target words from a set of source words. The behavior is modeled as a chunking procedure by first associating words to the head word of its chunk and then performing chunk-wise translation/insertion/deletion. null The complicated word alignment is handled by the determination of word positions in two stages: translation of chunk and chunk reordering. The former structures local orderings while the latter constitutes global orderings. In addition, the concept of head associated with each chunk plays the central role in constraining different levels of the reordering by the relative positions from heads.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.3 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> The parameter estimation for the chunk-based translation model relies on the EM-algorithm (Dempster et al., 1977). Given a large bilingual corpus the conditional probability of P(J,A,A,E|J,E) =</Paragraph>
      <Paragraph position="2"> first estimated for each pair of J and E (E-step), then each model parameters is computed based on the estimated conditional probability (M-step).</Paragraph>
      <Paragraph position="3"> The above procedure is iterated until the set of parameters converge.</Paragraph>
      <Paragraph position="4"> However, this naive algorithm will suffer from severe computational problems. The enumeration of all possible chunkingsJ andEtogether with word alignmentAand chunk alignment A requires a significant amount of computation. Therefore, we have introduced a variation of the Inside-Outside algorithm as seen in (Yamada and Knight, 2001) for E-step computation. The details of the procedure are described in Appendix A.</Paragraph>
      <Paragraph position="5"> In addition to the computational problem, there exists a local-maximum problem, where the EM-Algorithm converges to a maximum solution but does not guarantee finding the global maximum. In order to solve this problem and to make the parameters converge quickly, IBM Model 4 parameters were used as the initial parameters for training. We directly applied the Lexicon Model and Fertility Model to the chunk-based translation model but set other parameters as uniform.</Paragraph>
    </Section>
    <Section position="4" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.4 Decoding
</SectionTitle>
      <Paragraph position="0"> The decoding algorithm employed for this chunk-based statistical translation is based on the beam search algorithm for word alignment statistical translation presented in (Tillmann and Ney, 2000), which generates outputs in left-to-right order by consuming input in an arbitrary order.</Paragraph>
      <Paragraph position="1"> The decoder consists of two stages:  1. Generate possible output chunks for all possible input chunks.</Paragraph>
      <Paragraph position="2"> 2. Generate hypothesized output by consuming  input chunks in arbitrary order and combining possible output chunks in left-to-right order. The generation of possible output chunks is estimated through an inverted lexicon model and sequences of inserted strings (Tillmann and Ney, 2000). In addition, an example-based method is also introduced, which generates candidate chunks by looking up the viterbi chunking and alignment from a training corpus.</Paragraph>
      <Paragraph position="3"> Since the combination of all possible chunks is computationally very expensive, we have introduced the following pruning and scoring strategies. beam pruning: Since the search space is enormous, we have set up a size threshold to maintain partial hypotheses for both of the above two stages. We also incorporated a threshold for scoring, which allows partial hypotheses with a certain score to be processed.</Paragraph>
      <Paragraph position="4"> example-based scoring: Input/output chunk pairs that appeared in a training corpus are &amp;quot;rewarded&amp;quot; so that they are more likely kept in the beam. During the decoding process, when a pair of chunks appeared in the first stage, the score is boosted by using this formula in the log  appearing in the training corpus, and weight is a tuning parameter.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML