XML Viewer - j99-4005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/j99-4005_metho.xml
Size: 18,380 bytes
Last Modified: 2025-10-06 14:15:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="J99-4005">
  <Title>Squibs and Discussions Decoding Complexity in Word-Replacement Translation Models</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Part-of-Speech Tagging
</SectionTitle>
    <Paragraph position="0"> The prototype source-channel application in natural language is part-of-speech tagging (Church 1988). We review it here for purposes of comparison with machine translation.</Paragraph>
    <Paragraph position="1"> Source strings comprise sequences of part-of-speech tags like noun, verb, etc. A simple source model assigns a probability to a tag sequence tl .. *tm based on the probabilities of the tag pairs inside it. Target strings are English sentences, e.g., wl ... win. The channel model assumes each tag is probabilistically replaced by a word (e.g., noun by dog) without considering context. More concretely, we have:  We can assign parts-of-speech to a previously unseen word sequence wl... Wm by finding the sequence tl... tm that maximizes P(h... tmlWl... Wm). By Bayes' rule, we can equivalently maximize P(h ... tm)'P(wl.., wmlh.., tin), which we can calculate directly from the b and s tables above.</Paragraph>
    <Paragraph position="2"> Three interesting complexity problems in the source-channel framework are:  The first problem is solved in O(m) time for part-of-speech tagging--we simply count tag pairs and word/tag pairs, then normalize. The second problem seems to require enumerating all O(v m) potential source sequences to find the best, but can actually be solved in O(mv 2) time with dynamic programming. We turn to the third problem in the context of another application: cryptanalysis.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="608" type="metho">
    <SectionTitle>
3. Substitution Ciphers
</SectionTitle>
    <Paragraph position="0"> In a substitution cipher, a plaintext message like HELLO WORLD is transformed into a ciphertext message like EOPPX YXAPF via a fixed letter-substitution table. As with tagging, we can assume an alphabet of v source tokens, a bigram source model, a substitution channel model, and an m-token coded text.</Paragraph>
    <Paragraph position="1"> If the coded text is annotated with corresponding English, then building source and channel models is trivially O(m). Comparing the situation to part-of-speech tagging: null * (Bad news.) Cryptanalysts rarely get such coded/decoded text pairs and must employ &amp;quot;ciphertext-only&amp;quot; attacks using unannotated training data. * (Good news.) It is easy to train a source model separately, on raw unannotated English text that is unconnected to the ciphertext.</Paragraph>
    <Paragraph position="2"> Then the problem becomes one of acquiring a channel model, i.e., a table s(fle ) with an entry for each code-letter/plaintext-letter pair. Starting with an initially uniform table, we can use the estimation-maximization (EM) algorithm to iteratively revise s(fle ) so as to increase the probability of the observed corpus P(f). Figure 1 shows a naive EM implementation that runs in O(mv m) time. There is an efficient O(mv 2) EM implementation based on dynamic programming that accomplishes the same thing.</Paragraph>
    <Paragraph position="3"> Once the s(fle ) table has been learned, there is a similar O(mv 2) algorithm for optimal decoding. Such methods can break English letter-substitution ciphers of moderate size.</Paragraph>
    <Section position="1" start_page="608" end_page="608" type="sub_section">
      <SectionTitle>
Knight Decoding Complexity
</SectionTitle>
      <Paragraph position="0"> Given coded text f of length m, a plaintext vocabulary of v tokens, and a source model b:  1. set the s0Cle) table initially to be uniform 2. for several iterations do: a, b.</Paragraph>
      <Paragraph position="1"> C.</Paragraph>
      <Paragraph position="2"> d.</Paragraph>
      <Paragraph position="3"> set up a count table c0CI e) with zero entries</Paragraph>
      <Paragraph position="5"> for all possible source texts el... em (el drawn from vocabulary)</Paragraph>
      <Paragraph position="7"> for all source texts e of length m</Paragraph>
      <Paragraph position="9"> normalize c0Ci e) table to create a revised s0CI e) Figure 1 A naive application of the EM algorithm to break a substitution cipher. It runs in O(mv m) time.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="608" end_page="612" type="metho">
    <SectionTitle>
4. Machine Translation
</SectionTitle>
    <Paragraph position="0"> In our discussion of substitution ciphers, we were on relatively sure ground the channel model we assumed in decoding is actually the same one used by the cipher writer for encoding. That is, we know that plaintext is converted to ciphertext, letter by letter, according to some table. We have no such clear conception about how English gets converted to French, although many theories exist. Brown et al. (1993) recently cast some simple theories into a source-channel framework, using the bilingual Canadian parliament proceedings as training data. We may assume:  * v total English words.</Paragraph>
    <Paragraph position="1"> * A bigram source model with V 2 parameters.</Paragraph>
    <Paragraph position="2"> * Various substitution/permutation channel models.</Paragraph>
    <Paragraph position="3"> * A collection of bilingual sentence pairs (sentence lengths &lt; m).</Paragraph>
    <Paragraph position="4"> * A collection of monolingual French sentences (sentence lengths &lt; m).  Bilingual texts seem to exhibit English words getting substituted with French ones, though not one-for-one and not without changing their order. These are important departures from the two applications discussed earlier.</Paragraph>
    <Paragraph position="5"> In the main channel model of Brown et al. (1993), each English word token ei in a source sentence is assigned a &amp;quot;fertility&amp;quot; @, which dictates how many French words it will produce. These assignments are made stochastically according to a table n(~le ). Then actual French words are produced according to s(fie ) and permuted into new positions according to a distortion table d(jli, m, 1). Here, j and i are absolute target/source word positions within a sentence, and m and I are target/source sentence lengths.</Paragraph>
    <Paragraph position="6"> Inducing n, s, and d parameter estimates is easy if we are given annotations in the form of word alignments. An alignment is a set of connections between English and French words in a sentence pair. In Brown et al. (1993), aligrtrnents are asymmetric-each French word is connected to exactly one English word.</Paragraph>
    <Paragraph position="7">  Computational Linguistics Volume 25, Number 4 Given a collection of sentence pairs:  1. collect estimates for the ~(m\]l) table directly from the data 2. set the s0e\]e) table initially to be uniform 3. for several iterations do: a.</Paragraph>
    <Paragraph position="8"> b.</Paragraph>
    <Paragraph position="9"> C.</Paragraph>
    <Paragraph position="10"> set up a count table c(f\]e) with zero entries  for each given sentence pair e, f with respective lengths I, m: foral=ltol for a2 = 1 to 1 /* select connections for a word alignment */ for am = 1 to l compute P(al ...... \]e, f) - p(f' al ...... \]e) P(f\]e) for j = 1 to m</Paragraph>
    <Paragraph position="12"> Naive EM training for the Model 1 channel model.</Paragraph>
    <Paragraph position="13"> Word-aligned data is usually not available, but large sets of unaligned bilingual sentence pairs do sometimes exist. A single sentence pair will have \[m possible alignments--for each French word position 1... m, there is a choice of I English positions to connect to. A naive EM implementation will collect n, s, and d counts by considering each alignment, but this is expensive. (By contrast, part-of-speech tagging involves a single alignment, leading to O(m) training). Lacking a polynomial reformulation, Brown et al. (1993) decided to collect counts only over a subset of likely alignments. To bootstrap, they required some initial idea of what alignments are reasonable, so they began with several iterations of a simpler channel model (called Model 1) that has nicer computational properties.</Paragraph>
    <Paragraph position="14"> In the following description of Model 1, we represent an aligmnent formally as a vector al ..... am, with values aj ranging over English word positions 1... I.</Paragraph>
    <Paragraph position="15"> Model 1 Channel Parameters: c(mll ) and s(f\[e).</Paragraph>
    <Paragraph position="16"> Given a source sentence e of length I:  1. choose a target sentence length m according to C/(mll ) 2. for j = 1 to m, choose an English word position aj according to the uniform distribution over 1... l 3. for j = 1 to m, choose a French word j~ according to s~\]%) 4. read off fl ...fro as the target sentence  Because the same e may produce the same f by means of many different alignments, we must sum over all of them to obtain P(fle): 1 l 1 l m P(fl e) = c(mll) T~ Y~al=l ~a2=l &amp;quot;&amp;quot;&amp;quot; Y~am=l I\]j=l s(fjleai) Figure 2 illustrates naive EM training for Model 1. If we compute P(fle) once per iteration, outside the &amp;quot;for a&amp;quot; loops, then the complexity is O(ml m) per sentence pair, per iteration.</Paragraph>
    <Section position="1" start_page="610" end_page="611" type="sub_section">
      <SectionTitle>
Knight Decoding Complexity
</SectionTitle>
      <Paragraph position="0"> More efficient O(lm) training was devised by Brown et al. (1993). Instead of processing each alignment separately, they modified the algorithm in Figure 2 as follows: b. for each given sentence pair e, f of respective lengths l, m: for j = 1 to m</Paragraph>
      <Paragraph position="2"> This works because of the algebraic trick that the portion of P(fle) we originally wrote 1 1 m e m as ~al=,&amp;quot; &amp;quot;&amp;quot; Y~am=l 1-Ij=l S(J~\[ aj) can be rewritten as YIj=I Y~I=I s(fjlei)&amp;quot; We next consider decoding. We seek a string e that maximizes P(elf), or equivalently maximizes P(e) * P(fle). A naive algorithm would evaluate all possible source strings, whose lengths are potentially unbounded. If we limit our search to strings at most twice the length m of our observed French, then we have a naive O(m2v 2m) method: Given a string f of length m  1. for all source strings e of length I _ 2m: a. compute P(e) = b(el I boundary) - b(boundary Iet) &amp;quot; I-lli=2 b(eilei-1) m b. compute P(fle) = c(mll ) ~ l-\[j=1 ~1i=1 s(fjlei) c. compute P(elf) ,-~ P(e) * P(fle) d. if P(elf ) is the best so far, remember it 2. print best e  We may now hope to find a way of reorganizing this computation, using tricks like the ones above. Unfortunately, we are unlikely to succeed, as we now show. For proof purposes, we define our optimization problem with an associated yes-no decision problem: Definition: M1-OPTIMIZE Given a string f of length m and a set of parameter tables (b, e, s), return a string e of length I &lt; 2m that maximizes P(elf), or equivalently maximizes</Paragraph>
      <Paragraph position="4"> Definition: M1-DECIDE Given a string f of length m, a set of parameter tables (b, e, s), and a real number k, does there exist a string e of length l &lt; 2m such that P(e) * P(fle) &gt; k? We will leave the relationship between these two problems somewhat open and intuitive, noting only that M1-DECIDE's intractability does not bode well for M1OPTIMIZE. null  To show inclusion in NP, we need only nondeterministically choose e for any problem instance and verify that it has the requisite P(e) * P(fle) in O(m 2) time. Next we give separate polynomial-time reductions from two NP-complete problems. Each reduction highlights a different source of complexity.</Paragraph>
    </Section>
    <Section position="2" start_page="611" end_page="612" type="sub_section">
      <SectionTitle>
4.1 Reduction 1 (from Hamilton Circuit Problem)
</SectionTitle>
      <Paragraph position="0"> The Hamilton Circuit Problem asks: given a directed graph G with vertices labeled 0,...,n, does G have a path that visits each vertex exactly once and returns to its starting point? We transform any Hamilton Circuit instance into an M1-DECIDE instance as follows. First, we create a French vocabulary fl ..... fn, associating word fi with vertex i in the graph. We create a slightly larger English vocabulary e0 ..... en, with e0 serving as the &amp;quot;boundary&amp;quot; word for source model scoring. Ultimately, we will ask M1-DECIDE to decode the string fl...fn.</Paragraph>
      <Paragraph position="1"> We create channel model tables as follows:</Paragraph>
      <Paragraph position="3"> These tables ensure that any decoding e off1 ...fn will contain the n words el .... , en (in some order). We now create a source model. For every pair (i,j) such that 0 G i,j G n: = ~l/n if graph G contains an edge from vertex i to vertex j b(ej\[ei) to otherwise Finally, we set k to zero. To solve a Hamilton Circuit Problem, we transform it as above (in quadratic time), then invoke M1-DECIDE with inputs b, c, s, k, and fl...fm. If M1-DECIDE returns yes, then there must be some string e with both P(e) and P(fle) nonzero. The channel model lets us conclude that if P(f\[e) is nonzero, then e contains the n words el,..., en in some order. If P(e) is nonzero, then every bigram in e (including the two boundary bigrams involving e0) has nonzero probability. Because each English word in e corresponds to a unique vertex, we can use the order of words in e to produce an ordering of vertices in G. We append vertex 0 to the beginning and end of this list to produce a Hamilton Circuit. The source model construction guarantees an edge between each vertex and the next.</Paragraph>
      <Paragraph position="4"> If M1-DECIDE returns no, then we know that every string e includes at least one zero value in the computation of either P(e) or P(fle). From any proposed Hamilton Circuit--i.e., some ordering of vertices in G--we can construct a string e using the same ordering. This e will have P(f\]e) = 1 according to the channel model. Therefore, P(e) = 0. By the source model, this can only happen if the proposed &amp;quot;circuit&amp;quot; is actually broken somewhere. So no Hamilton Circuit exists.</Paragraph>
      <Paragraph position="5"> Figure 3 illustrates the intuitive correspondence between selecting a good word order and finding a Hamilton Circuit. We note that Brew (1992) discusses the NP-completeness of a related problem, that of finding some permutation of a string that is acceptable to a given context-free grammar. Both of these results deal with decision problems. Returning to optimization, we recall another circuit task called the Traveling  Selecting a good source word order is like solving the Hamilton Circuit Problem. If we assume that the channel model offers deterministic, word-for-word translations, then the bigram source model takes responsibility for ordering them. Some word pairs in the source language may be illegal. In that case, finding a legal word ordering is like finding a complete circuit in a graph. (In the graph shown above, a sample circuit is boundary --, this ---* year ~ comma ~ my --* birthday --~ falls --~ on ---* a --+ Thursday ~ boundary). If word pairs have probabilities attached to them, then word ordering resembles the finding the least-cost circuit, also known as the Traveling Salesman Problem.</Paragraph>
      <Paragraph position="6"> Salesman Problem. It introduces edge costs dq and seeks a minimum-cost circuit. By viewing edge costs as log probabilities, we can cast the Traveling Salesman Problem as one of optimizing P(e), that is, of finding the best source word order in Model 1 decoding.</Paragraph>
    </Section>
    <Section position="3" start_page="612" end_page="612" type="sub_section">
      <SectionTitle>
4.2 Reduction 2 (from Minimum Set Cover Problem)
</SectionTitle>
      <Paragraph position="0"> The Minimum Set Cover Problem asks: given a collection C of subsets of finite set S, and integer n, does C contain a cover for S of size ~ n, i.e., a subcollection whose union is S? We now transform any instance of Minimum Set Cover into an instance of M1-DECIDE, using polynomial time. This time, we assume a rather neutral source model in which all strings of a given length are equally likely, but we construct a more complex channel.</Paragraph>
      <Paragraph position="1"> We first create a source word ei for each subset in C, and let gi be the size of that subset. We create a table b(eilej) with values set uniformly to the reciprocal of the source vocabulary size (i.e., the number of subsets in C).</Paragraph>
      <Paragraph position="2"> Assuming S has m elements, we next create target words fl ..... fm corresponding to each of those elements, and set up channel model tables as follows: if the element in S corresponding to j~ is also in the subset corresponding to ei  Selecting a concise set of source words is like solving the Minimum Set Cover Problem. A channel model with overlapping, one-to-many dictionary entries will typically license many decodings. The source model may prefer short decodings over long ones. Searching for a decoding of length _&lt; n is difficult, resembling the problem of covering a finite set with a small collection of subsets. In the example shown above, the smallest acceptable set of source words is {and, cooked, however, left, comma, period}.</Paragraph>
      <Paragraph position="3"> If M1-DECIDE returns yes, then some decoding e with P(e) * P(f\]e) &gt; 0 must exist. We know that e must contain n or fewer words--otherwise P(f\[e) = 0 by the c table. Furthermore, the s table tells us that every word fj is covered by at least one English word in e. Through the one-to-one correspondence between elements of e and C, we produce a set cover of size G n for S.</Paragraph>
      <Paragraph position="4"> Likewise, if M1-DECIDE returns no, then all decodings have P(e) * P(f\[e) = 0. Because there are no zeroes in the source table b, every e has P(f\[e) = 0. Therefore either (1) the length of e exceeds n, or (2) somefj is left tmcovered by the words in e. Because source words cover target words in exactly the same fashion as elements of C cover S, we conclude that there is no set cover of size &lt; n for S. Figure 4 illustrates the intuitive correspondence between source word selection and minimum set covering.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML