XML Viewer - w03-1002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1002_metho.xml
Size: 18,851 bytes
Last Modified: 2025-10-06 14:08:26
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1002">
  <Title>Statistical Machine Translation Using Coercive Two-Level Syntactic Transduction</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Resources
</SectionTitle>
    <Paragraph position="0"> As in other SMT approaches, the primary training resource is a sentence-aligned parallel bilingual corpus.</Paragraph>
    <Paragraph position="1"> We further require that each side of the corpus be part-of-speech (POS) tagged and phrase chunked; our lab has previously developed techniques for rapid training of such tools (Cucerzan and Yarowsky, 2002). Our translation experiments were carried out on two languages: Arabic and French. The Arabic training corpus was a subset of the United Nations (UN) parallel corpus which is being made available by the Linguistic Data Consortium. For French-English training, we used a portion of the Canadian Hansards. Both corpora utilized sentence-level alignments publicly distributed by the Linguistic Data Consortium.</Paragraph>
    <Paragraph position="2"> POS tagging and phrase chunking in English were done using the trained systems provided with the fnTBL Toolkit (Ngai and Florian, 2001); both were trained from the annotated Penn Treebank corpus (Marcus et al., 1993). French POS tagging was done using the trained French lexical tagger also provided with the fnTBL software. For Arabic, we used a colleague's POS tagger and tokenizer (clitic separation was also performed prior to POS tagging), which was rapidly developed in our laboratory. Simple regular-expression-based phrase chunkers were developed by the authors for both Arabic and French, requiring less than a person-day each using existing multilingual learning tools.</Paragraph>
    <Paragraph position="3"> A further input to our system is a set of word alignment links on the parallel corpus. These are used to compute word translation probabilities and phrasal alignments.</Paragraph>
    <Paragraph position="4"> The word alignments can in principle come from any source: a dictionary, a specialized alignment program, or another SMT system. We used alignments generated by Giza++ (Och and Ney, 2000) by running it in both directions (e.g., Arabic ! English and English ! Arabic) on our parallel corpora. The union of these bidirectional alignments was used to compute cross-language phrase correspondences by simple majority voting, and for purposes of estimating word translation probabilities, each link in this union was treated as an independent instance of word translation.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Translation Model
</SectionTitle>
    <Paragraph position="0"> Now we turn to a detailed description of the proposed translation model. The exposition will give a formal specification and also will follow a running example throughout, using one of the actual Arabic test set sentences. This example, its gloss, system translation and reference human translation are shown in Table 1.</Paragraph>
    <Paragraph position="1"> The translation model (TM) we describe is trained directly from counts in the data, and is a direct model, not a noisy channel model. It consists of three nested components: (1) a sentence-level model of phrase correspondence and reordering, (2) a model of intra-phrase translation, and (3) models of lexical transfer, or word translation. We make a key assumption in our construction that translation at each of these three levels is independent of the others.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Sentence Translation
</SectionTitle>
      <Paragraph position="0"> As mentioned, both the foreign language and English corpora are input with &amp;quot;hard&amp;quot; phrase bracketings and labeled with &amp;quot;hard&amp;quot; phrase types (e.g., NP, VP1, PPNP2, etc.) as given by the output of the phrase chunker. These are denoted in the top-level model presentation in Table 2(1). Given word alignment links, as described in Section 2, we compute phrasal alignments on training data.</Paragraph>
      <Paragraph position="1"> We contrain these to have cardinality (foreign)N ! 1(English). Next, we collect counts over aligned phrase sequences and use the relative frequencies to estimate the probability distribution in Table 2(2). Particularly for smaller training corpora, unseen foreign-language phrase sequences are a problem, so we implemented a simple backoff method which assigns probability to translations of unseen foreign-language phrase sequences. Table 2(3) encapsulates the remainder of the translation model, which is described below.</Paragraph>
      <Paragraph position="2"> As an example, Table 3 shows the most probable aligned English phrase sequence generations given an Arabic simple sentence having the canonical VSO ordering. Also, note that all probabilities in the following  Arabic Example Sentence From Test Set (ARABIC) twSy Al- ljnp Al- sAdsp Al- jmEyp Al- EAmp b- AEtmAd m$rwE Al- mqrr Al- tAly : (PHR.-BRACKETED AR.) [twSy] [Al- ljnp Al- sAdsp] [Al- jmEyp Al- EAmp] [b- AEtmAd m$rwE Al- mqrr Al- tAly] [:] (AN ENG. GLOSS) [recommends] [the committee the sixth] [the assembly the general] [to adoption draft the decision the following] [:] (ENG. MT OUTPUT) [the sixth committee] [recommends] [the general assembly] [in the adoption of the following draft resolution] [:] (REFERENCE TRANS.) the sixth committee recommends to the general assembly the adoption of the following draft decision :  strings in this paper are rendered in the reversible Buckwalter transliteration. In addition, all words or symbols referring to Arabic and French in this paper are italicized.</Paragraph>
      <Paragraph position="3"> figures and tables are from the actual Arabic and French trained systems.</Paragraph>
      <Paragraph position="4">  bic, for canonical Arabic simple sentence structure VP (verb) NP (subject) NP (object). Subscripts in English phrase sequence are alignments to positions in the corresponding Arabic phrase sequence.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Phrase Translation
</SectionTitle>
      <Paragraph position="0"> Given an Arabic test sentence, a distribution of aligned English phrase sequences is proposed by the sentence-level model described in the previous section and in Table 2. Each proposed English phrase in each of the phrase sequence possibilities, therefore, comes to the middle level of the translation model with access to the identity of the French phrases aligned to it. Phrase translation is implemented as shown in Table 4. The phrase translation model is structured with several levels of backoff: if no observations exist from training data for a particular level, the model backs off to the next-more-general level.</Paragraph>
      <Paragraph position="1"> In all cases, generation of an English phrase is conditioned on the foreign phrase as well as the type (NP, VP, etc.) of the English phrase.</Paragraph>
      <Paragraph position="2"> Table 4 (1) describes the initial phrase translation model. It comes into play if the precise sequence of foreign words has been observed aligning to an English phrase of the appropriate type. In the example, we are trying to generate an NP given the Arabic word string &amp;quot;Al- ljnp Al- sAdsp&amp;quot; (literally: &amp;quot;the committee the sixth&amp;quot;). If this has been observed in data, then that relative frequency distribution serves as the translation probability distribution. Table 11 contains examples of some of these literal phrase translations from the French data.</Paragraph>
      <Paragraph position="3"> The next stage of backoff from the above, literal level is a model that generates aligned English POS tag sequences given foreign POS tag sequences: details and an example can be found in Table 4(2). The sequence alignments determine the position in English phrase and the part-of-speech into which we translate the foreign word. Again, translation is also conditioned on the English phrase type. Table 5 and Table 6 show the most probable aligned English sequence generations for two of the phrases in the example sentence.</Paragraph>
      <Paragraph position="4"> If there were no counts for (foreign-POS-sequence, english-phrase-type) then we back off to counts collected over (foreign-coarse-POS-sequence, englishphrase-type), where a coarse POS is, for example, N instead of NOUN-SG. This is shown in Table 4(3).</Paragraph>
      <Paragraph position="5"> In case further backoff is needed, as shown in Table 4(4), we begin stripping POS-tags off the &amp;quot;less significant&amp;quot; (non-head) end of the foreign POS-sequence until we are left with a phrase sequence that has been seen in training, and from this a corresponding English phrase distribution is observable. We define the &amp;quot;less significant&amp;quot; end of a phrase to be the end if it is head-initial, or the beginning if it is head-final, and at this point ignore issues such as nested structure in French and Arabic NP's.</Paragraph>
      <Paragraph position="6"> Aligned English POS-tag Sequence Translation Probabilities (conditioned on Arabic POS-tag sequence from NP in example)</Paragraph>
      <Paragraph position="8"> NP generations given an Arabic phrase DET NOUN-SG DET ADJ. Note: ; denotes a null alignment (generation from null). Generation from a null alignment is allowed for specified parts of speech, such as determiners and prepositions.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Lexical Transfer
4.3.1 The Basic Model
</SectionTitle>
      <Paragraph position="0"> In the basic model of word generation, phrases may be translated directly as single atomic entities (as in Table 4(1)), or via phrasal decomposition to individual words translated independently, conditioned only on the source word and target POS. Word translation in the latter case  sentences. (1) is the direct, lexical translation level. (2) - (4) constitute the backoff path to handle detailed phenomena unseen in the training set. (2) is a model of fine POS-tag reordering and lexical generation; (3) is similar, but conditions generation on coarse POS-tag sequences in the foreign language. (4) is a model for progressively stripping off POS-tags from the &amp;quot;less significant&amp;quot; end of a foreign sequence. The idea is to do this until we reach a subsequence that has been seen in training data, and which we therefore have a distribution of valid generatons for. The term i in (2) - (4) is a position alignment matrix. At all times, we generate not just an English POS-tag sequence, but rather an aligned sequence. Similarly, in the lexical transfer probabilities shown in this table, there is a function i() which takes an English sequence position index and returns the (unique) foreign word position to which it is aligned4. Aligned English POS-tag Sequence Translation Probabilities (conditioned on Arabic POS-tag sequence from VP in example)</Paragraph>
      <Paragraph position="2"> erations given an Arabic phrase VERB-IMP.</Paragraph>
      <Paragraph position="3"> is done in the context that the model has already proposed a sequence of POS tags for the phrase. Thus we know the English POS of the word we are trying to generate in addition to the foreign word that is generating it. Consequently, we condition translation on English POS as well as the foreign word. Table 7 describes the backoff path for basic lexical transfer and presents a motivating example in the French word droit. Translation probabilities for one of the words in the example Arabic sentence can be found in Table 8.</Paragraph>
      <Paragraph position="4"> 4.3.2 Generation via a Lemma Model To counter sparse data problems in estimating word translation probabilities, we also implemented a lemma- null els of backoff in the lexical transfer model. The example shows translations for the French word droit (&amp;quot;right&amp;quot;) conditioned on decreasingly specific values. The progressively lower ranking of the correct translation as we move from fine, to coarse, to no POS, illustrates the benefit of conditioning generation on the English part of speech.</Paragraph>
      <Paragraph position="5">  noun ljnp, &amp;quot;committee&amp;quot;.</Paragraph>
      <Paragraph position="6"> based model for word translation. Under this model, translation distributions are estimated by counting word alignment links between foreign and English lemmas, assuming a lemmatization of both sides of the parallel corpus as input. The form of the model is illustrated below:</Paragraph>
      <Paragraph position="8"> First, note that P( lemmaF j WF , TcoarseF ) is very simply a hard lemma assignment by the foreign language lemmatizer. Second, English word generation from English lemma and coarse POS (P( WE j lemmaE , TfineE )) is programmatic, and can be handled by means of rules in conjunction with a lookup table for irregular forms. The only distribution here that must be estimated from data is P( lemmaE j lemmaF , TcoarseE ). This is done as described above. Furthermore, given an electronic translation dictionary, even this distribution can be pre-loaded: indeed, we expect this to be an advantage of the lemma model, and an example of a good opportunity for integrating compiled human knowledge about language into an SMT system. Some examples of the lemma model combating sparse data problems inherent in the basic word-to-word models can be found in Table 9.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3.3 Coercion
</SectionTitle>
      <Paragraph position="0"> Lexical coercion is a phenomenon that sometimes occurs when we condition translation of a foreign word on the word and the English part-of-speech. We find that the system we have described frequently learns this behavior: specifically, the model learns in some cases how to generate, for instance, a nominal form with similar meaning from a French adjective, or an adjectival realization of a French verb's meaning; some examples of this phenomenon are shown in Table 10. We find this coercion effect to be of interest because it identifies interesting associations of meaning. For example, in Table 10 &amp;quot;willing&amp;quot; and &amp;quot;ready&amp;quot; are both sensible ways to realize the meaning of the action &amp;quot;to accept&amp;quot; in a passive, descriptive mode. droit behaves similarly. Though the English verb &amp;quot;to right' or &amp;quot;to be righted&amp;quot; does not have the philosophical/judicial entitlement sense of the noun &amp;quot;right&amp;quot;, we see that the model has learned to realize the meaning in an active, verbal form: e.g., VBG 'receiving&amp;quot; and VB &amp;quot;qualify&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="493" type="metho">
    <SectionTitle>
5 Decoding
</SectionTitle>
    <Paragraph position="0"> Decoding was implemented by constructing finite-state machines (FSMs) per evaluation sentence to encode relevant portions (for the individual sentence in question) of the component translation distributions described above. Operations on these FSMs are performed using the AT&amp;T FSM Toolkit (Mohri et al., 1997). The FSM constructed for a test sentence is subsequently composed with a FSM trigram language model created via the SRI Language Modeling Toolkit (Stolcke, 2002). Thus we use the trigram language model to implement rescoring of the (direct) translation probabilities for the English word sequences in the translation model lattice.</Paragraph>
    <Paragraph position="1"> We found that using the finite-state framework and the general-purpose AT&amp;T toolkit greatly facilitates decoder development by freeing the implementation from details of machine composition and best-path searching, etc.</Paragraph>
    <Paragraph position="2"> The structure of the translation model finite-state machines is as illustrated in Figure 1. The sentence-level (aligned phrase sequence generation) and phrase-level (aligned intra-phrase sequence generation) translation probabilities are encoded on epsilon arcs in the machines. Word translation probabilities are placed onto arcs emitting the word as an output symbol (in the figure, note the arcs emitting &amp;quot;committee&amp;quot;, &amp;quot;the&amp;quot;, etc.). The FSM in Figure 1 corresponds to the Arabic example sentence used throughout this paper. In the portion of the machine shown, the (best) path which generated the example sentence is drawn in bold. Finally, Figure 2 is a rendering of the actual FSM (aggressively pruned for display purposes) that generated the example Arabic sentence; although labels and details are not visible, it may provide a visual aid for better understanding the structure of the FSM lattices generated here.</Paragraph>
    <Paragraph position="3"> As a practical matter in decoding, during translation model FSM construction we modified arc costs for output words in the following way: a fixed bonus was assigned for generating a &amp;quot;content&amp;quot; word translating to a &amp;quot;content&amp;quot; word. Determining what qualifies as a content word was done on the basis of a list of content POS tags for each language. For example, all types of nouns, verbs and adjectives were listed as content tags; determiners, prepositions, and most other closed-class parts of speech were not. This implements a reasonable penalty on undesirable output sentence lengths. Without such a penalty, translation outputs tend to be very short: long sentence hypotheses are penalized de facto merely by containing many word translation probabilities. An additional trick in decoding is to use only the N-best translation options for sentence-level, phrase-level, and word-level translation. We found empirically (and very consistently) in dev-test experiments that restricting the syntactic transductions to a 30-best list and word translations to a 15-best list had no negative impact on Bleu score. The benefit, of course, is that the translation lattices are dramatically reduced in size, speeding up composition and search operations.</Paragraph>
    <Paragraph position="5"/>
    <Paragraph position="7"/>
    <Paragraph position="9"/>
    <Paragraph position="11"/>
    <Paragraph position="13"/>
    <Paragraph position="15"/>
    <Paragraph position="17"/>
    <Paragraph position="19"/>
    <Paragraph position="21"/>
    <Paragraph position="23"/>
    <Paragraph position="25"/>
    <Paragraph position="27"> tence, compacted and aggressively pruned by path probability for display purposes.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML