XML Viewer - p06-1122

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1122_metho.xml
Size: 20,887 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1122">
  <Title>Modelling lexical redundancy for machine translation</Title>
  <Section position="4" start_page="969" end_page="969" type="metho">
    <SectionTitle>
2 Lexical redundancy between languages
</SectionTitle>
    <Paragraph position="0"> In statistical MT, the source and target lexicons are usually defined as the sets of distinct types observed in the parallel training corpus for each language. Such models may not be optimal for certain language pairs and training regimes.</Paragraph>
    <Paragraph position="1"> A word-level statistical translation model approximates the probabilityPr(E|F) that a source type indexed by F will be translated as a target type indexed by E. Standard models, e.g. Brown et al. (1993), consist of discrete probability distributions with separate parameters for each unique pairing of a source and target types; no attempt is made to leverage structure within the event spaces E and F during parameter estimation. This results in a large number of parameters that must be estimated from limited amounts of parallel corpora.</Paragraph>
    <Paragraph position="2"> We refer to distinctions made between lexical types in one language that do not result in different distributions over types in the other language as lexically redundant for the language pair. Since the role of the translation model is to determine a distribution over target types given a source type, when the corresponding target distributions do not vary significantly over a set of source types, the model gains nothing by maintaining a distinct set of parameters for each member of this set.</Paragraph>
    <Paragraph position="3"> Lexical redundancy may arise when languages differ in the specificity with which they refer to the same concepts. For instance, colours of the spectrum may be partitioned differently (e.g. blue in English v.s. sinii and goluboi in Russian). It will also arisewhen languages explicitlyencode different information in the lexicon. For example, translating from French to English, a standard model would treat the following pairs of source and target types as distinct events with entirely unrelated parameters: (vert,green), (verte,green), (verts,green) and (vertes,green). Here the French types differ only in their final suffixes due to adjectival agreement. Since there is no equivalent mechanism in English, these distinctions are redundant with respect to this target language.</Paragraph>
    <Paragraph position="4"> Distinctions that are redundant in the source lexicon when translating into one language may, however, be significant when translating into another. For instance, the French adjectival number agreement (the addition of an s) may be significant when translating to Russian which also marks adjectives for number (the inflexion to -ye).</Paragraph>
    <Paragraph position="5"> We can remove redundancy from the translation model by conflating redundant types, e.g. vert .= {vert,verte,verts,vertes}, and averaging bilingual statistics associated with these events.</Paragraph>
  </Section>
  <Section position="5" start_page="969" end_page="972" type="metho">
    <SectionTitle>
3 Eliminating redundancy in the model
</SectionTitle>
    <Paragraph position="0"> Redundancy in the translation model can be viewed as unwarranted model complexity. A cluster-based translation model defined via a hard-clustering of the lexicon can reduce this complexitybyintroducingadditionalindependenceas- null sumptions: given the source cluster label, cj, the target type,ei, is assumed to be independent of the exact source type, fj, observed, i.e., p(ei|fj) [?] p(ei|cj). Optimising the model for lexical redundancy can be viewed as model selection over a set of such cluster-based translation models.</Paragraph>
    <Paragraph position="1"> We formulate model search as a maximum a posteriori optimisation: the data-dependent term, p(D|C), quantifiesevidenceprovidedforamodel, C, by bilingual training data, D, while the prior, p(C), can assert a preference for a particular model structure (clustering of the source lexicon) on the basis of monolingual features. Both terms have parameters that are estimated from data. Formally, we search forC[?],</Paragraph>
    <Paragraph position="3"> Evaluating the data-dependent term, p(D|C), for different partitions of the source lexicon, we can compare how well different models predict the target tokens aligned in a parallel corpus. This term willprefermodelsthatgrouptogethersourcetypes with similar distributions over target types. By using the marginal likelihood (integrating out the parameters of the translation model) to calculate  p(D|C), we can account explicitly for the complexity of the translation model and compare models with different numbers of clusters as well as different assignments of types to clusters.</Paragraph>
    <Paragraph position="4"> In addition to an implicit uniform prior over cluster labels as in k-means clustering (e.g. Chou (1991)), we also consider a Markov random field (MRF) parameterisation of the p(C) term to capture monolingual regularities in the lexicon. The MRF induces dependencies between clustering decisions in different parts of the lexicon via a monolingual feature space biasing the search towards models that exhibit monolingual regularities. Rather than assuming a priori knowledge of redundant distinctions in the source language, we use an EM algorithm to update parameters for features defined over sets of source types on the basis of existing cluster assignments. While initially the model search will be guided only by information from the bilingual statistics in p(D|C), monolingual regularities in the lexicon, such as inflexion patterns, may gradually be propagated through the model as p(C) becomes informative. Our experiments suggest that the MRF prior enables more robust model selection.</Paragraph>
    <Paragraph position="5"> As stated, the model selection procedure accounts for redundancy in the source lexicon using the target distributions. The target lexicon can be optimised analogously. Clustering target types allows the implementation of independence assumptions asserting that the exact specification of a target type is independent of the source type givenknowledgeofthetargetclusterlabel. Forexample, when translating an English adjective into French it may be more efficient to use the translation model to specify only that the translation lieswithinacertainsetofFrenchadjectives, correspondingtoasinglelemma, andhavethelanguage model select the exact form. Our experiments suggest that itcan be useful toaccount for redundancy in both languages in this way; this can be incorporated simply within our optimisation procedure. In Section 3.1 we describe the bilingual marginal likelihood, p(D|C), clustering procedure; in Section 3.2 we introduce the MRF parameterisation of the prior, p(C), over model structure; and in Section 3.3, we describe algorithmic approximations.</Paragraph>
    <Section position="1" start_page="970" end_page="971" type="sub_section">
      <SectionTitle>
3.1 Bilingual model selection
</SectionTitle>
      <Paragraph position="0"> Assume we are optimising the source lexicon (the target lexicon is optimised analogously). A clustering of the lexicon is a unique mapping CF : F - CF defined for allf [?] F where, in addition to all source types observed in the parallel training corpus, F may include items seen in other mono-lingual corpora (and, in the case of the source lexicon only, the development and test data). The standard SMT lexicon can be viewed as a clustering with each type observed in the parallel training corpus assigned to a distinct cluster and all other types assigned to a single 'unknown word' cluster.</Paragraph>
      <Paragraph position="1"> We optimise a conditional model of target tokens from word-aligned parallel corpora, D = {Dc0,...,DcN}, where Dci represents the set of target words that were aligned to the set of source types in cluster ci. We assume that each target token in the corpus is generated conditionally i.i.d.</Paragraph>
      <Paragraph position="2"> given the cluster label of the source type to which it is aligned. Sufficient statistics for this model consist of co-occurrence counts of source and target types summed across each source cluster,</Paragraph>
      <Paragraph position="4"> Maximising the likelihood of the data under this model would require us to specify the number of clusters (the size of the lexicon) in advance. Instead we place a Dirichlet prior parameterised by a1 over the translation model parameters of each cluster, ucf,e, defining the conditional distributions over target types. Given a clustering, the Dirichlet prior, and independent parameters, the distribution over data and parameters factorises,</Paragraph>
      <Paragraph position="6"> We optimise cluster assignments with respect to the marginal likelihood which averages the likelihood of the set of counts assigned to a cluster, Dcf, under the current model over the prior,</Paragraph>
      <Paragraph position="8"> p(ucf|a)p(Dcf|ucf,cf)ducf.</Paragraph>
      <Paragraph position="9"> This can be evaluated analytically for a Dirichlet prior with multinomial parameters.</Paragraph>
      <Paragraph position="10"> Assuming a (fixed) uniform prior over model structure, p(C), model selection involves iteratively re-assigning source types to clusters such as to maximise the marginal likelihood. Reassignments may alter the total number of clusters  at any point. Updates can be calculated locally, for instance, given the sets of target tokens Dci and Dcj aligned to source types currently in clusters ci and cj, the change in log marginal likelihood if clustersci andcj are merged into cluster -cis,</Paragraph>
      <Paragraph position="12"> which is a Bayes factor in favour of the hypothesis that Dci and Dcj were sampled from the same distribution (Wolpert, 1995). Unlike its equivalent in maximum likelihood clustering, Eq.(3) may assume positive values favouring a smaller number of clusters when the data does not support a more complex hypothesis. The more complex model, with ci and cj modelled separately, is penalised for being able to model a wider range of data sets.</Paragraph>
      <Paragraph position="13"> The hyperparameter, a, is tied across clusters and taken to be proportional to the marginal (the 'background') distribution over target types in the corpus. Under this prior, source types aligned to the same target types, will be clustered together more readily if these target types are less frequent in the corpus as a whole.</Paragraph>
    </Section>
    <Section position="2" start_page="971" end_page="972" type="sub_section">
      <SectionTitle>
3.2 Markov random field model prior
</SectionTitle>
      <Paragraph position="0"> As described above we consider a Markov random field (MRF) parameterisation of the prior over model structure, p(C). This defines a distribution over cluster assignments of the source lexicon as a whole based solely on monolingual characteristics of the lexical types and the relations between their respective cluster assignments.</Paragraph>
      <Paragraph position="1"> Viewed as graph, each variable in the MRF is modelled as conditionally independent of all other variables given the values of its neighbours (the Markov property; (Geman and Geman, 1984)).</Paragraph>
      <Paragraph position="2"> Each variable in the MRF prior corresponds to a lexical source type and its cluster assignment. Fig. 1 shows a section of the complete model including the MRF prior for a Welsh source lexicon; shading denotes cluster assignments and English target tokens are shown as directed nodes.2 From the Markov property it follows that this prior decomposes over neighbourhoods,  that we learn from the data; these are tied across the graph. b is a free parameter used to control the overall contribution of the prior in Eq. (1). Here features are defined over pairs of types but higher-order interactions can also be modelled. We only consider 'positive' prior knowledge that is indicative of redundancy among source types. Hence all features are non-zero only when their arguments are assigned to the same cluster.</Paragraph>
      <Paragraph position="3"> Features can be defined over any aspects of the lexicon; in our experiments we use binary features over constrained string edits between types. The following feature would be 1, for instance, if the Welsh types cymru and gymru (see Fig. 1), were assigned to the same cluster.3</Paragraph>
      <Paragraph position="5"> Setting the parameters of the MRF prior over this feature space by hand would require a priori knowledge of redundancies for the language pair.</Paragraph>
      <Paragraph position="6"> In the absence of such knowledge, we use an iterative EM algorithm to update the parameters on the basis of the previous solution to the bilingual clustering procedure. EM parameter estimation forces the cluster assignments of the MRF prior to agree with those obtained on the basis of bilingual data using monolingual features alone. Since features are tied across the MRF, patterns that characterise redundant relations between types will be re-enforced across the model. For instance (see Fig. 1), if cymru and gymru are clustered together, the parameter for featureps1, shown above, may increase. This induces a prior preference for car and gar to form a cluster on subsequent iterations. A similar feature defined for mar and gar in the a priori string edit feature space, on the other hand, may remain uninformative if not observed frequently on pairs of types assigned to the same clusters. In this way, the model learns to 3Here[?]matches a common substring of both arguments.</Paragraph>
      <Paragraph position="7">  generalise language-specific redundancy patterns from a large a priori feature space. Changes in the prior due to re-assignments can be calculated locally and combined with the marginal likelihood.</Paragraph>
    </Section>
    <Section position="3" start_page="972" end_page="972" type="sub_section">
      <SectionTitle>
3.3 Algorithmic approximations
</SectionTitle>
      <Paragraph position="0"> The model selection procedure is an EM algorithm. Each source type is initially assigned to its own cluster and the MRF parameters, li, are initialised to zero. A greedy E-step iteratively reassigns each source type to the cluster that maximises Eq. (1); cluster statistics are updated after any re-assignment. To reduce computation, we only consider re-assignments that would cause at least one (non-zero) feature in the MRF to fire, or to clusters containing types sharing target word-alignments with the current type; types may also be re-assigned to a cluster of their own at any iteration. When clustering both languages simultaneously, we average 'target' statistics over the number of events in each 'target' cluster in Eq. (2).</Paragraph>
      <Paragraph position="1"> We re-estimate the MRF parameters after each pass through the vocabulary. These are updated according to MLE using a pseudolikelihood approximation (Besag, 1986). Since MRF parameters can only be non-zero for features observed on types clustered together during an E-step, we use lazy instantiation to work with a large implicit feature set defined by a constrained string edit.</Paragraph>
      <Paragraph position="2"> The algorithm has two free parameters: adetermining the strength of the Dirichlet prior used in the marginal likelihood,p(D|C), andb which determines the contribution ofpMRF(C) to Eq. (1).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="972" end_page="973" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> Phrase-based SMT systems have been shown to outperform word-based approaches (Koehn et al., 2003). We evaluate the effects of lexicon model selection on translation quality by considering two applications within a phrase-based SMT system.</Paragraph>
    <Section position="1" start_page="972" end_page="972" type="sub_section">
      <SectionTitle>
4.1 Applications to phrase-based SMT
Aphrase-basedtranslationmodelcanbeestimated
</SectionTitle>
      <Paragraph position="0"> in two stages: first a parallel corpus is aligned at the word-level and then phrase pairs are extracted (Koehn et al., 2003). Aligning tokens in parallel sentences using the IBM Models (Brown et al., 1993), (Och and Ney, 2003) may require less information than full-blown translation since the task is constrained by the source and target tokens present in each sentence pair. In the phrase-level translation table, however, the model must assign  probabilities to a potentially unconstrained set of target phrases. We anticipate the optimal model sizes to be different for these two tasks.</Paragraph>
      <Paragraph position="1"> We can incorporate an optimised lexicon at the word-alignment stage by mapping tokens in the training corpus to their cluster labels. The mapping will not change the number of tokens in a sentence, hence the word-alignments can be associated with the original corpus (see Exp. 1).</Paragraph>
      <Paragraph position="2"> To extrapolate a mapping over phrases from our type-level models we can map each type within a phrase to its corresponding cluster label. This, however, results in a large number of distinct phrases being collapsed down to a single 'clustered phrase'. Using these directly may spread probability mass too widely. Instead we use them to smooth the phrase translation model (see Exp. 2). Here we consider a simple interpolation scheme; they could also be used within a backoff model (Yang and Kirchhoff, 2006).</Paragraph>
    </Section>
    <Section position="2" start_page="972" end_page="972" type="sub_section">
      <SectionTitle>
4.2 Experimental set-up
</SectionTitle>
      <Paragraph position="0"> The system we use is described in (Koehn, 2004). The phrase-based translation model includes phrase-level and lexical weightings in both directions. We use the decoder's default behaviour for unknown words copying them verbatim to the output. Smoothedtrigramlanguagemodelsareestimated on training sections of the parallel corpus. We used the parallel sections of the Prague Treebank (Cmejrek et al., 2004), French and English sections of the Europarl corpus (Koehn, 2005) and parallel text from the Welsh Assembly4 (see Table1). The source languages, Czech, French and Welsh, were chosen on the basis that they may exhibit different degrees of redundancy with respect to English and that they differ morphologically. Only the Czech corpus has explicit morphological annotation.</Paragraph>
    </Section>
    <Section position="3" start_page="972" end_page="973" type="sub_section">
      <SectionTitle>
4.3 Models
</SectionTitle>
      <Paragraph position="0"> All models used in the experiments are defined as mappings of the source and target vocabularies.</Paragraph>
      <Paragraph position="1"> The target vocabulary includes all distinct types  seen in the training corpus; the source vocabulary also includes types seen only in development and test data. Free parameters were set to max- null standard corresponds to the standard SMT lexicon. max-pref and min-freq are both simple stemming algorithms that can be applied to raw text. These mappings result in models defined over fewer distinctevents thatwill havehigher frequencies; min-freq optimises the latter directly. We optimise over (possibly different) values of n for source and target languages. The lemmatize mapping which maps types to their lemmas was only applicable to the Czech corpus.</Paragraph>
      <Paragraph position="2"> The optimised lexicon models define mappings directly via their clusterings of the vocabulary. We consider the following four models: * src: clustered source lexicon; * src+mrf: as src with MRF prior; * src+trg: clustered source and target lexicons; * src+trg+mrf: as src+trg with MRF priors.</Paragraph>
      <Paragraph position="3"> In each case we optimise overa(a single value for both languages) and, when using the MRF prior, overb (a single value for both languages).</Paragraph>
    </Section>
    <Section position="4" start_page="973" end_page="973" type="sub_section">
      <SectionTitle>
4.4 Experiments
</SectionTitle>
      <Paragraph position="0"> The two sets of experiments evaluate the base-line models and optimised lexicon models during word-alignment and phrase-level translation model estimation respectively.</Paragraph>
      <Paragraph position="1"> * Exp. 1: map the parallel corpus, perform word-alignment; estimate the phrase translation model using the original corpus.</Paragraph>
      <Paragraph position="2"> * Exp. 2: smooth the phrase translation model,</Paragraph>
      <Paragraph position="4"> Here e, f and ce, cf are phrases mapped under the standard model and the model being tested respectively; g is set once for all experiments on development data. Word-alignments were generated using the optimal max-pref mapping for each training set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML