File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1009_metho.xml

Size: 25,500 bytes

Last Modified: 2025-10-06 14:10:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1009">
  <Title>Discriminative Word Alignment with Conditional Random Fields</Title>
  <Section position="4" start_page="65" end_page="65" type="metho">
    <SectionTitle>
2 Conditional random fields
</SectionTitle>
    <Paragraph position="0"> CRFs are undirected graphical models which define a conditional distribution over a label sequence given an observation sequence. We use a CRF to model many-to-one word alignments, where each source word is aligned with zero or one target words, and therefore each target word can be aligned with many source words. Each source word is labelled with the index of its aligned target, or the special value null, denoting no alignment. An example word alignment is shown in Figure 1, where the hollow squares and circles indicate the correct alignments. In this example the French words une and autre would both be assigned the index 24 - for the English word another - when French is the source language. When the source language is English, another could be assigned either index 25 or 26; in these ambiguous situations we take the first index.</Paragraph>
    <Paragraph position="1"> The joint probability density of the alignment, a (a vector of target indices), conditioned on the source and target sentences, e and f, is given by:</Paragraph>
    <Paragraph position="3"> where we make a first order Markov assumption  Hansards test set. Hollow squares represent gold standard sure alignments, circles are gold possible alignments, and filled squares are predicted alignments. over the alignment sequence. Here t ranges over the indices of the source sentence (f), k ranges over the model's features, and L = {lk} are the model parameters (weights for their corresponding features). The feature functions hk are pre-defined real-valued functions over the source and target sentences coupled with the alignment labels over adjacent times (source sentence locations), t. These feature functions are unconstrained, and may represent overlapping and non-independent features of the data. The distribution is globally normalised by the partition function, ZL(e,f), which sums out the numerator in (1) for every possible alignment:</Paragraph>
    <Paragraph position="5"> We use a linear chain CRF, which is encoded in the feature functions of (1).</Paragraph>
    <Paragraph position="6"> The parameters of the CRF are usually estimatedfromafullyobservedtrainingsample(word null aligned), by maximising the likelihood of these data. I.e. LML = argmaxL pL(D), where D = {(a,e,f)} are the training data. Because maximum likelihood estimators for log-linear models have a tendency to overfit the training sample (Chen and Rosenfeld, 1999), we define a prior distribution over the model parameters and derive a maximum a posteriori (MAP) estimate, LMAP = argmaxL pL(D)p(L). We use a zero-mean Gaussian prior, with the probability density function p0(lk) [?] exp parenleftBig</Paragraph>
    <Paragraph position="8"> In order to train the model, we maximize (2).</Paragraph>
    <Paragraph position="9"> While the log-likelihood cannot be maximised for the parameters, L, in closed form, it is a convex function, and thus we resort to numerical optimisation to find the globally optimal parameters. We use L-BFGS, an iterative quasi-Newton optimisation method, which performs well for training log-linear models (Malouf, 2002; Sha and Pereira, 2003). Each L-BFGS iteration requires the objective value and its gradient with respect to the model parameters. These are calculated using forward-backward inference, which yields the partition function, ZL(e,f), required forthelog-likelihood,andthepair-wisemarginals, pL(at[?]1,at|e,f), required for its derivatives.</Paragraph>
    <Paragraph position="10"> The Viterbi algorithm is used to find the maximum posterior probability alignment for test sentences, a[?] = argmaxa pL(a|e,f). Both the forward-backward and Viterbi algorithm are dynamic programs which make use of the Markov assumption to calculate efficiently the exact marginal distributions.</Paragraph>
  </Section>
  <Section position="5" start_page="65" end_page="68" type="metho">
    <SectionTitle>
3 The alignment model
</SectionTitle>
    <Paragraph position="0"> Before we can apply our CRF alignment model, we must first specify the feature set - the functions hk in (1). Typically CRFs use binary indicator functions as features; these functions are only active when the observations meet some criteria and the label at (or label pair, (at[?]1,at)) matches a pre-specified label (pair). However, in our model the labellings are word indices in the target sentence and cannot be compared readily to labellings at other sites in the same sentence, or in other sentences with a different length. Such naive features would only be active for one labelling, therefore this model would suffer from serious sparse data problems.</Paragraph>
    <Paragraph position="1"> We instead define features which are functions of the source-target word match implied by a labelling, rather than the labelling itself. For example, from the sentence in Figure 1 for the labelling of f24 = de with a24 = 16 (for e16 = of) we might detect the following feature:</Paragraph>
    <Paragraph position="3"> Note that it is the target word indexed by at, rather than the index itself, which determines whether the feature is active, and thus the sparsity of the index label set is not an issue.</Paragraph>
    <Section position="1" start_page="65" end_page="68" type="sub_section">
      <SectionTitle>
3.1 Features
</SectionTitle>
      <Paragraph position="0"> One of the main advantages of using a conditional model is the ability to explore a diverse range of features engineered for a specific task. In our CRFmodelweemploytwomaintypesoffeatures: thosedefinedonacandidatealignedpairofwords; and Markov features defined on the alignment sequence predicted by the model.</Paragraph>
      <Paragraph position="1"> Dice and Model 1 As we have access to only a small amount of word aligned data we wish to be able to incorporate information about word association from any sentence aligned data available. A common measure of word association is the Dice coefficient (Dice, 1945):</Paragraph>
      <Paragraph position="3"> where CE and CF are counts of the occurrences of the words e and f in the corpus, while CEF is theirco-occurrencecount. WetreattheseDicevalues as translation scores: a high (low) value incidates that the word pair is a good (poor) candidate translation.</Paragraph>
      <Paragraph position="4"> However, the Dice score often over-estimates the association between common words. For instance, the words the and of both score highly when combined with either le or de, simply because these common words frequently co-occur.</Paragraph>
      <Paragraph position="5"> The GIZA++ models can be used to provide better translation scores, as they enforce competition for alignment beween the words. For this reason, we used the translation probability distribution from Model 1 in addition to the DICE scores. Model 1 is a simple position independent model which can be trained quickly and is often used to bootstrap parameters for more complex models. It models the conditional probability distribution:</Paragraph>
      <Paragraph position="7"> where p(f|e) are the word translation probabilities. null We use both the Dice value and the Model 1 translation probability as real-valued features for each candidate pair, as well as a normalised score   overallpossiblecandidatealignmentsforeachtarget word. We derive a feature from both the Dice and Model 1 translation scores to allow competition between sources words for a particular target alignment. This feature indicates whether a given alignment has the highest translation score of all the candidate alignments for a given target word. For the example in Figure 1, the words la, de and une all receive a high translation score when paired with the. To discourage all of these French words from aligning with the, the best of these (la) is flagged as the best candidate. This allows for competition between source words which would otherwise not occur.</Paragraph>
      <Paragraph position="8"> Orthographic features Features based on string overlap allow our model to recognise cognates and orthographically similar translation pairs, which are particularly common between European languages. Here we employ a number of string matching features inspired by similar features in Taskar et al. (2005). We use an indicator feature for every possible source-target word pair in the training data. In addition, we include indicator features for an exact string match, both with and without vowels, and the edit-distance between the source and target words as a real-valued feature. We also used indicator features to test for matching prefixes and suffixes of length three. As stated earlier, the Dice translation score often erroneously rewards alignments with common words. In order to address this problem, we include the absolute difference in word length as a real-valued feature and an indicator feature testing whether both words are shorter than 4 characters. Together these features allow the model to disprefer alignments between words with very different lengths - i.e. aligning rare (long) words with frequent (short) determiners, verbs etc.</Paragraph>
      <Paragraph position="9"> POS tags Part-of-speech tags are an effective method for addressing the sparsity of the lexical features. Observe in Figure 2 that the noun-adjective pair Canadian experts aligns with the adjective-noun pair sp'ecialistes canadiens: the alignment exactly matches the parts-of-speech.</Paragraph>
      <Paragraph position="10"> Access to the words' POS tags will allow simple modelling of such effects. POS can also be useful for less closely related language pairs, such as English and Japanese where English determiners are never aligned; nor are Japanese case markers.</Paragraph>
      <Paragraph position="11"> For our French-English language pair we POS tagged the source and target sentences with Tree-Tagger.2 We created indicator features over the POS tags of each candidate source and target word pair, as well as over the source word and target POS (and vice-versa). As we didn't have access to a Romanian POS tagger, these features were not used for the Romanian-English language pair.</Paragraph>
      <Paragraph position="12"> Bilingual dictionary Dictionaries are another source of information for word alignment. We use a single indicator feature which detects when the source and target words appear in an entry of the dictionary. For the English-French dictionary we used FreeDict,3 which contains 8,799 English words. For Romanian-English we used a dictionary compiled by Rada Mihalcea,4 which contains approximately 38,000 entries.</Paragraph>
      <Paragraph position="13"> Markov features Features defined over adjacent aligment labels allow our model to reflect the tendency for monotonic alignments between European languages. We define a real-valued alignment index jump width feature: jump width(t[?]1,t) = abs(at [?]at[?]1 [?]1) this feature has a value of 0 if the alignment labels follow the downward sloping diagonal, and is positiveotherwise. ThisdiffersfromtheGIZA++hidden Markov model which has individual parameters for each different jump width (Och and Ney, 2003; Vogel et al., 1996): we found a single feature (and thus parameter) to be more effective.</Paragraph>
      <Paragraph position="14"> We also defined three indicator features over nulltransitions to allow the modelling of the probability of transition between, to and from null labels. null Relative sentence postion A feature for the absolute difference in relative sentence position (abs(at|e |[?] t|f|)) allows the model to learn a preference for aligning words close to the alignment matrix diagonal. We also included two conjunctionfeaturesfortherelativesentencepositionmul- null tiplied by the Dice and Model 1 translation scores.</Paragraph>
      <Paragraph position="15"> Null We use a number of variants on the above featuresforalignmentsbetweenasourcewordand the null target. The maximum translation score between the source and one of the target words  is used as a feature to represent whether there is a strong alignment candidate. The sum of these scores is also used as a feature. Each source word and POS tag pair are used as indicator features which allow the model to learn particular words of tags which tend to commonly (or rarely) align.</Paragraph>
    </Section>
    <Section position="2" start_page="68" end_page="68" type="sub_section">
      <SectionTitle>
3.2 Symmetrisation
</SectionTitle>
      <Paragraph position="0"> In order to produce many-to-many alignments we combine the outputs of two models, one for each translation direction. We use the refined method from Och and Ney (2003) which starts from the intersection of the two models' predictions and 'grows' the predicted alignments to neighbouring alignments which only appear in the output of one of the models.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="68" end_page="70" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We have applied our model to two publicly available word aligned corpora. The first is the English-French Hansards corpus, which consists of 1.1 million aligned sentences and 484 word-aligned sentences. This data set was used for the 2003 NAACL shared task (Mihalcea and Pedersen, 2003), where the word-aligned sentences  weresplitintoa37sentencetrialsetanda447sentence testing set. Unlike the unsupervised entrants in the 2003 task, we require word-aligned training data, and therefore must cannibalise the test set for this purpose. We follow Taskar et al. (2005) by using the first 100 test sentences for training and the remaining 347 for testing. This means that our results should not be directly compared to those entrants, other than in an approximate manner. We used the original 37 sentence trial set for feature engineering and for fitting a Gaussian prior.</Paragraph>
    <Paragraph position="1"> The word aligned data are annotated with both sure(S)andpossible(P)alignments(S [?] P; Och and Ney (2003)), where the possible alignments indicate ambiguous or idiomatic alignments. We measure the performance of our model using alignment error rate (AER), which is defined as: AER(A,S,P) = 1[?] |A[?]S|+|A[?]P||A|+|S| where A is the set of predicted alignments.</Paragraph>
    <Paragraph position="2"> The second data set is the Romanian-English parallel corpus from the 2005 ACL shared task (Martin et al., 2005). This consists of approximately 50,000 aligned sentences and 448 word-aligned sentences, which are split into a 248 sentence trial set and a 200 sentence test set. We used these as our training and test sets, respectively. For parameter tuning, we used the 17 sentence trial set from the Romanian-English corpus in the 2003 NAACL task (Mihalcea and Pedersen, 2003). For this task we have used the same test data as the competition entrants, and therefore can directlycompare ourresults. The wordalignments in this corpus were only annotated with sure (S) alignments, and therefore the AER is equivalent to the F1 score. In the shared task it was found that models which were trained on only the first four letters of each word obtained superior results to those using the full words (Martin et al., 2005).</Paragraph>
    <Paragraph position="3"> We observed the same result with our model on the trial set and thus have only used the first four letters when training the Dice and Model 1 translation probabilities.</Paragraph>
    <Paragraph position="4"> Tables 1 and 2 show the results when all feature types areemployed on both language pairs. We report the results for both translation directions and when combined using the refined and intersection methods. The Model 4 results are from GIZA++ with the default parameters and the training data lowercased. For Romanian, Model 4 was trained using the first four letters of each word.</Paragraph>
    <Paragraph position="5"> The Romanian results are close to the best reported result of 26.10 from the ACL shared task (Martin et al., 2005). This result was from a system based on Model 4 plus additional parameters such as a dictionary. The standard Model 4 implementation in the shared task achieved a result of 31.65, while when only the first 4 letters of each word were used it achieved 28.80.5</Paragraph>
    <Section position="1" start_page="69" end_page="70" type="sub_section">
      <SectionTitle>
Table3showstheeffectofremovingeachofthe
</SectionTitle>
      <Paragraph position="0"> featuretypesinturnfromthefullmodel. Themost useful features are the Dice and Model 1 values which allow the model to incorporate translation probabilities from the large sentence aligned corpora. This is to be expected as the amount of word aligned data are extremely small, and therefore the model can only estimate translation probabilities foronlyafractionofthelexicon. Wewouldexpect the dependence on sentence aligned data to decrease as more word aligned data becomes available. null The effect of removing the Markov features can be seen from comparing Figures 2 (a) and (b). The model has learnt to prefer alignments that follow the diagonal, thus alignments such as 3 - three and prestation - provision are found, and missalignments such as de - of, which lie well off the diagonal, are avoided.</Paragraph>
      <Paragraph position="1"> The differing utility of the alignment word pair feature between the two tasks is probably a result of the different proportions of word- to sentence-aligned data. For the French data, where a very large lexicon can be estimated from the million sentence alignments, the sparse word pairs learnt on the word aligned sentences appear to lead to overfitting. Incontrast, for Romanian, where more word alignments are used to learn the translation pair features and much less sentence aligned data are available, these features have a significant impact on the model. Suprisingly the orthographic features actually worsen the performance in the tasks (incidentally, these features help the trial set). Our explanation is that the other features (eg. Model 1) already adequately model these correspondences, and therefore the orthographic fea- null groups of features from the full model.</Paragraph>
      <Paragraph position="2"> turesdonotaddmuchadditionalmodellingpower.</Paragraph>
      <Paragraph position="3"> We expect that with further careful feature engineering, and a larger trial set, these orthographic features could be much improved.</Paragraph>
      <Paragraph position="4"> The Romanian-English language pair appears to offer a more difficult modelling problem than the French-English pair. With both the translation score features (Dice and Model 1) removed - the sentence aligned data are not used - the AER of the Romanian is more than twice that of the French, despite employing more word aligned data. This could be caused by the lack of possible (P) alignment markup in the Romanian data, which provide a boost in AER on the French data set, rewarding what would otherwise be considered errors. Interestingly, without any features derived from the sentence aligned corpus, our model achieves performance equivalent to Model 3 trained on the full corpus (Och and Ney, 2003).</Paragraph>
      <Paragraph position="5"> This is a particularly strong result, indicating that this method is ideal for data-impoverished alignment tasks.</Paragraph>
    </Section>
    <Section position="2" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
4.1 Training with possible alignments
</SectionTitle>
      <Paragraph position="0"> Up to this point our Hansards model has been trained using only the sure (S) alignments. As thedatasetcontainsmanypossible(P)alignments, we would like to use these to improve our model.</Paragraph>
      <Paragraph position="1"> Most of the possible alignments flag blocks of ambiguous or idiomatic (or just difficult) phrase level alignments. These many-to-many alignments cannot be modelled with our many-to-one setup. However, a number of possibles flag one-to-one or many-to-one aligments: for this experiment we used these possibles in training to investigate their effect on recall. Using these additional alignments our refined precision decreased from 95.7 to 93.5, while recall increased from 89.2 to 92.4. This resulted in an overall decrease in AER to 6.99. We found no benefit from using many-to-many possible alignments as they added a significant amount of noise to the data.</Paragraph>
    </Section>
    <Section position="3" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
4.2 Model 4 as a feature
</SectionTitle>
      <Paragraph position="0"> Previous work (Taskar et al., 2005) has demonstrated that by including the output of Model 4 as a feature, it is possible to achieve a significant decrease in AER. We trained Model 4 in both directions on the two language pairs. We added two indicator features (one for each direction) to our CRF which were active if a given word pair were aligned in the Model 4 output. Table 4 displays the results on both language pairs when these additional features are used with the refined model.</Paragraph>
      <Paragraph position="1"> Thisproducesalargeincreaseinperformance, and when including the possibles, produces AERs of 5.29 and 25.8, both well below that of Model 4 alone (shown in Tables 1 and 2).</Paragraph>
    </Section>
    <Section position="4" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
4.3 Cross-validation
</SectionTitle>
      <Paragraph position="0"> Using 10-fold cross-validation we are able to generate results on the whole of the Hansards test data which are comparable to previously published results. As the sentences in the test set were randomly chosen from the training corpus we can expect cross-validation to give an unbiased estimate of generalisation performance. These results are displayed in Table 5, using the possible (P) alignmentsfortraining. Asthetrainingsetforeachfold is roughly four times as big previous training set, we see a small improvement in AER.</Paragraph>
      <Paragraph position="1"> The final results of 6.47 and 5.19 with and without Model 4 features both exceed the performance of Model 4 alone. However the unsupermodel precision recall f-score AER  out Model 4 features.</Paragraph>
      <Paragraph position="2"> vised Model 4 did not have access to the word-alignments in our training set. Callison-Burch et al. (2004) demonstrated that the GIZA++ models could be trained in a semi-supervised manner, leading to a slight decrease in error. To our knowledge, our AER of 5.19 is the best reported result, generative or discriminative, on this data set.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="70" end_page="71" type="metho">
    <SectionTitle>
5 Related work
</SectionTitle>
    <Paragraph position="0"> Recently, a number of discriminative word alignment models have been proposed, however these early models are typically very complicated with many proposing intractable problems which require heuristics for approximate inference (Liu et al., 2005; Moore, 2005).</Paragraph>
    <Paragraph position="1"> An exception is Taskar et al. (2005) who presented a word matching model for discriminative alignment which they they were able to solve optimally. However, theirmodelislimitedtoonlyproviding one-to-one alignments. Also, no features were defined on label sequences, which reduced themodel'sabilitytocapturethestrongmonotonic relationships present between European language pairs. On the French-English Hansards task, using the same training/testing setup as our work, they achieve an AER of 5.4 with Model 4 features, and 10.7 without (compared to 5.29 and 6.99 for our CRF). One of the strengths of the CRF MAP estimation is the powerful smoothing offered by the prior, which allows us to avoid heuristics such as early stopping and hand weighted loss-functions that were needed for the maximum-margin model.</Paragraph>
    <Paragraph position="2"> Liu et al. (2005) used a conditional log-linear model with similar features to those we have employed. They formulated a global model, without making a Markovian assumption, leading to the need for a sub-optimal heuristic search strategies.</Paragraph>
    <Paragraph position="3"> Ittycheriah and Roukos (2005) trained a dis- null criminative model on a corpus of ten thousand word aligned Arabic-English sentence pairs that outperformed a GIZA++ baseline. As with other approaches, they proposed a model which didn't allow a tractably optimal solution and thus had to resort to a heuristic beam search. They employed a log-linear model to learn the observation probabilities, while using a fixed transition distribution. Our CRF model allows both the observation and transition components of the model to be jointly optimised from the corpus.</Paragraph>
  </Section>
  <Section position="8" start_page="71" end_page="71" type="metho">
    <SectionTitle>
6 Further work
</SectionTitle>
    <Paragraph position="0"> The results presented in this paper were evaluated in terms of AER. While a low AER can be expected to improve end-to-end translation quality, this is may not necessarily be the case. Therefore, we plan to assess how the recall and precision characteristics of our model affect translation quality. The tradeoff between recall and precision may affect the quality and number of phrases extracted for a phrase translation table.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML