XML Viewer - n06-1015

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1015_evalu.xml
Size: 8,114 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1015">
  <Title>Word Alignment via Quadratic Assignment</Title>
  <Section position="5" start_page="116" end_page="118" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We applied our algorithms to word-level alignment using the English-French Hansards data from the 2003 NAACL shared task (Mihalcea and Pedersen, 2003). This corpus consists of 1.1M automatically aligned sentences, and comes with a validation set of 37 sentence pairs and a test set of 447 sentences. The validation and test sentences have been hand-aligned (see Och and Ney (2003)) and are marked with both sure and possible alignments. Using these alignments, alignment error rate (AER) is calculated as: parenleftbigg</Paragraph>
    <Paragraph position="2"> parenrightbigg x 100%.</Paragraph>
    <Paragraph position="3"> Here, A is a set of proposed index pairs, S is the sure gold pairs, and P is the possible gold pairs. For example, in Figure 4, proposed alignments are shown against gold alignments, with open squares for sure alignments, rounded open squares for possible alignments, and filled black squares for proposed alignments.</Paragraph>
    <Paragraph position="4"> The input to our algorithm is a small number of labeled examples. In order to make our results more comparable with Moore (2005), we split the original set into 200 training examples and 247 test examples. We also trained on only the first 100 to make our results more comparable with the experiments of Och and Ney (2003), in which IBM model 4 was tuned using 100 sentences. In all our experiments, we used a structured loss function that penalized false negatives 10 times more than false positives, where the value of 10 was picked by using a validation set. The regularization parameter g was also chosen using the validation set.</Paragraph>
    <Section position="1" start_page="116" end_page="118" type="sub_section">
      <SectionTitle>
4.1 Features and results
</SectionTitle>
      <Paragraph position="0"> We parameterized all scoring functions sjk, sdj*, sd*k and sjklm as weighted linear combinations of feature sets. The features were computed from the large unlabeled corpus of 1.1M automatically aligned sentences.</Paragraph>
      <Paragraph position="1"> In the remainder of this section we describe the improvements to the model performance as various features are added. One of the most useful features for the basic matching model is, of course, the set of predictions of IBM model 4. However, computing these features is very expensive and we would like to build a competitive model that doesn't require them. Instead, we made significant use of IBM model 2 as a source of features. This model, although not very accurate as a predictive model, is simple and cheap to construct and it is a useful source of features.</Paragraph>
      <Paragraph position="2"> The Basic Matching Model: Edge Features In the basic matching model of Taskar et al. (2005), called M here, one can only specify features on pairs of word tokens, i.e. alignment edges. These features  include word association, orthography, proximity, etc., and are documented in Taskar et al. (2005). We also augment those features with the predictions of IBM Model 2 run on the training and test sentences.</Paragraph>
      <Paragraph position="3"> We provided features for model 2 trained in each direction, as well as the intersected predictions, on each edge. By including the IBM Model 2 features, the performance of the model described in Taskar et al. (2005) on our test set (trained on 200 sentences) improves from 10.0 AER to 8.2 AER, outperforming unsymmetrized IBM Model 4 (but not intersected model 4).</Paragraph>
      <Paragraph position="4"> As an example of the kinds of errors the baseline M system makes, see Figure 2 (where multiple fertility cannot be predicted), Figure 3 (where a preference for monotonicity cannot be modeled), and Figure 4 (which shows several multi-fertile cases).</Paragraph>
      <Paragraph position="5"> The Fertility Model: Node Features To address errors like those shown in Figure 2, we increased the maximum fertility to two using the parameterized fertility model of Section 2.1. The model learns costs on the second flow arc for each word via features not of edges but of single words. The score of taking a second match for a word w was based on the following features: a bias feature, the proportion of times w's type was aligned to two or more words by IBM model 2, and the bucketed frequency of the word type. This model was called M+F. We also included a lexicalized feature for words which were common in our training set: whether w was ever seen in a multiple fertility alignment (more on this feature later). This enabled the system to learn that certain words, such as the English not and French verbs like aurait commonly participate in multiple fertility configurations.</Paragraph>
      <Paragraph position="6"> Figure 5 show the results using the fertility extension. Adding fertility lowered AER from 8.5 to 8.1, though fertility was even more effective in conjunction with the quadratic features below. The M+F setting was even able to correctly learn some multiple fertility instances which were not seen in the training data, such as those shown in Figure 2.</Paragraph>
      <Paragraph position="7"> The First-Order Model: Quadratic Features With or without the fertility model, the model makes mistakes such as those shown in Figure 3, where atypical translations of common words are not chosen despite their local support from adjacent edges. In the quadratic model, we can associate features with pairs of edges. We began with features which identify each specific pattern, enabling trends of monotonicity (or inversion) to be captured. We also added to each edge pair the fraction of times that pair's pattern (monotonic, inverted, one to two) occurred according each version of IBM model 2 (forward, backward, intersected).</Paragraph>
      <Paragraph position="8"> Figure 5 shows the results of adding the quadratic model. M+Q reduces error over M from 8.5 to 6.7 (and fixes the errors shown in Figure 3). When both the fertility and quadratic extensions were added, AER dropped further, to 6.2. This final model is even able to capture the diamond pattern in Figure 4; the adjacent cycle of alignments is reinforced by the quadratic features which boost adjacency. The example in Figure 4 shows another interesting phenomenon: the multi-fertile alignments for not and d'eput'e are learned even without lexical fertility features (Figure 4b), because the Dice coefficients of those words with their two alignees are both high.</Paragraph>
      <Paragraph position="9"> However the surface association of aurait with have is much higher than with would. If, however, lexical features are added, would is correctly aligned as well (Figure 4c), since it is observed in similar periphrastic constructions in the training set.</Paragraph>
      <Paragraph position="10"> We have avoided using expensive-to-compute features like IBM model 4 predictions up to this point. However, if these are available, our model can improve further. By adding model 4 predictions to the edge features, we get a relative AER reduction of 27%, from 6.5 to 4.5. By also including as features the posteriors of the model of Liang et al. (2006), we achieve AER of 3.8, and 96.7/95.5 precision/recall.</Paragraph>
      <Paragraph position="11"> It is comforting to note that in practice, the burden of running an integer linear program at test time can be avoided. We experimented with using just the LP relaxation and found that on the test set, only about 20% of sentences have fractional solutions and only 0.2% of all edges are fractional. Simple rounding3 of each edge value in the LP solution achieves the same AER as the integer LP solution, while using about a third of the computation time on average.</Paragraph>
      <Paragraph position="12"> 3We slightly bias the system on the recall side by rounding 0.5 up, but this doesn't yield a noticeable difference in the results. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML