File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1019_evalu.xml
Size: 10,769 bytes
Last Modified: 2025-10-06 13:59:30
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1019"> <Title>A Comparison of Syntactically Motivated Word Alignment Spaces</Title> <Section position="5" start_page="149" end_page="151" type="evalu"> <SectionTitle> 4 Experiments and Results </SectionTitle> <Paragraph position="0"> Wecomparethealignment spaces described inthis paper under two criteria. First we test the guidance provided by a space, or its capacity to stop an aligner from selecting bad alignments. We also test expressiveness, or how often a space allows an aligner to select the best alignment.</Paragraph> <Paragraph position="1"> In all cases, we report our results in terms of alignment quality, using the standard word alignment error metrics: precision, recall, F-measure andalignment error rate (OchandNey, 2003). Our test set is the 500 manually aligned sentence pairs created by Franz Och and Hermann Ney (2003).</Paragraph> <Paragraph position="2"> These English-French pairs are drawn from the Canadian Hansards. English dependency trees are supplied by Minipar (Lin, 1994).</Paragraph> <Section position="1" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 4.1 Objective Function </SectionTitle> <Paragraph position="0"> In our experiments, we hold all variables constant except for the alignment space being searched, and in the case of imperfect searches, the search method. In particular, all of the methods we test will use the same objective function to select the &quot;best&quot; alignment from their space. Let A be an alignment for an English, Foreign sentence pair, (E,F). A is represented as a set of links, where each link is a pair of English and Foreign positions, (i,j), that are connected by the alignment.</Paragraph> <Paragraph position="1"> The score of a proposed alignment is:</Paragraph> <Paragraph position="3"> Note that this objective function evaluates each link independently, unaware of the other links selected. Taskar et al (2005) have shown that with a strong flink, one can achieve state of the art results using this objective function and the maximum matching algorithm. Our two experiments will vary the definition of flink to test different aspects of alignment spaces.</Paragraph> <Paragraph position="4"> All of the methods will create only one-to-one alignments. Phrasal alignment would introduce unnecessary complications that could mask some of the differences in the re-orderings defined by these spaces.</Paragraph> </Section> <Section position="2" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 4.2 Search methods tested </SectionTitle> <Paragraph position="0"> We test seven methods, one for each of the four syntactic spaces described in this paper, and three variations of search in permutation space: Greedy: A greedy search of permutation space.</Paragraph> <Paragraph position="1"> Links are added in the order of their link scores. This corresponds to the competitive linking algorithm (Melamed, 2000).</Paragraph> <Paragraph position="2"> Beam: A beam search of permutation space, where links are added to a growing alignment, biased bytheirlinkscores. Beamwidth is 2 and agenda size is 40.</Paragraph> <Paragraph position="3"> Match: The weighted maximum matching algorithm (West, 2001). This is a perfect search of permutation space.</Paragraph> <Paragraph position="4"> ITG: The alignment resulting from ITG parsing with the canonical grammar in (2). This is a perfect search of ITG space.</Paragraph> <Paragraph position="5"> Dep: A beam search of the dependency space.</Paragraph> <Paragraph position="6"> Thisisequivalent toBeamplusadependency constraint.</Paragraph> <Paragraph position="7"> D-ITG: The result of ITG parsing as described in Section 3.2. This is a perfect search of the intersection of the ITG and dependency spaces. HD-ITG: TheD-ITG method with an added head constraint, as described in Section 3.3.</Paragraph> </Section> <Section position="3" start_page="149" end_page="150" type="sub_section"> <SectionTitle> 4.3 Learned objective function </SectionTitle> <Paragraph position="0"> Thelinkscoreflink isusually imperfect, because it is learned from data. Appropriately defined alignment spaces may rule out bad links even if they areassigned highflink values, based onother links in the alignment. We define the following simple link score to test the guidance provided by different alignment spaces:</Paragraph> <Paragraph position="2"> Here, a = (i,j) is a link and ph2(ei,fj) returns the ph2 correlation metric (Gale and Church, 1991) between the English token at i and the Foreign token at j. The ph2 scores were obtained using co-occurrence counts from 50k sentence pairs of Hansard data. The second term is an absolute position penalty. C is a small constant selected to be just large enough to break ties in favor of similar positions. Links to null are given a flat score of 0, while token pairs with no value in our ph2 table are assigned [?]1.</Paragraph> <Paragraph position="3"> The results of maximizing falign on our test set are shown in Table 1. The first thing to note is that our flink is not artificially weak. Our function takes into account token pairs and position, making it roughly equivalent to IBM Model 2.</Paragraph> <Paragraph position="4"> Our weakest method outperforms Model 2, which scores an AERof 22.0 on this test set whentrained with roughly twice as many sentence pairs (Och and Ney, 2003).</Paragraph> <Paragraph position="5"> The various search methods fall into three categories in terms of alignment accuracy. The searches through permutation space allhave AERs of roughly 20, with the more complete searches scoring better. The ITG method scores an AER of 17.4, a 10% reduction in error rate from maximum matching. This indicates that the constraints established by ITG space are beneficial, even before adding an outside parse. The three dependency tree-guided methods all have AERs of around 13.3. This is a 31% improvement over maximum matching. One should also note that, with the exception of the HD-ITG, recall goes up as smaller spaces are searched. In a one-to-one alignment, enhancing precision canalso enhance recall, asevery error of commission avoided presents two new opportunities to avoid an error of omission.</Paragraph> <Paragraph position="6"> The small gap between the beam search and maximum matching indicates that for this flink, the beam search is a good approximation to complete enumeration of a space. This is important, as the only method we have available to search dependency space is also a beam search.</Paragraph> <Paragraph position="7"> The error rates for the three dependency-based methods are similar; no one method provides much more guidance than the other. Enforcing head constraints produces only a small improvement over the D-ITG. Assuming our beam search is approximating a complete search, these results also indicate that D-ITG space and dependency space have very similar properties with respect to alignment.</Paragraph> </Section> <Section position="4" start_page="150" end_page="151" type="sub_section"> <SectionTitle> 4.4 Oracle objective function </SectionTitle> <Paragraph position="0"> Anytime welimit an alignment space, werisk ruling out correct alignments. We now test the expressiveness of an alignment space according to the best alignments that can be found there when given an oracle link score. This is similar to the experiments in (Fox, 2002), but instead of countingcrossings, wecount howmanylinksamaximal alignment misses when confined to the space.</Paragraph> <Paragraph position="1"> We create a tailored flink for each sentence pair, based on the gold standard alignment for that pair. Gold standard links are broken up into twocategories in Och and Ney's evaluation framework (2003). S links are used when the annotators agree and are certain, while P links are meant to handle ambiguity. Since only S links are used to calculate recall, we define our flink to mirror the S links in the gold standard:</Paragraph> <Paragraph position="3"> Table 2 shows the results of maximizing summed flink values in our various alignment spaces.</Paragraph> <Paragraph position="4"> The two imperfect permutation searches were left out, as they are simply approximating maximum matching. The precision column was left out, as it is trivially 100 in all cases. A new column has been added to count missed links.</Paragraph> <Paragraph position="5"> Maximum matching sets the upper bound for this task, with a recall of 96.4. It does not achieve perfect recall due to the one-to-one constraint.</Paragraph> <Paragraph position="6"> Note that its error rate is not a lower bound on the AER of a one-to-one aligner, as systems can score better by including P links.</Paragraph> <Paragraph position="7"> Of the constrained systems, ITG fairs the best, showing only a tiny reduction in recall, due to 3 missed links throughout the entire test set. Considering the non-trivial amount of guidance provided by the ITG in Section 4.3, this small drop in expressiveness is quite impressive. For the most part, the ITG constraints appear to rule out only incorrect alignments.</Paragraph> <Paragraph position="8"> The D-ITG has the next highest recall, doing noticeably better than the two other dependency-based searches, but worse than the ITG. The 1.5% drop in expressiveness may or may not be worth the increased guidance shown in Section 4.3, depending on the task. It may be surprising to see D-ITG outperforming Dep, as the alignment space of Dep is larger than that of D-ITG. The heuristic nature of Dep's search means that its alignment space is only partially explored.</Paragraph> <Paragraph position="9"> The HD-ITG makes 26 fewer correct links than the D-ITG, each corresponding to a single missed link in a different sentence pair. These misses occur in cases where two modifiers switch position with respect to their head during translation. Surprisingly, there are regularly occurring, systematic constructs that violate the head constraints. Anexample of such a construct is when an English noun has both adjective and noun modifiers. Cases like &quot;Canadian WheatBoard&quot; aretranslated as, &quot;Board Canadian of Wheat&quot;, switching the modifiers' relative positions. These switches correspond to discontinuous constituents (Melamed, 2003) in general bitext parsing. The D-ITGcan handle discontinuities by freely grouping constituents to create continuity, but the HD-ITG, with its fixed head and modifiers, cannot. Given that the HD-ITG provides only slightly more guidance than the D-ITG, we recommend that this type of head information be included only as a soft constraint.</Paragraph> </Section> </Section> class="xml-element"></Paper>