File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0306_intro.xml

Size: 2,662 bytes

Last Modified: 2025-10-06 14:01:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0306">
  <Title>Word Alignment Baselines</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Simple baselines provide insights into the value of scoring functions and give starting points for measuring the performance improvements of technological advances.</Paragraph>
    <Paragraph position="1"> This paper presents baseline unsupervised techniques for performing word alignment based on geometric and word edit distances as well as supervised fusion of the results of these techniques using the nearest neighbor rule.</Paragraph>
    <Paragraph position="2"> 2 Alignment as binary classification One model for the task of aligning words in a left-hand-side (LHS) segment with those in a right-hand-side (RHS) segment is to consider each pair of tokens as a potential alignment and build a binary classifier to discriminate between correctly and incorrectly aligned pairs. Any of n source language words to align with any of m target language words, resulting in 2nm possible alignment configurations. This approach allows well-understood binary classification tools to address the problem. However, the assumption made in this approach is that the alignments are independent and identically distributed (IID). This is false, but the same assumption is made by the alignment evaluation metrics. This approach also introduces difficulty in incorporating knowledge of adjacency of aligned pairs, and HMM approaches to word alignment show that this knowledge is important (Och and Ney, 2000).</Paragraph>
    <Paragraph position="3"> All of the techniques presented in this work approach the problem as a binary classification task.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Random baseline
</SectionTitle>
      <Paragraph position="0"> A randomized baseline was created which flips a coin to mark alignments. The bias of the coin is chosen to maximize the F-measure on the trial dataset, and the resulting performance gives insight into the inherent difficulty of the task. If the categorization task was balanced, with exactly half of the paired tokens being marked as aligned, then the precision, recall, and F-measure of the coin with the best bias would have all been 50%. The preponderance of non-aligned tokens shifted the F-measure away from 50%, to the 5-10% range, suggesting that only about 10% of the pairs were aligned. An aligner performing worse than this baseline would perform better by inverting its predictions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML