File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0306_metho.xml

Size: 8,891 bytes

Last Modified: 2025-10-06 14:08:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0306">
  <Title>Word Alignment Baselines</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Unsupervised methods
</SectionTitle>
    <Paragraph position="0"> There are a number of alignment techniques that can be used to align texts when one lacks the benefit of a large aligned corpus. These unsupervised techniques take advantage of general knowledge of the language pair to be aligned. Their relative simplicity and speed allow them to be used in places where timeliness is of utmost importance, as well as to be quickly tuned on a small dataset.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Final punctuation
</SectionTitle>
      <Paragraph position="0"> Many LHS segments end in a punctuation mark that is aligned with the final punctuation of the corresponding RHS. A high precision aligner that marks only that alignment is useful for debugging the larger alignment system.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Length ratios
</SectionTitle>
      <Paragraph position="0"> Short words such as stop words tend to align with short words and long words such as names tend to align with long words. This weak hypothesis is worth pursuit because a similar hypothesis was useful for aligning sen- null tences (Gale and Church, 1991; Brown et al., 1991). The observation can be codified as a distance between the word at position i on the LHS and the word at position j on the RHS</Paragraph>
      <Paragraph position="2"> where L(li) is the length of the token at position i on the LHS. Note that Dlen is similar to a normalized harmonic mean, ranging from 0 to 1.0, with the minimum achieved when the lengths are the same. A threshold on Dlen is used to turn this distance metric into a classification rule.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Edit distances
</SectionTitle>
      <Paragraph position="0"> The language pairs in the experiments were drawn from Western languages, filled with cognates and names. An obvious way to start finding cognates in languages that share character sets is by comparing the edit distance between words.</Paragraph>
      <Paragraph position="1"> Three word edit distances were investigated, and thresholds tuned to turn them into classification rules. Dexact indicates exact match with a zero distance and a mismatch with value of 1.0. Dwedit is the minimum number of character edits (insertions, deletions, substitutions) required to transform one word into another, normalized by the lengths. It can be interpreted as an edit distance rate, edits per character:</Paragraph>
      <Paragraph position="3"> (2) Dlcedit is the same as Dwedit, except both arguments are lower-cased prior to the edit distance calculation.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Dotplot geometry
</SectionTitle>
      <Paragraph position="0"> Geometric approaches to bilingual alignment have been used with great success in both finding anchor points and aligning sentences (Fung and McKeown, 1994; Melamed, 1996). Three distance metrics were created to incorporate the knowledge that all of the aligned pairs use roughly the same word order. In every case, the distance of the pair of words from a diagonal in the dotplot was used.</Paragraph>
      <Paragraph position="1"> In the metrics below, the L1 norm distance from a point (i,j) to a line from (0,0) to (I,J) is</Paragraph>
      <Paragraph position="3"> The first metric, Dwdiag, is a normalized distance of the (i,j) pair of tokens to the diagonal on the word dot-</Paragraph>
      <Paragraph position="5"> where Lw(l) is the length of the LHS in words.</Paragraph>
      <Paragraph position="6"> The next two distances are character based, comparing the box containing aligned characters from the words at position (i,j) with the diagonal line on the character dotplot. Let Lc(li) be the number of characters preceding the ith word in the LHS.</Paragraph>
      <Paragraph position="7"> Let the left edge of the box be bl = Lc(li), the right edge of the box be br = Lc(li+1), the bottom edge of the box be bb = Lc(rj), and the top edge of the box be bt = Lc(rj+1). The center of the box formed by the words at (i,j) is  One character metric is the distance from the center of the character box to the diagonal line of the character dotplot, where Lc(l) is the character length of the entire LHS segment.</Paragraph>
      <Paragraph position="9"> The distance of the box to the diagonal line is the sec-</Paragraph>
      <Paragraph position="11"/>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Data-driven and supervised methods
</SectionTitle>
    <Paragraph position="0"> The distance metrics and associated classifiers described above were all optimized on the trial data, but they required optimization of at most one parameter, a threshold on the distance. Four metrics were investigated that used the larger dataset to estimate larger models, with parameters for every pair of collocated words in the training dataset.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Likelihoods
</SectionTitle>
      <Paragraph position="0"> Three likelihood-based distance metrics were investigated, and the first is the relative likelihood of the aligned pairs of words. c(li,LHS) is the number of times the word li was seen in the LHS of the aligned corpus.</Paragraph>
      <Paragraph position="2"> The next two are conditional probabilities of seeing one of the words given that the other word from the pair was seen in an aligned sentence. Here RHSx means the right-hand-side of aligned pair number x in the parallel corpus.</Paragraph>
      <Paragraph position="4"> Note that neither of these is satisfactory as a probabilistic lexicon because they give stop words such as determiners high probability for every conditioning token.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Bag-of-segments distance
</SectionTitle>
      <Paragraph position="0"> The final data-driven measure that was investigated considers the bag of segments (bos) in which the words appear. The result of the calculation is the Tanimoto distance between the bag of segments that word li appears in and the bag of segments that word rj appears in.</Paragraph>
      <Paragraph position="2"/>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Nearest neighbor rule
</SectionTitle>
    <Paragraph position="0"> The nearest neighbor rule is a well-known classification algorithm that provably converges to the Bayes Error Rate of a classification task as dataset size grows (Duda et al., 2001). The distance metrics described above were used to train a nearest neighbor rule classifier, each metric providing distance in one dimension. To provide comparability of distances in the different dimensions, the distribution of points in each dimension was normalized to have zero mean and unit variance (u = 0,s = 1). The L2 norm, Euclidean distance, was used to compute distance between points.</Paragraph>
    <Paragraph position="1"> Two versions of the nearest neighbor rule were explored. In the first, the binary decisions of the classifiers were used as features, and in the second the distances provided by the classifiers were used as features.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> Two datasets of different language pairs were used to evaluate these measures: Romanian-English and English-French. The measures were optimized on a trial dataset and then evaluated blind on a test set. The Romanian-English trial data was 17 sentences long and the English-French trial dataset was 37 sentences. Additionally, approximately 1.1 million aligned English-French sentences and 48,000 Romanian-English sentences were used for the set of supervised experiments.</Paragraph>
    <Paragraph position="1"> Four measures were used to evaluate the classifiers: precision, recall, F-measure, and alignment error rate (AER). Precision and recall are the ratios of matching aligned pairs to the number of predicted pairs and the number of reference pairs respectively. F-measure is the harmonic mean of precision and recall. AER differentiates between &amp;quot;sure&amp;quot; and &amp;quot;possible&amp;quot; aligned pairs in the reference, requiring hypotheses to match those that are &amp;quot;sure&amp;quot; and permitting them to match those that are &amp;quot;possible&amp;quot;. (Och and Ney, 2000).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML