File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1031_metho.xml

Size: 10,469 bytes

Last Modified: 2025-10-06 14:08:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1031">
  <Title>Word to word alignment strategies</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Alignment strategies
</SectionTitle>
    <Paragraph position="0"> A clue matrix summarizes information from various sources that can be used for the identification of translation relations. However, there is no obvious way to utilize this information for word alignment as we explicitly include multi-word units (MWUs) in our approach. The clue matrix in figure 2 has been obtained for a bi-text segment from our English-Swedish test corpus (the Bellow corpus) using a set of weighted declarative and estimated clues.</Paragraph>
    <Paragraph position="1"> There are many ways of &amp;quot;clustering&amp;quot; words together and there is no obvious maximization procedure for finding the alignment optimum when MWUs are involved. The alignment proingen visar s&amp;quot;arskilt mycket t@alamod  cedure depends very much on the definition of an optimal alignment. The best alignment for our example would probably be the set of the following links: links = braceleftBigg no one ingen is patient visar t@alamod very s&amp;quot;arskilt mycket bracerightBigg A typical procedure for automatic word alignment is to start with one-to-one word links. Links that have common source or target language words are called overlapping links. Sets of overlapping links, which do not overlap with any other link outside the set, are called link clusters (LC). Aligning words one by one often produces overlaps and in this way implicitly creates aligned multi-word-units as part of link clusters. A general word-to-word alignment L for a given bitext segment with N source language words (s1s2...sN) and M target language words (t1t2...tM) can be formally described as a set of links L = {L1,L2,...,Lx} with Lx = [sx1,tx2],x1 [?] {1..N},x2 [?] {1..M}. This general definition allows varying numbers of links (0 [?] x [?] N [?] M) within possible alignments L. It is not straightforward how to find the optimal alignment as L may include different numbers of links.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Directional alignment models
</SectionTitle>
      <Paragraph position="0"> One word-to-word alignment approach is to assume a directional word alignment model similar to the models in statistical machine translation. The directional alignment model assumes that there is at most one link for each source language word. Using alignment clues, this can be expressed as the following optimization problem: ^LD = argmaxLD producttextNn=1 C(LDn ) where LD = {LD1 ,LD2 ,..,LDN} is a set of links</Paragraph>
      <Paragraph position="2"> is the combined clue value for the linked items sn and taDn . In other words, word alignment is the search for the best link for each source language word. Directional models do not allow multiple links from one item to several target items. However, target items can be linked to multiple source language words as they can be aligned to the same target language word.</Paragraph>
      <Paragraph position="3"> The direction of alignment can easily be reversed, which leads to the inverse directional  alignment: ^LI = argmaxLI producttextMm=1 C(LIm) with links LIm = bracketleftBig saIm,tm bracketrightBig and aIm [?] {1..N}. In the  inverse directional alignment, source language words can be linked to multiple words but not the other way around. The following figure illustrates directional alignment models applied to the example in figure 2:</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Combined directional alignment
</SectionTitle>
      <Paragraph position="0"> Directional link sets can be combined in several ways. The union of link sets (^L[?] = ^LD [?] ^LI) usually causes many overlaps and, hence, very large link clusters. On the other hand, an intersection of link sets (^L[?] = ^LD [?] ^LI) removes all overlaps and leaves only highly confident one-to-one word links behind. Using the same example from above we obtain the following alignments:  The union and the intersection of links do not produce satisfactory results as seen in the example. Another alignment strategy is a refined combination of link sets (^LR = {^LD [?] ^LI} [?] {LR1 ,...,LRr }) as suggested by (Och and Ney, 2000b). In this approach, the intersection of links is iteratively extended by additional links LRr which pass one of the following two constraints: null  the new link is either vertically or horizontally adjacent to an existing link and the new link does not cause any link to be adjacent to other links in both dimensions (horizontally and vertically).</Paragraph>
      <Paragraph position="1"> Applying this approach to the example, we get:</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Competitive linking
</SectionTitle>
      <Paragraph position="0"> Another alignment approach is the competitive linking approach proposed by Melamed (Melamed, 1996). In this approach, one assumes that there are only one-to-one word links. The alignment is done in a greedy &amp;quot;best-first&amp;quot; search manner where links with the highest association scores are aligned first, and the aligned items are then immediately removed from the search space. This process is repeated until no more links can be found. In this way, the optimal alignment (^LC) for non-overlapping one-to-one links is found. The number of possible links in an alignment is reduced to min(N,M). Using competitive linking with our example we yield:</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Constrained best-first alignment
</SectionTitle>
      <Paragraph position="0"> Another iterative alignment approach has been proposed in (Tiedemann, 2003). In this approach, the link LBx = [sx1,tx2] with the highest score in the clue matrix ^C(sx1,tx2) = maxsi,tj(C(si,tj)) is added to the set of link clusters if it fulfills certain constraints. The top score is removed from the matrix (i.e. set to zero) and the link search is repeated until no more links can be found. This is basically a constrained best-first search. Several constraints are possible. In (Tiedemann, 2003) an adjacency check is suggested, i.e. overlapping links are accepted only if they are adjacent to other links in one and only one existing link cluster.</Paragraph>
      <Paragraph position="1"> Non-overlapping links are always accepted (i.e.</Paragraph>
      <Paragraph position="2"> a non-overlapping link creates a new link cluster). Other possible constraints are clue value thresholds, thresholds for clue score differences between adjacent links, or syntactic constraints (e.g. that link clusters may not cross phrase boundaries). Using a best-first search strategy with the adjacency constraint we obtain the following alignment:</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Summary
</SectionTitle>
      <Paragraph position="0"> None of the alignment approaches described above produces the preferred reference alignment in our example using the given clue matrix. However, simple iterative procedures come very close to the reference and produce acceptable alignments even for multi-word units, which is promising for an automatic clue alignment system. Directional alignment models depend very much on the relation between the source and the target language. One direction usually works better than the other, e.g.</Paragraph>
      <Paragraph position="1"> an alignment from English to Swedish is better than Swedish to English because in English terms and concepts are often split into several words whereas Swedish tends to contain many compositional compounds. Symmetric approaches to word alignment are certainly more appropriate for general alignment systems than directional ones.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="65" type="metho">
    <SectionTitle>
4 Evaluation methodology
</SectionTitle>
    <Paragraph position="0"> Word alignment quality is usually measured in terms of precision and recall. Often, previously created gold standards are used as reference data in order to simplify automatic tests of alignment attempts. Gold standards can be re-used for additional test runs which is important when examining different parameter settings. However, recall and precision derived from information retrieval have to be adjusted for the task of word alignment. The main difficulty with these measures in connection with word alignment arises with links between MWUs that cause partially correct alignments. It is not straightforward how to judge such links in order to compute precision and recall. In order to account for partiality we use a slightly modified version of the partiality score Q proposed in (Ahrenberg et al., 2000)2:</Paragraph>
    <Paragraph position="2"> The set of algxsrc includes all source language words of all proposed links if at least one of them is partially correct with respect to the reference link x from the gold standard. Similarly, algxtrg refers to all the proposed target language words. corrxsrc and corrxtrg refer to the sets of source and target language words in link x of the gold standard. Using the partiality value Q, we can define the recall and precision metrics as follows:</Paragraph>
    <Paragraph position="4"> A balanced F-score can be used to combine both, precision and recall:</Paragraph>
    <Paragraph position="6"> giza+pp. Alignment strategies: directional (LD), inverse directional (LI), union (L[?]), intersection (L[?]), refined (LR), competitive linking (LC), and constrained best-first (LB).</Paragraph>
    <Paragraph position="7"> Alternative measures for the evaluation of one-to-one word links have been proposed in (Och and Ney, 2000a; Och and Ney, 2003).</Paragraph>
    <Paragraph position="8"> However, these measures require completely aligned bitext segments as reference data. Our gold standards include random samples from the corpus instead (Ahrenberg et al., 2000).</Paragraph>
    <Paragraph position="9"> Furthermore, we do not split MWU links as proposed by (Och and Ney, 2000a). Therefore, the measures proposed above are a natural choice for our evaluations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML