File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1004_metho.xml

Size: 20,162 bytes

Last Modified: 2025-10-06 14:14:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1004">
  <Title>A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts</Title>
  <Section position="2" start_page="0" end_page="29" type="metho">
    <SectionTitle>
1. The Problem
</SectionTitle>
    <Paragraph position="0"> The problem we consider is how to find word and phrase alignments for a bitext that is already aligned at the sentence level. Results should be delivered in a form that could easily be checked and corrected by a human user.</Paragraph>
    <Paragraph position="1"> Although we primarily use the system for  bitexts with an English and a Scandinavian half, the system should preferably be useful for many different language pairs. Thus we don~ rely on the existence of POS-taggers or lemmatizers for the languages involved, but wish to provide mechanisms that a user can easily adapt to new languages.</Paragraph>
    <Paragraph position="2"> The organisation of the paper is as follows: In section 2 we relate this approach to previous work, in section 3 we motivate and spell out our assumptions about the behaviour of lexical units in translation, in section 4 we present the basic features of the algorithm, and in section 5 we present results from an evaluation and try to compare these to the results of others.</Paragraph>
  </Section>
  <Section position="3" start_page="29" end_page="29" type="metho">
    <SectionTitle>
2. Previous work
</SectionTitle>
    <Paragraph position="0"> Most algorithms for bilingual word alignment to date have been based on the probabilistic translation models first proposed by Brown et al.</Paragraph>
    <Paragraph position="1"> (1988, 1990), especially Model I and Model 2.</Paragraph>
    <Paragraph position="2"> These models explicitly exclude multi-word units from consideration 1. Melamed (1997b), however, proposes a method for the recognition of multi-word compounds in bitexts that is based on the predictive value of a translation model. A trial translation model that treat certain multi-word sequences as units is compared with a base translation model that treats the same sequences as multiple single-word units.</Paragraph>
    <Paragraph position="3"> A drawback with Melamed's method is that compounds are defined relative to a given translation and not with respect to language-internal criteria. Thus, if the method is used to construct a bilingual concordance, there is a risk that compounds and idioms that translate compositionally will not be found. Moreover, it is computationally expensive and, since it constructs compounds incrementally, adding one word at a time, requires many iterations and much processing to find linguistic units of the proper size.</Paragraph>
    <Paragraph position="4"> Kitamura and Matsumoto (1996) present results from aligning multi-word and single word expressions with a recall of 80 per cent if partially correct translations were included. Their method is iterative and is based on the use of the Dice coefficient. Smadja et. al (1996) also use the Dice Model 3-5 includes multi-word units in one direction. coefficient as their basis for aligning collocations between English and French. Their evaluation show results of 73 per cent accuracy (precision) on average.</Paragraph>
  </Section>
  <Section position="4" start_page="29" end_page="31" type="metho">
    <SectionTitle>
3. Underlying assumptions
</SectionTitle>
    <Paragraph position="0"> As Fung and Church (1994) we wish to estimate the bilingual lexicon directly. Unlike Fung and Church our texts are already aligned at sentence level and the lexicon is viewed, not merely as word associations, but as associations between lexical units of the two languages.</Paragraph>
    <Paragraph position="1"> We assume that texts have structure at many different levels. At the most concrete level a text is simply a sequence of characters. At the next level a text is a sequence of word tokens, where word tokens are defined as sequences of alphanumeric character strings that are separated from one another by a finite set of delimiters such as spaces and punctuation marks. While many characters can be used either as word delimiters or as nondelimiters, we prefer to uphold a consistent difference between delimiters and non-delimiters, for the ease of implementation that it allows. At the same time, however, the tokenizer recognizes common abbreviations with internal punctuation marks and regularizes clitics to words (e.g. can't is regularized to can not).</Paragraph>
    <Paragraph position="2"> At the next level up a text can be viewed as a partially ordered bag of lexical units. It is a bag because the same unit can occur several times in a single sentence. It is partially ordered because a lexical unit may extend across other lexical units, as in He turned the offer down.</Paragraph>
    <Paragraph position="3"> Tabs were kept on him.</Paragraph>
    <Paragraph position="4"> We say that words express lexical units, and that units are expressed by words. A unit may be expressed by a multi-word sequence, while a given word can express at most one lexical unit. 2 It is often hard to tell the difference between a lexical unit and a lexical complex. We assume that 2 This latter assumption is actually too strict for Germanic languages where morphological compounding is a productive process, but we make it nevertheless, as we have no means too identify compounds reliably. Moreover, the borderline between a lexicalized compound and a compositional compound is hard to draw consistently, anyway.</Paragraph>
    <Paragraph position="5">  recurrent collocations that pass certain structural and contextual tests are candidate expressions for lexical units. If such collocations are found to correspond to something in the other half of the bitext on the basis of co-occurrence measures, they are regarded as expressions of lexical units. This will include compound names such as New York&amp;quot; ~enry Kissinger' and ~World War II&amp;quot; and compound terms such as 'network server directory'. Thus, as with the compositional compounds just discussed, we prefer high recall to high precision in identifying multi-word units.</Paragraph>
    <Paragraph position="6"> The expressions of a lexical unit form an equivalence class. An equivalence class for a single-word unit includes its morphological variants. An equivalence class for a multi-word unit should include syntactic variants as well. For instance, the lexical unit turn down should include p ~urned down' ~urning down' as well as expressions where the particle is separated from the verb by some appropriate phrase, as in the example above. The current system, though, does not provide for syntactic variants.</Paragraph>
    <Paragraph position="7"> Our aim is to establish relations not only between corresponding words and word sequences in the bitext, but also between corresponding lexical units. A problem is then that the algorithm cannot recognize lexical units directly, but only their expressions. It helps to include lexical units in the underlying model, however, as they have explanatory value. Moreover, the algorithm can be made to deliver its output in the form of correspondences between equivalence classes of expressions belonging to the same lexical unit.</Paragraph>
    <Paragraph position="8"> For the purpose of generating the alignment and the dictionary we divide the lexical units into three classes:  1. irrelevant units, 2. closed class units, 3. open class units  The same categories apply to expressions.</Paragraph>
    <Paragraph position="9"> Irrelevant units are simply those that we don~t want to include. They have to be listed explicitly. The reason for not including some items may vary with the purpose of alignment. Even if we wish the alignment to be as complete as possible, it might be useful to exclude certain units that we suspect may confuse the algorithm. For instance, the do-support found in English usually has no counterpart in other languages. Thus, the different forms of 'do' may be excluded from consideration from the start.</Paragraph>
    <Paragraph position="10"> As for the translation relation we make the following assumptions: 1. A lexical unit in one half of the bitext corresponds to at most one lexical unit in the other half. This can be seen as a generalization of the one-to-one assumption for word-to-word translation used by Melamed (1997a) and is exploited for the same purpose, i.e. to exclude large numbers of candidate alignments, when good initial alignments have been found.</Paragraph>
    <Paragraph position="11">  2. Open class and closed class lexical units are usually translated and there are a limited number of lexical units in the other language that are commonly used to translate them. While deliberately vague this assumption is what  motivates our search for frequent pairs &lt;source expression, target expression&gt; with high mutual information. It also motivates our choice of regarding additions and deletions of lexical units in translation as haphazard apart from the case of a restricted set of irrelevant units that we assume can  be known in advance.</Paragraph>
    <Paragraph position="12"> 3. Open class units can only be aligned with open class units, and closed class units can only be  aligned with closed class units. This assumption seems generally correct and has the effect of reducing the number of candidate alignments significantly. Closed class units have to be listed explicitly. The assumption is that we know the two languages sufficiently well to be able to come up with an appropriate list of closed class units and expressions. Multi-word closed class units are listed separately. Closed class units can be further classified for the purposes of alignment (see below).</Paragraph>
    <Paragraph position="13"> 4. If some expression for the lexical unit Us is found corresponding to some expression for the lexical unit UT, then assume that any expression of Us may correspond to any expression of UT. This assumption is in accordance with the often made observation that morphological properties are not invariants in translation. It is used to make the algorithm more greedy by accepting infrequent alignments that are morphological variants of high-rating ones.</Paragraph>
    <Paragraph position="14"> 5. If one half of an aligned sentence pair is the expression of a single lexical unit, then assume that the other half is also. This is definitely a heuristic, but it has been shown to be very useful  for technical texts involving English and Scandinavian, where terms are often found in lists or table cells (Tiedemann 1997). This heuristic is useful for finding alignments regardless of frequencies.</Paragraph>
    <Paragraph position="15"> Similarly, if there is only one non-aligned (relevant open class) word left in a partially aligned sentence, assume that it corresponds to the remaining (relevant open class) words of the corresponding sentence.</Paragraph>
    <Paragraph position="16"> 6. Position matters, i.e. while word order is not an invariant of translation it is not random either. We implement the contribution of position as a distribution of weights over the candidate pairs of expressions drawn from a given pair of sentences. Expressions that are close in relative position receive higher weights, while expressions that are far from each other receive lower weights.</Paragraph>
  </Section>
  <Section position="5" start_page="31" end_page="33" type="metho">
    <SectionTitle>
4. The Approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
4.1 Input
</SectionTitle>
      <Paragraph position="0"> A bitext aligned at the sentence level.</Paragraph>
    </Section>
    <Section position="2" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
4.2 Output
</SectionTitle>
      <Paragraph position="0"> There are two types of output data: (i) a table of link types in the form of a bilingual dictionary where each entry has the form &lt;&lt;sf .... t&amp;quot;&gt;, s being the source expression type and t I .... t n the target expression types that were found to correspond to s; and (ii) a table of link instances &lt;&lt;s,t&gt;&lt;i,j&gt;&gt; sorted by sentence pairs, where s is some expression from the source text, t is an expression from the translated text, and i and j are the (withinsentence) positions of the first word of s and t, respectively.</Paragraph>
    </Section>
    <Section position="3" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
4.3 Preprocessing
</SectionTitle>
      <Paragraph position="0"> Both halves of the bitext are regularized.</Paragraph>
      <Paragraph position="1"> When open class multi-word units are to be included, they are generated in a preprocessing stage for both the source and target texts and assembled in a table. For this purpose, we use the phrase extracting program described in Merkel et al. (1994).</Paragraph>
    </Section>
    <Section position="4" start_page="31" end_page="32" type="sub_section">
      <SectionTitle>
4.4 Basic operation
</SectionTitle>
      <Paragraph position="0"> The basic algorithm combines the K-vec approach, described by Fung and Church (1993), with the greedy word-to-word algorithm of Melamed (1997a). In addition, open class expressions are handled separately from closed class expressions, and sentences consisting of a single expression are handled in the manner of Tiedemann (1997).</Paragraph>
      <Paragraph position="1"> The algorithm is iterative, repeating the same process of generating translation pairs from the bitext, and then reducing the bitext by removing the pairs that have been found before the next iteration starts. The algorithm will stop when no more pairs can be generated, or when a given number of iterations have been completed.</Paragraph>
      <Paragraph position="2"> In each iteration, the following operations are performed: (i) For each open class expression in the source half of the bitext (with frequency higher than 3), the open class expressions in corresponding sentences of the other half are ranked according to their likelihood as translations of the given source expression.</Paragraph>
      <Paragraph position="3"> We estimate the probability that a candidate target expression is a translation by counting co-occurrences of the expressions within sentence pairs and overall occurrences in the bitext as a whole. Then the t-score, used by Fung and Church, is calculated, and the candidates are ranked on the basis of this value: In our case K is the number of sentence pairs in prob(V~,Vt) - prob(V~) prob(V,) t-the bitext. The target expression giving the highest t-score is selected as a translation provided the following two conditions are met: (a) this t-score is higher than a given threshold, and (b) the overall frequency of the pair is sufficiently high. (These are the same conditions that are used by Fung and Church.) This operation yields a list of translation pairs involving open class expressions.</Paragraph>
      <Paragraph position="4"> (ii) The same as in (i) but this time with the closed class expressions. A difference from the previous stage is that only target candidates of the proper sub-category or sub-categories for the source expression are considered. Conjunctions and personal pronouns are for example specified for both the target and the source languages. This strategy helps to limit the search space when closed-class expressions are linked.</Paragraph>
      <Paragraph position="5">  (iii) Open class expressions that constitute a sentence on their own (not counting irrelevant word tokens) generate translation pairs with the open class expressions of the corresponding sentence.</Paragraph>
      <Paragraph position="6"> (iv) When all (relevant) source expressions have been tried in this manner, a number of translation pairs have been obtained that are entered in the output table and then removed from the bitext.</Paragraph>
      <Paragraph position="7"> This will affect t-scores by reducing mariginal frequencies and will also cause fewer candidate pairs to be considered in the sequel. The reduced bitext is input for the next iteration.</Paragraph>
    </Section>
    <Section position="5" start_page="32" end_page="33" type="sub_section">
      <SectionTitle>
4.5 Variants
</SectionTitle>
      <Paragraph position="0"> The basic algorithm is enhanced by a number of modules that can be combined freely by the user.</Paragraph>
      <Paragraph position="1"> These modules are * a morphological module that groups expressions that are identical modulo specified sets of suffices; * a weight module that affects the likelihood of a candidate translation according to its position in the sentence; * a phrase module that includes multi-word expressions generated in the pre-processing stage as candidate expressions for alignment. 4.5.1 The morphological module The morphological module collects open class translation pairs that are similar to the ones that are found by the basic algorithm. More precisely, if the pair (X, Y) has been generated as a translation pair in some iteration, other candidate pairs with X as the first element are searched. A pair (X, Z) is considered to be a translation pair iff there exist strings W, F and G such that</Paragraph>
      <Paragraph position="3"> and F and G have been defined as different suffices of the same paradigm.</Paragraph>
      <Paragraph position="4"> The data needed for this module consists of simple suffix lists for regular paradigms of the languages involved. For example, \[0, s, ed, ing\] is a suffix list for regular English verbs. They have to be defined by the user in advance.</Paragraph>
      <Paragraph position="5"> When the morphological module is used, it is possible to reverse the direction of the linking process at a certain stage. After each iteration of linking expressions from source to target, the different inflectional variants of the target word are used as input data and these candidates are then linked from target to source. This strategy makes it possible to link low-frequency source expressions belonging to the same suffix paradigm.</Paragraph>
      <Paragraph position="6">  The weight module distribute weights over the target expressions depending on their position relative to the given source expression. The weights must be provided by the user in the form of lists of numbers (greater than or equal to 0). The weight for a pair is caclulated as the sum of the weights for the instances of that pair. This weight is then used to adjust the co-occurrence probabilities by using the weight instead of the co-occurrence frequency as input to the the t-score formula. The threshold used is adjusted accordingly. In the current configuration of weights, the threshold is increased by 1. In the weight module it is possible to specify the maximal distance between a source and target expression measured as their relative position in the sentences.</Paragraph>
      <Paragraph position="7"> 4.5.3 The phrase module When the phrase module is invoked, multi-word expressions are also considered as potential elements of translation pairs. The multi-word expressions to be considered are generated in a special pre-processing phase and stored in a phrase table.</Paragraph>
      <Paragraph position="8"> T-scores for candidate translation pairs involving multi-word expressions are calculated in the same way as for single words. When weights are used the weight of a multi-word expression is considered equal to that of its first word.</Paragraph>
      <Paragraph position="9"> It can happen that the t-scores for two pairs &lt;s,tl&gt; and &lt;s,t;&gt;, where t I is a multi-word expression and P is a word that is part of t 1, will be identical or almost identical. In this case we prefer the almost identical target multi-word expression over a single word candidate if it has a t-value over the threshold and is one of the top six target candidates. When a multi-word expression is found to be an element of a translation pair, the expressions that overlap with it, whether multi-word or single-word expressions, are removed from the current agenda and not considered until the next iteration.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML