File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1052_metho.xml

Size: 12,265 bytes

Last Modified: 2025-10-06 14:07:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1052">
  <Title>Using Similarity Scoring To Improve the Bilingual Dictionary for Word Alignment</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Competitive Linking in our work
</SectionTitle>
    <Paragraph position="0"> We implemented the basic Competitive Linking algorithm as described above. For each pair of parallel sentences, we construct a ranked list of possible links: each word in the source language is paired with each word in the target language. Then for each word pair the score is looked up in the dictionary, and the pairs are ranked from highest to lowest score. If a word pair does not appear in the dictionary, it is not ranked. The algorithm then recursively links the word pair with the highest cooccurrence, then the next one, etc. In our implementation, linking is performed on a sentence basis, i.e. the list of possible links is constructed only for one sentence pair at a time.</Paragraph>
    <Paragraph position="1"> Our version allows for more than one link per word, i.e. we do not assume one-to-one or zero-to-one alignments between words. Furthermore, our implementation contains a threshold that specifies how high the cooccurrence score must be for the two words in order for this pair to be considered for a link.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The baseline dictionary
</SectionTitle>
    <Paragraph position="0"> In our experiments, we used a baseline dictionary, rebuilt the dictionary with our approach, and compared the performance of the alignment algorithm between the baseline and the rebuilt dictionary. The dictionary that was used as a baseline and as a basis for rebuilding is derived from bilingual sentence-aligned text using a count-and-filter algorithm: a0 Count: for each source word type, count the number of times each target word type cooccurs in the same sentence pair, as well as the total number of occurrences of each source and target type.</Paragraph>
    <Paragraph position="1"> a0 Filter: after counting all cooccurrences, retain only those word pairs whose cooccurrence probability is above a defined threshold. To be retained, a word pair a1a3a2 ,a1a5a4 must satisfy</Paragraph>
    <Paragraph position="3"> a1a16a2a42a21a23a1a5a4a43a19 is the number of times the two words cooccurred.</Paragraph>
    <Paragraph position="4"> By making the threshold vary with frequency, one can control the tendency for infrequent words to be included in the dictionary as a result of chance collocations. The 50% cooccurrence probability of a pair of words with frequency 2 and a single co-occurrence is probably due to chance, while a 10% cooccurrence probability of words with frequency 5000 is most likely the result of the two words being translations of each other. In our experiments, we varied the threshold from 0.005 to 0.01 and 0.02.</Paragraph>
    <Paragraph position="5"> It should be noted that there are many possible algorithms that could be used to derive the baseline dictionary, e.g. a44a46a45 , pointwise mutual information, etc. An overview of such approaches can be found in (Kilgarriff, 1996). In our work, we preferred to use the above-described method, because it this method is utilized in the example-based MT system being developed in our group (Brown, 1997). It has proven useful in this context.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The problem of derivational and
</SectionTitle>
    <Paragraph position="0"> inflectional morphology As the scores in the dictionary are based on surface form words, statistical alignment algorithms such as Competitive Linking face the problem of inflected and derived terms. For instance, the English word liberty can be translated into French as a noun (libert'e), or else as an adjective (libre), the same adjective in the plural (libres), etc. This happens quite frequently, as sentences are often restructured in translation. In such a case, libert'e, libre, libres, and all the other translations of liberty in a sense share their cooccurrence scores with liberty. This can cause problems especially because there are words that are overall frequent in one language (here, French), and that receive a high cooccurrence count regardless of the word in the other language (here, English). If the cooccurrence score between liberty and an unrelated but frequent word is higher than libres, then the algorithm will prefer a link between liberty and le over a link between liberty and libres, even if the latter is correct.</Paragraph>
    <Paragraph position="1"> As for a concrete example from the training data used in this study, consider the English word oil.</Paragraph>
    <Paragraph position="2"> This word is quite frequent in the training data and thus cooccurs at high counts with many target language words 1. In this case, the target language is French. The cooccurrence dictionary contains the following entries for oil among other entries:</Paragraph>
    <Paragraph position="4"> It can be seen that words such as et and dans receive higher coccurrence scores with oil than some correct translations of oil, such as p'etroli`ere, and p'etroli`eres, and, in the case of et, also p'etrole. This will cause the Competitive Linking algorithm to favor a link e.g. between oil and et over a link between oil and p'etrole.</Paragraph>
    <Paragraph position="5"> In particular, word variations can be due to inflectional morphology (e.g. adjective endings) and derivational morphology (e.g. a noun being trans- null tails.</Paragraph>
    <Paragraph position="6"> lated as an adjective due to sentence restructuring). Both inflectional and derivational morphology will result in words that are similar, but not identical, so that cooccurrence counts will score them separately. Below we describe an approach that addresses these two problems. In principle, we cluster similar words and assign them a new dictionary score that is higher than the scores of the individual words. In this way, the dictionary is rebuilt. This will influence the ranked list that is produced by the algorithm and thus the final alignments.</Paragraph>
    <Paragraph position="7"> 5 Rebuilding the dictionary based on similarity scores Rebuilding the dictionary is based largely on similarities between words. We have implemented an algorithm that assigns a similarity score to a pair of words a3 a1a3a2a4a2 a3 a4 . The score is higher for a pair of similar words, while it favors neither shorter nor longer words. The algorithm finds the number of matching characters between the words, while allowing for insertions, deletions, and substitutions. The concept is thus very closely related to the Edit distance, with the difference that our algorithm counts the matching characters rather than the non-matching ones. The length of the matching substring (which is not necessarily continguous) is denoted by Match-StringLength). At each step, a character from a3 a1 is compared to a character from a3 a4 . If the characters are identical, the count for the MatchStringLength is incremented. Then the algorithm checks for reduplication of the character in one or both of the words. Reduplication also results in an incremented Match-StringLength. If the characters do not match, the algorithm skips one or more characters in either word. Then the longest common substring is put in relation to the length of the two words. This is done so as to not favor longer words that would result in a higher MatchStringLength than shorter words. The similarity score of a3 a1 and  a4 is then computed using the following formula:</Paragraph>
    <Paragraph position="9"> This similarity scoring provides the basis for our newly built dictionary. The algorithm proceeds as follows: For any given source language word a0 a1 , there are a9 target language words a3a39a38</Paragraph>
    <Paragraph position="11"> Note that in most cases a9 is much smaller than the size of the target language vocabulary, but also much greater than a0 . For the words a3a39a38 a0a1a0a1a0 a3a41a40 , the algorithm computes the similarity score for each word</Paragraph>
    <Paragraph position="13"> that this computation is potentially very complex.</Paragraph>
    <Paragraph position="14"> The number of word pairs grows exponentially as a9 grows. This problem is addressed by excluding word pairs whose cooccurrence scores are low, as will be discussed in more detail later.</Paragraph>
    <Paragraph position="15"> In the following, we use a greedy bottom-up clustering algorithm (Manning and Sch&amp;quot;utze, 1999) to cluster those words that have high similarity scores.</Paragraph>
    <Paragraph position="16"> The clustering algorithm is initialized to a9 clusters, where each cluster contains exactly one of the</Paragraph>
    <Paragraph position="18"> ters the pair of words with the maximum similarity score. The new cluster also stores a similarity score a11 a7a10a6a13a12a15a14a17a16 a11a6a18a16 a28a20a19 a16a22a21 a19 , which in this case is the similarity score of the two clustered words. In the following steps, the algorithm again merges those two clusters that have the highest similarity score</Paragraph>
    <Paragraph position="20"> a19 . The clustering can occur in one of three ways: 1. Merge two clusters that each contain one word. Then the similarity score a11 a7a10a6a13a12a15a14a17a16a24a23a26a25a28a27a30a29a7a25a32a31 of the merged cluster will be the similarity score of the word pair.</Paragraph>
    <Paragraph position="21"> 2. Merge a cluster a42 a1 that contains a single word a3 a1 and a cluster a42</Paragraph>
    <Paragraph position="23"> ilarity score of the merged cluster is the average similarity score of the a6 -word cluster, averaged with the similarity scores between the single word and all a6 words in the cluster. This means that the algorithm computes the similarity score between the single word a3  3. Merge two clusters that each contain more than a single word. In this case, the algo- null rithm proceeds as in the second case, but averages the added similarity score over all word pairs. Suppose there exists a cluster a42 a1 with a55  Clustering proceeds until a threshold, a6a8a7a10a9 a0 a7a10a6 , is exhausted. If none of the possible merges would result in a new cluster whose average similarity score</Paragraph>
    <Paragraph position="25"> tering stops. Then the dictionary entries are modified as follows: suppose that words a3a9a6</Paragraph>
    <Paragraph position="27"> source language word a0 a1 . Furthermore, denote the cooccurrence score of the word pair a0 a1 and a3a7a6 by</Paragraph>
    <Paragraph position="29"> Not all words are considered for clustering. First, we compiled a stop list of target language words that are never clustered, regardless of their similarity and cooccurrence scores with other words. The words on the stop list are the 20 most frequent words in the target language training data. Section a77 argues why this exclusion makes sense: one of the goals of clustering is to enable variations of a word to receive a higher dictionary score than words that are very common overall.</Paragraph>
    <Paragraph position="30"> Furthermore, we have decided to exclude words from clustering that account for only few of the cooccurrences of a0a2a1 . In particular, a separate thresh-</Paragraph>
    <Paragraph position="32"> a43 , controls how high the cooccurrence score with a0a2a1 has to be in relation to all other scores between a0a2a1 and a target language word. a42a39a43a44a43a44a42</Paragraph>
    <Paragraph position="34"> that cooccur with source language word a0 a1 .</Paragraph>
    <Paragraph position="35"> Similarly to the most frequent words, dictionary scores for word pairs that are too rare for clustering remain unchanged.</Paragraph>
    <Paragraph position="36"> This exclusion makes sense because words that cooccur infrequently are likely not translations of each other, so it is undesirable to boost their score by clustering. Furthermore, this threshold helps keep the complexity of the operation under control. The fewer words qualify for clustering, the fewer similarity scores for pairs of words have to be computed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML