Using Similarity Scoring To Improve the Bilingual Dictionary for Word
Alignment
Katharina Probst
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA, USA, 15213
kathrin@cs.cmu.edu
Ralf Brown
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA, USA, 15213
ralf@cs.cmu.edu
Abstract
We describe an approach to improve the
bilingual cooccurrence dictionary that is
used for word alignment, and evaluate the
improved dictionary using a version of
the Competitive Linking algorithm. We
demonstrate a problem faced by the Com-
petitive Linking algorithm and present an
approach to ameliorate it. In particular, we
rebuild the bilingual dictionary by cluster-
ing similar words in a language and as-
signing them a higher cooccurrence score
with a given word in the other language
than each single word would have other-
wise. Experimental results show a signifi-
cant improvement in precision and recall
for word alignment when the improved
dicitonary is used.
1 Introduction and Related Work
Word alignment is a well-studied problem in Natu-
ral Language Computing. This is hardly surprising
given its significance in many applications: word-
aligned data is crucial for example-based machine
translation, statistical machine translation, but also
other applications such as cross-lingual information
retrieval. Since it is a hard and time-consuming task
to hand-align bilingual data, the automation of this
task receives a fair amount of attention. In this pa-
per, we present an approach to improve the bilin-
gual dictionary that is used by word alignment al-
gorithms. Our method is based on similarity scores
between words, which in effect results in the clus-
tering of morphological variants.
One line of related work is research in clustering
based on word similarities. This problem is an area
of active research in the Information Retrieval com-
munity. For instance, Xu and Croft (1998) present
an algorithm that first clusters what are assumedly
variants of the same word, then further refines the
clusters using a cooccurrence related measure. Word
variants are found via a stemmer or by clustering all
words that begin with the same three letters. An-
other technique uses similarity scores based on N-
grams (e.g. (Kosinov, 2001)). The similarity of two
words is measured using the number of N-grams that
their occurrences have in common. As in our ap-
proach, similar words are then clustered into equiv-
alence classes.
Other related work falls in the category of word
alignment, where much research has been done. A
number of algorithms have been proposed and eval-
uated for the task. As Melamed (2000) points out,
most of these algorithms are based on word cooccur-
rences in sentence-aligned bilingual data. A source
language word a0a2a1 and a target language word a3
a1 are
said to cooccur if a0a2a1 occurs in a source language sen-
tence and a3
a1 occurs in the corresponding target lan-
guage sentence. Cooccurrence scores then are then
counts for all word pairs a0a5a4 and a3a7a6 , where a0a8a4 is in
the source language vocabulary and a3a9a6 is in the tar-
get language vocabulary. Often, the scores also take
into account the marginal probabilites of each word
and sometimes also the conditional probabilities of
one word given the other.
Aside from the classic statistical approach of
                Computational Linguistics (ACL), Philadelphia, July 2002, pp. 409-416.
                         Proceedings of the 40th Annual Meeting of the Association for
(Brown et al., 1990; Brown et al., 1993), a number
of other algorithms have been developed. Ahren-
berg et al. (1998) use morphological information on
both the source and the target languages. This infor-
mation serves to build equivalence classes of words
based on suffices. A different approach was pro-
posed by Gaussier (1998). This approach models
word alignments as flow networks. Determining the
word alignments then amounts to solving the net-
work, for which there are known algorithms. Brown
(1998) describes an algorithm that starts with ‘an-
chors’, words that are unambiguous translations of
each other. From these anchors, alignments are ex-
panded in both directions, so that entire segments
can be aligned.
The algorithm that this work was based on is the
Competitive Linking algorithm. We used it to test
our improved dictionary. Competitive Linking was
described by Melamed (1997; 1998; 2000). It com-
putes all possible word alignments in parallel data,
and ranks them by their cooccurrence or by a similar
score. Then links between words (i.e. alignments)
are chosen from the top of the list until no more links
can be assigned. There is a limit on the number of
links a word can have. In its basic form the Compet-
itive Linking algorithm (Melamed, 1997) allows for
only up to one link per word. However, this one-to-
one/zero-to-one assumption is relaxed by redefining
the notion of a word.
2 Competitive Linking in our work
We implemented the basic Competitive Linking al-
gorithm as described above. For each pair of paral-
lel sentences, we construct a ranked list of possible
links: each word in the source language is paired
with each word in the target language. Then for
each word pair the score is looked up in the dictio-
nary, and the pairs are ranked from highest to lowest
score. If a word pair does not appear in the dictio-
nary, it is not ranked. The algorithm then recursively
links the word pair with the highest cooccurrence,
then the next one, etc. In our implementation, link-
ing is performed on a sentence basis, i.e. the list of
possible links is constructed only for one sentence
pair at a time.
Our version allows for more than one link per
word, i.e. we do not assume one-to-one or zero-to-
one alignments between words. Furthermore, our
implementation contains a threshold that specifies
how high the cooccurrence score must be for the two
words in order for this pair to be considered for a
link.
3 The baseline dictionary
In our experiments, we used a baseline dictionary,
rebuilt the dictionary with our approach, and com-
pared the performance of the alignment algorithm
between the baseline and the rebuilt dictionary. The
dictionary that was used as a baseline and as a ba-
sis for rebuilding is derived from bilingual sentence-
aligned text using a count-and-filter algorithm:
a0 Count: for each source word type, count the
number of times each target word type cooc-
curs in the same sentence pair, as well as the
total number of occurrences of each source and
target type.
a0 Filter: after counting all cooccurrences, re-
tain only those word pairs whose cooccurrence
probability is above a defined threshold. To be
retained, a word pair a1a3a2 ,a1a5a4 must satisfy
a6a8a7a10a9a12a11a14a13a15a11
a1a16a2a18a17a1a5a4a20a19a22a21
a13a15a11
a1a5a4a23a17a1a16a2a23a19a24a19a26a25
a3a24a27a29a28a18a30a32a31a34a33a36a35a38a37a33a36a39a14a40
where a41
a11
a1a16a2a42a21a23a1a5a4a43a19 is the number of times the
two words cooccurred.
By making the threshold vary with frequency, one
can control the tendency for infrequent words to be
included in the dictionary as a result of chance col-
locations. The 50% cooccurrence probability of a
pair of words with frequency 2 and a single co-
occurrence is probably due to chance, while a 10%
cooccurrence probability of words with frequency
5000 is most likely the result of the two words being
translations of each other. In our experiments, we
varied the threshold from 0.005 to 0.01 and 0.02.
It should be noted that there are many possible
algorithms that could be used to derive the baseline
dictionary, e.g. a44a46a45 , pointwise mutual information,
etc. An overview of such approaches can be found in
(Kilgarriff, 1996). In our work, we preferred to use
the above-described method, because it this method
is utilized in the example-based MT system being
developed in our group (Brown, 1997). It has proven
useful in this context.
4 The problem of derivational and
inflectional morphology
As the scores in the dictionary are based on surface
form words, statistical alignment algorithms such as
Competitive Linking face the problem of inflected
and derived terms. For instance, the English word
liberty can be translated into French as a noun (lib-
ert´e), or else as an adjective (libre), the same adjec-
tive in the plural (libres), etc. This happens quite fre-
quently, as sentences are often restructured in trans-
lation. In such a case, libert´e, libre, libres, and all
the other translations of liberty in a sense share their
cooccurrence scores with liberty. This can cause
problems especially because there are words that are
overall frequent in one language (here, French), and
that receive a high cooccurrence count regardless of
the word in the other language (here, English). If
the cooccurrence score between liberty and an un-
related but frequent word is higher than libres, then
the algorithm will prefer a link between liberty and
le over a link between liberty and libres, even if the
latter is correct.
As for a concrete example from the training data
used in this study, consider the English word oil.
This word is quite frequent in the training data and
thus cooccurs at high counts with many target lan-
guage words 1. In this case, the target language is
French. The cooccurrence dictionary contains the
following entries for oil among other entries:
oil - et 543
oil - dans 118
a0a1a0a1a0
oil - p´etrole 259
oil - p´etroli`ere 61
oil - p´etroli`eres 61
It can be seen that words such as et and dans re-
ceive higher coccurrence scores with oil than some
correct translations of oil, such as p´etroli`ere, and
p´etroli`eres, and, in the case of et, also p´etrole. This
will cause the Competitive Linking algorithm to fa-
vor a link e.g. between oil and et over a link between
oil and p´etrole.
In particular, word variations can be due to in-
flectional morphology (e.g. adjective endings) and
derivational morphology (e.g. a noun being trans-
1We used Hansards data, see the evaluation section for de-
tails.
lated as an adjective due to sentence restructuring).
Both inflectional and derivational morphology will
result in words that are similar, but not identical, so
that cooccurrence counts will score them separately.
Below we describe an approach that addresses these
two problems. In principle, we cluster similar words
and assign them a new dictionary score that is higher
than the scores of the individual words. In this way,
the dictionary is rebuilt. This will influence the
ranked list that is produced by the algorithm and thus
the final alignments.
5 Rebuilding the dictionary based on
similarity scores
Rebuilding the dictionary is based largely on sim-
ilarities between words. We have implemented an
algorithm that assigns a similarity score to a pair of
words a3 a1a3a2a4a2 a3 a4 . The score is higher for a pair of sim-
ilar words, while it favors neither shorter nor longer
words. The algorithm finds the number of match-
ing characters between the words, while allowing
for insertions, deletions, and substitutions. The con-
cept is thus very closely related to the Edit distance,
with the difference that our algorithm counts the
matching characters rather than the non-matching
ones. The length of the matching substring (which
is not necessarily continguous) is denoted by Match-
StringLength). At each step, a character from a3
a1 is
compared to a character from a3 a4 . If the characters
are identical, the count for the MatchStringLength is
incremented. Then the algorithm checks for redupli-
cation of the character in one or both of the words.
Reduplication also results in an incremented Match-
StringLength. If the characters do not match, the al-
gorithm skips one or more characters in either word.
Then the longest common substring is put in re-
lation to the length of the two words. This is done
so as to not favor longer words that would result in a
higher MatchStringLength than shorter words. The
similarity score of a3
a1 and
a3
a4 is then computed using
the following formula:
a5a7a6 a39 a8a10a9a12a11a42a39a14a13a16a15 a17a19a18a21a20a23a22a24a17a19a18a20a39a4a9
a25
a26
a27a29a28
a22a24a17a19a18a20a39a4a9a31a30 a32a34a33
a28
a22a24a17a19a18a20a39a4a9a31a30 a35a37a36
This similarity scoring provides the basis for our
newly built dictionary. The algorithm proceeds as
follows: For any given source language word a0 a1 ,
there are a9 target language words a3a39a38
a0a1a0a1a0
a3a41a40 such that
the cooccurrence score a42a39a43a44a43a44a42
a11 a0 a1 ,
a3
a4
a19 is greater than 0.
Note that in most cases a9 is much smaller than the
size of the target language vocabulary, but also much
greater than a0 . For the words a3a39a38
a0a1a0a1a0
a3a41a40 , the algo-
rithm computes the similarity score for each word
pair a11 a3
a4
a21
a3a7a6
a19 , where
a1a3a2a5a4
a21a7a6
a2
a9
a21
a4a9a8
a10
a6 . Note
that this computation is potentially very complex.
The number of word pairs grows exponentially as
a9 grows. This problem is addressed by excluding
word pairs whose cooccurrence scores are low, as
will be discussed in more detail later.
In the following, we use a greedy bottom-up clus-
tering algorithm (Manning and Sch¨utze, 1999) to
cluster those words that have high similarity scores.
The clustering algorithm is initialized to a9 clus-
ters, where each cluster contains exactly one of the
words a3a19a38
a0a1a0a1a0
a3a41a40 . In the first step, the algorithm clus-
ters the pair of words with the maximum similar-
ity score. The new cluster also stores a similarity
score a11 a7a10a6a13a12a15a14a17a16 a11a6a18a16 a28a20a19 a16a22a21 a19 , which in this case is the
similarity score of the two clustered words. In the
following steps, the algorithm again merges those
two clusters that have the highest similarity score
a11
a7a10a6a13a12a15a14a17a16 a11a6a18a16
a28a20a19
a16a22a21
a19 . The clustering can occur in one
of three ways:
1. Merge two clusters that each contain one word.
Then the similarity score a11 a7a10a6a13a12a15a14a17a16a24a23a26a25a28a27a30a29a7a25a32a31 of the
merged cluster will be the similarity score of
the word pair.
2. Merge a cluster a42 a1 that contains a single word a3 a1
and a cluster a42
a4 that contains
a6 words
a3a39a38
a0a1a0a1a0
a3a7a6
and has a11
a7a10a6a13a12a15a14a17a16 a11a6a18a16
a28a20a19
a16a22a21a34a33
a15 a37
a33a36a35
a19 . Then the sim-
ilarity score of the merged cluster is the aver-
age similarity score of the a6 -word cluster, av-
eraged with the similarity scores between the
single word and all a6 words in the cluster. This
means that the algorithm computes the similar-
ity score between the single word a3
a1 in cluster
a42
a1 and each of the
a6 words in cluster
a42
a4 , and
averages them with a11
a7a10a6a13a12a15a14a17a16 a11
a42
a4
a19 :
a27a38a37a40a39
a35a42a41
a25
a35a24a15 a43a45a44 a39
a32a47a46
a39
a35a32a48
a36a37a33a49a44a47a44
a39
a26
a48
a11a1a15 a43a51a50a53a52a41a22a32a44a4a8
a35a32a48a38a48
a54
a33a49a44
a39
a26
a48
3. Merge two clusters that each contain more
than a single word. In this case, the algo-
rithm proceeds as in the second case, but av-
erages the added similarity score over all word
pairs. Suppose there exists a cluster a42 a1 with a55
words a3a19a38
a0a1a0a1a0
a3a30a56 and a11
a7a10a6a13a12a15a14a17a16 a11
a42
a1
a19 and a cluster
a42
a4
with a6 words a3a37a38
a0a1a0a1a0
a3a7a6 and a11
a7a10a6a13a12a15a14a17a16 a11
a42
a4
a19 . Then
a11
a7a10a6a13a12a15a14a17a16 a11a6a18a16
a28a20a19
a16a22a21a57a33
a15 a37
a33a36a35
a19 is computed as follows:
a27a38a37a59a58
a32 a41
a25
a37 a39
a35a42a41
a25
a35a24a15 a43a45a44 a39
a32a47a46
a39
a35a32a48
a36a37a33a49a44a47a44
a39
a26
a48
a11a1a15 a43a51a50a53a52a41a22a32a44a4a8
a35a32a48a38a48
a33a49a44a47a44
a58
a26
a48
a11a1a15 a43a51a50a53a52a41a22a28a60 a8
a32a36a61a62a48
a60
a54a64a63
a28
a61
a33a49a44
a39
a26
a48
a33a49a44
a58
a26
a48
Clustering proceeds until a threshold, a6a8a7a10a9 a0 a7a10a6 , is
exhausted. If none of the possible merges would re-
sult in a new cluster whose average similarity score
a11
a7a10a6a13a12a15a14a17a16 a11a6a18a16
a28a20a19
a16a22a21
a19 would be at least
a6a8a7a10a9 a0 a7a10a6 , clus-
tering stops. Then the dictionary entries are mod-
ified as follows: suppose that words a3a9a6
a0a1a0a1a0
a3
a23 are
clustered, where all words a3 a6
a0a1a0a1a0
a3
a23 cooccur with
source language word a0 a1 . Furthermore, denote the
cooccurrence score of the word pair a0 a1 and a3a7a6 by
a42a39a43a44a43a44a42
a11 a0 a1
a21
a3a7a6
a19 . Then in the rebuilt dictionary the en-
try a65
a15a47a66a47a67
a35a69a68a71a70a32a72a7a72a7a70
a44
a65
a15a38a66a47a67
a35
a48
will be replaced witha65
a15a47a66a47a67
a35a69a68
a37 a43
a28 a73
a54
a70a32a72a7a72a7a70
a44
a65
a15a47a66a38a67
a28
a48 if
a67
a35a69a74
a67 a54a76a75a30a75a30a75 a67a38a43
Not all words are considered for clustering. First,
we compiled a stop list of target language words that
are never clustered, regardless of their similarity and
cooccurrence scores with other words. The words
on the stop list are the 20 most frequent words in
the target language training data. Section a77 argues
why this exclusion makes sense: one of the goals of
clustering is to enable variations of a word to receive
a higher dictionary score than words that are very
common overall.
Furthermore, we have decided to exclude words
from clustering that account for only few of the
cooccurrences of a0a2a1 . In particular, a separate thresh-
old, a42a39a43a44a43a44a42
a0
a28a79a78 a3
a7
a43 , controls how high the cooccurrence
score with a0a2a1 has to be in relation to all other scores
between a0a2a1 and a target language word. a42a39a43a44a43a44a42
a0
a28a79a78 a3
a7
a43
is expressed as follows: a word a3
a4 qualifies for clus-
tering if
a33a81a80a30a80a30a33
a31
a2
a15 a37
a4
a35
a40
a37 a17
a54
a73a17a82
a33a81a80a30a80a30a33
a31
a2
a15 a37
a4
a54 a40a84a83
a42a39a43a44a43a44a42
a0
a28a79a78 a3
a7
a43
As before, a3a37a38
a0a1a0a1a0
a3a41a40 are all the target language words
that cooccur with source language word a0 a1 .
Similarly to the most frequent words, dictionary
scores for word pairs that are too rare for clustering
remain unchanged.
This exclusion makes sense because words that
cooccur infrequently are likely not translations of
each other, so it is undesirable to boost their score by
clustering. Furthermore, this threshold helps keep
the complexity of the operation under control. The
fewer words qualify for clustering, the fewer simi-
larity scores for pairs of words have to be computed.
6 Evaluation
We trained three basic dictionaries using part of the
Hansard data, around five megabytes of data (around
20k sentence pairs and 850k words). The basic dic-
tionaries were built using the algorithm described
in section 3, with three different thresholds: 0.005,
0.01, and 0.02. In the following, we will refer to
these dictionaries as as Dict0.005, Dict0.01, and
Dict0.02.
50 sentences were held back for testing. These
sentences were hand-aligned by a fluent speaker of
French. No one-to-one assumption was enforced. A
word could thus align to zero or more words, where
no upper limit was enforced (although there is a nat-
ural upper limit).
The Competitive Linking algorithm was then run
with multiple parameter settings. In one setting, we
varied the maximum number of links allowed per
word, a6 a78a1a0a76a55
a7a10a9
a6
a0 . For example, if the maximum
number is 2, then a word can align to 0, 1, or 2 words
in the parallel sentence. In other settings, we en-
forced a minimum score in the bilingual dictionary
for a link to be accepted, a6a8a7a10a9 a0 a42a39a43a18a28
a16 . This means that
two words cannot be aligned if their score is below
a6a8a7a10a9 a0
a42a39a43a18a28
a16 . In the rebuilt dictionaries, a6a8a7a10a9 a0
a42a39a43a18a28
a16 is
applied in the same way.
The dictionary was also rebuilt using a number
of different parameter settings. The two parameters
that can be varied when rebuilding the dictionary
are the similarity threshold a6a8a7a10a9 a0 a7a10a6 and the cooc-
currence threshold a42a39a43a44a43a44a42
a0
a28a79a78 a3
a7
a43 . a6a8a7a10a9 a0 a7a10a6 enforces
that all words within one cluster must have an av-
erage similarity score of at least a6a8a7a10a9 a0 a7a10a6 . The sec-
ond threshold, a42a39a43a44a43a44a42
a0
a28a79a78 a3
a7
a43 , enforces that only certain
words are considered for clustering. Those words
that are considered for clustering should account
for more than a1 a0a79a0a3a2 a42a39a43a44a43a44a42
a0
a28a79a78 a3
a7
a43
a4 of the cooccur-
rences of the source language word with any tar-
get language word. If a word falls below threshold
a42a39a43a44a43a44a42
a0
a28a79a78 a3
a7
a43 , its entry in the dictionary remains un-
changed, and it is not clustered with any other word.
Below we summarize the values each parameter was
set to.
a0 maxlinks Used in Competitive Linking algo-
rithm: Maximum number of words any word
can be aligned with. Set to: 1, 2, 3.
a0 minscore Used in Competitive Linking algo-
rithm: Minimum score of a word pair in the
dictionary to be considered as a possible link.
Set to: 1, 2, 4, 6, 8, 10, 20, 30, 40, 50.
a0 minsim Used in rebuilding dictionary: Mini-
mum average similarity score of the words in
a cluster. Set to: 0.6, 0.7, 0.8.
a0 coocsratio Used in rebuilding dictionary:
a1 a0a79a0a5a2
a42a39a43a44a43a44a42
a0
a28a79a78 a3
a7
a43 is the minimum percentage of all
cooccurrences of a source language word with
any target language word that are accounted for
by one target language word. Set to: 0.003.
Thus varying the parameters, we have constructed
various dictionaries by rebuilding the three baseline
dictionaries. Here, we report on results on three dic-
tionaries where minsim was set to 0.7 and coocsra-
tio was set to 0.003. For these parameter settings,
we observed robust results, although other parame-
ter settings also yielded positive results.
Precision and recall was measured using the hand-
aligned 50 sentences. Precision was defined as
the percentage of links that were correctly pro-
posed by our algorithm out of all links that were
proposed. Recall is defined as the percentage of
links that were found by our algorithm out of all
links that should have been found. In both cases,
the hand-aligned data was used as a gold standard.
The F-measure combines precision and recall: a6 -
a6a18a16
a78
a0a8a7
a28
a16 a10
a45a10a9a12a11
a27a64a25 a33 a1
a2
a1 a80
a40
a9
a27a64a25 a33a14a13
a56 a56
a11
a27a64a25 a33 a1
a2
a1 a80
a40a16a15
a27a64a25 a33a14a13
a56 a56
.
The following figures and tables illustrate that the
Competitive Linking algorithm performs favorably
when a rebuilt dictionary is used. Table 1 lists the
improvement in precision and recall for each of the
dictionaries. The table shows the values when the
minscore score is set to 50, and up to 1 link was
allowed per word. Furthermore, the p-values of a 1-
tailed t-test are listed, indicating these performance
boosts are in mostly highly statistically significant
Dict0.005 Dict0.01 Dict0.02
P Improvement 0.060 0.067 0.057
P p-value 0.0003 0.0042 0.0126
R Improvement 0.094 0.11 0.087
R p-value 0.0026 0.0008 0.0037
Table 1: Percent improvement and p-value for recall
and precision, comparing baseline and rebuilt dictio-
naries at minscore 50 and maxlinks 1.
for these parameter settings, where some of the best
results were observed.
The following figures (figures 1-9) serve to illus-
trate the impact of the algorithm in greater detail. All
figures plot the precision, recall, and f-measure per-
formance against different minscore settings, com-
paring rebuilt dictionaries to their baselines. For
each dictionary, three plots are given, one for each
maxlinks setting, i.e. the maximum number of links
allowed per word. The curve names indicate the
type of the curve (Precision, Recall, or F-measure),
the maximum number of links allowed per word (1,
2, or 3), the dictionary used (Dict0.005, Dict0.01,
or Dict0.02), and whether the run used the base-
line dictionary or the rebuilt dictionary (Baseline or
Cog7.3).
It can be seen that our algorithm leads to sta-
ble improvement across parameter settings. In few
cases, it drops below the baseline when minscore is
low. Overall, however, our algorithm is robust - it
improves alignment regardless of how many links
are allowed per word, what baseline dictionary is
used, and boosts both precision and recall, and thus
also the f-measure.
To return briefly to the example cited in section
a0 , we can now show how the dictionary rebuild has
affected these entries. In dictionary a11
a7a10a6
a0 a1 a0
a0
a0 they
now look as follows:
oil - et 262
oil - dans 118
a0a1a0a1a0
oil - p´etrole 434
oil - p´etroli`ere 434
oil - p´etroli`eres 434
The fact that p´etrole, p´etroli`ere, and p´etroli`eres
now receive higher scores than et and dans is what
causes the alignment performance to increase.
0.25
0.3
0.35
0.4
0.45
0.5
0 5 10 15 20 25 30 35 40 45 50
performance
minscore
’Precision1-Dict0.005-Cog7.3’
’Precision1-Dict0.005-Baseline’
’Recall1-Dict0.005-Cog7.3’
’Recall1-Dict0.005-Baseline’
’F-measure1-Dict0.005-Cog7.3’
’F-measure1-Dict0.005-Baseline’
Figure 1: Performance of dictionaries Dict0.005 for
up to one link per word
0.29
0.3
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0 5 10 15 20 25 30 35 40 45 50
performance
minscore
’Precision2-Dict0.005-Cog7.3’
’Precision2-Dict0.005-Baseline’
’Recall2-Dict0.005-Cog7.3’
’Recall2-Dict0.005-Baseline’
’F-measure2-Dict0.005-Cog7.3’
’F-measure2-Dict0.005-Baseline’
Figure 2: Performance of dictionaries Dict0.005 for
up to two links per word
7 Conclusions and Future Work
We have demonstrated how rebuilding a dictionary
can improve the performance (both precision and re-
call) of a word alignment algorithm. The algorithm
proved robust across baseline dictionaries and vari-
ous different parameter settings. Although a small
test set was used, the improvements are statistically
significant for various parameter settings. We have
shown that computing similarity scores of pairs of
words can be used to cluster morphological variants
of words in an inflected language such as French.
It will be interesting to see how the similarity
and clustering method will work in conjunction with
other word alignment algorithms, as the dictionary
0.22
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
0 5 10 15 20 25 30 35 40 45 50
performance
minscore
’Precision3-Dict0.005-Cog7.3’
’Precision3-Dict0.005-Baseline’
’Recall3-Dict0.005-Cog7.3’
’Recall3-Dict0.005-Baseline’
’F-measure3-Dict0.005-Cog7.3’
’F-measure3-Dict0.005-Baseline’
Figure 3: Performance of dictionaries Dict0.005 for
up to three links per word
0.25
0.3
0.35
0.4
0.45
0.5
0 5 10 15 20 25 30 35 40 45 50
performance
minscore
’Precision1-Dict0.01-Cog7.3’
’Precision1-Dict0.01-Baseline’
’Recall1-Dict0.01-Cog7.3’
’Recall1-Dict0.01-Baseline’
’F-measure1-Dict0.01-Cog7.3’
’F-measure1-Dict0.01-Baseline’
Figure 4: Performance of dictionaries Dict0.01 for
up to one link per word
rebuilding algorithm is independent of the actual
word alignment method used.
Furthermore, we plan to explore ways to improve
the similarity scoring algorithm. For instance, we
can assign lower match scores when the characters
are not identical, but members of the same equiva-
lence class. The equivalence classes will depend on
the target language at hand. For instance, in Ger-
man, a and ¨a will be assigned to the same equiva-
lence class, because some inflections cause a to be-
come ¨a. An improved similarity scoring algorithm
may in turn result in improved word alignments.
In general, we hope to move automated dictio-
nary extraction away from pure surface form statis-
tics and toward dictionaries that are more linguisti-
0.3
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0 5 10 15 20 25 30 35 40 45 50
performance
minscore
’Precision2-Dict0.01-Cog7.3’
’Precision2-Dict0.01-Baseline’
’Recall2-Dict0.01-Cog7.3’
’Recall2-Dict0.01-Baseline’
’F-measure2-Dict0.01-Cog7.3’
’F-measure2-Dict0.01-Baseline’
Figure 5: Performance of dictionaries Dict0.01 for
up to two links per word
0.22
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
0 5 10 15 20 25 30 35 40 45 50
performance
minscore
’Precision3-Dict0.01-Cog7.3’
’Precision3-Dict0.01-Baseline’
’Recall3-Dict0.01-Cog7.3’
’Recall3-Dict0.01-Baseline’
’F-measure3-Dict0.01-Cog7.3’
’F-measure3-Dict0.01-Baseline’
Figure 6: Performance of dictionaries Dict0.01 for
up to three links per word
cally motivated.
References
Lars Ahrenberg, M. Andersson, and M. Merkel. 1998. A
simple hybrid aligner for generating lexical correspon-
dences in parallel texts. In Proceedings of COLING-
ACL’98.
Peter Brown, J. Cocke, V.D. Pietra, S.D. Pietra, J. Jelinek,
J. Lafferty, R. Mercer, and P. Roossina. 1990. A statis-
tical approach to Machine Translation. Computational
Linguistics, 16(2):79–85.
Peter Brown, S.D. Pietra, V.D. Pietra, and R. Mercer.
1993. The mathematics of statistical Machine Trans-
lation: Parameter estimation. Computational Linguis-
tics.
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0 5 10 15 20 25 30 35 40 45 50
performance
minscore
’Precision1-Dict0.02-Cog7.3’
’Precision1-Dict0.02-Baseline’
’Recall1-Dict0.02-Cog7.3’
’Recall1-Dict0.02-Baseline’
’F-measure1-Dict0.02-Cog7.3’
’F-measure1-Dict0.02-Baseline’
Figure 7: Performance of dictionaries Dict0.02 for
up to one link per word
0.29
0.3
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0 5 10 15 20 25 30 35 40 45 50
performance
minscore
’Precision2-Dict0.02-Cog7.3’
’Precision2-Dict0.02-Baseline’
’Recall2-Dict0.02-Cog7.3’
’Recall2-Dict0.02-Baseline’
’F-measure2-Dict0.02-Cog7.3’
’F-measure2-Dict0.02-Baseline’
Figure 8: Performance of dictionaries Dict0.02 for
up to two links per word
Ralf Brown. 1997. Automated dictionary extraction for
‘knowledge-free’ example-based translation. In Pro-
ceedings of TMI 1997, pages 111–118.
Ralf Brown. 1998. Automatically-extracted thesauri for
cross-language IR: When better is worse. In Proceed-
ings of COMPUTERM’98.
Eric Gaussier. 1998. Flow network models for word
alignment and terminology extraction from bilingual
corpora. In Proceedings of COLING-ACL’98.
Adam Kilgarriff. 1996. Which words are particularly
characteristic of a text? A survey of statistical ap-
proaches. In Proceedings of AISB Workshop on Lan-
guage Engineering for Document Analysis and Recog-
nition.
Serhiy Kosinov. 2001. Evaluation of N-grams confla-
tion approach in text-based Information Retrieval. In
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
0 5 10 15 20 25 30 35 40 45 50
performance
minscore
’Precision3-Dict0.02-Cog7.3’
’Precision3-Dict0.02-Baseline’
’Recall3-Dict0.02-Cog7.3’
’Recall3-Dict0.02-Baseline’
’F-measure3-Dict0.02-Cog7.3’
’F-measure3-Dict0.02-Baseline’
Figure 9: Performance of dictionaries Dict0.02 for
up to three links per word
Proceedings of International Workshop on Informa-
tion Retrieval IR’01.
Christopher D. Manning and Hinrich Sch¨utze, 1999.
Foundations of Statistical Natural Language Process-
ing, chapter 14. MIT Press.
Dan I. Melamed. 1997. A word-to-word model of trans-
lation equivalence. In Proceedings of ACL’97.
Dan I. Melamed. 1998. Empirical methods for MT lexi-
con development. In Proceedings of AMTA’98.
Dan I. Melamed. 2000. Models of translational equiv-
alence among words. Computational Linguistics,
26(2):221–249.
Jinxi Xu and W. Bruce Croft. 1998. Corpus-based stem-
ming using co-occurrence of word variants. ACM
Transactions on Information Systems, 16(1):61–81.
