Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 817–824,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Weakly Supervised Named Entity Transliteration and Discovery from
Multilingual Comparable Corpora
Alexandre Klementiev Dan Roth
Dept. of Computer Science
University of Illinois
Urbana, IL 61801
a0 klementi,danr
a1 @uiuc.edu
Abstract
Named Entity recognition (NER) is an
important part of many natural language
processing tasks. Current approaches of-
ten employ machine learning techniques
and require supervised data. However,
many languages lack such resources. This
paper presents an (almost) unsupervised
learning algorithm for automatic discov-
ery of Named Entities (NEs) in a resource
free language, given a bilingual corpora in
which it is weakly temporally aligned with
a resource rich language. NEs have similar
time distributions across such corpora, and
often some of the tokens in a multi-word
NE are transliterated. We develop an algo-
rithm that exploits both observations itera-
tively. The algorithm makes use of a new,
frequency based, metric for time distribu-
tions and a resource free discriminative ap-
proach to transliteration. Seeded with a
small number of transliteration pairs, our
algorithm discovers multi-word NEs, and
takes advantage of a dictionary (if one ex-
ists) to account for translated or partially
translated NEs. We evaluate the algorithm
on an English-Russian corpus, and show
high level of NEs discovery in Russian.
1 Introduction
Named Entity recognition has been getting much
attention in NLP research in recent years, since it
is seen as significant component of higher level
NLP tasks such as information distillation and
question answering. Most successful approaches
to NER employ machine learning techniques,
which require supervised training data. However,
for many languages, these resources do not ex-
ist. Moreover, it is often difficult to find experts
in these languages both for the expensive anno-
tation effort and even for language specific clues.
On the other hand, comparable multilingual data
(such as multilingual news streams) are becoming
increasingly available (see section 4).
In this work, we make two independent obser-
vations about Named Entities encountered in such
corpora, and use them to develop an algorithm that
extracts pairs of NEs across languages. Specifi-
cally, given a bilingual corpora that is weakly tem-
porally aligned, and a capability to annotate the
text in one of the languages with NEs, our algo-
rithm identifies the corresponding NEs in the sec-
ond language text, and annotates them with the ap-
propriate type, as in the source text.
The first observation is that NEs in one language
in such corpora tend to co-occur with their coun-
terparts in the other. E.g., Figure 1 shows a his-
togram of the number of occurrences of the word
Hussein and its Russian transliteration in our bilin-
gual news corpus spanning years 2001 through
late 2005. One can see several common peaks
in the two histograms, largest one being around
the time of the beginning of the war in Iraq. The
word Russia, on the other hand, has a distinctly
different temporal signature. We can exploit such
weak synchronicity of NEs across languages to
associate them. In order to score a pair of enti-
ties across languages, we compute the similarity
of their time distributions.
The second observation is that NEs often con-
tain or are entirely made up of words that are pho-
netically transliterated or have a common etymo-
logical origin across languages (e.g. parliament in
English and a2a4a3a6a5a8a7a9a3a11a10a13a12a15a14a4a16 , its Russian translation),
and thus are phonetically similar. Figure 2 shows
817
 0
 5
 10
 15
 20
’hussein’ (English)
 0
 5
 10
 15
 20
’hussein’ (Russian)
 0
 5
 10
 15
 20
01/01/01 10/05/05
Number of Occurences
Time
’russia’ (English)
Figure 1: Temporal histograms for Hussein (top),
its Russian transliteration (middle), and of the
word Russia (bottom).
an example list of NEs and their possible Russian
transliterations.
Approaches that attempt to use these two
characteristics separately to identify NEs across
languages would have significant shortcomings.
Transliteration based approaches require a good
model, typically handcrafted or trained on a clean
set of transliteration pairs. On the other hand, time
sequence similarity based approaches would in-
correctly match words which happen to have sim-
ilar time signatures (e.g., Taliban and Afghanistan
in recent news).
We introduce an algorithm we call co-ranking
which exploits these observations simultaneously
to match NEs on one side of the bilingual cor-
pus to their counterparts on the other. We use a
Discrete Fourier Transform (Arfken, 1985) based
metric for computing similarity of time distribu-
tions, and show that it has significant advantages
over other metrics traditionally used. We score
NEs similarity with a linear transliteration model.
We first train a transliteration model on single-
word NEs. During training, for a given NE in one
language, the current model chooses a list of top
ranked transliteration candidates in another lan-
guage. Time sequence scoring is then used to re-
rank the list and choose the candidate best tem-
porally aligned with the NE. Pairs of NEs and the
best candidates are then used to iteratively train the
a0a2a1a4a3a6a5a8a7a8a9a11a10a13a12a14a0 a15a17a16a4a9a11a9a18a7a20a19a21a1a22a12a14a0
a23a25a24a26a23a26a24a26a27 a28a30a29a31a28a32a29a21a33
a34a30a35a37a36 a27a39a38 a35a41a40 a42 a28a30a43a18a44a45a33a21a43a41a46
a47 a40a39a48a50a49a52a51a54a53a55a40a56a49 a57 a46a30a58a56a59 a42a61a60 a46a55a59
a24a26a62 a48 a47 a35 a23 a29a21a63a11a64 a57 a43a41a28a30a65
a38 a53a45a66a68a67a69a48a45a70a21a70 a71a41a60a72a42a74a73 a64a45a75
a76 a48a77a36 a38 a67a78a48a50a70a30a49a31a79 a80 a64a45a44 a73 a64a50a75a52a59a32a81
Figure 2: Example English NEs and their translit-
erated Russian counterparts.
transliteration model.
Once the model is trained, NE discovery pro-
ceeds as follows. For a given NE, transliteration
model selects a candidate list for each constituent
word. If a dictionary is available, each candidate
list is augmented with translations (if they exist).
Translations will be the correct choice for some
NE words (e.g. for queen in Queen Victoria),
and transliterations for others (e.g. Bush in Steven
Bush). We expect temporal sequence alignment to
resolve many of such ambiguities. It is used to
select the best translation/transliteration candidate
from each word’s candidate set, which are then
merged into a possible NE in the other language.
Finally, we verify that the NE is actually contained
in the target corpus.
A major challenge inherent in discovering
transliterated NEs is the fact that a single en-
tity may be represented by multiple transliteration
strings. One reason is language morphology. For
example, in Russian, depending on a case being
used, the same noun may appear with various end-
ings. Another reason is the lack of translitera-
tion standards. Again, in Russian, several possible
transliterations of an English entity may be accept-
able, as long as they are phonetically similar to the
source.
Thus, in order to rely on the time sequences we
obtain, we need to be able to group variants of
the same NE into an equivalence class, and collect
their aggregate mention counts. We would then
score time sequences of these equivalence classes.
For instance, we would like to count the aggregate
number of occurrences of a82 Herzegovina, Herce-
govinaa83 on the English side in order to map it ac-
curately to the equivalence class of that NE’s vari-
ants we may see on the Russian side of our cor-
pus (e.g. a82a72a84 a12 a5a86a85a9a12a88a87a41a89a21a90a6a91 a14 a3a52a92a93a84 a12 a5a94a85a9a12a88a87a41a89a30a90a6a91 a14a4a95a37a92a93a84 a12 a5a86a85 a12a77a96
a87a41a89a30a90a32a91 a14a98a97a86a92a98a84 a12 a5a86a85 a12a77a87a41a89a21a90a6a91 a14a99a89a101a100a91a32a83 ).
One of the objectives for this work was to use as
818
little of the knowledge of both languages as pos-
sible. In order to effectively rely on the quality of
time sequence scoring, we used a simple, knowl-
edge poor approach to group NE variants for the
languages of our corpus (see 3.2.1).
In the rest of the paper, whenever we refer to a
Named Entity or an NE constituent word, we im-
ply its equivalence class. Note that although we
expect that better use of language specific knowl-
edge would improve the results, it would defeat
one of the goals of this work.
2 Previous work
There has been other work to automati-
cally discover NE with minimal supervision.
Both (Cucerzan and Yarowsky, 1999) and (Collins
and Singer, 1999) present algorithms to obtain
NEs from untagged corpora. However, they focus
on the classification stage of already segmented
entities, and make use of contextual and mor-
phological clues that require knowledge of the
language beyond the level we want to assume
with respect to the target language.
The use of similarity of time distributions for
information extraction, in general, and NE extrac-
tion, in particular, is not new. (Hetland, 2004)
surveys recent methods for scoring time sequences
for similarity. (Shinyama and Sekine, 2004) used
the idea to discover NEs, but in a single language,
English, across two news sources.
A large amount of previous work exists on
transliteration models. Most are generative and
consider the task of producing an appropriate
transliteration for a given word, and thus require
considerable knowledge of the languages. For
example, (AbdulJaleel and Larkey, 2003; Jung
et al., 2000) train English-Arabic and English-
Korean generative transliteration models, respec-
tively. (Knight and Graehl, 1997) build a gen-
erative model for backward transliteration from
Japanese to English.
While generative models are often robust, they
tend to make independence assumptions that do
not hold in data. The discriminative learning
framework argued for in (Roth, 1998; Roth, 1999)
as an alternative to generative models is now used
widely in NLP, even in the context of word align-
ment (Taskar et al., 2005; Moore, 2005). We
make use of it here too, to learn a discriminative
transliteration model that requires little knowledge
of the target language.
We extend our preliminary work in (Kle-
mentiev and Roth, 2006) to discover multi-word
Named Entities and to take advantage of a dictio-
nary (if one exists) to handle NEs which are par-
tially or entirely translated. We take advantage of
dynamically growing feature space to reduce the
number of supervised training examples.
3 Co-Ranking: An Algorithm for NE
Discovery
3.1 The algorithm
In essence, the algorithm we present uses tem-
poral alignment as a supervision signal to itera-
tively train a transliteration model. On each iter-
ation, it selects a list of top ranked transliteration
candidates for each NE according to the current
model (line 6). It then uses temporal alignment
(with thresholding) to re-rank the list and select
the best transliteration candidate for the next round
of training (lines 8, and 9).
Once the training is complete, lines 4 through
10 are executed without thresholding for each con-
stituent NE word. If a dictionary is available,
transliteration candidate lists a0a2a1a4a3 on line 6 are
augmented with translations. We then combine
the best candidates (as chosen on line 8, without
thresholding) into complete target language NE.
Finally, we discard transliterations which do not
actually appear in the target corpus.
Input: Bilingual, comparable corpus (a5 , a6 ), set of
named entities a7a9a8a11a10 from a5 , threshold a12
Output: Transliteration model a13
Initialize a13 ;1 a14
a8a16a15a17a7a9a8 a10 , collect time distribution a18a20a19 a10 ;2
repeat3 a21a23a22a25a24
;4
for each a8 a10 a15a26a7a9a8 a10 do5
Use a13 to collect a list of candidates a7a9a8a28a27a29a15a30a66
with high transliteration scores;a14
a8a31a15a26a7a9a8a11a27 collect time distribution a18a20a19a32a27 ;7
Select candidate a8 a27 a15a26a7a9a8 a27 with the best8 a33a35a34a37a36a39a38a41a40a43a42a43a44a46a45
a18a20a19 a10a48a47 a18a20a19a32a27a50a49 ;
if
a33
exceeds a12 , add tuple
a45
a8 a10a51a47 a8a11a27a52a49 to
a21
;9
end10
Use
a21
to train a13 ;11
until D stops changing between iterations ;12
Algorithm 1: Iterative transliteration model
training.
819
3.2 Time sequence generation and matching
In order to generate time sequence for a word, we
divide the corpus into a sequence of temporal bins,
and count the number of occurrences of the word
in each bin. We then normalize the sequence.
We use a method called the F-index (Hetland,
2004) to implement the a0a2a1a4a3a6a5a8a7 similarity function
on line 8 of the algorithm. We first run a Discrete
Fourier Transform on a time sequence to extract its
Fourier expansion coefficients. The score of a pair
of time sequences is then computed as a Euclidean
distance between their expansion coefficient vec-
tors.
3.2.1 Equivalence Classes
As we mentioned in the introduction, an NE
may map to more than one transliteration in an-
other language. Identification of the entity’s
equivalence class of transliterations is important
for obtaining its accurate time sequence.
In order to keep to our objective of requiring as
little language knowledge as possible, we took a
rather simplistic approach for both languages of
our corpus. For Russian, two words were consid-
ered variants of the same NE if they share a prefix
of size five or longer. Each unique word had its
own equivalence class for the English side of the
corpus, although, in principal, ideas such as in (Li
et al., 2004) could be incorporated.
A cumulative distribution was then collected
for such equivalence classes.
3.3 Transliteration model
Unlike most of the previous work considering gen-
erative transliteration models, we take the discrim-
inative approach. We train a linear model to decide
whether a word a1 a3a10a9a12a11 is a transliteration of an
NE a1a14a13a15a9a17a16 . The words in the pair are partitioned
into a set of substrings a0a18a13 and a0 a3 up to a particular
length (including the empty string ). Couplings of
the substrings a19a20a0 a13a22a21 a0 a3a24a23 from both sets produce fea-
tures we use for training. Note that couplings with
the empty string represent insertions/omissions.
Consider the following example: (a1a25a13 , a1a51a3 ) =
(powell, pauel). We build a feature vector from
this example in the following manner:
a26 First, we split both words into all possible
substrings of up to size two:
a1a14a13a28a27 a82 a21a30a29a31a21 a3 a21a33a32a34a21 a7 a21a36a35a37a21a36a35a37a21a30a29 a3 a21 a3 a32a34a21a33a32 a7 a21 a7 a35a37a21a38a35a30a35 a83
a1 a3 a27 a82 a21a30a29a31a21a36a39a40a21a33a41a31a21 a7 a21a36a35a42a21a30a29a43a39a44a21a36a39a45a41a31a21a33a41 a7 a21 a7 a35 a83
a26 We build a feature vector by coupling sub-
strings from the two sets:
a19a33a19 a29a46a21 a23a47a21 a19 a29a46a21a36a39a48a23a47a21a50a49a51a49a51a49 a19 a32a34a21a36a39a52a41a53a23a47a21a50a49a51a49a51a49 a19a20a7 a35a54a21 a7 a35a55a23a47a21a56a49a51a49a51a49 a19 a35a57a35a37a21 a7 a35a58a23a33a23
We use the observation that transliteration tends
to preserve phonetic sequence to limit the number
of couplings. For example, we can disallow the
coupling of substrings whose starting positions are
too far apart: thus, we might not consider a pairing
a19 a29 a3 a21a33a41 a7 a23 in the above example. In our experiments,
we paired substrings if their positions in their re-
spective words differed by -1, 0, or 1.
We use the perceptron (Rosenblatt, 1958) algo-
rithm to train the model. The model activation
provides the score we use to select best translit-
erations on line 6. Our version of perceptron takes
variable number of features in its examples; each
example is a subset of all features seen so far that
are active in the input. As the iterative algorithm
observes more data, it discovers and makes use of
more features. This model is called the infinite at-
tribute model (Blum, 1992) and it follows the per-
ceptron version of SNoW (Roth, 1998).
Positive examples used for iterative training are
pairs of NEs and their best temporally aligned
(thresholded) transliteration candidates. Negative
examples are English non-NEs paired with ran-
dom Russian words.
4 Experimental Study
We ran experiments using a bilingual comparable
English-Russian news corpus we built by crawl-
ing a Russian news web site (www.lenta.ru).
The site provides loose translations of (and
pointers to) the original English texts. We col-
lected pairs of articles spanning from 1/1/2001
through 10/05/2005. The corpus consists of
2,327 documents, with 0-8 documents per day.
The corpus is available on our web page at
http://L2R.cs.uiuc.edu/a59 cogcomp/.
The English side was tagged with a publicly
available NER system based on the SNoW learn-
ing architecture (Roth, 1998), that is available
on the same site. This set of English NEs was
hand-pruned to remove incorrectly classified
words to obtain 978 single word NEs.
In order to reduce running time, some lim-
ited pre-processing was done on the Russian side.
All classes, whose temporal distributions were
close to uniform (i.e. words with a similar like-
lihood of occurrence throughout the corpus) were
820
 0
 10
 20
 30
 40
 50
 60
 70
 80
 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19
Accuracy (%)
Iteration
Complete Algorithm
Transliteration Model Only
Temporal Sequence Only
Figure 3: Proportion of correctly discovered NE
pairs vs. training iteration. Complete algorithm
outperforms both transliteration model and tempo-
ral sequence matching when used on their own.
deemed common and not considered as NE can-
didates. Unique words were thus grouped into
14,781 equivalence classes.
Unless mentioned otherwise, the transliteration
model was initialized with a set of 20 pairs of En-
glish NEs and their Russian transliterations. Nega-
tive examples here and during the rest of the train-
ing were pairs of randomly selected non-NE En-
glish and Russian words.
New features were discovered throughout train-
ing; all but top 3000 features from positive and
3000 from negative examples were pruned based
on the number of their occurrences so far. Fea-
tures remaining at the end of training were used
for NE discovery.
Insertions/omissions features were not used in
the experiments as they provided no tangible ben-
efit for the languages of our corpus.
In each iteration, we used the current transliter-
ation model to find a list of 30 best transliteration
equivalence classes for each NE. We then com-
puted time sequence similarity score between NE
and each class from its list to find the one with
the best matching time sequence. If its similar-
ity score surpassed a set threshold, it was added
to the list of positive examples for the next round
of training. Positive examples were constructed
by pairing an NE with the common stem of its
transliteration equivalence class. We used the
same number of positive and negative examples.
 0
 10
 20
 30
 40
 50
 60
 70
 80
 0  1  2  3  4  5
Accuracy (%)
Iteration
5 examples
20 examples
80 examples
Figure 4: Proportion of correctly discovered NE
pairs vs. the initial example set size. As long as
the size is large enough, decreasing the number of
examples does not have a significant impact on the
performance of the later iterations.
We used the Mueller English-Russian dictio-
nary to obtain translations in our multi-word NE
experiments. We only considered the first dictio-
nary definition as a candidate.
For evaluation, random 727 of the total of 978
NEs were matched to correct transliterations by a
language expert (partly due to the fact that some of
the English NEs were not mentioned in the Rus-
sian side of the corpus). Accuracy was computed
as the percentage of NEs correctly identified by
the algorithm.
In the multi-word NE experiment, 282 random
multi-word (2 or more) NEs and their translit-
erations/translations discovered by the algorithm
were verified by a language expert.
4.1 NE discovery
Figure 3 shows the proportion of correctly dis-
covered NE transliteration equivalence classes
throughout the training stage. The figure also
shows the accuracy if transliterations are selected
according to the current transliteration model (top
scoring candidate) and temporal sequence match-
ing alone.
The transliteration model alone achieves an ac-
curacy of about 38%, while the time sequence
alone gets about 41%. The combined algorithm
achieves about 63%, giving a significant improve-
ment.
821
a0a2a1a4a3 a0a5a1a7a6 a0a2a1a9a8
Cosine 41.3 5.8 1.7
Pearson 41.1 5.8 1.7
DFT 41.0 12.4 4.8
Table 1: Proportion of correctly discovered NEs
vs. corpus misalignment (a0 ) for each of the three
measures. DFT based measure provides signifi-
cant advantages over commonly used metrics for
weakly aligned corpora.
a32
a1a4a3
a32
a1a7a10
a32
a1a7a6
Cosine 5.8 13.5 18.4
Pearson 5.8 13.5 18.2
DFT 12.4 20.6 27.9
Table 2: Proportion of correctly discovered NEs
vs. sliding window size (a32 ) for each of the three
measures.
In order to understand what happens to the
transliteration model as the training proceeds, let
us consider the following example. Figure 5 shows
parts of transliteration lists for NE forsyth for two
iterations of the algorithm. The weak translitera-
tion model selects the correct transliteration (ital-
icized) as the 24th best transliteration in the first
iteration. Time sequence scoring function chooses
it to be one of the training examples for the next
round of training of the model. By the eighth iter-
ation, the model has improved to select it as a best
transliteration.
Not all correct transliterations make it to the top
of the candidates list (transliteration model by it-
self is never as accurate as the complete algorithm
on Figure 3). That is not required, however, as the
model only needs to be good enough to place the
correct transliteration anywhere in the candidate
list.
Not surprisingly, some of the top translitera-
tion candidates start sounding like the NE itself,
as training progresses. On Figure 5, candidates for
forsyth on iteration 7 include fross and fossett.
Once the transliteration model was trained, we
ran the algorithm to discover multi-word NEs,
augmenting candidate sets of dictionary words
with their translations as described in Section 3.1.
We achieved the accuracy of about 66%. The
correctly discovered Russian NEs included en-
tirely transliterated, partially translated, and en-
tirely translated NEs. Some of them are shown on
Figure 6.
4.2 Initial example set size
We ran a series of experiments to see how the size
of the initial training set affects the accuracy of the
model as training progresses (Figure 4). Although
the performance of the early iterations is signif-
icantly affected by the size of the initial training
example set, the algorithm quickly improves its
performance. As we decrease the size from 80 to
20, the accuracy of the first iteration drops by over
20%, but a few iterations later the two have sim-
ilar performance. However, when initialized with
the set of size 5, the algorithm never manages to
improve.
The intuition is the following. The few ex-
amples in the initial training set produce features
corresponding to substring pairs characteristic for
English-Russian transliterations. Model trained
on these (few) examples chooses other transliter-
ations containing these same substring pairs. In
turn, the chosen positive examples contain other
characteristic substring pairs, which will be used
by the model to select more positive examples on
the next round, and so on. On the other hand, if
the initial set is too small, too few of the character-
istic transliteration features are extracted to select
a clean enough training set on the next round of
training.
In general, one would expect the size of the
training set necessary for the algorithm to improve
to depend on the level of temporal alignment of
the two sides of the corpus. Indeed, the weaker the
temporal supervision the more we need to endow
the model so that it can select cleaner candidates
in the early iterations.
4.3 Comparison of time sequence scoring
functions
We compared the performance of the DFT-based
time sequence similarity scoring function we use
in this paper to the commonly used cosine (Salton
and McGill, 1986) and Pearson’s correlation mea-
sures.
We perturbed the Russian side of the corpus
in the following way. Articles from each day
were randomly moved (with uniform probabil-
ity) within a a0 -day window. We ran single word
NE temporal sequence matching alone on the per-
turbed corpora using each of the three measures
(Table 1).
Some accuracy drop due to misalignment could
be accommodated for by using a larger temporal
822
a0a2a1a4a3a6a5a8a7a9a1a4a10a12a11a14a13a16a15 a0a2a1a17a3a6a5a8a7a9a1a4a10a18a11a19a13a21a20
a22 a23a25a24a27a26a29a28a31a30a33a32a35a34a36a30a6a37a29a34a36a38a39a37a31a34a36a38a31a40a33a30a25a41a17a26a42a37a8a34a18a38a31a40a33a43a31a38a45a44 a22 a46a48a47a50a49a52a51a54a53a8a55a14a56a58a57a35a59a60a53a8a61a62a59a36a61a63a59a65a64a25a66
a67 a26a6a68a69a26a29a28a42a70a71a32a35a34a18a72a42a30a17a73a31a26a42a37a74a34a12a72a19a30a25a73a31a43a42a43a39a37a27a34a18a43a31a72a75a37a27a34a18a43a42a76a29a77a31a44 a67 a26a74a68a69a26a74a28a42a70a78a32a35a34a18a72a42a30a17a73a31a26a42a37a27a34a18a72a42a30a17a73a31a43a31a43a39a37a74a34a18a43a31a72a75a37a31a34a36a43a31a76a29a77a42a44
a79 a24a27a26a29a24a42a28a31a80a25a38a42a73a81a32a25a34a36a82a42a37a29a34a50a44 a79 a83a31a28a31a26a29a28a42a84a85a32a25a34a18a86a74a26a74a70a87a37a74a34a18a86a6a82a31a37a74a34a12a72a19a43a39a37a31a34a36a76a29a84a89a88a45a37a27a34a36a86a29a84a69a37a45a90a17a90a25a90a25a44
a91 a68a92a72a19a26a29a28a31a30a33a32a35a34a36a73a42a23a74a37a29a34a36a73a31a93a39a37a42a34a50a37a27a34a36a73a31a93a31a43a42a43a45a44 a91 a68a69a28a31a26a74a23a17a23
a94 a95
a68a69a26a74a23a17a23a35a30a17a76a33a32a35a34a36a76a8a37a74a34a18a76a8a82a42a37a27a34a18a76a74a96a74a37a74a34a36a82a42a37a27a34a18a96a14a44
a97 a98
a67 a91 a46a48a47a50a49a45a51a54a53a8a55a14a56a99a57a25a59a60a53a100a61a63a59a36a61a101a59a65a64a25a66 a102
a103 a104
Figure 5: Transliteration lists for forsyth for two iterations of the algorithm. As transliteration model
improves, the correct transliteration moves up the list.
bin for collecting occurrence counts. We tried var-
ious (sliding) window size a32 for a perturbed cor-
pus with a0a2a1a7a6 (Table 2).
DFT metric outperforms the other measures sig-
nificantly in most cases. NEs tend to have dis-
tributions with few pronounced peaks. If two
such distributions are not well aligned, we expect
both Pearson and Cosine measures to produce low
scores, whereas the DFT metric should catch their
similarities in the frequency domain.
5 Conclusions
We have proposed a novel algorithm for cross
lingual multi-word NE discovery in a bilingual
weakly temporally aligned corpus. We have
demonstrated that using two independent sources
of information (transliteration and temporal simi-
larity) together to guide NE extraction gives better
performance than using either of them alone (see
Figure 3).
We developed a linear discriminative transliter-
ation model, and presented a method to automati-
cally generate features. For time sequence match-
ing, we used a scoring metric novel in this domain.
We provided experimental evidence that this met-
ric outperforms other scoring metrics traditionally
used.
In keeping with our objective to provide as lit-
tle language knowledge as possible, we introduced
a simplistic approach to identifying transliteration
equivalence classes, which sometimes produced
erroneous groupings (e.g. an equivalence class
for NE congolese in Russian included both congo
and congolese on Figure 6). We expect that more
language specific knowledge used to discover ac-
curate equivalence classes would result in perfor-
mance improvements.
Other type of supervision was in the form of a
a105a107a106a109a108a111a110a92a112a113a105 a114a116a115a109a117a74a110a118a112a113a105a120a119a27a121a45a115a111a122a124a123a63a110a69a125a29a126a124a127a42a117a25a117
a128a35a129a29a130a132a131a133a129a135a134a31a136a35a131a39a137a31a138a74a139a9a140a132a136 a141a29a142a29a143a74a144a14a142a9a145a35a146a54a147a100a146a36a148a9a144a39a149a151a150a19a152a35a144a42a153a69a154a31a155a29a156a42a157a8a152
a158 a129a8a130a54a128a118a134a31a159a9a140a132a130a50a138a29a159a9a160 a161a63a142a8a143a42a141a116a150a14a162a163a157a8a143a31a164
a137a42a129a8a139a42a165a29a166a31a138a74a159a31a130a132a139a42a136 a154a42a142a29a156a31a167a35a168a74a164a6a143a42a156
a169a170a159a42a137a31a130a132a136 a158 a136a92a128a35a138a29a159a31a139a19a128a172a171a173a131 a174a29a152a25a143a6a175a74a155a31a145a25a146a18a174a74a156a31a176a87a148a45a149a107a177a35a155a29a174a74a152a35a157a9a145a6a178a179a178a173a178a180a149
a128a172a138a74a139a31a165a29a138a74a131a179a136a17a169a170a136 a141a27a155a29a156a31a167a25a155a31a145a25a146a50a147a6a146a18a144a42a152a35a181a17a177a172a141a27a155a29a148a45a149
a139a31a138a29a130a132a140a132a182a113a128a25a129a8a130a50a138a29a131a173a171a179a139a42a129 a177a172a152a25a174a29a152a25a143a45a145a74a178a173a178a179a178a179a149a48a141a29a142a29a143a31a155a4a144a39a145a35a146a36a183a31a156a42a142a42a147a179a178a173a178a173a178a180a149
a184
a159a31a139a31a171a133a128a172a182a31a171a173a130a50a138a118a185a27a138a74a171a179a186a25a159 a158 a171 a150a19a181a25a162a163a156a31a183a31a157a29a183a31a143a31a155a187a141a27a155a29a183a27a150a14a181a35a164a9a161a62a183
a130a132a136a25a182 a158 a129a8a139 a143a31a152a17a161a63a142a8a156a45a145a25a146a50a147a27a146a65a142a9a149
Figure 6: Example of correct transliterations dis-
covered by the algorithm.
very small bootstrapping transliteration set.
6 Future Work
The algorithm can be naturally extended to com-
parable corpora of more than two languages.
Pair-wise time sequence scoring and translitera-
tion models should give better confidence in NE
matches.
The ultimate goal of this work is to automati-
cally tag NEs so that they can be used for training
of an NER system for a new language. To this end,
we would like to compare the performance of an
NER system trained on a corpus tagged using this
approach to one trained on a hand-tagged corpus.
7 Acknowledgments
We thank Richard Sproat, ChengXiang Zhai, and
Kevin Small for their useful feedback during this
work, and the anonymous referees for their help-
ful comments. This research is supported by
the Advanced Research and Development Activity
(ARDA)’s Advanced Question Answering for In-
telligence (AQUAINT) Program and a DOI grant
under the Reflex program.
823
References
Nasreen AbdulJaleel and Leah S. Larkey. 2003. Statis-
tical transliteration for english-arabic cross language
information retrieval. In Proceedings of CIKM,
pages 139–146, New York, NY, USA.
George Arfken. 1985. Mathematical Methods for
Physicists. Academic Press.
Avrim Blum. 1992. Learning boolean functions
in an infinite attribute space. Machine Learning,
9(4):373–386.
Michael Collins and Yoram Singer. 1999. Unsuper-
vised models for named entity classification. In
Proc. of the Conference on Empirical Methods for
Natural Language Processing (EMNLP).
Silviu Cucerzan and David Yarowsky. 1999. Lan-
guage independent named entity recognition com-
bining morphological and contextual evidence. In
Proc. of the Conference on Empirical Methods for
Natural Language Processing (EMNLP).
Magnus Lie Hetland, 2004. Data Mining in Time Se-
ries Databases, chapter A Survey of Recent Meth-
ods for Efficient Retrieval of Similar Time Se-
quences. World Scientific.
Sung Young Jung, SungLim Hong, and Eunok Paek.
2000. An english to korean transliteration model
of extended markov window. In Proc. the Inter-
national Conference on Computational Linguistics
(COLING), pages 383–389.
Alexandre Klementiev and Dan Roth. 2006. Named
entity transliteration and discovery from multilin-
gual comparable corpora. In Proc. of the Annual
Meeting of the North American Association of Com-
putational Linguistics (NAACL).
Kevin Knight and Jonathan Graehl. 1997. Machine
transliteration. In Proc. of the Meeting of the Eu-
ropean Association of Computational Linguistics,
pages 128–135.
Xin Li, Paul Morie, and Dan Roth. 2004. Identifi-
cation and tracing of ambiguous names: Discrimi-
native and generative approaches. In Proceedings
of the National Conference on Artificial Intelligence
(AAAI), pages 419–424.
Robert C. Moore. 2005. A discriminative framework
for bilingual word alignment. In Proc. of the Con-
ference on Empirical Methods for Natural Language
Processing (EMNLP), pages 81–88.
Frank Rosenblatt. 1958. The perceptron: A probabilis-
tic model for information storage and organization in
the brain. Psychological Review, 65.
Dan Roth. 1998. Learning to resolve natural language
ambiguities: A unified approach. In Proceedings
of the National Conference on Artificial Intelligence
(AAAI), pages 806–813.
Dan Roth. 1999. Learning in natural language. In
Proc. of the International Joint Conference on Arti-
ficial Intelligence (IJCAI), pages 898–904.
Gerard Salton and Michael J. McGill. 1986. Intro-
duction to Modern Information Retrieval. McGraw-
Hill, Inc., New York, NY, USA.
Yusuke Shinyama and Satoshi Sekine. 2004. Named
entity discovery using comparable news articles. In
Proc. the International Conference on Computa-
tional Linguistics (COLING), pages 848–853.
Ben Taskar, Simon Lacoste-Julien, and Michael Jor-
dan. 2005. Structured prediction via the extra-
gradient method. In The Conference on Advances
in Neural Information Processing Systems (NIPS).
MIT Press.
824
