Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 82–88,
New York, June 2006. c©2006 Association for Computational Linguistics
Named Entity Transliteration and Discovery from Multilingual Comparable
Corpora
Alexandre Klementiev Dan Roth
Department of Computer Science
University of Illinois
Urbana, IL 61801
a0 klementi,danr
a1 @uiuc.edu
Abstract
Named Entity recognition (NER) is an im-
portant part of many natural language pro-
cessing tasks. Most current approaches
employ machine learning techniques and
require supervised data. However, many
languages lack such resources. This paper
presents an algorithm to automatically dis-
cover Named Entities (NEs) in a resource
free language, given a bilingual corpora
in which it is weakly temporally aligned
with a resource rich language. We ob-
serve that NEs have similar time distribu-
tions across such corpora, and that they
are often transliterated, and develop an al-
gorithm that exploits both iteratively. The
algorithm makes use of a new, frequency
based, metric for time distributions and a
resource free discriminative approach to
transliteration. We evaluate the algorithm
on an English-Russian corpus, and show
high level of NEs discovery in Russian.
1 Introduction
Named Entity recognition has been getting much
attention in NLP research in recent years, since it
is seen as a significant component of higher level
NLP tasks such as information distillation and ques-
tion answering, and an enabling technology for bet-
ter information access. Most successful approaches
to NER employ machine learning techniques, which
require supervised training data. However, for many
languages, these resources do not exist. Moreover,
it is often difficult to find experts in these languages
both for the expensive annotation effort and even for
language specific clues. On the other hand, compa-
rable multilingual data (such as multilingual news
streams) are increasingly available (see section 4).
In this work, we make two independent observa-
tions about Named Entities encountered in such cor-
pora, and use them to develop an algorithm that ex-
tracts pairs of NEs across languages. Specifically,
given a bilingual corpora that is weakly temporally
aligned, and a capability to annotate the text in one
of the languages with NEs, our algorithm identifies
the corresponding NEs in the second language text,
and annotates them with the appropriate type, as in
the source text.
The first observation is that NEs in one language
in such corpora tend to co-occur with their coun-
terparts in the other. E.g., Figure 1 shows a his-
togram of the number of occurrences of the word
Hussein and its Russian transliteration in our bilin-
gual news corpus spanning years 2001 through late
2005. One can see several common peaks in the two
histograms, largest one being around the time of the
beginning of the war in Iraq. The word Russia, on
the other hand, has a distinctly different temporal
signature. We can exploit such weak synchronicity
of NEs across languages as a way to associate them.
In order to score a pair of entities across languages,
we compute the similarity of their time distributions.
The second observation is that NEs are often
transliterated or have a common etymological origin
across languages, and thus are phonetically similar.
Figure 2 shows an example list of NEs and their pos-
82
 0
 5
 10
 15
 20
’hussein’ (English)
 0
 5
 10
 15
 20
’hussein’ (Russian)
 0
 5
 10
 15
 20
01/01/01 10/05/05
Number of Occurences
Time
’russia’ (English)
Figure 1: Temporal histograms for Hussein (top),
its Russian transliteration (middle), and of the word
Russia (bottom).
sible Russian transliterations.
Approaches that attempt to use these two charac-
teristics separately to identify NEs across languages
would have significant shortcomings. Translitera-
tion based approaches require a good model, typi-
cally handcrafted or trained on a clean set of translit-
eration pairs. On the other hand, time sequence sim-
ilarity based approaches would incorrectly match
words which happen to have similar time signatures
(e.g. Taliban and Afghanistan in recent news).
We introduce an algorithm we call co-ranking
which exploits these observations simultaneously to
match NEs on one side of the bilingual corpus to
their counterparts on the other. We use a Discrete
Fourier Transform (Arfken, 1985) based metric for
computing similarity of time distributions, and we
score NEs similarity with a linear transliteration
model. For a given NE in one language, the translit-
eration model chooses a top ranked list of candidates
in another language. Time sequence scoring is then
used to re-rank the candidates and choose the one
best temporally aligned with the NE. That is, we at-
tempt to choose a candidate which is both a good
transliteration (according to the current model) and
is well aligned with the NE. Finally, pairs of NEs
a0a2a1a4a3a6a5a8a7a8a9a11a10a13a12a14a0 a15a17a16a4a9a11a9a18a7a20a19a21a1a22a12a14a0
a23a25a24a26a23a26a24a26a27 a28a30a29a31a28a32a29a21a33
a34a30a35a37a36 a27a39a38 a35a41a40 a42 a28a30a43a18a44a45a33a21a43a41a46
a47 a40a39a48a50a49a52a51a54a53a55a40a56a49 a57 a46a30a58a56a59 a42a61a60 a46a55a59
a24a26a62 a48 a47 a35 a23 a29a21a63a11a64 a57 a43a41a28a30a65
a38 a53a45a66a68a67a69a48a45a70a21a70 a71a41a60a72a42a74a73 a64a45a75
a76 a48a77a36 a38 a67a78a48a50a70a30a49a31a79 a80 a64a45a44 a73 a64a50a75a52a59a32a81
Figure 2: Example English NEs and their transliter-
ated Russian counterparts.
and the best candidates are used to iteratively train
the transliteration model.
A major challenge inherent in discovering
transliterated NEs is the fact that a single entity may
be represented by multiple transliteration strings.
One reason is language morphology. For example,
in Russian, depending on a case being used, the
same noun may appear with various endings. An-
other reason is the lack of transliteration standards.
Again, in Russian, several possible transliterations
of an English entity may be acceptable, as long as
they are phonetically similar to the source.
Thus, in order to rely on the time sequences we
obtain, we need to be able to group variants of
the same NE into an equivalence class, and col-
lect their aggregate mention counts. We would then
score time sequences of these equivalence classes.
For instance, we would like to count the aggregate
number of occurrences of a0 Herzegovina, Hercegov-
ina a1 on the English side in order to map it accu-
rately to the equivalence class of that NE’s vari-
ants we may see on the Russian side of our cor-
pus (e.g. a0a72a82a84a83a56a85a87a86a88a83a77a89a41a90a21a91a6a92a68a93a95a94a31a96a97a82a84a83a56a85a87a86a88a83a77a89a41a90a21a91a6a92a68a93a99a98a41a96a14a82a100a83a101a85a87a86a88a83a77a102
a89a41a90a21a91a6a92a68a93a84a103a87a96a100a82a84a83a56a85a87a86a88a83a77a89a41a90a21a91a6a92a68a93a95a90a100a104a92
a1 ).
One of the objectives for this work was to use as
little of the knowledge of both languages as possible.
In order to effectively rely on the quality of time se-
quence scoring, we used a simple, knowledge poor
approach to group NE variants for Russian.
In the rest of the paper, whenever we refer to a
Named Entity, we imply an NE equivalence class.
Note that although we expect that better use of lan-
guage specific knowledge would improve the re-
sults, it would defeat one of the goals of this work.
83
2 Previous Work
There has been other work to automatically discover
NE with minimal supervision. Both (Cucerzan and
Yarowsky, 1999) and (Collins and Singer, 1999)
present algorithms to obtain NEs from untagged cor-
pora. However, they focus on the classification stage
of already segmented entities, and make use of con-
textual and morphological clues that require knowl-
edge of the language beyond the level we want to
assume with respect to the target language.
The use of similarity of time distributions for in-
formation extraction, in general, and NE extraction,
in particular, is not new. (Hetland, 2004) surveys
recent methods for scoring time sequences for sim-
ilarity. (Shinyama and Sekine, 2004) used the idea
to discover NEs, but in a single language, English,
across two news sources.
A large amount of previous work exists on
transliteration models. Most are generative and con-
sider the task of producing an appropriate translit-
eration for a given word, and thus require consid-
erable knowledge of the languages. For example,
(AbdulJaleel and Larkey, 2003; Jung et al., 2000)
train English-Arabic and English-Korean generative
transliteration models, respectively. (Knight and
Graehl, 1997) build a generative model for back-
ward transliteration from Japanese to English.
While generative models are often robust, they
tend to make independence assumptions that do not
hold in data. The discriminative learning framework
argued for in (Roth, 1998; Roth, 1999) as an alter-
native to generative models is now used widely in
NLP, even in the context of word alignment (Taskar
et al., 2005; Moore, 2005). We make use of it here
too, to learn a discriminative transliteration model
that requires little knowledge of the target language.
3 Co-ranking: An Algorithm for NE
Discovery
In essence, the algorithm we present uses temporal
alignment as a supervision signal to iteratively train
a discriminative transliteration model, which can be
viewed as a distance metric between and English NE
and a potential transliteration. On each iteration, it
selects a set of transliteration candidates for each NE
according to the current model (line 6). It then uses
temporal alignment (with thresholding) to select the
best transliteration candidate for the next round of
training (lines 8, and 9).
Once the training is complete, lines 4 through 10
are executed without thresholding for each NE ina0
to discover its counterpart ina1 .
3.1 Time Sequence Generation and Matching
In order to generate time sequence for a word, we
divide the corpus into a sequence of temporal bins,
and count the number of occurrences of the word in
each bin. We then normalize the sequence.
We use a method called the F-index (Hetland,
2004) to implement the a2a4a3a6a5a8a7a10a9 similarity function
on line 8 of the algorithm. We first run a Discrete
Fourier Transform on a time sequence to extract its
Fourier expansion coefficients. The score of a pair of
time sequences is then computed as a Euclidian dis-
tance between their expansion coefficient vectors.
.
Input: Bilingual, comparable corpus (a11 ,a12 ), set of
named entitiesa13a15a14a17a16 froma11 , thresholda18
Output: Transliteration modela19
Initializea19 ;1 a20
a14a22a21a23a13a24a14a16 , collect time distributiona25a27a26a16 ;2
repeat3 a28a30a29a32a31
;4
for eacha14a16 a21a33a13a24a14a16 do5
Usea19 to collect a set of candidatesa13a24a14a35a34a36a21a37a126
with high transliteration scores;a20
a14a22a21a23a13a24a14a17a34 collect time distributiona25a27a26a38a34 ;7
Select candidatea14a34 a21a23a13a24a14a34 with the best8 a39a41a40a43a42a45a44a47a46a49a48a6a50a52a51
a25a53a26a16a55a54a25a53a26a38a34a57a56;
if
a39
exceedsa18 , add tuple
a51
a14a17a16 a54a14a34 a56 to
a28
;9
end10
Use
a28
to traina19 ;11
until D stops changing between iterations ;12
Algorithm 1: Co-ranking: an algorithm for it-
erative cross lingual NE discovery.
3.1.1 Equivalence Classes
As we mentioned earlier, an NE in one language
may map to multiple morphological variants and
transliterations in another. Identification of the en-
tity’s equivalence class of transliterations is impor-
tant for obtaining its accurate time sequence.
In order to keep to our objective of requiring as lit-
tle language knowledge as possible, we took a rather
simplistic approach to take into account morpholog-
84
ical ambiguities of NEs in Russian. Two words were
considered variants of the same NE if they share a
prefix of size five or longer. At this point, our al-
gorithm takes a simplistic approach also for the En-
glish side of the corpus – each unique word had its
own equivalence class although, in principle, we can
incorporate works such as (Li et al., 2004) into the
algorithm. A cumulative distribution was then col-
lected for such equivalence classes.
3.2 Transliteration Model
Unlike most of the previous work to transliteration,
that consider generative transliteration models, we
take a discriminative approach. We train a linear
model to decide whether a word a0a2a1a4a3a43a1 is a translit-
eration of an NE a0a6a5a7a3 a0 . The words in the pair
are partitioned into a set of substrings a2a8a5 and a2a9a1
up to a particular length (including the empty string
). Couplings of the substrings a10a2a11a5a13a12a49a2a14a1a16a15 from both
sets produce features we use for training. Note
that couplings with the empty string represent inser-
tions/omissions.
Consider the following example: (a0a17a5 , a0a6a1 ) =
(powell, pauel). We build a feature vector from this
example in the following manner:
a18 First, we split both words into all possible sub-
strings of up to size two:
a0a19a5a21a20
a0
a12a23a22a13a12 a5a19a12a25a24a26a12a49a9a27a12a29a28a30a12a29a28a30a12a23a22 a5a19a12 a5a9a24a26a12a25a24a37a9a31a12a49a9a32a28a30a12a33a28a23a28 a1
a0a6a1a34a20
a0
a12a23a22a13a12a29a35a36a12a25a37a13a12a49a9a31a12a29a28a38a12a23a22a39a35a2a12a29a35a40a37a13a12a25a37 a9a27a12a49a9a32a28 a1
a18 We build a feature vector by coupling sub-
strings from the two sets:
a10a25a10a41a22a42a12 a15a43a12a14a10a41a22a13a12a29a35a44a15a43a12a46a45a47a45a47a45a48a10a49a24a50a12a29a35a40a37a51a15a43a12a46a45a47a45a47a45a48a10a9a9a28a30a12a49a9a32a28a52a15a43a12a46a45a47a45a53a45a48a10a23a28a23a28a30a12a49a9a9a28a54a15a25a15
We use the observation that transliteration tends
to preserve phonetic sequence to limit the number
of couplings. For example, we can disallow the
coupling of substrings whose starting positions are
too far apart: thus, we might not consider a pair-
ing a10a41a22 a5a19a12a25a37a57a9a9a15 in the above example. In our experi-
ments, we paired substrings if their positions in their
respective words differed by -1, 0, or 1.
We use the perceptron (Rosenblatt, 1958) algo-
rithm to train the model. The model activation pro-
vides the score we use to select best transliterations
on line 6. Our version of perceptron takes exam-
ples with a variable number of features; each ex-
ample is a set of all features seen so far that are
active in the input. As the iterative algorithm ob-
serves more data, it discovers and makes use of more
features. This model is called the infinite attribute
model (Blum, 1992) and it follows the perceptron
version in SNoW (Roth, 1998).
Positive examples used for iterative training are
pairs of NEs and their best temporally aligned
(thresholded) transliteration candidates. Negative
examples are English non-NEs paired with random
Russian words.
4 Experimental Study
We ran experiments using a bilingual comparable
English-Russian news corpus we built by crawling
a Russian news web site (www.lenta.ru).
The site provides loose translations of (and
pointers to) the original English texts. We col-
lected pairs of articles spanning from 1/1/2001
through 12/24/2004. The corpus consists of
2,022 documents with 0-8 documents per day.
The corpus is available on our web page at
http://L2R.cs.uiuc.edu/a55 cogcomp/.
The English side was tagged with a publicly
available NER system based on the SNoW learning
architecture (Roth, 1998), that is available at the
same site. This set of English NEs was hand-pruned
to remove incorrectly classified words to obtain 978
single word NEs.
In order to reduce running time, some limited
preprocessing was done on the Russian side. All
classes, whose temporal distributions were close
to uniform (i.e. words with a similar likelihood
of occurrence throughout the corpus) were deemed
common and not considered as NE candidates.
Unique words were grouped into 15,594 equivalence
classes, and 1,605 of those classes were discarded
using this method.
Insertions/omissions features were not used in the
experiments as they provided no tangible benefit for
the languages of our corpus.
Unless mentioned otherwise, the transliteration
model was initialized with a subset of 254 pairs
of NEs and their transliteration equivalence classes.
Negative examples here and during the rest of the
training were pairs of randomly selected non-NE
English and Russian words.
In each iteration, we used the current transliter-
85
 20
 30
 40
 50
 60
 70
 80
 0  1  2  3  4  5  6  7  8  9
Accuracy (%)
Iteration
Complete Algorithm
Transliteration Model Only
Sequence Only
Figure 3: Proportion of correctly discovered NE
pairs vs. iteration. Complete algorithm outperforms
both transliteration model and temporal sequence
matching when used on their own.
ation model to find a list of 30 best transliteration
equivalence classes for each NE. We then computed
time sequence similarity score between NE and each
class from its list to find the one with the best match-
ing time sequence. If its similarity score surpassed
a set threshold, it was added to the list of positive
examples for the next round of training. Positive ex-
amples were constructed by pairing each English NE
with each of the transliterations from the best equiv-
alence class that surpasses the threshold. We used
the same number of positive and negative examples.
For evaluation, random 727 of the total of 978 NE
pairs matched by the algorithm were selected and
checked by a language expert. Accuracy was com-
puted as the percentage of those NEs correctly dis-
covered by the algorithm.
4.1 NE Discovery
Figure 3 shows the proportion of correctly discov-
ered NE transliteration equivalence classes through-
out the run of the algorithm. The figure also shows
the accuracy if transliterations are selected accord-
ing to the current transliteration model (top scor-
ing candidate) and sequence matching alone. The
transliteration model alone achieves an accuracy of
about 47%, while the time sequence alone gets about
 30
 40
 50
 60
 70
 0  1  2  3  4  5  6  7  8  9
Accuracy (%)
Iteration
254 examples
127 examples
85 examples
Figure 5: Proportion of the correctly discovered NE
pairs for various initial example set sizes. Decreas-
ing the size does not have a significant effect of the
performance on later iterations.
41%. The combined algorithm achieves about 66%,
giving a significant improvement.
In order to understand what happens to the
transliteration model as the algorithm proceeds, let
us consider the following example. Figure 4 shows
parts of transliteration lists for NE forsyth for two
iterations of the algorithm. The weak transliteration
model selects the correct transliteration (italicized)
as the 24th best transliteration in the first iteration.
Time sequence scoring function chooses it to be one
of the training examples for the next round of train-
ing of the model. By the eighth iteration, the model
has improved to select it as a best transliteration.
Not all correct transliterations make it to the top of
the candidates list (transliteration model by itself is
never as accurate as the complete algorithm on Fig-
ure 3). That is not required, however, as the model
only needs to be good enough to place the correct
transliteration anywhere in the candidate list.
Not surprisingly, some of the top transliteration
candidates start sounding like the NE itself, as train-
ing progresses. On Figure 4, candidates for forsyth
on iteration 7 include fross and fossett.
86
a0a2a1a4a3a6a5a8a7a9a1a4a10a12a11a14a13a16a15 a0a2a1a17a3a6a5a8a7a9a1a4a10a18a11a19a13a21a20
a22 a23a25a24a27a26a29a28a31a30a33a32a35a34a36a30a6a37a29a34a36a38a39a37a31a34a36a38a31a40a33a30a25a41a17a26a42a37a8a34a18a38a31a40a33a43a31a38a45a44 a22 a46a48a47a50a49a52a51a54a53a8a55a14a56a58a57a35a59a60a53a8a61a62a59a36a61a63a59a65a64a25a66
a67 a26a6a68a69a26a29a28a42a70a71a32a35a34a18a72a42a30a17a73a31a26a42a37a74a34a12a72a19a30a25a73a31a43a42a43a39a37a27a34a18a43a31a72a75a37a27a34a18a43a42a76a29a77a31a44 a67 a26a74a68a69a26a74a28a42a70a78a32a35a34a18a72a42a30a17a73a31a26a42a37a27a34a18a72a42a30a17a73a31a43a31a43a39a37a74a34a18a43a31a72a75a37a31a34a36a43a31a76a29a77a42a44
a79 a24a27a26a29a24a42a28a31a80a25a38a42a73a81a32a25a34a36a82a42a37a29a34a50a44 a79 a83a31a28a31a26a29a28a42a84a85a32a25a34a18a86a74a26a74a70a87a37a74a34a18a86a6a82a31a37a74a34a12a72a19a43a39a37a31a34a36a76a29a84a89a88a45a37a27a34a36a86a29a84a69a37a45a90a17a90a25a90a25a44
a91 a68a92a72a19a26a29a28a31a30a33a32a35a34a36a73a42a23a74a37a29a34a36a73a31a93a39a37a42a34a50a37a27a34a36a73a31a93a31a43a42a43a45a44 a91 a68a69a28a31a26a74a23a17a23
a94 a95
a68a69a26a74a23a17a23a35a30a17a76a33a32a35a34a36a76a8a37a74a34a18a76a8a82a42a37a27a34a18a76a74a96a74a37a74a34a36a82a42a37a27a34a18a96a14a44
a97 a98
a67 a91 a46a48a47a50a49a45a51a54a53a8a55a14a56a99a57a25a59a60a53a100a61a63a59a36a61a101a59a65a64a25a66 a102
a103 a104
Figure 4: Transliteration lists for forsyth for two iterations of the algorithm ranked by the current transliter-
ation model. As the model improves, the correct transliteration moves up the list.
4.2 Rate of Improvement vs. Initial Example
Set Size
We ran a series of experiments to see how the size
of the initial training set affects the accuracy of the
model as training progresses (Figure 5). Although
the performance of the early iterations is signifi-
cantly affected by the size of the initial training ex-
ample set, the algorithm quickly improves its perfor-
mance. As we decrease the size from 254, to 127, to
85 examples, the accuracy of the first iteration drops
by roughly 10% each time. However, starting at the
6th iteration, the three are with 3% of one another.
These numbers suggest that we only need a few
initial positive examples to bootstrap the translitera-
tion model. The intuition is the following: the few
examples in the initial training set produce features
corresponding to substring pairs characteristic for
English-Russian transliterations. Model trained on
these (few) examples chooses other transliterations
containing these same substring pairs. In turn, the
chosen positive examples contain other characteris-
tic substring pairs, which will be used by the model
to select more positive examples on the next round,
and so on.
5 Conclusions
We have proposed a novel algorithm for cross lin-
gual NE discovery in a bilingual weakly temporally
aligned corpus. We have demonstrated that using
two independent sources of information (transliter-
ation and temporal similarity) together to guide NE
extraction gives better performance than using either
of them alone (see Figure 3).
We developed a linear discriminative translitera-
a105a107a106a109a108a14a110a112a111a112a113a17a114a116a115a117a105 a118a120a119a121a113a17a113a25a111a12a122a31a106a123a115a117a105a125a124a6a126a45a119a109a111a128a127a63a129a69a130a74a110a112a122a31a113a17a113
a131a133a132a135a134a42a136a35a137a29a131a135a134 a138a42a139a42a140a31a141a27a142a4a138a42a143a29a140a81a144a35a145a65a146a31a147a27a145a54a147a27a145a18a148a149a139a31a150a45a151
a137a29a152a50a153a35a154a74a137a29a134 a142a74a150a31a155a35a156a25a142a29a140a81a144a35a145a65a157a158a141a27a142a29a156a25a142a42a147a6a145a18a155a29a151
a134a31a132a135a159a35a134a42a160a29a134a42a161a163a162a27a164 a140a31a139a42a165a35a140a42a146a29a140a42a157a158a141a31a139
a166
a152
a166
a154
a166
a160a29a164 a167a27a150a31a167a27a156a35a168a6a146a33a144a35a145a36a169a42a157a158a141a27a142a74a155a29a147a100a145a36a169a31a170a31a171a69a147a74a145a12a172a39a147a29a145a18a169a39a151
a152a173a137a6a161a163a132a135a134a31a154 a150a31a142a74a165a35a139a31a140a42a156a35a142a74a174
a154a29a152
a166a31a175
a160a29a134 a156a25a150a31a167a27a176a74a146a8a140a81a144a35a145a36a142a74a174a89a147a27a145a50a151
a177
a153a35a132a135a178a87a153a25a161 a174a63a146a29a169a31a168a29a155a17a157a179a144a25a145a36a146a42a147a27a145a18a167a14a151
a152a173a137a6a161a163a153a35a180a163a180a54a160 a150a31a142a74a165a35a155a25a181a8a181a29a157a158a141a27a142a74a156a35a142
a153a25a136
a166
a160a29a182a9a137a74a152 a183a25a141a31a168a74a146a100a184a19a142a29a150a81a144a35a145a65a146a31a147a6a145a50a151
a131a135a160a8a185
a177
a160a29a134 a138a19a146a29a141a31a148a149a174a63a146a29a140
a186 a152a173a132a135a153a25a182a31a152a173a132a187a136a158a188 a189a92a150a31a139a9a184a19a150a31a139a9a190a81a144a35a145a54a147a27a145a36a146a31a151
a136a158a188a42a160a74a182 a191a42a146a100a184a14a146
Figure 6: Example of correct transliterations discov-
ered by the algorithm.
tion model, and presented a method to automatically
generate features. For time sequence matching, we
used a scoring metric novel in this domain. As sup-
ported by our own experiments, this method outper-
forms other scoring metrics traditionally used (such
as cosine (Salton and McGill, 1986)) when corpora
are not well temporally aligned.
In keeping with our objective to provide as lit-
tle language knowledge as possible, we introduced
a simplistic approach to identifying transliteration
equivalence classes, which sometimes produced er-
roneous groupings (e.g. an equivalence class for
NE lincoln in Russian included both lincoln and lin-
colnshire on Figure 6). This approach is specific
to Russian morphology, and would have to be al-
tered for other languages. For example, for Arabic,
a small set of prefixes can be used to group most NE
variants. We expect that language specific knowl-
87
edge used to discover accurate equivalence classes
would result in performance improvements.
6 Future Work
In this work, we only consider single word Named
Entities. A subject of future work is to extend the
algorithm to the multi-word setting. Many of the
multi-word NEs are translated as well as transliter-
ated. For example, Mount in Mount Rainier will
probably be translated, and Rainier - transliterated.
If a dictionary exists for the two languages, it can be
consulted first, and, if a match is found, translitera-
tion model can be bypassed.
The algorithm can be naturally extended to com-
parable corpora of more than two languages. Pair-
wise time sequence scoring and transliteration mod-
els should give better confidence in NE matches.
It seems plausible to suppose that phonetic fea-
tures (if available) would help learning our translit-
eration model. We would like to verify if this is in-
deed the case.
The ultimate goal of this work is to automatically
tag NEs so that they can be used for training of an
NER system for a new language. To this end, we
would like to compare the performance of an NER
system trained on a corpus tagged using this ap-
proach to one trained on a hand-tagged corpus.
7 Acknowledgments
We thank Richard Sproat, ChengXiang Zhai, and
Kevin Small for their useful feedback during this
work, and the anonymous referees for their help-
ful comments. This research is supported by
the Advanced Research and Development Activity
(ARDA)’s Advanced Question Answering for Intel-
ligence (AQUAINT) Program and a DOI grant under
the Reflex program.
References
Nasreen AbdulJaleel and Leah S. Larkey. 2003. Statistical
transliteration for english-arabic cross language information
retrieval. In Proceedings of CIKM, pages 139–146, New
York, NY, USA.
George Arfken. 1985. Mathematical Methods for Physicists.
Academic Press.
Avrim Blum. 1992. Learning boolean functions in an infinite
attribute space. Machine Learning, 9(4):373–386.
Michael Collins and Yoram Singer. 1999. Unsupervised mod-
els for named entity classification. In Proc. of the Confer-
ence on Empirical Methods for Natural Language Process-
ing (EMNLP).
Silviu Cucerzan and David Yarowsky. 1999. Language in-
dependent named entity recognition combining morpholog-
ical and contextual evidence. In Proc. of the Conference
on Empirical Methods for Natural Language Processing
(EMNLP).
Magnus Lie Hetland, 2004. Data Mining in Time Series
Databases, chapter A Survey of Recent Methods for Effi-
cient Retrieval of Similar Time Sequences. World Scientific.
Sung Young Jung, SungLim Hong, and Eunok Paek. 2000. An
english to korean transliteration model of extended markov
window. In Proc. the International Conference on Compu-
tational Linguistics (COLING), pages 383–389.
Kevin Knight and Jonathan Graehl. 1997. Machine translitera-
tion. In Proc. of the Meeting of the European Association of
Computational Linguistics, pages 128–135.
Xin Li, Paul Morie, and Dan Roth. 2004. Identification and
tracing of ambiguous names: Discriminative and generative
approaches. In Proceedings of the National Conference on
Artificial Intelligence (AAAI), pages 419–424.
Robert C. Moore. 2005. A discriminative framework for bilin-
gual word alignment. In Proc. of the Conference on Empir-
ical Methods for Natural Language Processing (EMNLP),
pages 81–88.
Frank Rosenblatt. 1958. The perceptron: A probabilistic model
for information storage and organization in the brain. Psy-
chological Review, 65.
Dan Roth. 1998. Learning to resolve natural language am-
biguities: A unified approach. In Proceedings of the Na-
tional Conference on Artificial Intelligence (AAAI), pages
806–813.
Dan Roth. 1999. Learning in natural language. In Proc. of
the International Joint Conference on Artificial Intelligence
(IJCAI), pages 898–904.
Gerard Salton and Michael J. McGill. 1986. Introduction to
Modern Information Retrieval. McGraw-Hill, Inc., New
York, NY, USA.
Yusuke Shinyama and Satoshi Sekine. 2004. Named entity dis-
covery using comparable news articles. In Proc. the Interna-
tional Conference on Computational Linguistics (COLING),
pages 848–853.
Ben Taskar, Simon Lacoste-Julien, and Michael Jordan. 2005.
Structured prediction via the extragradient method. In The
Conference on Advances in Neural Information Processing
Systems (NIPS). MIT Press.
88
