Inducing Multilingual Text Analysis Tools
via Robust Projection across Aligned Corpora
David Yarowsky
Dept. of Computer Science
Johns Hopkins University
Baltimore, MD 21218 USA
yarowsky@cs.jhu.edu
Grace Ngai
Dept. of Computer Science
Johns Hopkins University
Baltimore, MD 21218 USA
gyn@cs.jhu.edu
Richard Wicentowski
Dept. of Computer Science
Johns Hopkins University
Baltimore, MD 21218 USA
richardw@cs.jhu.edu
ABSTRACT
This paper describes a system and set of algorithms for automati-
cally inducing stand-alone monolingual part-of-speech taggers, base
noun-phrase bracketers, named-entity taggers and morphological
analyzers for an arbitrary foreign language. Case studies include
French, Chinese, Czech and Spanish.
Existing text analysis tools for English are applied to bilingual
text corpora and their output projected onto the second language
via statistically derived word alignments. Simple direct annotation
projection is quite noisy, however, even with optimal alignments.
Thus this paper presents noise-robust tagger, bracketer and lemma-
tizer training procedures capable of accurate system bootstrapping
from noisy and incomplete initial projections.
Performance of the induced stand-alone part-of-speech tagger
applied to French achieves 96% core part-of-speech (POS) tag ac-
curacy, and the corresponding induced noun-phrase bracketer ex-
ceeds 91% F-measure. The induced morphological analyzer achie-
ves over 99% lemmatization accuracy on the complete French ver-
bal system.
This achievement is particularly noteworthy in that it required
absolutely no hand-annotated training data in the given language,
and virtually no language-specific knowledge or resources beyond
raw text. Performance also significantly exceeds that obtained by
direct annotation projection.
Keywords
multilingual, text analysis, part-of-speech tagging, noun phrase brac-
keting, named entity, morphology, lemmatization, parallel corpora
1. TASK OVERVIEW
A fundamental roadblock to developing statistical taggers, brack-
eters and other analyzers for many of the world’s 200+ major lan-
guages is the shortage or absence of annotated training data for the
large majority of these languages. Ideally, one would like to lever-
.
[][ ]IN NNP NNP VBG VBG NNS NNSJJ JJ JJ
HongIn Kong national law(s)implementing of
109876543210
National   laws  applying  in  Hong  Kong2 4 53[ [] ]JJ VBG IN NNP NNP0 1NNS
un  producteur  important   de   petrole  brut 
DT[
NN
20
NN ]a  significant  producer    for    crude  oil[ ]
22 23
NNJJIN
21
JJ
1918
12
JJJJ NN
13 14 151110[ [ ]]
DT
[PLACE]
[PLACE]
Annotations From Existing English Tools
Annotations From Existing English Tools
Induced Annotations for Chinese
Induced Annotations for French
Figure 1: Projecting part-of-speech tags, named-entity tags and
noun-phrase structure from English to Chinese and French.
a0 a1a2 a3a4 a4 a5 a6a2 a7a8 a9a10a11
a6a2a0 a1a12 a10a13 a14 a7a8 a9a10a11
croyaient croissant croire croitre
believe growing growbelievingbelieved
French RootsFrench Inflections
BELIEVE
a0 a1a2
GROW
English Bridge Lemmas
V
a3a4 a4 a5 a11a10a13 a14a6a12
Figure 2: French morphological analysis via English
age the large existing investments in annotated data and tools for
resource-rich languages (such as English and Japanese) to over-
come the annotated resource shortage in other languages.
To show the broad potential of our approach and methods, this
paper will investigate four fundamental language analysis tasks:
POS tagging, base noun phrase (baseNP) bracketing, named en-
tity tagging, and inflectional morphological analysis, as illustrated
in Figures 1 and 2. These bedrock tools are important components
of thea15 language analysis pipelines for many applications, and their
low cost extension to new languages, as described here, can serve
as a broadly useful enabling resource.
2. BACKGROUND
Previous research on the word alignment of parallel corpora has
tended to focus on their use in translation model training for MT
rather than on monolingual applications. One exception is bilin-
gual parsing. Wu (1995, 1997) investigated the use of concurrent
parsing of parallel corpora in a transduction inversion framework,
helping to resolve attachment ambiguities in one language by the
coupled parsing state in the second language. Jones and Havrilla
(1998) utilized similar joint parsing techniques (twisted-pair gram-
mars) for word reordering in target language generation.
However, with these exceptions in the field of parsing, to our
knowledge no one has previously used linguistic annotation pro-
jection via aligned bilingual corpora to induce traditional stand-
alone monolingual text analyzers in other languages. Thus both
our proposed projection and induction methods, and their applica-
tion to multilingual POS tagging, named-entity classification and
morphological analysis induction, appears to be highly novel.
3. DATA RESOURCES
The data sets used in these experiments included the English-
French Canadian Hansards, the English-Chinese Hong Kong Han-
sards, and parallel Czech-English Reader’s Digest collection. In
addition, multiple versions of the Bible were used, including the
French Douay-Rheims Bible, Spanish Reina Valera Bible, and three
English Bible Versions (King James, New International and Re-
vised Standard), automatically verse-aligned in multiple pairings.
All corpora were automatically word-aligned by the now publicly
available EGYPT system (Al-Onaizan et al., 1999), based on IBM’s
Model 3 statistical MT formalism (Brown et al., 1990). The tag-
ging and bracketing tasks utilized approximately 2 million words
in each language, with the sample sizes for morphology induc-
tion given in Table 3. All word alignments utilized strictly raw-
word-based model variants for English/French/Spanish/Czech and
character-based model variants for Chinese, with no use of mor-
phological analysis or stemming, POS-tagging, bracketing or dic-
tionary resources.
4. PART-OF-SPEECHTAGGER INDUCTION
Part-of-speech tagging is the first of four applications covered in
this paper. The goal of this work is to project POS analysis capabil-
ities from one language to another via word-aligned parallel bilin-
gual corpora. To do so, we use an existing POS tagger (e.g. Brill,
1995) to annotate the English side of the parallel corpus. Then,
as illustrated in Figure 1 for Chinese and French, the raw tags are
transferred via the word alignments, yielding an extremely noisy
initial training set for the 2nd language. The third crucial step is to
generalize from these noisy projected annotations in a robust way,
yielding a stand-alone POS tagger for the new language that is con-
siderably more accurate than the initial projected tags.
Additional details of this algorithm are given in Yarowsky and
Ngai (2001). Due to lack of space, the following sections will serve
primarily as an overview of the algorithm and its salient issues.
4.1 Part-of-speech Projection Issues
First, because of considerable cross-language differences in fine-
grained tag set inventories, this work focuses on accurately assign-
ing core POS categories (e.g. noun, verb, adverb, adjective, etc.),
with additional distinctions in verb tense, noun number and pro-
noun type as captured in the English tagset inventory. Although
impoverished relative to some languages, and incapable of resolv-
ing details such as grammatical gender, this Brown-corpus-based
tagset granularity is sufficient for many applications. Furthermore,
many finer-grained part-of-speech distinctions are resolved primar-
ily by morphology, as handled in Section 7. Finally, if one desires
to induce a finer-grained tagging capability for case, for example,
one should project from a reference language such as Czech, where
case is lexically marked.
Figure 3 illustrates six scenarios encountered when projecting
POS tags from English to a language such as French. The first
two show straightforward 1-to-1 projections, which are encoun-
tered in roughly two-thirds of English words. Phrasal (1-to-N)
alignments offer greater challenges, as typically only a subset of
the aligned words accept the English tag. To distinguish these
cases, we initially assign position-sensitive phrasal parts-of-speech
via subscripting (e.g. Les/NNSa16 lois/NNSa17 ), and subsequently learn
a probablistic mapping to core, non-phrasal parts of speech (e.g.
Pa18 DTa19 NNSa16a21a20 ) that is used along with tag sequence and lexical prior
models to re-tag these phrasal POS projections.
French
Induced Tags NN
O    ... salon ...Les lois ...
DT NNS
Tagger Output
English ... living room ...
VBG NNThe laws ...DT NNS
... veterans ...
... anciens combattants ...
NNS NNS
NNS
(JJ) (NNS)
... potatoes ...
... pommes   de     terre ...
NNS
NNS NNSNNS
(IN) (NN)(NNS)
Tagger Output
English
French
Induced Tag
Correct Tag
Les   lois ...
NNS
(DT) (NNS)
O    Laws ...
NNS
O
Les     lois ...
Laws ...
NNS NNS
(DT) (NNS)
NNS
b a ba bc a
Figure 3: French POS tag projection scenarios
4.2 Noise-robust POS Tagger Training
Even at the relatively low tagset granularity of English, direct
projection of core POS tags onto French achieves only 76% ac-
curacy using EGYPT’s automatic word alignments (as shown in
Table 1). Part of this deficiency is due to word-alignment error;
when word alignments were manually corrected, direct projection
core-tag accuracy increased to 85%. Also, standard bigram taggers
trained on the automatically projected data achieve only modest
success at generalization (86% when reapplied to the noisy train-
ing data). More highly lexicalized learning algorithms exhibit even
greater potential for overmodeling the specific projection errors of
this data.
Thus our research has focused on noise-robust techniques for
distilling a conservative but effective tagger from this challenging
raw projection data. In particular, we modify standard n-gram mod-
eling to separate the training of the tag sequence model a22a23a18a25a24a26a20 from
the lexical prior models a22a23a18a28a27a29a19a24a26a20 , and apply different confidence
weighting and signal amplification techniques to both.
4.2.1 Lexical Prior Estimation
Figure 4 illustrates the process of hierarchically smoothing the
lexical prior model a30a22a23a18a25a31a32a19a33a26a20 . One motivating empirical observation
is that words in French, English and Czech have a strong tendency
to exhibit only a single core POS tag (e.g. a34 or a35 ), and very rarely
have more than 2. In English, with relatively high a22a23a18 POSa19a33a26a20 am-
biguity, only 0.37% of the tokens in the Brown Corpus are not cov-
ered by a word type’s two most frequent core tags, and in French
the percentage of tokens is only 0.03%. Thus we employ an ag-
Evaluate on Evaluate on Unseen
E-F Aligned French Monolingual French
Core Eng Eqv Core Eng Eqv
Model Tagset Tagset Tagset Tagset
(a) Direct transfer (on auto-aligned data) .76 .69 N/A N/A
(b) Direct transfer (on hand-aligned data) .85 .78 N/A N/A
(c) Standard bigram model (on auto-aligned data) .86 .82 .82 .68
(d) Noise-robust bigram induction (on auto-aligned data) .96 .93 .94 .91
(e) Fully supervised bigram training (on goldstandard) .97 .96 .98 .97
Table 1: Evaluation of 5 POS tagger induction models on 2 French datasets and 2 tagset granularities
gressive re-estimation in favor of this bias, amplifying the model
probability of the majority POS tag, and reducing or zeroing the
model probability of 2nd or lower ranked core tags proportional
to their relative frequency with respect to the majority tag. This
process is then applied recursively, similarly amplifying the proba-
bility of the majority subtags within each core tag. Further details,
including the handling of 1-to-N phrasal alignment projections, are
given in Yarowsky and Ngai (2001).
Tag
P(Tag)
Tag
P(Tag)
Tag
P(Tag)
Directly Projected Tag
Word J N V R I
achat 0 62 48 0 1
cadre 2 35 7 1 1
cadres 1 5 0 0 0
pr´evu 1 11 48 0 0
a36
Smoothed a37a38a26a39a41a40a43a42a44a46a45
N V NN NNS VBN VBG
.76 .24 .73 .03 .03 .21
.90 .10 .86 .04 .03 .00
.94 .00 .04 .90 .00 .00
.09 .91 .08 .01 .86 .00
Figure 4: Hierarchical smoothing of a30a22a47a18a25a31a48a19a33a26a20 tag probabilities
4.2.2 Tag Sequence Model Estimation
In contrast, the training of the tag sequence model a22a23a18a25a31a50a49a43a19a31a50a49a52a51a54a53a48a55a57a56a58a56a59a56 a20
focuses on confidence weighting and filtering of projected training
subsequences. The contribution of each candidate training sentence
is weighted proportionally with both its EGYPT/GIZA sentence-
level alignment score and an agreement measure between the pro-
jected tags and the 1st iteration lexical priors, a rough measure
of alignment reasonableness. Given the observed bursty distri-
bution of alignment errors in the corpus, this downweighting of
low-confidence alignment regions substantially improves sequence
model quality with tolerable reduction in training volume.
4.3 Evaluation of POS Tagger Induction
As shown in Table 1, performance is evaluated on two evalua-
tion data sets, including an independent 200K-word hand-tagged
French dataset provided by Universit´e de Montr´eal, which is used
to gauge stand-alone tagger performance. Signal amplification and
noise reduction techniques yield a 71% error reduction, achieving a
core tagset accuracy of 96%, closely approaching the upper-bound
97% performance of an equivalent bigram model trained directly
on an 80% subset of the hand-tagged evaluation set (using 5-fold
cross-validation). Thus robust training on 500K words of very
noisy but automatically-derived tag projections can approach the
performance obtained by fully supervised learning on 80K words
of hand-tagged training data.
5. NOUN PHRASE BRACKETER
INDUCTION
Our empirical studies show that there is a very strong tendency
for noun phrases to cohere as a unit when translated between lan-
guages, even when undergoing significant internal re-ordering. This
strong noun-phrase cohesion even tends to hold for relatively free
word order languages such as Czech, where both native speakers
and parallel corpus data indicate that nominal modifiers tend to re-
main in the same contiguous chunk as the nouns they modify. This
property allows collective word alignments to serve as a reliable
basis for bracket projection as well.
5.1 BaseNP Projection Methodology
The projection process begins by automatically tagging and brac-
keting the English data, using Brill (1995) and Ramshaw & Marcus
(1994), respectively.
As illustrated in Figure 5, each word within an English noun
phrase is then subscripted with the number of its NP in the sentence,
and this subscript is projected onto the aligned French (or Chinese)
words. In the most common case, the corresponding French/Chinese
noun phrase is simply the maximal span of the projected subscript.
Figure 6 shows some of the projection challenges encountered.
Nearly all such cases of interwoven projected NPs are due to align-
ment errors, and a strong inductive bias towards NP cohesion was
utilized to resolve these incompatible projections.
J         N                VBD             N           N              IN             N[ ] [][
DT      N       J        VBD           N        de        N            DT          N[
1 1 2 2 3
(3)(2)(2)(1)(1)(1) ]]][
]
O
[[
Figure 5: Standard NP projection scenarios.
DT      J        N          VBD          N         N[ ]1 1 1 2 2[ ]
(1)(1)[ (2)} (1){ (2)][DT       N             VBD       N            J        de       N]
O
Figure 6: Problematic NP projection scenarios.
5.2 BaseNP Training Algorithm
For stand-alone tool development, the Ramshaw & Marcus IOB
bracketing framework and a fast transformation-based learning sys-
tem (Ngai and Florian, 2001) were applied to the noisy baseNP-
projected data described above.
As with POS tagger induction, bracketer induction is improved
by focusing training on the highest quality projected data and ex-
cluding regions with the strongest indications of word-alignment
error. Thus sentences with the lowest 25% of model-3 alignment
scores were excluded from training, as were sentences where pro-
jected bracketings overlapped and conflicted (also an indicator of
alignment errors). Data with lower-confidence POS tagging were
not filtered,a60 however, as this filtering reduces robustness when the
stand-alone bracketers are applied to noisy tagger output. Addi-
tional details are provided in Yarowsky and Ngai (2001).
Current efforts to further improve the quality of the training data
include use of iterative EM bootstrapping techniques. Separate pro-
jection of bracketings from aligned parallel data with a 3rd lan-
guage also shows promise for providing independent supervision,
which can further help distinguish consensus signal from noise.
5.3 BaseNP Projection Evaluation
Because no bracketed evaluation data were available to us for
French or Chinese, a third party fluent in these languages hand-
bracketed a small, held-out 40-sentence evaluation set in both lan-
guages, using a set of bracketing conventions that they felt were
appropriate for the languages. Table 2 shows the performance rela-
tive to these evaluation sets, as measured by exact-match bracketing
precision (Pr), recall (R) and F-measure (F).
Exact Match Acceptable Match
Method Pr R F Pr R F
Chinese:
Direct (auto) .26 .58 .36 .48 .58 .51
Direct (hand) .47 .61 .53 .86 .86 .86
French:
Direct (auto) .43 .48 .45 .60 .58 .59
Direct (hand) .56 .51 .53 .74 .70 .72
FTBL (auto) .82 .81 .81 .91 .91 .91
Table 2: Performance of BaseNP induction models
It is important to note, however, that many decisions regarding
BaseNP bracketing conventions are essentially arbitrary, and agree-
ment rates between additional human judges on these data were
measured at 64% and 80% for French and Chinese respectively.
Since the translingual projections are essentially unsupervised and
have no data on which to mimic arbitrary conventions, it is also rea-
sonable to evaluate the degree to which the induced bracketings are
deemed acceptable and consistent with the arbitrary goldstandard
(e.g. no crossing brackets). To this end, an additional pool of 3
judges were asked to further adjudicate the differences between the
goldstandard and the projection output, annotating such situations
as either acceptable/compatible or unacceptable/incompatible.
Overall, these translingual projection results are quite encourag-
ing. For the Chinese, they are similar to Wu’s 78% precision re-
sult for translingual-grammar-based NP bracketing, and especially
promising given that no word segmentation (only raw characters)
were used. For French, the increase from 59% to 91% F-measure
for the stand-alone induced bracketer shows that the training algo-
rithm is able to generalize successfully from the noisy raw projec-
tion data, distilling a reasonably accurate (and transferable) model
of baseNP structure from this high degree of noise.
6. NAMED ENTITY TAGGER INDUCTION
Multilingual named entity tagger induction is based on the ex-
tended combination of the part-of-speech and noun-phrase brac-
keting frameworks. The entity class tags used for this study were
FNAME, LNAME, PLACE and OTHER (other entities including or-
ganizations). They were derived from an anonymously donated
MUC-6 named entity tagger applied to the English side of the French-
English Canadian Hansards data.
Initial classification proceeds on a per-word basis, using an ag-
gressively smoothed transitive projection model similar to those de-
scribed in Section 7. For a given second-language word FW and all
English words a61a62a27a63a49 aligned to it:
a22a23a18 NEclassa64a65a19 FWa20a67a66a69a68
a49
a22a23a18 NEclassa64a65a19 EWa49 a20a70a22 a16 a18 EWa49 a19 FWa20
a22a23a18 PLACEa19 Cor´eea20a71a66a72a22a23a18 PLACEa19Koreaa20a73a22a54a16a74a18 Koreaa19 Cor´eea20a76a75a77a56a58a56a59a56
The co-training-based algorithm given in Cucerzan and Yarowsky
(1999) was then used to train a stand-alone named entity tagger
from the projected data. Seed words for this algorithm were those
French words that were both POS-tagged as proper nouns and had
an above-threshold entity-class confidence from the lexical projec-
tion models.
Performance was measured in terms of per-word entity-type clas-
sification accuracy on the French Hansard test data, using the 4-
class inventory listed above. Classification accuracy of raw tag
projections was only 64% (based on automatic word alignment).
In contrast, the stand-alone co-training-based tagger trained on the
projections achieved 85% classification accuracy, illustrating its ef-
fectivess at generalization in the face of projection noise. Notably,
most of its observed errors can be traced to entity classification er-
rors from the original English tagger. In fact, when evaluated on the
English translation of the French test data set, the English tagger
only achieved 86% classification accuracy on this directly compa-
rable data set. It appears that the projection-induced French tagger
achieves performance nearly as high as its original training source.
Thus further improvements should be expected from higher quality
English training sources.
7. MORPHOLOGICAL ANALYSIS
INDUCTION
Bilingual corpora can also serve as a very successful bridge for
aligning complex inflected word forms in a new language with their
root forms, even when their surface similarity is quite different or
highly irregular.
croyaient croissant croire croître
believe growing growbelievingbelieved
French RootsFrench Inflections
croyant
Potential English Bridge Words
Figure 7: Direct-bridge French inflection/root alignment
As illustrated in Figure 7, the association between a French ver-
bal inflection (croyant) and its correct root (croire), rather than a
similar competitor (croˆıtre), can be identified by a single-step tran-
sitive association via an English bridge word (believing). However,
in the case of morphology induction, such direct associations are
relatively rare given that inflections in a second language tend to as-
sociate with similar tenses in English while the singular/infinitive
forms tend to associate with analogous singular/infinitive forms,
and thus croyaient (believed) and its root croire have no direct En-
glish link in our aligned corpus.
However, Figure 2 (first page) illustrates that an existing invest-
ment in a lemmatizer for English can help bridge this gap by joining
a multi-step transitive association croyaienta36 believeda36 believea36
croire. Figure 8 illustrates how this transitive linkage via English
a78a80a79a25a81a58a82a83a52a83a85a84 a86a81a59a87a88a52a89a90a91
a86a81a78a80a79a25a92 a90a93a52a94a95a87a88a85a89a90a91
croyaient croissant
believebelievingbelieved
French Inflections
BELIEVE
English Bridge Lemmas
thought think
THINK
croitreV
French Roots
croire
a91a90a93a85a94a78a80a79a96a81a25a82a83a52a83a85a84a96a86a92
Figure 8: Multi-bridge French inflection/root alignment
lemmatization can be potentially utilized for all other English lem-
mas (such as THINK) with which croyaient and croire also asso-
ciate, offering greater potential coverage and robustness via multi-
ple bridges.
Formally, these multiple transitive linkages can be modeled as
shown below, by summing over all English lemmas (a61a98a97a100a99a102a101a104a103 ) with
which either a candidate foreign inflection (a105 infl) or its root (a105 root)
exhibit an alignment in the parallel corpus:
a22a54a101a67a106a107a18a85a105a54a108a43a109a43a109a111a110a43a19a105 a49a41a112a21a113 a97 a20a67a66a72a68
a49
a22 a16 a18a85a105a114a108a50a109a43a109a111a110a111a19a61 a97a100a99a102a101 a103a111a20a70a22 a16 a18a85a61 a97a100a99a102a101 a103a48a19a105 a49a58a112a115a113 a97 a20
For example:
a22a114a101a67a106a74a18 croirea19 croyaienta20a71a66
a22a54a16a74a18 croirea19 BELIEVEa20a73a22a54a16a70a18 BELIEVEa19 croyaienta20a50a75
a22 a16 a18 croirea19 THINKa20a70a22 a16 a18 THINKa19 croyaienta20a116a75a77a56a59a56a59a56
This projection/bridge-based similarity measure a22 mpa18a85a105 roota19a105 infla20
can be quite effective on its own, as shown in the MProj only entries
in Table 3 (for multiple parallel corpora in 3 different languages),
especially when restricted to the highest-confidence subset of the
vocabulary (5.2% to 77.9% in these data) for which the association
exceeds simple fixed probability and frequency thresholds. When
estimated using a 1.2 million word subset of the French Hansards,
for example, the MProj measure alone achives 98.5% precision
on 32.7% of the inflected French verbs in the corpus (constitut-
ing 97.6% of the tokens in the corpus). Unlike traditional string-
transduction-based morphology induction methods where irregular
verbs pose the greatest challenges, these typically high-frequency
words are often the best modelled data in the vocabulary making
these multilingual projection techniques a natural complement to
existing models.
7.1 Trie-based Morphology Models
The high precision on the MProj-covered subset also make these
partial pairings effective training data for robust supervised algo-
rithms that can generalize the string transformation behavior to the
remaining uncovered vocabulary. While any supervised morpho-
logical analysis technique is possible here, we employ a trie-based
modeling technique where the probability of a given stem-change
(from the inventory observed in the MProj-paired training data) is
modeled hierarchically using variable suffix context, as described
in Yarowsky and Wicentowski (2000):
a22a23a18 roota19 inflectiona20a67a66a117a22a23a18a52a118a120a119a104a19a118a122a121a71a20a67a66a72a22a23a18a52a121
a36
a119a104a19a118a122a121a71a20a67a66
a123
a49a125a124
a49a85a22a23a18a52a121
a36
a119a126a19a127a107a49a128a20 for a127a107a49a54a66 suffixa18a52a129a50a55a50a118a130a121a131a20
For example:
a22a23a18 commencera19 commenc¸aa20a71a66a117a22a23a18 c¸a
a36 cer
a19 commenc¸aa20a71a66
a124a107a132
a22a23a18 c¸a
a36 cer
a20a133a75
a124
a53a43a22a23a18 c¸a
a36 cer
a19 aa20a116a75
a124a135a134
a22a23a18 c¸a
a36 cer
a19 c¸aa20a50a75
a75
a124a107a136
a22a23a18 c¸a a36 cera19nc¸aa20a116a75
a124a107a137
a22a23a18 c¸a a36 cera19enc¸aa20a133a75a77a56a59a56a59a56
a22a23a18 ployera19 ploiea20a71a66a72a22a23a18 ie
a36 yer
a19 ploiea20a125a66
a124a107a132
a22a23a18 ie
a36 yer
a20a133a75
a124
a53a43a22a23a18 ie
a36 yer
a19ea20a133a75
a124a135a134
a22a23a18 ie
a36 yer
a19 iea20a50a75
a75
a124a107a136
a22a23a18 ie
a36 yer
a19 oiea20a76a75
a124a107a137
a22a23a18 ie
a36 yer
a19 loiea20a116a75a117a56a59a56a59a56
An important property of the trie-based models is their effective-
ness at clustering words that exhibit similar morphological behav-
ior, both reducing model size and facilitating generalization to pre-
viously unseen examples. This property is illustrated in Figure 9,
showing a sample (inflection a36 root) trie branch for French verbal
inflections, with suffix histories a127 =’oie’, a127 =’noie’, a127 =’roie’, etc.
At each history node, the hierarchically smoothed probabilities of
several a121
a36
a119 (inflectiona36 root) changes are given. Note that the
relative probabilities of the competing analyses iea36 ir and iea36 yer
differ substantially for diffent suffix histories, and that there are
subexceptions that tend to cluster by affix history. This allows for
the successful analysis of 8 of the 9 italicized test words that had
not been seen in the bilingual projection data or where the MProj
model yielded no root candidate above threshold.
o
i
a19
a19
a19
a19
a19aboyer
nettoyer
apitoie
côtoyer
atermoie
nettoie
côtoie
aboie
atermoyer
apitoyer
a169
a169
a169
a169
a169 nv l s r a169
a169
a169
a169
0.070
0.424
0.309
0.072
eoiroie
e re
ie ir
ie yer
a169
a169
a169
a169
0.030
0.007
0.313
0.083
ie
ie
oie
e
eoir
re
yer
ir
a169
a169
a169
a169
0.005
0.0005
0.021
0.016
ie
e
ie
oie
ir
re
yer
eoir
a169
a169
a169
a169
0.0008
0.0001
0.003
0.002
ie
e
ie
oie
ir
re
yer
eoir
rassoie a25
asseoir a19
assoie a169
a169
h=’soie’
eoiroie
a169
a25
a169
a169
0.34
0.48
né e
ie
ie
ir
yer c
0.50
0.46
b
e
ie a169
a169
yer
re
a169 a19
a169
a25
déploie a25
a25
noie
emploie
ployer a19
ploie
a169
h=’noie’
h=’loie’
ie yer
a25
a25
a169
h=’evoie’
h=’évoie’
irie a169
h=’nvoie’
yerie
a25
a25
a169 a169α hβα β
Test data incorrectly analyzed
Training data from MProj analysis
Test data correctly analyzed
a25
a25
h=’broie’
ie yera169
a25
h=’croie’
e a169re
e
ε
root
h=’ie’
h=’e’
h=’oie’
a23louvoirlouvoie
voie
h=’voie’
h=’roie’
octroie octroyer
revoie
prévoie
envoie
renvoie
(P )|
a23
a19
broie croie
h=
Figure 9: Example of a French MTrie branch, showing inflec-
tion a36 root probabilities (a22a23a18a52a121 a36 a119a104a19a127a107a49a128a20 ) for variable length
suffix histories (a127 a49 ). MTrie analyses on test data are given in
italics.
Table 3 illustrates the performance of a variety of morphology
induction models. When using the projection-based MProj and
trie-based MTrie models together (with the latter extending cover-
age to words that may not even appear in the parallel corpus), full
verb lemmatization precision on the 1.2M word Hansard subset ex-
ceedsa138 99.5% (by type) and 99.9% (by token) with 95.8% coverage
by type and 99.8% coverage by token. A backoff model based on
Levenshtein-distance and distributional context similarity handles
the relatively small percentage of cases where MProj and MTrie
together are not sufficiently confident, bringing the system cov-
erage to 100% coverage with a small drop in precision to 97.9%
(by type) and 99.8% (by token) on the unrestricted space of in-
flected verbs observed in the full French Hansards. As shown in
Section 7.3, performance is strongly correlated with size of the ini-
tial aligned bilingual corpus, with a larger Hansard subset of 12M
words yielding 99.4% precision (by type) and 99.9% precision (by
token). Performance on Czech is discussed in Section 7.3.
Precision Coverage
Model Typ Tok Typ Tok
FRENCH Verbal Morphology Induction
French Hansards (12M words):
MProj only .992 .999 .779 .994
MProj+MTrie .998 .999 .988 .999
MProj+MTrie+BKM .994 .999 1.00 1.00
French Hansards (1.2M words):
MProj only .985 .998 .327 .976
MProj+MTrie .995 .999 .958 .998
MProj+MTrie+BKM .979 .998 1.00 1.00
French Hansards (120K words):
MProj only .962 .931 .095 .901
MProj+MTrie .984 .993 .916 .994
MProj+MTrie+BKM .932 .989 1.00 1.00
French Bible (300K words) via 1 English Bible:
MProj only 1.00 1.00 .052 .747
MProj+MTrie .991 .998 .918 .992
MProj+MTrie+BKM .954 .994 1.00 1.00
French Bible (300K words) via 3 English Bibles:
MProj only .928 .975 .100 .820
MProj+MTrie .981 .991 .931 .990
MProj+MTrie+BKM .964 .991 1.00 1.00
CZECH Verbal Morphology Induction
Czech Reader’s Digest (500K words):
MProj only .915 .993 .152 .805
MProj+MTrie .916 .917 .893 .975
MProj+MTrie+BKM .878 .913 1.00 1.00
SPANISH Verbal Morphology Induction
Spanish Bible (300K words) via 1 English Bible:
MProj only .973 .935 .264 .351
MProj+MTrie .988 .998 .971 .967
MProj+MTrie+BKM .966 .985 1.00 1.00
Spanish Bible (300K words) via French Bible:
MProj only .980 .935 .722 .765
MProj+MTrie .983 .974 .986 .993
MProj+MTrie+BKM .974 .968 1.00 1.00
Spanish Bible (300K words) via 3 English Bibles:
MProj only .964 .948 .468 .551
MProj+MTrie .990 .998 .978 .987
MProj+MTrie .976 .987 1.00 1.00
Table 3: Performance of full verbal morphological analysis,
including precision/coverage by type/token
7.2 Morphology Induction via Aligned Bibles
Performance using even small parallel corpora (e.g. a 120K sub-
set of the French Hansards) still yields a respectable 93.2% (type)
and 98.9% (token) precision on the verb-lemmatization test set for
the full Hansards. Given that the Bible is actually larger (approxi-
mately 300K words, depending on version and language) and avail-
able on-line or via OCR for virtually all languages (Resnik et al.,
2000), we also conducted several experiments on Bible-based mor-
phology induction, further detailed in Table 3.
7.2.1 Boosting Performance via Multiple
Parallel Translations
Even though at most one translation of the Bible is typically
available in a given foreign language, numerous English Bible ver-
sions are freely available and a performance increase can be achie-
ved by simultaneously utilizing alignments to each English version.
As illustrated in Figure 10, different aligned Bible pairs may exhibit
(or be missing) different full or partial bridge links for a given word
(due both to different lexical usage and poor textual parallelism in
some text-regions or version pairs). However, a22 a16 a18a85a105a114a108a50a109a111a109a111a110a111a19a61 a97a100a99a102a101 a103a139a20
and a22a54a16a74a18a85a61 a97a100a99a102a101 a103a120a19a105 infla20 need not be estimated from the same Bible
pair. Even if one has only one Bible in a given source language,
each alignment with a distinct English version gives new bridging
opportunities with no additional resources needed on the source
language side. The baseline approach (evaluated here) is simply to
concatenate the different aligned versions together. While word-
pair instances translated the same way in each version will be re-
peated, this rather reasonably reflects the increased confidence in
this particular alignment. An alternate model would weight version
pairs differently based on the otherwise-measured translation faith-
fulness and alignment quality between the version pairs. Doing so
would help decrease noise. Increasing from 1 to 3 English versions
reduces the type error rate (at full coverage) by 22% on French and
28% on Spanish with no increase in the source language resources.
a78a80a79a25a81 a82a83a52a83a85a84 a86a81 a87a88a52a89a90a91
a86a81a78a80a79a25a92a65a90a93a52a94a95a87a88a52a89a90a91
a91a90a93a52a94a78a80a79a25a81a25a82a83a85a83a52a84a96a86a92
croyaient
French Inflections
croitreV
French Roots
croire
believingbelieved
believed
believe
BELIEVE
SOURCE 1
SOURCE 2
SOURCE 3
(e.g. KJV)
(e.g. NIV)
(e.g. RSV)
English Bridge Lemma
croyant
Figure 10: Use of multiple parallel Bible translations
7.2.2 Boosting Performance via Multiple
Bridge Languages
Once lemmatization capabilities have been successfully projected
to a new language (such as French), this language can then serve as
an additional bridging source for morphology induction in a third
language (such as Spanish), as illustrated in Figure 11. This can
be particularly effective if the two languages are very similar (as
in Spanish-French) or if their available Bible versions are a close
translation of a common source (e.g. the Latin Vulgate Bible). As
shown in Table 3, using the previously analyzed French Bible as a
bridge for Spanish achieves performance (97.4% precision) com-
parable to the use of 3 parallel English Bible versions.
a140a128a141a25a142a58a143a144a52a144a52a145 a146a142a59a147a148a52a149a150a151
a146a142a140a80a141a96a152 a150a153a85a154a155a147a148a85a149a150a151
believebelievingbelieved
BELIEVE
English Bridge Lemmas
a151a150a153a52a154a143a144a52a144a52a145 a146a152
French Bridge Lemmas
Spanish Inflections Spanish Roots
croirecroyaient
CROIRE
creyeron creia creer crear
a140a128a141a25a142
Figure 11: Use of bridges in multiple languages.
7.3 Morphology Induction: Observations
This section includes additional detail regarding the morphol-
ogy induction experiments, supplementing the previous details and
analyses given in Section 7 and Table 3.
a156 Performance induction using the French Bible as the bridge
source is evaluated using the full test verb set extracted from
the French Hansards. The strong performance when trained
only using the Bible illustrates that even a small single text
in a very different genre can provide effective transfer to
modern (conversational) French. While the observed genre
and topic-sensitive vocabulary differs substantially between
the Bible and Hansards, the observed inventories of stem
changes and suffixation actually have large overlap, as do the
set of observed high-frequency irregular verbs. Thus the in-
ventory of morphological phenomena seem to translate better
across genre than do lexical choice and collocation models.
a156 Over 60% of errors are due to gaps in the candidate rootlists.
Currently the candidate rootlists are derived automatically by
applying the projected POS models and selecting any word
with the probability of being an uninflected verb greater than
a generous threshold and also ending in a canonical verb suf-
fix. False positives are easily tolerated (less than 5% of errors
are due to spurious non-root competitors), but with missing
roots the algorithms are forced either to propose previously
unseen roots or align to the closest previously observed root
candidate. Thus while no non-English dictionary was used
in the computation of these results, it would substantially
improve performance to have a dictionary-based inventory
of potential roots, increasing coverage and decreasing noise
from competing non-roots and spelling errors.
a156 Performance in all languages has been significnatly hindered
by low-accuracy parallel-corpus word-alignments using the
original Model-3 GIZA tools. Use of Och and Ney’s re-
cently released and enhanced GIZA++ word-alignment mod-
els (Och and Ney, 2000) should improve performance for all
of the applications studied in this paper, as would iterative re-
alignments using richer alignment features (including lemma
and part-of-speech) derived from this research.
a156 The current somewhat lower performance on Czech is due
to several factors. They include (a) very low accuracy ini-
tial word-alignments due to often non-parallel translations
of the Reader’s Digest sample and the failure of the initial
word-alignment models to handle the highly inflected Czech
morphology. (b) the small size of the Czech parallel cor-
pus (less than twice the length of the Bible). (c) the com-
mon occurrence in Czech of two very similar perfective and
non-perfective root variants (e.g. odol´avat and odolat, both
of which mean to resist). A simple monolingual dictionary-
derived list of canonical roots would resolve ambiguity re-
garding which is the appropriate target.
a156 Many of the errors are due to all (or most) inflections of a sin-
gle verb mapping to the same incorrect root. But for many
applications where the function of lemmatization is to cluster
equivalent words (e.g. stemming for information retrieval),
the choice of label for the lemma is less important than cor-
rectly linking the members of the lemma.
.005
.01
Error Rate
.001
.002
.02
.05
120 K 300 K 1.2 M 12 M
Size of Aligned Corpus (words)
.10
(by type)
(by token)
French Bible
French Bible
(Error rate by TOKEN)
French Hansards
(Error rate by TYPE)
French Hansards
Figure 12: Learning Curves for French Morphology
a156 The learning curves in Figure 12 show the strong correlation
between performance and size of the aligned corpus. Given
that large quantities of parallel text currently exist in trans-
lation bureau archives and OCR-able books, not to mention
the increasing online availability of bitext on the web, the
natural growth of available bitext quantities should continue
to support performance improvement.
a156 The system analysis examples shown in Table 4 are repre-
sentative of model performance and are selected to illustrate
the range of encountered phenomena. All system evaluation
is based on the task of selecting the correct root for a given
inflection (which has a long lexicography-based consensus
regarding the “truth”). In contrast, the descriptive analysis of
any such pairing is very theory dependent without standard
consensus. The “TopBridge” column shows the strongest
English bridge lemma utilized in mapping (typically one of
many potential bridge lemmas).
These results are quite impressive in that they are based on essen-
tially no language-specific knowledge of French, Spanish or Czech.
In addition, the multilingual bridge algorithm is surface-form in-
dependent, and can just as readily handle obscure infixational or
reduplicative morphological processes.
8. CONCLUSION
This paper has presented a detailed survey of original algorithms
for cross-language annotation projection and noise-robust tagger
induction, evaluated on four diverse applications. It shows how
previous major investments in English annotated corpora and tool
development can be effectively leveraged across languages, achiev-
ing accurate stand-alone tool development in other languages with-
out comparable human annotation efforts. Collectively this work is
the most comprehensive existing exploration of a very promising
new paradigma157 for cross-language resource projection.
Acknowledgements
This research has been partially supported by NSF grant IIS-
9985033 and ONR/MURI contract N00014-01-1-0685. The au-
thors thank Silviu Cucerzan, Radu Florian, Jan Hajic, Gideon Mann
and Charles Schafer for their valuable contributions and feedback.
9. REFERENCES
[1] Y. Al-Onaizan, J. Curin, M. Jahr, K. Knight, J. Lafferty, D.
Melamed, FJ Och, D. Purdy, N. Smith and D. Yarowsky. 1999.
Statistical Machine Translation (tech report). Johns Hopkins
University.
[2] E. Brill. 1995. Transformation-based error-driven learning and
natural language processing: A case study in part of speech
tagging. Computational Linguistics, 24(1): 543–565.
[3] P. Brown, J. Cocke, S. DellaPietra, V. DellaPietra, F. Jelinek,
J. Lafferty, R. Mercer, and P. Rossin. 1990. A statistical
approach to machine translation. Computational Linguistics,
16(2):29–85.
[4] S. Cucerzan and D. Yarowsky. 1999. Language independent
named entity recognition combining morphological and
contextual evidence.” In Proceedings, 1999 Joint SIGDAT
Conference on Empirical Methods in NLP and Very Large
Corpora, pp. 90-99.
[5] P. Fung and K. Church. 1994. K-vec: a new approach for
aligning parallel texts. In Proceedings of COLING-94, pp.
1096–1102.
[6] P. Fung and K. McKeown. 1994. Aligning noisy parallel
corpora across language groups: Word pair feature matching by
dynamic warping. In Proceedings of AMTA-94, pp. 81–88.
[7] D. Jones, and R. Havrilla. 1998 Twisted pair grammar:
Support for rapid development of machine translation for low
density languages In Procs. of AMTA’98, pp. 318–332.
[8] D. Melamed. 1999. Bitext maps and alignment via pattern
recognition. Computational Linguistics, 25(1):107–130.
[9] G. Ngai and R. Florian. 2001. Transformation-based learning
in the fast lane. In Proceedings of NAACL-2001, pp. 40-47.
[10] F.J. Och and H. Ney. 2000. Improved statistical alignment
models. In Proceedings of ACL-2000, pp. 440-447.
[11] L. Ramshaw and M. Marcus, 1999. Text chunking using
transformation-based learning. In Armstrong et al. (Eds.),
Natural Language Processing Using Very Large Corpora.
Kluwer, pp. 157-176.
[12] P. Resnik, M. Olsen, and M. Diab. 2000. The Bible as a
parallel corpus: annotating the ‘Book of 2000 Tongues’
Computers and the Humanities, 33(1-2):129-153.
[13] D. Wu. 1995. An algorithm for simultaneously bracketing
parallel texts. In Proc. of ACL-95, pp. 244–251.
[14] D. Wu. 1997. Statistical inversion transduction grammars an
bilingual parsing of parallel corpora. Computational
Linguistics, 23(3):377-404.
[15] D. Yarowsky and G. Ngai. 2001. Inducing multilingual POS
taggers and NP Bracketers via robust projection across aligned
corpora. In Proceedings of NAACL-2001, pp. 377-404.
[16] D. Yarowsky and R. Wicentowski. 2000. Minimally
supervised morphological analysis by multimodal alignment. In
Proceedings of ACL-2000, pp. 207-216.
Induced Morphological Analyses for CZECH
Inflection Root Out Analysis TopBridge
bral br´at ala36 ´at marry
brala br´at alaa36 ´at accept
brali br´at alia36 ´at marry
byl b´yt yla36 ´yt be
byli b´yt ylia36 ´yt be
bylo b´yt yloa36 ´yt be
chovala chovat laa36 t behave
chov´a chovat ´aa36 at behave
chov´ame chovat ´amea36 at behave
chodila chodit laa36 t walk
chod´ı chodit ´ıa36 it walk
choˇdte chodit ˇdtea36 dit swim
chr´anila chr´anit laa36 t protect
chr´an´ı chr´anit ´ıa36 it protect
couval couvat la36 t back
chce cht´ıt cea36 t´ıt want
chcete cht´ıt cetea36 t´ıt want
chceˇs cht´ıt ceˇsa36 t´ıt want
chci cht´ıt cia36 t´ıt want
chtˇej´ı cht´ıt ˇej´ıa36 ´ıt want
chtˇeli cht´ıt ˇelia36 ´ıt want
chtˇelo cht´ıt ˇeloa36 ´ıt want
Induced Morphological Analyses for SPANISH
Inflection Root Out Analysis TopBridge
aborreci´o aborrecer i´oa36 er hate
aborrec´ıa aborrecer ´ıaa36 er hate
aborrezco aborrecer zcoa36 cer hate
abrace abrazar cea36 zar embrace
abrazado abrazar adoa36 ar embrace
adquiere adquirir erea36 rir get
andamos andar amosa36 ar walk
andando andar andoa36 ar walk
andar´an andar ar´ana36 ar wander
andar´as andar ar´asa36 ar wander
andemos andar emosa36 ar walk
anden andar ena36 ar walk
anduvo andar uvoa36 ar walk
busc´ais buscar ´aisa36 ar seek
busc´o buscar ´oa36 ar seek
busque buscar quea36 car seek
busqu´e buscar qu´ea36 car seek
Induced Morphological Analyses for FRENCH
Inflection Root Out Analysis TopBridge
abr`ege abr´eger `egea36 ´eger shorten
abr`egent abr´eger `egenta36 ´eger shorten
abr´egerai abr´eger eraia36 er curtail
ach`ete acheter `etea36 eter buy
ach`etent acheter `etenta36 eter buy
ach`etera acheter `eteraa36 eter buy
advenait advenir aita36 ir happen
advenu advenir ua36 ir happen
adviendrait advenir iendraita36 enir happen
advient advenir ienta36 enir happen
ali`ene ali´ener `enea36 ´ener alienate
ali`enent ali´ener `enenta36 ´ener alienate
conc¸u concevoir c¸ua36 cevoir conceive
crois croire sa36 re believe
croyaient croire yaienta36 ire believe
Table 4: Sample of induced morphological analyses
