Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 851–858, Vancouver, October 2005. c©2005 Association for Computational Linguistics
A Backoff Model for Bootstrapping Resources
for Non-English Languages∗
Chenhai Xi and Rebecca Hwa
Department of Computer Science
University of Pittsburgh
{chenhai,hwa}@cs.pitt.edu
Abstract
The lack of annotated data is an ob-
stacle to the development of many
natural language processing applica-
tions; the problem is especially severe
when the data is non-English. Pre-
vious studies suggested the possibility
of acquiring resources for non-English
languages by bootstrapping from high
quality English NLP tools and paral-
lel corpora; however, the success of
these approaches seems limited for dis-
similar language pairs. In this paper,
we propose a novel approach of com-
bining a bootstrapped resource with a
small amount of manually annotated
data. We compare the proposed ap-
proach with other bootstrapping meth-
ods in the context of training a Chinese
Part-of-Speech tagger. Experimental
results show that our proposed ap-
proach achieves a significant improve-
ment over EM and self-training and
systems that are only trained on man-
ual annotations.
1 Introduction
Natural language applications that use super-
vised learning methods require annotated train-
ing data, but annotated data is scarce for many
∗We thank Stephen Clark, Roger Levy, Carol Nichols,
and the three anonymous reviewers for their helpful com-
ments.
non-English languages. It has been suggested
that annotated data for these languages might
be automatically created by leveraging paral-
lel corpora and high-accuracy English systems
(Yarowsky and Ngai, 2001; Diab and Resnik,
2002). The studies are centered around the
assumption that linguistic analyses for English
(e.g., Part-of-Speech tags, Word sense disam-
biguation, grammatical dependency relation-
ships) are also valid analyses in the translation
of the English. For example, in the English
noun phrase the red apples, red modifies ap-
ples; the same modifier relationship also exists in
its French translations les pommes rouges, even
though the word orders differ. To the extent
that the assumption is true, annotated data in
the non-English language can be created by pro-
jecting English analyses across a word aligned
parallel corpus. The resulting projected data
can then serve as (albeit noisy) training exam-
ples to develop applications in the non-English
language.
The projection approach faces both a theo-
retical and a practical challenge. Theoretically,
it is well-known that two languages often do
not express the same meaning in the same way
(Dorr, 1994). Practically, the projection frame-
work is sensitive to component errors. In partic-
ular, poor word alignments significantly degrade
the accuracy of the projected annotations. Pre-
vious research on resource projection attempts
to address these problems by redistributing the
parameter values (Yarowsky and Ngai, 2001) or
by applying transformation rules (Hwa et al.,
851
2002). Their experimental results suggest that
while these techniques can overcome some er-
rors, they are not sufficient for projected data
that are very noisy.
In this work, we tackle the same problems by
relaxing the zero manual annotation constraint.
The main question we address is: how can we
make the most out of a small set of manually la-
beled data (on the non-English side). Following
the work of Yarowsky and Ngai (2001) we focus
on the task of training a Part-of-Speech (POS)
tagger, but we conduct our experiments with
the more dissimilar language pair of English-
Chinese instead of English-French. Through
empirical studies, we show that when the word
alignment quality is sufficiently poor, the er-
ror correction techniques proposed by Yarowsky
and Ngai are unable to remove enough mistakes
in the projected data. We propose an alternative
approach that is inspired by backoff language
modeling techniques in which the parameters of
two tagging models (one trained on manually la-
beled data; the other trained on projected data)
are combined to achieve a more accurate final
model.
2 Background
The idea of trying to squeeze more out of an-
notated training examples has been explored in
a number of ways in the past. Most popular
is the family of bootstrapping algorithms, in
which a model is seeded with a small amount of
labeled data and iteratively improved as more
unlabeled data are folded into the training set,
typically, through unsupervised learning. An-
other approach is active learning (Cohn et al.,
1996), in which the model is also iteratively im-
proved but the training examples are chosen by
the learning model, and the learning process is
supervised. Finally, the work that is the closest
to ours in spirit is the idea of joint estimation
(Smith and Smith, 2004).
Of the bootstrapping methods, perhaps the
most well-known is the Expectation Maximiza-
tion (EM) algorithm. This approach has been
explored in the context of many NLP applica-
tions; one example is text classification (Nigam
et al., 1999). Another bootstrapping approach
reminiscent of EM is self-training. Yarowsky
(1995) used this method for word sense disam-
biguation. In self-training, annotated examples
are used as seeds to train an initial classifier
with any supervised learning method. This ini-
tial classifier is then used to automatically an-
notate data from a large pool of unlabeled ex-
amples. Of these newly labeled data, the ones
labeled with the highest confidence are used as
examples to train a new classifier. Yarowsky
showed that repeated application of this pro-
cess resulted in a series of word sense classi-
fiers with improved accuracy and coverage. Also
related is the co-training algorithm (Blum and
Mitchell, 1998) in which the bootstrapping pro-
cess requires multiple learners that have differ-
ent views of the problem. The key to co-training
is that the views should be conditionally inde-
pendent given the label. The strong indepen-
dence requirement on the views is difficult to
satisfy. For practical applications, different fea-
tures sets or models (that are not conditionally
independent) have been used as an approxima-
tion for different views. Co-training has been ap-
plied to a number of NLP applications, includ-
ing POS-tagging (Clark et al., 2003), parsing
(Sarkar, 2001), word sense disambiguation (Mi-
halcea, 2004), and base noun phrase detection
(Pierce and Cardie, 2001). Due to the relaxation
of the view independence assumption, most em-
pirical studies suggest a marginal improvement.
The common thread between EM, self-training,
and co-training is that they all bootstrap off
of unannotated data. In this work, we explore
an alternative to “pure” unannotated data; our
data have been automatically annotated with
projected labels from English. Although the
projected labels are error-prone, they provide us
with more information than automatically pre-
dicted labels used in bootstrapping methods.
With a somewhat different goal in mind, ac-
tive learning addresses the problem of choosing
the most informative data for annotators to la-
bel so that the model would achieve the greatest
improvement. Active learning also has been ap-
plied to many NLP applications, including POS
tagging (Engelson and Dagan, 1996) and pars-
852
ing (Baldridge and Osborne, 2003). The draw-
back of an active learning approach is that it
assumes that a staff of annotators is waiting on
call, ready to label the examples chosen by the
system at every iteration. In practice, it is more
likely that one could only afford to staff anno-
tators for a limited period of time. Although
active learning is not a focus in this paper, we
owe some ideas to active learning in choosing a
small initial set of training examples; we discuss
these ideas in section 3.2.
More recently, Smith and Smith (2004) pro-
posed to merge an English parser, a word align-
ment model, and a Korean PCFG parser trained
from a small number of Korean parse trees un-
der a unified log linear model. Their results sug-
gest that a joint model produces somewhat more
accurate Korean parses than a PCFG Korean
parser trained on a small amount of annotated
Korean parse trees alone. Their motivation is
similar to the starting point of our work: that a
word aligned parallel corpus and a small amount
of annotated data in the foreign language side
offer information that might be exploited. Our
approach differs from theirs in that we do not
optimize the three models jointly. One concern
is that joint optimization might not result in op-
timal parameter settings for the individual com-
ponents. Because our focus is primarily on ac-
quiring non-English language resources, we only
use the parallel corpus as a means of projecting
resources from English.
3 Our Approach
This work explores developing a Chinese POS
tagger without a large manually annotated cor-
pus. Our approach is to train two separate
models from two different data sources: a large
corpus of automatically tagged data (projected
from English) and a small corpus of manually
tagged data; the two models are then combined
into one via the Whitten-Bell backoff language
model.
3.1 Projected Data
One method of acquiring a large corpus of au-
tomatically POS tagged Chinese data is by
projection (Yarowsky and Ngai, 2001). This
approach requires a sentence-aligned English-
Chinese corpus, a high-quality English tagger,
and a method of aligning English and Chinese
words that share the same meaning. Given the
parallel corpus, we tagged the English words
with a publicly available maximum entropy tag-
ger (Ratnaparkhi, 1996), and we used an im-
plementation of the IBM translation model (Al-
Onaizan et al., 1999) to align the words. The
Chinese words in the parallel corpus would then
receive the same POS tags as the English words
to which they are aligned. Next, the basic pro-
jection algorithm is modified to accommodate
two complicating factors. First, word align-
ments are not always one-to-one. To compen-
sate, we assign a default tag to unaligned Chi-
nese words; in the case of one-Chinese-to-many-
English, the Chinese word would receive the tag
of the final English word. Second, English and
Chinese do not share the same tag set. Fol-
lowing Yarowsky and Ngai (2001), we define 12
equivalence classes over the 47 Penn-English-
Treebank POS tags. We refer to them as Core
Tags. With the help of 15 hand-coded rules and
a Naive Bayes model trained on a small amount
of manually annotated data, the Core Tags can
be expanded to the granularity of the 33 Penn-
Chinese-Treebank POS tags (which we refer to
as Full Tags).
3.2 Manually Annotated Data
Since the amount of manual annotation is lim-
ited, we must decide what type of data to anno-
tate. In the spirit of active learning, we aim to
select sentences that may bring about the great-
est improvements in the accuracy of our model.
Because it is well known that handling unknown
words is a serious problem for POS taggers, our
strategy for selecting sentences for manual anno-
tation is to maximize the word coverage of the
in ital model. That is, we wish to find a small
set of sentences that would lead to the greatest
reduction of currently unknown words Finding
these sentences is a NP-hard problem because
the 0/1 knapsack problem could be reduced to
this problem in polynomial-time (Gurari, 1989).
Thus, we developed an approximation algorithm
for finding sentences with the maximum word
853
M : number of tokens will be annotated.
S={s1,s2,...,sn}: the unannotated corpus.
Ssel : set of selected sentences in S.
Sunsel : set of unselected sentences in S.
|Ssel| : number of tokens in Ssel.
TYPE(Ssel) : number of types in Ssel.
MWC:
randomly choose Ssel ⊂ S
such that|Ssel| ≤ M.
For each sentence si in Ssel
find a sentence rj in Sunsel
which maximizes swap score(si,rj).
if swap score(si,rj) > 0
{
Ssel = (Ssel −si)∪rj;
Sunsel = (Sunsel −rj)∪si;
}
swap score(si,rj)
{
Ssel new = (Ssel −si)∪rj;
if ( |Ssel new| > M ) return -1;
else return TYPE(Ssel new)−TYPE(Ssel);
}
Figure 1: The pseudo-code for MWC algorithm.
The input is M and S and the output is Ssel
coverage of unknown words (MWC). This algo-
rithm is described in Figure 1,
3.3 Basic POS Tagging Model
It is well known that a POS tagger can be
trained with an HMM (Weischedel et al., 1993).
Given a trained model, the most likely tag se-
quence ˆT = {t1,t2,...tn} is computed for the
input word sentence: ˆW = {w1,w2,...wn}:
ˆT = arg max
T
P(T|W) = arg max
T
P(T|W)P(T)
The transition probability P(T) is approxi-
mated by a trigram model:
P(T) ≈ p(t1)p(t2|t1)
nproductdisplay
i=3
p(ti|ti−1,ti−2),
and the observation probability P(W|T) is com-
puted by
P(W|T) ≈
nproductdisplay
i=1
p(wi|ti).
3.4 Combined Models
From the two data sources, two separate trigram
taggers have been trained (Tanno from manually
annotated data and Tproj from projected data).
This section considers ways of combining them
into a single tagger. The key insight that drives
our approach is based on reducing the effect of
unknown words. We see the two data sources as
complementary in that the large projected data
source has better word coverage while the man-
ually labeled one is good at providing tag-to-tag
transitions. Based on this principle, one way of
merging these two taggers into a single HMM
(denoted as Tinterp) is to use interpolation:
pinterp(w|t) = λ×panno(w|t)
+(1−λ)×pproj(w|t)
pinterp(ti|ti−1,ti−2) = panno(ti|ti−1,ti−2)
where λ is a tunable weighting parameter1 of
the merged tagger. This approach may be prob-
lematic because it forces the model to always
include some fraction of poor parameter values.
Therefore, we propose to estimate the observa-
tion probabilities using backoff. The parameters
of Tback are estimated as follows:
pback(w|t) =
braceleftBigg
α(t)×panno(w|t) if panno(w|t) > 0
β(t) × pproj(w|t) if panno(w|t) = 0
pback(ti|ti−1,ti−2) = panno(ti|ti−1,ti−2)
where α(t) is a discounting coefficient and β
is set to satisfy that summationtextall wordsP(w|t) = 1.
The discounting coefficient is computed using
the Witten-Bell discounting method:
α(t) = Canno(t)C
anno(t) +Sanno(t)
,
where Canno(t) is the count of tokens whose
tag is t in the manually annotated corpus and
1In our experiments, the value of λ is set to 0.8 based
on held-out data.
854
Sanno(t) is the seen types of words with tag t.
In other words, we trust the parameter estimates
from the model trained on manual annotation by
default unless it is based on unreliable statistics.
4 Experiments
We conducted a suite of experiments to inves-
tigate the effect of allowing a small amount of
manually annotated data in conjunction with
using annotations projected from English. We
first establish a baseline of training on projected
data alone in Section 4.1. It is an adaptation of
the approach described by Yarowsky and Ngai
(2001). Next, we consider the case of using
manually annotated data alone in Section 4.2.
We show that there is an increase in accuracy
when the MWC active learning strategy is used.
In Section 4.3, we show that with an appro-
priate merging strategy, a tagger trained from
both data sources achieves higher accuracy. Fi-
nally, in Section 4.4, we evaluate our approach
against other semi-supervised methods to ver-
ify that the projected annotations, though noisy,
contain useful information.
We use an English-Chinese Federal Broadcast
Information Service (FBIS) corpus as the data
source for the projected annotation. We sim-
ulated the manual annotation process by using
the POS tags provided by the Chinese Treebank
version 4 (CHTB). We used about a thousand
sentences from CHTB as held-out data. The re-
maining sentences are split into ten-fold cross
validation sets. Each test set contains 1400 sen-
tences. Training data are selected (using MWC)
from the remaining 12600 sentences. The re-
ported results are the average of the ten trials.
One tagger is considered to be better than an-
other if, according to the paired t-test, we are
at least 95% confident that their difference in
accuracy is non-zero. Performance is measured
in terms of the percentage of correctly tagged
tokens in the test data. For comparability with
Tproj (which assumes no availability of manu-
ally annotated data), most experimental results
are reported with respect to the reduced Core
Tag gold standard; evaluation against the full
33 CHTB tag gold standard is reported in Sec-
tion 4.4.
4.1 Tagger Trained from Projected
Data
To determine the quality of Tproj for Chinese,
we replicate the POS-tagging experiment in
Yarowsky and Ngai (2001). Trained on all pro-
jected data, the tagger has an accuracy of 58.2%
on test sentences. The low accuracy rate sug-
gests that the projected data is indeed very
noisy. To reduce the noise in the projected data,
Yarowsky and Ngai developed a re-estimation
technique based on the observation that words
in French, English and Czech have a strong ten-
dency to exhibit only a single core POS tag
and very rarely have more than two. Apply-
ing the same re-estimation technique that favors
this bias to the projected Chinese data raises
the final tagger accuracy to 59.1%. That re-
estimation did not help English-Chinese projec-
tion suggests that the dissimilarity between the
two languages is an important factor. A related
reason for the lower accuracy rate is due to poor
word alignments in the English-Chinese corpus.
As a further noise reduction step, we automat-
ically filter out sentence pairs that were poorly
aligned (i.e., the sentence pairs had too many
unaligned words or too many one-to-many align-
ments). This results in a corpus of about 9000
FBIS sentences. A tagger trained on the filtered
data has an improved accuracy of 64.5%. We
take this to be Tproj used in later experiments.
4.2 Taggers Trained from Manually
Labeled Data
This experiment verifies that the Maximum
Word Coverage (MWC) selection scheme pre-
sented in Section 3.2 is helpful in selecting data
for training Tanno. We compare it against ran-
dom selection. Figure 2 plots the taggers’ per-
formances on test sentences as the number of
manually annotated tokens increase from 100 to
30,000. We see that the taggers trained on data
selected by MWC outperform those trained on
randomly selected data. Thus, in the main ex-
periments, we always use MWC to select the set
of manually tagged data for training Tanno.
855
 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9  100
 3000 5000
 10000
 15000
 20000
 25000
 30000
accuracy
number of annotated tokens
Random MWC
Figure 2: A comparison between MWC and ran-
dom selection.
4.3 Evaluation of the Combined
Taggers
 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9  100
 3000 5000
 10000
 15000
 20000
 25000
 30000
accuracy
number of annotated tokens
proj anno concat Interp back
Figure 3: A comparison of the proposed backoff
approach against alternative methods of com-
bining Tproj and Tanno
To investigate how Tanno and Tproj might be
merged to form a higher quality tagger, we con-
duct an experiment to evaluate the different
alternatives described in section 3.4: Tinterp,
and Tback. They are compared against three
baselines: Tanno, Tproj, and Tconcat, a tagger
trained from the concatenation of the two data
sources. To determine the effect of manual an-
notation, we vary the size of the training set for
Tanno from 100 tokens (fewer than 10 sentences)
to 30,000 tokens (about 1000 sentences). The
learning curves are plotted in Figure 3. The re-
sult suggests that Tback successfully incorporates
information from both the manually annotated
data and the projected data. The improvement
over training on manually annotated data alone
(Tanno) is especially high when fewer than 10,000
manually annotated tokens are available. As ex-
pected, Tinterp, and Tconcat perform worse than
Tanno because they are not as effective at dis-
counting the erroneous projected annotations.
4.4 Comparisons with Other
Semi-Supervised Approaches
This experiment evaluates the proposed back-
off approach against two other semi-supervised
approaches: self-training (denoted as Tself) and
EM (denoted as Tem). Both start with a fully su-
pervised model (Tanno) and iteratively improve
it by seeing more unannotated data.2 As dis-
cussed earlier, a major difference between our
proposed approach and the bootstrapping meth-
ods is that our approach makes use of anno-
tations projected from English while the boot-
strapping methods rely on unannotated data
alone. To investigate the effect of leveraging
from English resources, we use the Chinese por-
tion of the FBIS parallel corpus (the same 9000
sentences as the training corpus of Tproj but
without the projected tags) as the unannotated
data source for the bootstrapping methods.
Figure 4 compares the four learning curves.
We have evaluated them both in terms of the
Core Tag gold standard and in terms of Full
Tag gold standard. Although all three ap-
proaches produce taggers with higher accuracies
than that of Tanno, our backoff approach outper-
forms both self-training and EM. The difference
is especially prominent when manual annota-
tion is severely limited. When more manual an-
notations are made available, the gap narrows;
however, the differences are still statistically sig-
nificant at 30,000 manually annotated tokens.
These results suggest that projected data have
more useful information than unannotated data.
2In our implementation of self-training, the top 10%
of the unannoated sentences with the highest confidence
scores is selected. The confidence score is computed as:
logP(T|W)
length of the sentence.
856
 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9  100
 3000 5000
 10000
 15000
 20000
 25000
 30000
accuracy
number of annotated tokens
anno self em back
 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9  100
 3000 5000
 10000
 15000
 20000
 25000
 30000
accuracy
number of annotated tokens
anno self em back
(a) (b)
Figure 4: A comparison of Backoff against self-training and EM. (a) Evaluation against the Core
Tag gold standard. (b) Evaluation against the Full Tag gold standard.
5 Discussion
While the experimental results support our intu-
ition that Tback is effective in making use of both
data sources, there are still two questions worth
addressing. First, there may be other ways of
estimating the parameters of a merged HMM
from the parameters of Tanno and Tproj. For ex-
ample, a natural way of merging the two taggers
into a single HMM (denoted as Tmerge) is to use
the values of the observation probabilities from
Tproj and the values of the transition probabili-
ties from Tanno:
pmerge(w|t) = pproj(w|t),
pmerge(ti|ti−1,ti−2) = panno(ti|ti−1,ti−2).
Another is the reverse of Tmerge:
prev merge(w|t) = panno(w|t)
prev merge(ti|ti−1,ti−2) = pproj(ti|ti−1,ti−2)
Tmerge is problematic because it ignores all man-
ual word-tag annotations; however, Trev merge’s
learning curve is nearly identical to that of Tanno
(graph not shown). Its models do not take ad-
vantage of the broader word coverage of the
projected data, so it does not perform as well
as Tback. Trev merge outperforms Tmerge when
trained from more than 2000 manually anno-
tated tokens. We make two observations from
this finding. One is that the differences between
pproj(ti|ti−1,ti−2) and panno(ti|ti−1,ti−2) are not
large. Another is that the success of the merged
HMM tagger hinges on the goodness of the ob-
servation probabilities, p(w|t). This is in accord
with our motivation in improving the reliability
of p(w|t) through backoff.
Second, while our experimental results sug-
gest that Tback outperforms self-training and
EM, these approaches are not incompatible with
one another. Because Tback is partially esti-
mated from the noisy corpus of projected an-
notations, it might be further improved by
applying a bootstrapping algorithm over the
noisy corpus (with the projected tags removed).
To test our hypothesis, we initialized the self-
training algorithm with a backoff tagger that
used 3000 manually annotated tokens. This led
to a slight but statistically significant improve-
ment, from 74.3% to 74.9%.
6 Conclusion and Future Work
In summary, we have shown that backoff is an ef-
fective technique for combining manually anno-
tated data with a large but noisy set of automat-
ically annotated data (from projection). Our ap-
857
proach is the most useful when a small amount
of annotated tokens is available. In our exper-
iments, the best results were achieved when we
used 3000 manually annotated tokens (approxi-
mately 100 sentences).
The current study points us to several direc-
tions for future work. One is to explore ways of
applying the proposed approach to other learn-
ing models. Another is to compare against other
methods of combining evidences from multiple
learners. Finally, we will investigate whether
the proposed approach can be adapted to more
complex tasks in which the output is not a class
label but a structure (e.g. parsing).
References
Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin
Knight, John Lafferty, I. Dan Melamed, Franz-
Josef Och, David Purdy, Noah A. Smith, and
David Yarowsky. 1999. Statistical machine transla-
tion. Technical report, JHU. citeseer.nj.nec.com/al-
onaizan99statistical.html.
Jason Baldridge and Miles Osborne. 2003. Active learn-
ing for HPSG parse selection. In Proceedings of the 7th
Conference on Natural Language Learning, Edmonton,
Canada, June.
Avrim Blum and Tom Mitchell. 1998. Combining labeled
and unlabeled data with co-training. In Proceedings of
the 1998 Conference on Computational Learning The-
ory, pages 92–100, Madison, WI.
Stephen Clark, James Curran, and Miles Osborne. 2003.
Bootstrapping pos-taggers using unlabelled data. In
Proc. of the Computational Natural Language Learn-
ing Conference, pages 164–167, Edmonton, Canada,
June.
David A. Cohn, Zoubin Ghahramani, and Michael I. Jor-
dan. 1996. Active learning with statistical models.
Journal of Artificial Intelligence Research, 4:129–145.
Mona Diab and Philip Resnik. 2002. An unsupervised
method for word sense tagging using parallel corpora.
In Proceedings of the 40th Annual Meeting of the As-
sociation for Computational Linguistics, Philadelphia,
PA.
Bonnie J. Dorr. 1994. Machine translation divergences:
A formal description and proposed solution. Compu-
tational Linguistics, 20(4):597–635.
Sean P. Engelson and Ido Dagan. 1996. Minimizing man-
ual annotation cost in supervised training from copora.
In Proceedings of the 34th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 319–326,
Santa Cruz, CA.
Eitan Gurari. 1989. An Introduction to the Theory of
Computation. Ohio State University Computer Sci-
ence Press.
Rebecca Hwa, Philip S. Resnik, Amy Weinberg, and
Okan Kolak. 2002. Evaluating translational corre-
spondence using annotation projection. In Proceed-
ings of the 40th Annual Meeting of the Association for
Computational Linguistics.
Rada Mihalcea. 2004. Co-training and self-training for
word sense disambiguation. In Proceedings of the Con-
ference on Computational Natural Language Learning
(CoNLL-2004).
Kamal Nigam, Andrew McCallum, Sebastian Thrun, and
Tom Mitchell. 1999. Text Classification from Labeled
and Unlabeled Documents using EM. Machine Learn-
ing, 1(34).
David Pierce and Claire Cardie. 2001. Limitations of
co-training for natural language learning from large
datasets. In Proceedings of the 2001 Conference on
Empirical Methods in Natural Language Processing
(EMNLP-01), pages 1–9, Pittsburgh, PA.
Adwait Ratnaparkhi. 1996. A maximum entropy model
for part-of-speech tagging. In Eric Brill and Kenneth
Church, editors, Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing, pages
133–142. Association for Computational Linguistics,
Somerset, New Jersey.
Anoop Sarkar. 2001. Applying co-training methods to
statistical parsing. In Proceedings of the Second Meet-
ing of the North American Association for Compu-
tational Linguistics, pages 175–182, Pittsburgh, PA,
June.
David A. Smith and Noah A. Smith. 2004. Bilingual
parsing with factored estimation: Using english to
parse korean. In Proceedings of the 2005 Conference
on Empirical Methods in Natural Language Processing
(EMNLP-05).
Ralph Weischedel, Richard Schwartz, Jeff Palmucci,
Marie Meteer, and Lance Ramshaw. 1993. Coping
with ambiguity and unknown words through proba-
bilistic models. Comput. Linguist., 19(2):361–382.
David Yarowsky and Grace Ngai. 2001. Inducing mul-
tilingual POS taggers and NP bracketers via robust
projection across aligned corpora. In Proceedings of
the Second Meeting of the North American Associa-
tion for Computational Linguistics, pages 200–207.
David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In Proceed-
ings of the 35th Annual Meeting of the Association
for Computational Linguistics, pages 189–196, Cam-
bridge, MA.
858
