In: Proceedings of CoNLL-2000 and LLL-PO00, pages 1-6, Lisbon, Portugal, 2000. 
Learning in Natural Language: Theory and Algorithmic 
Approaches* 
Dan Roth 
Department of Computer Science 
University of Illinois at Urbana-Champaign 
1304 W Springfield Ave., Urbana, IL 61801 
danr@cs, uiuc. edu 
Abstract 
This article summarizes work on developing a 
learning theory account for the major learning 
and statistics based approaches used in natural 
 processing. It shows that these ap- 
proaches can all be explained using a single dis- 
tribution free inductive principle related to the 
pac model of learning. Furthermore, they all 
make predictions using the same simple knowl- 
edge representation - a linear representation 
over a common feature space. This is signifi- 
cant both to explaining the generalization and 
robustness properties of these methods and to 
understanding how these methods might be ex- 
tended to learn from more structured, knowl- 
edge intensive examples, as part of a learning 
centered approach to higher level natural lan- 
guage inferences. 
1 Introduction 
Many important natural  inferences 
can be viewed as problems of resolving phonetic, 
syntactic, semantics or pragmatics ambiguities, 
based on properties of the surrounding context. 
It is generally accepted that a learning compo- 
nent must have a central role in resolving these 
context sensitive ambiguities, and a significant 
amount of work has been devoted in the last few 
years to developing learning methods for these 
tasks, with considerable success. Yet, our un- 
derstanding of when and why learning works in 
this domain and how it can be used to support 
increasingly higher level tasks is still lacking. 
This article summarizes work on developing a 
learning theory account for the major learning 
approaches used in NL. 
While the major statistics based methods 
used in NLP are typically developed with a 
* This research is supported by NSF grants IIS-9801638, 
SBR-9873450 and IIS-9984168. 
Bayesian view in mind, the Bayesian principle 
cannot directly explain the success and robust- 
ness of these methods, since their probabilistic 
assumptions typically do not hold in the data. 
Instead, we provide this explanation using a sin- 
gle, distribution free inductive principle related 
to the pac model of learning. We describe the 
unified learning framework and show that, in 
addition to explaining the success and robust- 
ness of the statistics based methods, it also ap- 
plies to other machine learning methods, such 
as rule based and memory based methods. 
An important component of the view devel- 
oped is the observation that most methods use 
the same simple knowledge representation. This 
is a linear representation over a new feature 
space - a transformation of the original instance 
space to a higher dimensional and more expres- 
sive space. Methods vary mostly algorithmicly, 
in ways they derive weights for features in this 
space. This is significant both to explaining 
the generalization properties of these methods 
and to developing an understanding for how and 
when can these methods be extended to learn 
from more structured, knowledge intensive ex- 
amples, perhaps hierarchically. These issues are 
briefly discussed and we emphasize the impor- 
tance of studying knowledge representation and 
inference in developing a learning centered ap- 
proach to NL inferences. 
2 Learning Frameworks 
Generative probability models provide a princi- 
pled way to the study of statistical classification 
in complex domains such as NL. It is common 
to assume a generative model for such data, es- 
timate its parameters from training data and 
then use Bayes rule to obtain a classifier for 
this model. In the context of NL most clas- 
sifters are derived from probabilistic  
models which estimate the probability of a sen- 
tence 8 using Bayes rule, and then decompose 
this probability into a product of conditional 
probabilities according to the generative model. 
Pr(s) = Pr(wl, W2,... Wn) ---- 
= H~=lPr(wilwl,...wi-1) = H~=lPr(wilhi) 
where hi is the relevant history when predicting 
wi, and s is any sequence of tokens, words, part- 
of-speech (pos) tags or other terms. 
This general scheme has been used to de- 
rive classifiers for a variety of natural lan- 
guage applications including speech applica- 
tions (Rab89), pos tagging (Kup92; Sch95), 
word-sense ambiguation (GCY93) and context- 
sensitive spelling correction (Go195). While the 
use of Bayes rule is harmless, most of the work 
in statistical  modeling and ambiguity 
resolution is devoted to estimating terms of the 
form Pr(wlh ). The generative models used to 
estimate these terms typically make Markov or 
other independence assumptions. It is evident 
from studying  data that these assump- 
tions are often patently false and that there 
are significant global dependencies both within 
and across sentences. For example, when using 
(Hidden) Markov Model (HMM) as a generative 
model for pos tagging, estimating the probabil- 
ity of a sequence of tags involves assuming that 
the pos tag ti of the word wi is independent of 
other words in the sentence, given the preced- 
ing tag ti-1. It is not surprising therefore that 
this results in a poor estimate of the probabil- 
ity density function. However, classifiers built 
based on these false assumptions nevertheless 
seem to behave quite robustly in many cases. 
A different, distribution free inductive princi- 
ple that is related to the pac model of learning 
is the basis for the account developed here. 
In an instance of the agnostic variant of pac 
learning (Val84; Hau92; KSS94), a learner is 
given data elements (x, l) that are sampled ac- 
cording to some fixed but arbitrary distribu- 
tion D on X x {0, 1}. X is the instance space 
and I E {0, 1} is the label 1. D may simply re- 
flect the distribution of the data as it occurs 
"in nature" (including contradictions) without 
assuming that the labels are generated accord- 
ing to some "rule". Given a sample, the goal 
of the learning algorithm is to eventually out- 
put a hypothesis h from some hypothesis class 
7/ that closely approximates the data. The 
1The model can be extended to deal with any discrete 
or continuous range of the labels. 
true error of the hypothesis h is defined to 
be errorD(h) = Pr(x,O~D\[h(x) 7~ If, and the 
goal of the (agnostic) pac learner is to com- 
pute, for any distribution D, with high prob- 
ability (> 1 -5), a hypothesis h E 7/ with 
true error no larger than ~ + inffhenerrorD(h). 
In practice, one cannot compute the true er- 
ror errorD(h). Instead, the input to the learn- 
ing algorithm is a sample S = {(x i,l)}i=li m of 
m labeled examples and the learner tries to 
find a hypothesis h with a small empirical er- 
ror errors(h) = I{x e Slh(x) ¢ l}l/ISl, and 
hopes that it behaves well on future examples. 
The hope that a classifier learned from a train- 
ing set will perform well on previously unseen 
examples is based on the basic inductive prin- 
ciple underlying learning theory (Val84; Vap95) 
which, stated informally, guarantees that if the 
training and the test data are sampled from the 
same distribution, good performance on large 
enough training sample guarantees good per- 
formance on the test data (i.e., good "true" er- 
ror). Moreover, the quality of the generalization 
is inversely proportional to the expressivity of 
the class 7-/. Equivalently, for a fixed sample 
size IsI, the quantified version of this princi- 
ple (e.g. (Hau92)) indicates how much can one 
count on a hypothesis selected according to its 
performance on S. Finally, notice the underly- 
ing assumption that the training and test data 
are sampled from the same distribution; this 
framework addresses this issue. (See (GR99).) 
In our discussion functions learned over the 
instance space X are not defined directly over 
the raw instances but rather over a transforma- 
tion of it to a feature space. A feature is an in- 
dicator function X : X ~ {0, 1} which defines a 
subset of the instance space - all those elements 
in X which are mapped to 1 by X- X denotes 
a class of such functions and can be viewed as 
a transformation of the instance space; each ex- 
ample (Xl,... xn) E X is mapped to an example 
(Xi,...Xlxl) in the new space. We sometimes 
view a feature as an indicator function over the 
labeled instance space X x {0, 1} and say that 
X(x, l) = 1 for examples x E x(X) with label l. 
3 Explaining Probabilistic Methods 
Using the abovementioned inductive principle 
we describe a learning theory account that ex- 
plains the success and robustness of statistics 
based classifiers (Rot99a). A variety of meth- 
ods used for learning in NL are shown to make 
their prediction using Linear Statistical Queries 
(LSQ) hypotheses. This is a family of linear 
predictors over a set of features which are di- 
rectly related to the independence assumptions 
of the probabilistic model assumed. The success 
of these classification methods is then shown to 
be due to the combination of two factors: 
• Low expressive power of the derived classifier. 
• Robustness properties shared by all linear sta- 
tistical queries hypotheses. 
Since the hypotheses are computed over a fea- 
ture space chosen so that they perform well on 
training data, learning theory implies that they 
perform well on previously unseen data, irre- 
spective of whether the underlying probabilistic 
assumptions hold. 
3.1 Robust Learning 
This section defines a learning algorithm and 
a class of hypotheses with some generaliza- 
tion properties, that capture many probabilis- 
tic learning methods used in NLP. The learn- 
ing algorithm is a Statistical Queries(SQ) algo- 
rithm (Kea93). An SQ algorithm can be viewed 
as a learning algorithm that interacts with its 
environment in a restricted way. Rather than 
viewing examples, the algorithm only requests 
the values of various statistics on the distribu- 
tion of the examples to construct its hypothesis. 
(E.g. "What is the probability that a randomly 
chosen example (x, l) has xi = 0 and l = 1"?) 
A statistical query has the form IX, l, 7-\], where 
X 6 X is a feature, l 6 {0, 1} is a further (op- 
tional) restriction imposed on the query and ~" 
is an error parameter. A call to the SQ oracle 
returns an estimate ~n of \[x,z,~\] 
P,\[xDj\] = PrD{ (X, i)lx(x) = 1 A i = l} 
which satisfies \]15x - Px\] < T. (We usually omit 
T and/or l from this notation.) A statistical 
queries algorithm is a learning algorithm that 
constructs its hypothesis only using information 
received from an SQ oracle. An algorithm is 
said to use a query space X if it only makes 
queries of the form \[X, l, T\] where X 6 A'. An 
SQ algorithm is said to be a good learning al- 
gorithm if, with high probability, it outputs a 
hypothesis h with small error, using sample size 
that is polynomial in the relevant parameters. 
Given a query \[X, l, T\] the SQ oracle is sim- 
ulated by drawing a large sample S of labeled 
examples (x, l) according to D and evaluating 
Prs\[x(x, l)\] = I{(x, l) : X(x, l) = ll}/ISl. 
Chernoff bounds guarantee that the nUmber of 
examples required to achieve tolerance T with 
probability at least 1 - 5 is polynomial in 1/T 
and log 1/5. (See (Zea93; Dec93; AD95)). 
Let X be a class of features and f : {0, 1} 
a function that depends only on the values 
~D for E X. Given x 6 X, a Linear Statis- \[x,~\] X 
tical Queries (LSQ) hypothesis predicts 
l argmaxte{o,1} ~xeX ^ D = f\[xj\] ({P\[x,z\] })" X(x). 
Clearly, the LSQ is a linear discriminator over 
the feature space A', with coefficients f that 
are computed given (potentially all) the values 
^D P\[x,t\]" The definition generalizes naturally to 
non-binary classifiers; in this case, the discrim- 
inator between predicting l and other values is 
linear. A learning algorithm that outputs an 
LSQ hypothesis is called an LSQ algorithm. 
Example 3.1 The naive Bayes predictor 
(DH73) is derived using the assumption that 
given the label l E L the features' values are 
statistically independent. Consequently, the 
Bayes optimal prediction is given by: 
h(x) = argmaxteLH~n=l Pr(xill)Pr(1), 
where Pr(1) denotes the prior probability of l 
(the fraction of examples labeled l) and Pr(xill) 
are the conditional feature probabilities (the 
fraction of the examples labeled l in which the 
ith feature has value xi). Therefore, we get: 
Claim: The naive Bayes algorithm is an LSQ 
algorithm over a set ,.~ which consists of n + 1 
features: X0 --- 1, Xi -- xi for i = 1,...,n 
and where f\[1J\]O = log/5\[~z\],, and f\[x,J\]O = 
^D ^D logP\[x,,l\]/P\[1,l\], i = 1,... ,n. 
The observation that the LSQ hypothesis 
is linear over X' yields the first generalization 
property of LSQ. VC theory implies that the 
VC dimension of the class of LSQ hypothe- 
ses is bounded above by IXI. Moreover, if 
the LSQ hypothesis is sparse and does not 
make use of unobserved features in X (as in 
Ex. 3.1) it is bounded by the number of features 
used (Rot00). Together with the basic general- 
ization property described above this implies: 
Corollary 3.1 For LSQ, the number of train- 
ing examples required in order to maintain a 
specific generalization performance guarantee 
scales linearly with the number o/features used. 
3 
The robustness property of LSQ can be cast 
for the case in which the hypothesis is learned 
using a training set sampled according to a dis- 
tribution D, but tested over a sample from D ~. 
It still performs well as long as the distributional 
distance d(D, D') is controlled (Rot99a; Rot00). 
Theorem 3.2 Let .A be an SQ(T, X) learning 
algorithm for a function class ~ over the distri- 
bution D and assume that d(D, D I) < V (for V 
inversely polynomial in T). Then .A is also an 
SQ(T, ,~') learning algorithm for ~ over D I. 
Finally, we mention that the robustness of the 
algorithm to different distributions depends on 
the sample size and the richness of the feature 
class 2¢ plays an important role here. Therefore, 
for a given size sample, the use of simpler fea- 
tures in the LSQ representation provides better 
robustness. This, in turn, can be traded off with 
the ability to express the learned function with 
an LSQ over a simpler set of features. 
3.2 Additional Examples 
In addition to the naive Bayes (NB) classifier 
described above several other widely used prob- 
abilistic classifiers can be cast as LSQ hypothe- 
ses. This property is maintained even if the NB 
predictor is generalized in several ways, by al- 
lowing hidden variables (GR00) or by assuming 
a more involved independence structure around 
the target variable. When the structure is mod- 
eled using a general Bayesian network (since we 
care only about predicting a value for a single 
variable having observed the others) the Bayes 
optimal predictor is an LSQ hypothesis over fea- 
tures that are polynomials X = IIxilxi2 • .. xik of 
degree that depends on the number of neighbors 
of the target variable. A specific case of great 
interest to NLP is that of hidden Markov Mod- 
els. In this case there are two types of variables, 
state variables S and observed ones, O (Rab89). 
The task of predicting the value of a state vari- 
able given values of the others can be cast as an 
LSQ, where X C {S, O, 1} × {S, O, 1}, a suitably 
defined set of singletons and pairs of observables 
and state variables (Rot99a). 
Finally, Maximum Entropy (ME) models 
(Jay82; Rat97) are also LSQ models. In this 
framework, constrains correspond to features; 
the distribution (and the induced classifier) are 
defined in terms of the expected value of the fea- 
tures over the training set. The induced clas- 
sifter is a linear classifier whose weights are de- 
rived from these expectations; the weights axe 
computed iteratively (DR72) since no closed 
form solution is known for the optimal values. 
4 Learning Linear Classifiers 
It was shown in (Rot98) that several other 
learning approaches widely used in NL work 
also make their predictions by utilizing a lin- 
ear representation. The SNoW learning archi- 
tecture (Rot98; CCRR99; MPRZ99) is explic- 
itly presented this way, but this holds also for 
methods that are presented in different ways, 
and some effort is required to cast them this 
way. These include Brill's transformation based 
method (Bri95) 2, decision lists (Yax94) and 
back-off estimation (Kat87; CB95). 
Moreover, even memory-based methods 
(ZD97; DBZ99) can be cast in a similar 
way (Rot99b). They can be reformulated as 
feature-based algorithms, with features of types 
that are commonly used by other features-based 
learning algorithms in NLP; the prediction com- 
puted by MBL can be computed by a linear 
function over this set of features. 
Some other methods that have been recently 
used in  related applications, such as 
Boosting (CS99) and support vector machines 
are also making use of the same representation. 
At a conceptual level all learning methods 
are therefore quite similar. They transform the 
original input (e.g., sentence, sentence+pos in- 
formation) to a new, high dimensional, feature 
space, whose coordinates are typically small 
conjunctions (n-grams) over the original input. 
In this new space they search for a linear func- 
tion that best separates the training data, and 
rely on the inductive principle mentioned to 
yield good behavior on future data. Viewed this 
way, methods are easy to compare and analyze 
for their suitability to NL applications and fu- 
ture extensions, as we sketch below. 
The goal of blowing up the instance space to a 
high dimensional space is to increase the expres- 
sivity of the classifier so that a linear function 
could represent the target concepts. Within this 
space, probabilistic methods are the most lim- 
ited since they do not actually search in the 
eThis holds only in cases in which the TBL condi- 
tions do not depend on the labels, as in Context Sensi- 
tive Spelling (MB97) and Prepositional Phrase Attach- 
ment (BR94) and not in the general case. 
4 
space of linear functions. Given the feature 
space they directly compute the classifier. In 
general, even when a simple linear function gen- 
erates the training data, these methods are not 
guaranteed to be consistent with it (Rot99a). 
However, if the feature space is chosen so that 
they are, the robustness properties shown above 
become significant. Decision lists and MBL 
methods have advantages in their ability to rep- 
resent exceptions and small areas in the feature 
space. MBL, by using long and very specialized 
conjunctions (DBZ99) and decision lists, due to 
their functional form - a linear function with 
exponentially decreasing weights - at the cost 
of predicting with a single feature, rather than 
a combination (Go195). Learning methods that 
attempt to find the best linear function (relative 
to some loss function) are typically more flexi- 
ble. Of these, we highlight here the SNoW ar- 
chitecture, which has some specific advantages 
that favor NLP-like domains. 
SNoW determines the features' weights using 
an on-line algorithm that attempts to minimize 
the number of mistakes on the training data us- 
ing a multiplicative weight update rule (Lit88). 
The weight update rule is driven by the maxi- 
mum entropy principle (KW95). The main im- 
plication is that SNoW has significant advan- 
tages in sparse spaces, those in which a few of 
the features are actually relevant to the tar- 
get concept, as is typical in NLP. In domains 
with these characteristics, for a given number 
of training examples, SNoW generalizes better 
than additive update methods like perceptron 
and its close relative SVMs (Ros58; FS98) (and 
in general,it has better learning curves). 
Furthermore, although in SNoW the transfor- 
mation to a large dimensional space needs to be 
done explicitly (rather than via kernel functions 
as is possible in perceptron and SVMs) its use of 
variable size examples nevertheless gives it com- 
putational advantages, due to the sparse feature 
space in NLP applications. It is also significant 
for extensions to relational domain mentioned 
later. Finally, SNoW is a multi-class classifier. 
5 Future Research Issues 
Research on learning in NLP needs to be inte- 
grated with work on knowledge representation 
and inference to enable studying higher level NL 
tasks. We mention two important directions the 
implications on the learning issues. 
The unified view presented reveals that all 
methods blow up the dimensionality of the orig- 
inal space in essentially the same way; they gen- 
erate conjunctive features over the linear struc- 
ture of the sentence (i.e., n-gram like features in 
the word and/or pos space). 
This does not seem to be expressive enough. 
Expressing complex concepts and relations 
necessary for higher level inferences will re- 
quire more involved intermediate representa- 
tions ("features") over the input; higher order 
structural and semantic properties, long term 
dependencies and relational predicates need to 
be represented. Learning will stay manageable 
if done in terms of these intermediate represen- 
tations as done today, using functionally simple 
representations (perhaps cascaded). 
Inductive logic programming (MDR94; 
Coh95) is a natural paradigm for this. How- 
ever, computational limitations that include 
both learnability and subsumption render this 
approach inadequate for large scale knowledge 
intensive problems (KRV99; CR00). 
In (CR00) we suggest an approach that ad- 
dresses the generation of complex and relational 
intermediate representations and supports effi- 
cient learning on top of those. It allows the 
generation and use of structured examples which 
could encode relational information and long 
term functional dependencies. This is done us- 
ing a construct that defines "types" of (poten- 
tially, relational) features the learning process 
might use. These represent infinitely many fea- 
tures, and are not generated explicitly; only 
those present in the data are generated, on the 
fly, as part of the learning process. Thus it 
yields hypotheses that are as expressive as re- 
lational learners in a scalable fashion. This ap- 
proach, however, makes some requirements on 
the learning process. Most importantly, the 
learning approach needs to be able to process 
variable size examples. And, it has to be feature 
efficient in that its complexity depends mostly 
on the number of relevant features. This seems 
to favor the SNoW approach over other algo- 
rithms that learn the same representation. 
Eventually, we would like to perform infer- 
ences that depend on the outcomes of several 
different classifiers; together these might need to 
coherently satisfy some constrains arising from 
the sequential nature of the data or task and do- 
main specific issues. There is a need to study, 
along with learning and knowledge representa- 
tion, inference methods that suit this frame- 
work (KR97). Work in this direction requires a 
consistent semantics of the learners (Val99) and 
will have implications on the knowledge repre- 
sentations and learning methods used. Prelim- 
inary work in (PRO0) suggests several ways to 
formalize this problem and is evaluated in the 
context of identifying phrase structure. 

References 
J. A. Aslam and S. E. Decatur. Specification and simu- 
lation of statistical query algorithms for efficiency and 
noise tolerance. In COLT 1995, pages 437-446. 
E. Brill and P. Resnik. A rule-based approach to prepo- 
sitional phrase attachment disambiguation. In Proc. 
of COLING, 1994. 
E. Brill. Transformation-based error-driven learning 
and natural  processing: A case study in 
part of speech tagging. Computational Linguistics, 
21(4):543-565, 1995. 
M. Collins and J Brooks. Prepositional phrase attach- 
ment through a backed-off model. In WVLC 1995. 
A. Carleson, C. Cumby, J. Rosen, and D. Roth. 
The SNoW learning architecture. Technical Report 
UIUCDCS-R-99-2101, UIUC CS, May 1999. 
W. Cohen. PAC-learning recursive logic programs: Effi- 
cient algorithms. JAIR, 2:501-539, 1995. 
C. Cumby and D. Roth. Relational representations that 
facilitate learning. In KR 2000, pages 425-434. 
M. Collins and Y. Singer. Unsupervised models for name 
entity classification. In EMNLP- VLC'99. 
W. Daelemans, A. van den Bosch, and J. Zavrel. Forget- 
ting exceptions is harmful in  learning. Ma- 
chine Learning, 34(1-3):11-43, 1999. 
S. E. Decatur. Statistical queries and faulty PAC oracles. 
In COLT 1993, pages 262-268. 
R. O. Duda and P. E. Hart. Pattern Classification and 
Scene Analysis. Wiley, 1973. 
J. N. Darroch and D. Ratcliff. Generalized iterative 
scaling for log-linear models. Annals of Mathemati- 
cal Statistics, 43(5):1470-1480, 1972. 
Y. Freund and R. Schapire. Large margin classification 
using the Perceptron algorithm. In COLT 1998. 
W. Gale, K. Church, and D. Yarowsky. A method for 
disambiguating word senses in a large corpus. Com- 
puters and the Humanities, 26:415-439~ 1993. 
A. R. Golding. A Bayesian hybrid method for context- 
sensitive spelling correction. In Proceedings of the 3rd 
workshop on very large corpora, ACL-95, 1995. 
A. R. Golding and D. Roth. A Winnow based ap- 
proach to context-sensitive spelling correction. Ma- 
chine Learning, 34(1-3):107-130, 1999. 
A. Grove and D. Roth. Linear concepts and hidden vari- 
ables. Machine Learning, 2000. To Appear. 
D. Haussler. Decision theoretic generalizations of the 
PAC model for neural net and other learning appli- 
cations. Inform. Comput., 100(1):78-150, 1992. 
E. T. Jaynes. On the rationale of maximum-entropy 
methods. Proc. of the IEEE, 70(9):939-952, 1982. 
S. M. Katz. Estimation of probabilities from sparse data 
for the  model component of a speech recog- 
nizer. IEEE Transactions on Acoustics, speech, and 
Signal Processing, 35(3):400-401, 1987. 
M. Kearns. Efficient noise-tolerant learning from statis- 
tical queries. In COLT 1993, pages 392-401. 
R. Khardon and D. Roth. Learning to reason. Journal 
of the ACM, 44(5):697-725, Sept. 1997. 
R. Khardon, D. Roth, and L. G. Valiant. Relational 
learning for NLP using linear threshold elements. In 
IJCAI 1999, pages 911-917. 
M. J. Kearns, R. E. Schapire, and L. M. Sellie. To- 
ward efficient agnostic learning. Machine Learning, 
17(2/3):115-142, 1994. 
J. Kupiec. Robust part-of-speech tagging using a hid- 
den Markov model. Computer Speech and Language, 
6:225-242, 1992. 
J. Kivinen and M. K. Warmuth. Exponentiated gradi- 
ent versus gradient descent for linear predictors. In 
STOC, 1995. 
N. Littlestone. Learning quickly when irrelevant at- 
tributes abound: A new linear-threshold algorithm. 
Machine Learning, 2:285-318, 1988. 
L. Mangu and E. Brill. Automatic rule acquisition for 
spelling correction. In ICML 1997, pages 734-741. 
S. Muggleton and L. De Raedt. Inductive logic program- 
ming: Theory and methods. Journal of Logic Pro- 
gramming, 20:629-679, 1994. 
M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. 
A learning approach to shallow parsing. In EMNLP- 
VLC'99, pages 168-178. 
V. Punyakanok and D. Roth. Inference with classifiers. 
Technical Report UIUCDCS-R-2000-2181, UIUC CS. 
L. R. Rabiner. A tutorial on hidden Markov models and 
selected applications in speech recognition. Proceed- 
ings of the IEEE, 77(2):257-285, 1989. 
A. Ratnaparkhi. A linear observed time statistical parser 
based on maximum entropy models. In EMNLP-97. 
F. Rosenblatt. The perceptron: A probabilistic model 
for information storage and organization in the brain. 
Psychological Review, 65:386-407, 1958. 
D. Roth. Learning to resolve natural  ambigui- 
ties: A unified approach. In AAAI'98, pages 806-813. 
D. Roth. Learning in natural . In IJCAI 1999, 
pages 898-904. 
D. Roth. Memory based learning in NLP. Technical Re- 
port UIUCDCS-R-99-2125, UIUC CS, March 1999. 
D. Roth. Learning in natural . Technical Re- 
port UIUCDCS-R-2000-2180, UIUC CS July 2000. 
H. Schfitze. Distributional part-of-speech tagging. In 
EACL 1995. 
L. G. Valiant. A theory of the learnable. Communica- 
tions of the ACM, 27(11):1134-1142, November 1984. 
L. G. Valiant. Robust logic. In STOC 1999. 
V. N. Vapnik. The Nature of Statistical Learning Theory. 
Springer-Verlag, New York, 1995. 
D. Yarowsky. Decision lists for lexical ambiguity resolu- 
tion: application to accent restoration in Spanish and 
French. In ACL 1994, pages 88-95. 
J. Zavrel and W. Daelemans. Memory-based learning: 
Using similarity for smoothing. In ACL, 1997. 
