Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X),
pages 1–5, New York City, June 2006. c©2006 Association for Computational Linguistics
AMissionforComputationalNaturalLanguageLearning
WalterDaelemans
CNTS LanguageTechnologyGroup
UniversityofAntwerp
Belgium
walter.daelemans@ua.ac.be
Abstract
In this presentation, I will look back at
10 years of CoNLL conferences and the
state oftheartofmachine learning oflan-
guage that is evident from this decade of
research. My conclusion, intended to pro-
voke discussion, will be that we currently
lack a clear motivation or “mission” to
survive as a discipline. I will suggest that
anewmission forthe fieldcould befound
in a renewed interest for theoretical work
(which learning algorithms have a bias
that matches the properties of language?,
what is the psycholinguistic relevance of
learner design issues?), in more sophis-
ticated comparative methodology, and in
solving the problem of transfer, reusabil-
ity,and adaptation oflearned knowledge.
1 Introduction
When looking at ten years of CoNLL conferences,
it is clear that the impact and the size of the con-
ference has enormously grown over time. The tech-
nical papers you will find in this proceedings now
are comparable in quality and impact to those of
other distinguished conferences like the Conference
on Empirical Methods in Natural Language Pro-
cessingoreventhemainconferencesofACL,EACL
and NAACL themselves. An important factor in
the success of CoNLL has been the continued se-
ries of shared tasks (notice we don’t use terms like
challenges orcompetitions) thathasproduced ause-
ful set of benchmarks for comparing learning meth-
ods, and that has gained wide interest in the field.
It should also be noted, however, that the success
of the conferences is inversely proportional with
the degree to which the original topics which mo-
tivatedtheconference arepresentintheprogramme.
Originally, the people driving CoNLL wanted it to
be promiscuous (i) in the selection of partners (we
wanted to associate with Machine Learning, Lin-
guistics and Cognitive Science conferences as well
aswithComputational Linguistics conferences) and
(ii)intherangeoftopicstobepresented. Wewanted
to encourage linguistically and psycholinguistically
relevantmachinelearningwork,andbiologically in-
spired and innovative symbolic learning methods,
and present this work alongside the statistical and
learningapproachesthatwereatthattimeonlystart-
ing to gradually become the mainstream in Compu-
tational Linguistics. It has turned out differently,
and we should reflect on whether we have become
too much of amainstream computational linguistics
conference ourselves, aback-offforthegoodpapers
that haven’t made it in EMNLP or ACL because of
the crazy rejection rates there (with EMNLP in its
turn a back-off for good papers that haven’t made
it in ACL). Some of the work targeted by CoNLL
hasfound aforum inmeetings like theworkshop on
Psycho-computational models of human language
acquisition, the International Colloquium on Gram-
matical Inference, the workshop on Morphological
and Phonological Learning etc. Weshould ask our-
selves why we don’t have this type of work more
in CoNLL. In the first part of the presentation I
will sketch very briefly the history of SIGNLL and
1
CoNLL and try to initiate some discussion on what
a conference on Computational Language Learning
should be doing in2007 and after.
2 StateoftheArtinComputational
NaturalLanguageLearning
The second part of my presentation will be a dis-
cussion of the state of the art as it can be found in
CoNLL (and EMNLP and the ACL conferences).
The field can be divided into theoretical, method-
ological, and engineering work. There has been
progress in theory and methodology, but perhaps
not sufficiently. I will argue that most progress has
been made in engineering with most often incre-
mental progress on specific tasks as a result rather
than increased understanding of how language can
be learned fromdata.
MachineLearningofNaturalLanguage(MLNL),
or Computational Natural Language Learning
(CoNLL) is a research area lying in the intersec-
tionofcomputational linguistics andmachinelearn-
ing. I would suggest that Statistical Natural Lan-
guage Processing (SNLP) should be treated as part
of MLNL,orperhaps even asasynonym. Symbolic
machine learning methods belong to the same part
of the ontology as statistical methods, but have dif-
ferent solutions for specific problems. E.g., Induc-
tive Logic Programming allows elegant addition of
background knowledge,memory-based learninghas
implicit similarity-based smoothing, etc.
There is no need here to explain the success of
inductivemethodsinComputational Linguisticsand
why we are all such avid users of the technology:
availability of data, fast production of systems with
good accuracy, robustness and coverage, cheaper
than linguistic labor. There is also no need here
to explain that many of these arguments in favor of
learning in NLP are bogus. Getting statistical and
machine learning systems to work involves design,
optimization, and smoothing issues that are some-
thing of a black art. For many problems, getting
sufficient annotated data is expensive and difficult,
our annotators don’t sufficiently agree, our trained
systems are not really that good. My favorite exam-
ple for the latter is part of speech tagging, which is
considered asolvedproblem, butstillhaserrorrates
of 20-30% for the ambiguities that count, like verb-
noun ambiguity. We are doing better than hand-
crafted linguistic knowledge-based approaches but
from the point of view of the goal of robust lan-
guage understanding unfortunately not that signifi-
cantly better. Twice better than very bad is not nec-
essarily any good. We also implicitly redefined the
goals of the field of Computational Linguistics, for-
getting for example about quantification, modality,
tense, inference and a large number of other sen-
tence and discourse semantics issues which do not
fit the default classification-based supervised learn-
ingframeworkverywellorforwhichwedon’t have
annotated data readily available. As a final irony,
one of the reasons why learning methods have be-
come so prevalent in NLPis their success in speech
recognition. Yet, there too, this success is relative;
the goal of spontaneous speaker-independent recog-
nition isstill far away.
2.1 Theory
There has been a lot of progress recently in theoret-
ical machine learning(Vapnik, 1995; Jordan, 1999).
Statistical Learning Theory and progress in Graph-
ical Models theory have provided us with a well-
defined framework in which we can relate differ-
ent approaches like kernel methods, Naive Bayes,
Markov models, maximum entropy approaches (lo-
gistic regression), perceptrons and CRFs. Insight
intothedifferences betweengenerative anddiscrim-
inative learning approaches has clarified the rela-
tions between different learning algorithms consid-
erably.
However, this work does not tell us something
general about machine learning of language. The-
oretical issues that should be studied in MLNL are
forexamplewhichclassesoflearningalgorithmsare
best suited for which type of language processing
task, what the need for training data is for a given
task, which information sources are necessary and
sufficient forlearning aparticular language process-
ing task, etc. These fundamental questions all re-
late to learning algorithm bias issues. Learning is
a search process in a hypothesis space. Heuristic
limitations on the search process and restrictions on
the representations allowed for input and hypothe-
sisrepresentations together definethisbias. Thereis
not a lot of work on matching properties of learning
algorithms with properties of language processing
2
tasks,ormorespecifically onhowthebiasofpartic-
ular (families of) learning algorithms relates to the
hypothesis spaces of particular (types of) language
processing tasks.
As an example of such a unifying approach,
(Roth,2000) shows thatseveral different algorithms
(memory-based learning, tbl, snow, decision lists,
various statistical learners, ...) use the same type
of knowledge representation, a linear representation
overafeaturespacebasedonatransformation ofthe
original instance space. However, the only relation
to language here is rather negative with the claim
that this bias is not sufficient for learning higher
level language processing tasks.
As another example of this type of work,
Memory-Based Learning (MBL) (Daelemans and
van den Bosch, 2005), with its implicit similarity-
based smoothing, storage of all training evidence,
and uniform modeling of regularities, subregulari-
ties and exceptions has been proposed as having the
right bias for language processing tasks. Language
processing tasks are mostly governed by Zipfian
distributions and high disjunctivity which makes it
difficult to make a principled distinction between
noise and exceptions, which would put eager learn-
ing methods (i.e. most learning methods apart from
MBLand kernel methods) atadisadvantage.
More theoretical work in this area should make it
possible to relate machine learner bias to properties
of language processing tasks in a more fine-grained
way, providing more insight into both language and
learning. Anavenuethathasremainedlargelyunex-
ploredinthisrespectistheuseofartificialdataemu-
lating properties oflanguage processing tasks, mak-
ing possible a much more fine-grained study of the
influence of learner bias. However, research in this
area will not be able to ignore the “no free lunch”
theorem (Wolpert and Macready, 1995). Referring
back to the problem of induction (Hume, 1710) this
theorem can be interpreted that no inductive algo-
rithm is universally better than any other; general-
ization performance of any inductive algorithm is
zero when averaged over a uniform distribution of
all possible classification problems (i.e. assuming
a random universe). This means that the only way
to test hypotheses about bias and necessary infor-
mation sources in language learning is to perform
empirical research, making a reliable experimental
methodology necessary.
2.2 Methodology
Eithertoinvestigate theroleofdifferentinformation
sources in learning a task, or to investigate whether
the bias of some learning algorithm fits the proper-
ties of natural language processing tasks better than
alternative learning algorithms, comparative experi-
mentsarenecessary. Asanexampleofthelatter, we
may be interested in investigating whether part-of-
speech tagging improvestheaccuracy ofaBayesian
text classification system or not. As an example of
the former, we may be interested to know whether
a relational learner is better suited than a propo-
sitional learner to learn semantic function associa-
tion. This can be achieved by comparing the accu-
racy of the learner withand without the information
sourceordifferentlearnersonthesametask. Crucial
for objectively comparing algorithm bias and rele-
vance of information sources is a methodology to
reliably measure differences and compute their sta-
tistical significance. A detailed methodology has
been developed for this involving approaches like
k-fold cross-validation to estimate classifier quality
(in terms of measures derived from a confusion ma-
trix like accuracy, precision, recall, F-score, ROC,
AUC,etc.),aswellasstatistical techniques likeMc-
Nemar and paired cross-validation t-tests for deter-
mining the statistical significance of differences be-
tweenalgorithms orbetweenpresence orabsence of
information sources. This methodology is generally
accepted and used both in machine learning and in
most workin inductive NLP.
CoNLL has contributed a lot to this compara-
tiveworkbyproducing asuccessful series ofshared
tasks, which has provided to the community a rich
set of benchmark language processing tasks. Other
competitive research evaluations like senseval, the
PASCAL challenges and the NIST competitions
have similarly tuned the field toward comparative
learning experiments. In a typical comparative ma-
chine learning experiment, two or more algorithms
are compared for a fixed sample selection, feature
selection, feature representation, and (default) al-
gorithm parameter setting over a number of trials
(cross-validation), and if the measured differences
are statistically significant, conclusions are drawn
aboutwhichalgorithmisbettersuitedtotheproblem
3
beingstudiedandwhy(mostlyintermsofalgorithm
bias). Sometimes different sample sizes are used to
provide alearning curve, andsometimes parameters
of (some of the) algorithms are optimized on train-
ing data, or heuristic feature selection is attempted,
but this is exceptional rather than common practice
incomparative experiments.
Yeteveryone knows that many factors potentially
play a role in the outcome of a (comparative) ma-
chine learning experiment: the data used (the sam-
ple selection and the sample size), the information
sources used (the features selected) and their repre-
sentation (e.g. as nominal or binary features), the
class representation (error coding, binarization of
classes), and the algorithm parameter settings (most
ML algorithms have various parameters that can be
tuned). Moreover,all these factors are known to in-
teract. E.g., (Banko and Brill, 2001) demonstrated
that for confusion set disambiguation, a prototypi-
cal disambiguation in context problem, the amount
of data used dominates the effect of the bias of the
learning method employed. The effect of training
datasizeonrelevanceofPOS-taginformationontop
oflexicalinformationinrelation findingwasstudied
in (van den Bosch and Buchholz, 2001). The pos-
itive effect of POS-tags disappears with sufficient
data. In (Daelemans et al., 2003) it is shown that
thejoinedoptimizationoffeatureselectionandalgo-
rithmparameter optimization significantly improves
accuracy compared to sequential optimization. Re-
sults from comparative experiments may therefore
not be reliable. I will suggest an approach to im-
prove methodology toimprove reliability.
2.3 Engineering
Whereas comparative machine learning work can
potentiallyprovideusefultheoreticalinsightsandre-
sults, there is a distinct feeling that it also leads to
anexaggeratedattentionforaccuracyonthedataset.
Given the limited transfer and reusability of learned
modules when used in different domains, corpora
etc., this may not be very relevant. If a WSJ-trained
statistical parser looses 20% accuracy on a compa-
rable newspaper testcorpus, it doesn’t really matter
alotthatsystem Adoes1%better than systemBon
the default WSJ-corpus partition.
In order to win shared tasks and perform best on
somelanguageprocessingtask,variouscleverarchi-
tectural and algorithmic variations have been pro-
posed, sometimes with the single goal of getting
higher accuracy (ensemble methods, classifier com-
bination in general, ...), sometimes with the goal of
solvingmanualannotation bottlenecks (activelearn-
ing, co-training, semisupervised methods, ...).
This work is extremely valid from the point of
view of computational linguistics researchers look-
ing for any old method that can boost performance
and get benchmark natural language processing
problems or applications solved. But from the point
of view ofa SIGon computational natural language
learning, this work is probably too much theory-
independent and doesn’t teach usenough about lan-
guage learning.
However,engineering worklikethiscansuddenly
become theoretically important when motivated not
by a few percentage decimals more accuracy but
rather by (psycho)linguistic plausibility. For exam-
ple, the current trend in combining local classifiers
withholistic inference maybeacognitively relevant
principle rather than aneat engineering trick.
3 Conclusion
The field of computational natural language learn-
ing is in need of a renewed mission. In two par-
entfieldsdominatedbygoodengineering useofma-
chine learning in language processing, and interest-
ing developments in computational language learn-
ing respectively, our field should focus moreon the-
ory. Moreresearchshouldaddressthequestionwhat
we can learn about language from comparative ma-
chine learning experiments, and address or at least
acknowledge methodological problems.
4 Acknowledgements
There are many people that have influenced me,
most of my students and colleagues have done so
at some point, but I would like to single out David
Powers and Antal van den Bosch, and thank them
for making this strange field of computational lan-
guagelearningsuchaninterestingandpleasantplay-
ground.

References
Michele Banko and Eric Brill. 2001. Mitigating the
paucity-of-dataproblem: exploringtheeffectoftrain-
ing corpus size on classifier performance for natu-
ral language processing. In HLT ’01: Proceedings
of the first international conference on Human lan-
guage technology research, pages 1–5, Morristown,
NJ,USA.AssociationforComputationalLinguistics.
Walter Daelemans and Antal van den Bosch. 2005.
Memory-Based Language Processing. Cambridge
UniversityPress,Cambridge,UK.
Walter Daelemans, V´eronique Hoste, Fien De Meulder,
and Bart Naudts. 2003. Combined optimization of
feature selection and algorithm parameter interaction
in machine learning of language. In Proceedings of
the 14th European Conference on Machine Learn-
ing (ECML-2003), Lecture Notes in Computer Sci-
ence 2837, pages 84–95, Cavtat-Dubrovnik, Croatia.
Springer-Verlag.
D.Hume. 1710. A Treatise Concerningthe Principles of
HumanKnowledge.
M.I.Jordan. 1999. Learning in graphicalmodels. MIT,
Cambridge,MA,USA.
D. Roth. 2000. Learning in natural language: The-
ory and algorithmic approaches. In Proc. of the An-
nualConferenceon ComputationalNatural Language
Learning(CoNLL),pages1–6,Lisbon,Portugal.
Antal vanden Bosch and Sabine Buchholz. 2001. Shal-
low parsing on the basis of words only: a case study.
In ACL ’02: Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics, pages
433–440,Morristown,NJ,USA.AssociationforCom-
putationalLinguistics.
Vladimir N. Vapnik. 1995. The nature of statistical
learningtheory. Springer-VerlagNewYork,Inc.,New
York,NY,USA.
David H. Wolpert and William G. Macready. 1995. No
freelunchtheoremsforsearch. TechnicalReportSFI-
TR-95-02-010,SantaFe,NM.
