XML Viewer - w06-2901

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2901_intro.xml
Size: 13,586 bytes
Last Modified: 2025-10-06 14:04:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2901">
  <Title>AMissionforComputationalNaturalLanguageLearning WalterDaelemans</Title>
  <Section position="4" start_page="1" end_page="3" type="intro">
    <SectionTitle>
2 StateoftheArtinComputational
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
NaturalLanguageLearning
</SectionTitle>
      <Paragraph position="0"> The second part of my presentation will be a discussion of the state of the art as it can be found in CoNLL (and EMNLP and the ACL conferences).</Paragraph>
      <Paragraph position="1"> The field can be divided into theoretical, methodological, and engineering work. There has been progress in theory and methodology, but perhaps not sufficiently. I will argue that most progress has been made in engineering with most often incremental progress on specific tasks as a result rather than increased understanding of how language can be learned fromdata.</Paragraph>
      <Paragraph position="2"> MachineLearningofNaturalLanguage(MLNL), or Computational Natural Language Learning (CoNLL) is a research area lying in the intersectionofcomputational linguistics andmachinelearning. I would suggest that Statistical Natural Language Processing (SNLP) should be treated as part of MLNL,orperhaps even asasynonym. Symbolic machine learning methods belong to the same part of the ontology as statistical methods, but have different solutions for specific problems. E.g., Inductive Logic Programming allows elegant addition of background knowledge,memory-based learninghas implicit similarity-based smoothing, etc.</Paragraph>
      <Paragraph position="3"> There is no need here to explain the success of inductivemethodsinComputational Linguisticsand why we are all such avid users of the technology: availability of data, fast production of systems with good accuracy, robustness and coverage, cheaper than linguistic labor. There is also no need here to explain that many of these arguments in favor of learning in NLP are bogus. Getting statistical and machine learning systems to work involves design, optimization, and smoothing issues that are something of a black art. For many problems, getting sufficient annotated data is expensive and difficult, our annotators don't sufficiently agree, our trained systems are not really that good. My favorite example for the latter is part of speech tagging, which is considered asolvedproblem, butstillhaserrorrates of 20-30% for the ambiguities that count, like verb-noun ambiguity. We are doing better than hand-crafted linguistic knowledge-based approaches but from the point of view of the goal of robust language understanding unfortunately not that significantly better. Twice better than very bad is not necessarily any good. We also implicitly redefined the goals of the field of Computational Linguistics, forgetting for example about quantification, modality, tense, inference and a large number of other sentence and discourse semantics issues which do not fit the default classification-based supervised learningframeworkverywellorforwhichwedon't have annotated data readily available. As a final irony, one of the reasons why learning methods have become so prevalent in NLPis their success in speech recognition. Yet, there too, this success is relative; the goal of spontaneous speaker-independent recognition isstill far away.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Theory
</SectionTitle>
      <Paragraph position="0"> There has been a lot of progress recently in theoretical machine learning(Vapnik, 1995; Jordan, 1999).</Paragraph>
      <Paragraph position="1"> Statistical Learning Theory and progress in Graphical Models theory have provided us with a well-defined framework in which we can relate different approaches like kernel methods, Naive Bayes, Markov models, maximum entropy approaches (logistic regression), perceptrons and CRFs. Insight intothedifferences betweengenerative anddiscriminative learning approaches has clarified the relations between different learning algorithms considerably. null However, this work does not tell us something general about machine learning of language. Theoretical issues that should be studied in MLNL are forexamplewhichclassesoflearningalgorithmsare best suited for which type of language processing task, what the need for training data is for a given task, which information sources are necessary and sufficient forlearning aparticular language processing task, etc. These fundamental questions all relate to learning algorithm bias issues. Learning is a search process in a hypothesis space. Heuristic limitations on the search process and restrictions on the representations allowed for input and hypothesisrepresentations together definethisbias. Thereis not a lot of work on matching properties of learning algorithms with properties of language processing  tasks,ormorespecifically onhowthebiasofparticular (families of) learning algorithms relates to the hypothesis spaces of particular (types of) language processing tasks.</Paragraph>
      <Paragraph position="2"> As an example of such a unifying approach, (Roth,2000) shows thatseveral different algorithms (memory-based learning, tbl, snow, decision lists, various statistical learners, ...) use the same type of knowledge representation, a linear representation overafeaturespacebasedonatransformation ofthe original instance space. However, the only relation to language here is rather negative with the claim that this bias is not sufficient for learning higher level language processing tasks.</Paragraph>
      <Paragraph position="3"> As another example of this type of work, Memory-Based Learning (MBL) (Daelemans and van den Bosch, 2005), with its implicit similarity-based smoothing, storage of all training evidence, and uniform modeling of regularities, subregularities and exceptions has been proposed as having the right bias for language processing tasks. Language processing tasks are mostly governed by Zipfian distributions and high disjunctivity which makes it difficult to make a principled distinction between noise and exceptions, which would put eager learning methods (i.e. most learning methods apart from MBLand kernel methods) atadisadvantage.</Paragraph>
      <Paragraph position="4"> More theoretical work in this area should make it possible to relate machine learner bias to properties of language processing tasks in a more fine-grained way, providing more insight into both language and learning. Anavenuethathasremainedlargelyunexploredinthisrespectistheuseofartificialdataemu- null lating properties oflanguage processing tasks, making possible a much more fine-grained study of the influence of learner bias. However, research in this area will not be able to ignore the &amp;quot;no free lunch&amp;quot; theorem (Wolpert and Macready, 1995). Referring back to the problem of induction (Hume, 1710) this theorem can be interpreted that no inductive algorithm is universally better than any other; generalization performance of any inductive algorithm is zero when averaged over a uniform distribution of all possible classification problems (i.e. assuming a random universe). This means that the only way to test hypotheses about bias and necessary information sources in language learning is to perform empirical research, making a reliable experimental methodology necessary.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
2.2 Methodology
Eithertoinvestigate theroleofdifferentinformation
</SectionTitle>
      <Paragraph position="0"> sources in learning a task, or to investigate whether the bias of some learning algorithm fits the properties of natural language processing tasks better than alternative learning algorithms, comparative experimentsarenecessary. Asanexampleofthelatter, we may be interested in investigating whether part-of-speech tagging improvestheaccuracy ofaBayesian text classification system or not. As an example of the former, we may be interested to know whether a relational learner is better suited than a propositional learner to learn semantic function association. This can be achieved by comparing the accuracy of the learner withand without the information sourceordifferentlearnersonthesametask. Crucial for objectively comparing algorithm bias and relevance of information sources is a methodology to reliably measure differences and compute their statistical significance. A detailed methodology has been developed for this involving approaches like k-fold cross-validation to estimate classifier quality (in terms of measures derived from a confusion matrix like accuracy, precision, recall, F-score, ROC, AUC,etc.),aswellasstatistical techniques likeMc-Nemar and paired cross-validation t-tests for determining the statistical significance of differences betweenalgorithms orbetweenpresence orabsence of information sources. This methodology is generally accepted and used both in machine learning and in most workin inductive NLP.</Paragraph>
      <Paragraph position="1"> CoNLL has contributed a lot to this comparativeworkbyproducing asuccessful series ofshared tasks, which has provided to the community a rich set of benchmark language processing tasks. Other competitive research evaluations like senseval, the PASCAL challenges and the NIST competitions have similarly tuned the field toward comparative learning experiments. In a typical comparative machine learning experiment, two or more algorithms are compared for a fixed sample selection, feature selection, feature representation, and (default) algorithm parameter setting over a number of trials (cross-validation), and if the measured differences are statistically significant, conclusions are drawn aboutwhichalgorithmisbettersuitedtotheproblem  beingstudiedandwhy(mostlyintermsofalgorithm bias). Sometimes different sample sizes are used to provide alearning curve, andsometimes parameters of (some of the) algorithms are optimized on training data, or heuristic feature selection is attempted, but this is exceptional rather than common practice incomparative experiments.</Paragraph>
      <Paragraph position="2"> Yeteveryone knows that many factors potentially play a role in the outcome of a (comparative) machine learning experiment: the data used (the sample selection and the sample size), the information sources used (the features selected) and their representation (e.g. as nominal or binary features), the class representation (error coding, binarization of classes), and the algorithm parameter settings (most ML algorithms have various parameters that can be tuned). Moreover,all these factors are known to interact. E.g., (Banko and Brill, 2001) demonstrated that for confusion set disambiguation, a prototypical disambiguation in context problem, the amount of data used dominates the effect of the bias of the learning method employed. The effect of training datasizeonrelevanceofPOS-taginformationontop oflexicalinformationinrelation findingwasstudied in (van den Bosch and Buchholz, 2001). The positive effect of POS-tags disappears with sufficient data. In (Daelemans et al., 2003) it is shown that  thejoinedoptimizationoffeatureselectionandalgorithmparameter optimization significantly improves accuracy compared to sequential optimization. Results from comparative experiments may therefore not be reliable. I will suggest an approach to improve methodology toimprove reliability.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.3 Engineering
</SectionTitle>
      <Paragraph position="0"> Whereas comparative machine learning work can  potentiallyprovideusefultheoreticalinsightsandresults, there is a distinct feeling that it also leads to anexaggeratedattentionforaccuracyonthedataset.</Paragraph>
      <Paragraph position="1"> Given the limited transfer and reusability of learned modules when used in different domains, corpora etc., this may not be very relevant. If a WSJ-trained statistical parser looses 20% accuracy on a comparable newspaper testcorpus, it doesn't really matter alotthatsystem Adoes1%better than systemBon the default WSJ-corpus partition.</Paragraph>
      <Paragraph position="2"> In order to win shared tasks and perform best on  somelanguageprocessingtask,variouscleverarchitectural and algorithmic variations have been proposed, sometimes with the single goal of getting higher accuracy (ensemble methods, classifier combination in general, ...), sometimes with the goal of solvingmanualannotation bottlenecks (activelearning, co-training, semisupervised methods, ...).</Paragraph>
      <Paragraph position="3"> This work is extremely valid from the point of view of computational linguistics researchers looking for any old method that can boost performance and get benchmark natural language processing problems or applications solved. But from the point of view ofa SIGon computational natural language learning, this work is probably too much theory-independent and doesn't teach usenough about language learning.</Paragraph>
      <Paragraph position="4"> However,engineering worklikethiscansuddenly become theoretically important when motivated not by a few percentage decimals more accuracy but rather by (psycho)linguistic plausibility. For example, the current trend in combining local classifiers withholistic inference maybeacognitively relevant principle rather than aneat engineering trick.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML