File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-2011_intro.xml

Size: 4,507 bytes

Last Modified: 2025-10-06 14:01:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2011">
  <Title>Combining labelled and unlabelled data: a case study on Fisher kernels and transductive inference for biological entity recognition</Title>
  <Section position="3" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The availability of electronic databases of rapidly increasing sizes has encouraged the developmentofmethodsthatcantapinto these databases to automatically generate knowledge, for example by retrieving relevant information or extracting entities and their relationships.</Paragraph>
    <Paragraph position="1"> Machine learning seems especially relevantin this context, because it helps performing these tasks with a minimum of user interaction.</Paragraph>
    <Paragraph position="2"> Anumber of problems likeentity extraction or ltering can be mapped to supervised techniques like categorisation. In addition, modern supervised classication methods like Support Vector Machines haveproven to be ecientand versatile. They do, however, rely on the availability of labelled data, where labels indicate eg whether a document is relevant or whether a candidate expression is an interesting entity.</Paragraph>
    <Paragraph position="3"> This causes two important problems that motivate our work: 1) annotating data is often a dicult and costly task involving a lot of human work  , such that large collections of labelled data are dicult to obtain, and 2) inter-annotator agreement tends to be lowineggenomics collections (Krauthammer et al., 2000), thus calling for methods that are able to deal with noise and incomplete data.</Paragraph>
    <Paragraph position="4"> On the other hand, unsupervised techniques do not require labelled data and can thus be applied regardless of the annotation problems.</Paragraph>
    <Paragraph position="5"> Unsupervised learning, however, tend to be less data-ecient than its supervised counterpart, requiring many more examples to discover signicant features in the data, and is incapable of solving the same kinds of problems. For example, an ecient clustering technique maybe able to distribute documents in a number of well-dened clusters. However, it will be unable to decide which clusters are relevant without a minimum of supervision.</Paragraph>
    <Paragraph position="6"> This motivates our study of techniques that rely on a combination of supervised and unsupervised learning, in order to leverage the availability oflargecollections ofunlabelled dataand use a limited amount of labelled documents.</Paragraph>
    <Paragraph position="7"> The focus of this study is on a particular application to the genomics literature. In genomics, a vast amountofknowledge still resides in large collections of scientic papers suchas Medline, and several approaches have been proposed toextract,(semi-)automatically, information from such papers. These approaches range from purely statistical ones to symbolic ones relying on linguistic and knowledge processing tools (Ohta et al., 1997;; Thomas et al., 2000;; Proux et al., 2000, for example). Furthermore, due to the nature of the problem at hand, meth- null If automatic annotation was available, wewould basically have solved our Machine Learning problem ods derived from machine learning are called for, (Craven and Kumlien, 1999), whether supervised, unsupervised or relying on a combination of both.</Paragraph>
    <Paragraph position="8"> Let us insist on the fact that our work is primarily concerned with combining labelled and unlabelled data, and entity extraction is used as an application in this context. As a consequence, it is not our purpose at this pointto compare our experimental results to those obtained by specic machine learning techniques applied to entity extraction (Cali, 1999). Although we certainly hope that our work can be useful for entity extraction, we rather think of it as a methodological study which can hopefully be applied to dierent applications where unlabelled data may be used to improve the results of supervised learning algorithms. In addition, performing a fair comparison of our work on standard information extraction benchmarks is not straightforward: either wewould need to obtain a large amount of unlabelled data that is comparable tothe benchmark, orwewould need to \un-label&amp;quot; a portion of the data. In both cases, comparing to existing results is dicult as the amount of information used is dierent.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML