File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0610_intro.xml

Size: 6,973 bytes

Last Modified: 2025-10-06 14:03:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0610">
  <Title>Using Uneven Margins SVM and Perceptron for Information Extraction</Title>
  <Section position="3" start_page="0" end_page="72" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Information Extraction (IE) is the process of automatic extraction of information about pre-speci ed types of events, entities or relations from text such as newswire articles or Web pages. IE is useful in many applications, such as information gathering in a variety of domains, automatic annotations of web pages for Semantic Web, and knowledge management. null A wide range of machine learning techniques have been used for IE and achieved state-of-the-art results, comparable to manually engineered IE systems. A learning algorithm usually learns a model from a set of documents which have been manually annotated by the user. Then the model can be used to extract information from new documents. Manual annotation is a time-consuming process. Hence, in many cases learning from small data sets is highly desirable. Therefore in this paper we also evaluate the performance of our algorithms on small amounts of training data and show their learning curve.</Paragraph>
    <Paragraph position="1"> The learning algorithms for IE can be classi ed broadly into two main categories: rule learning and statistical learning. The former induces a set of rules from training examples. There are many rule based learning systems, e.g. SRV (Freitag, 1998), RAPIER (Califf, 1998), WHISK (Soderland, 1999), BWI (Freitag and Kushmerick, 2000), and (LP)2 (Ciravegna, 2001). Statistical systems learn a statistical model or classi ers, such as HMMs (Freigtag and McCallum, 1999), Maximal Entropy (Chieu and Ng., 2002), the SVM (Isozaki and Kazawa, 2002; May eld et al., 2003), and Perceptron (Carreras et al., 2003). IE systems also differ from each other in the NLP features that they use. These include simple features such as token form and capitalisation information, linguistic features such as part-ofspeech, semantic information from gazetteer lists, and genre-speci c information such as document structure. In general, the more features the system uses, the better performance it can achieve.</Paragraph>
    <Paragraph position="2"> This paper concentrates on classi er-based learning for IE, which typically converts the recognition of each information entity into a set of classi cation problems. In the framework discussed here, two binary classi ers are trained for each type of information entity. One classi er is used for recognising the entity's start token and the other the entity's end token.</Paragraph>
    <Paragraph position="3">  The classi cation problem derived from IE usually has imbalanced training data, in which positive training examples are vastly outnumbered by negative ones. This is particularly true for smaller data sets where often there are hundreds of negative training examples and only few positive ones. Two approaches have been studied so far to deal with imbalanced data in IE. One approach is to under-sample majority class or over-sample minority class in order to obtain a relatively balanced training data (Zhang and Mani, 2003). However, under-sampling can potentially remove certain important examples, and over-sampling can lead to over- tting and a larger training set. Another approach is to divide the problem into several sub-problems in two layers, each of which has less imbalanced training set than the original one (Carreras et al., 2003; Sitter and Daelemans, 2003). The output of the classi er in the rst layer is used as the input to the classi ers in the second layer. As a result, this approach needs more classi ers than the original problem. Moreover, the classi cation errors in the rst layer will affect the performance of the second one.</Paragraph>
    <Paragraph position="4"> In this paper we explore another approach to handle the imbalanced data in IE, namely, adapting the learning algorithms for balanced classi cation to imbalanced data. We particularly study two popular classi cation algorithms in IE, Support Vector Machines (SVM) and Perceptron.</Paragraph>
    <Paragraph position="5"> SVM is a general supervised machine learning algorithm, that has achieved state of the art performance on many classi cation tasks, including NE recognition. Isozaki and Kazawa (2002) compared three commonly used methods for named entity recognition the SVM with quadratic kernel, maximal entropy method, and a rule based learning system, and showed that the SVM-based system performed better than the other two. May eld et al.</Paragraph>
    <Paragraph position="6"> (2003) used a lattice-based approach to named entity recognition and employed the SVM with cubic kernel to compute transition probabilities in a lattice. Their results on CoNLL2003 shared task were comparable to other systems but were not the best ones. Previous research on using SVMs for IE adopts the standard form of the SVM, which treats positive and negative examples equally. As a result, they did not consider the difference between the balanced classi cation problems, where the SVM performs quite well, and the imbalanced ones. Li and Shawe-Taylor (2003) proposes an uneven margins version of the SVM and shows that the SVM with uneven margins performs signi cantly better than the standard SVM on document classi cation problems with imbalanced training data. Since the classi cation problem for IE is also imbalanced, this paper investigates the SVM with uneven margins for IE tasks and demonstrates empirically that the uneven margins SVM does have better performance than the standard SVM.</Paragraph>
    <Paragraph position="7"> Perceptron is a simple, fast and effective learning algorithm, which has successfully been applied to named entity recognition (Carreras et al., 2003).</Paragraph>
    <Paragraph position="8"> The system uses a two-layer structure of classi ers to handle the imbalanced data. The rst layer classi es each word as entity or non-entity. The second layer classi es the named entities identi ed by the rst layer in the respective entity classes. Li et al.</Paragraph>
    <Paragraph position="9"> (2002) proposed another variant of Perceptron, the Perceptron algorithm with uneven margins (PAUM), designed especially for imbalanced data. In this paper we explore the application of PAUM to IE.</Paragraph>
    <Paragraph position="10"> The rest of the paper is structured as follows. Section 2 describes the uneven margins SVM and Perceptron algorithms. Sections 3.1 and 3.2 discuss the classi er-based framework for IE and the experimental datasets we used, respectively. We compare our systems to other state-of-the-art systems on three benchmark datasets in Section 3.3. Section 3.4 discusses the effects of the uneven margins parameter on the SVM and Perceptron's performances. Finally, Section 4 provides some conclusions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML