File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-3004_intro.xml
Size: 7,934 bytes
Last Modified: 2025-10-06 14:03:23
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-3004"> <Title>Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists</Title> <Section position="2" start_page="0" end_page="16" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Automatic information extraction and information retrieval concerning particular person, location, organization, title of movie or book, juxtaposes to the Named Entity Recognition (NER) task. NER consists in detecting the most silent and informative elements in a text such as names of people, company names, location, monetary currencies, dates. Early NER systems (Fisher et al., 1997), (Black et al., 1998) etc., participating in Message Understanding Conferences (MUC), used linguistic tools and gazetteer lists. However these are difficult to develop and domain sensitive.</Paragraph> <Paragraph position="1"> To surmount these obstacles, application of machine learning approaches to NER became a research subject. Various state-of-the-art machine learning algorithms such as Maximum Entropy (Borthwick, 1999), AdaBoost(Carreras et al., 2002), Hidden Markov Models (Bikel et al., ), Memory-based Based learning (Tjong Kim Sang, 2002b), have been used1. (Klein et al., 2003), (Mayfield et al., 2003), (Wu et al., 2003), (Kozareva et al., 2005c) among others, combined several classifiers to obtain better named entity rithms rely on previously hand-labeled training data. Obtaining such data is labor-intensive, time consuming and even might not be present for languages with limited funding. Resource limitation, directed NER research (Collins and Singer, 1999), (Carreras et al., 2003), (Kozareva et al., 2005a) toward the usage of semi-supervised techniques.</Paragraph> <Paragraph position="2"> These techniques are needed, as we live in a multilingualsocietyandaccesstoinformationfromvar- null ious language sources is reality. The development of NER systems for languages other than English commenced.</Paragraph> <Paragraph position="3"> This paper presents the development of a Spanish Named Recognition system based on machine learning approach. For it no morphologic or syntactic information was used. However, we propose and incorporate a very simple method for automatic gazetteer2 construction. Such method can be easily adapted to other languages and it is low-costly obtained as it relies on n-gram extraction from unlabeled data. We compare the performance of our NER system when labeled and unlabeled training data is present.</Paragraph> <Paragraph position="4"> The paper is organized in the following way: brief explanation about NER process is represented in Section 2. In Section 3 follows feature extraction. The experimental evaluation for the Named Entity detection and classification tasks withand without labeled data are in Sections 4 and tection and entity classification. Entity delimitation consist in determining the boundaries of the entity (e.g. the place from where it starts and the place it finishes). This is important for tracing entities composed of two or more words such as &quot;Presidente de los Estados Unidos &quot;3, &quot;Universidad Politecnica de Catalu~na&quot;4. For this purpose, the BIO scheme was incorporated. In this scheme, tag B denotes the start of an entity, tag I continues the entity and tag O marks words that do not form part of an entity. This scheme was initially introduced in CoNLL's (Tjong Kim Sang, 2002a) and (Tjong Kim Sang and De Meulder, 2003) NER competitions, and we decided to adapt it for our experimental work.</Paragraph> <Paragraph position="5"> Once all entities in the text are detected, they are passed for classification in a predefined set of categories such as location, person, organization or miscellaneous5 names. This task is known as entity classification. The final NER performance is measured considering the entity detection and classification tasks together.</Paragraph> <Paragraph position="6"> Our NER approach is based on machine learning. The two algorithms we used for the experiments were instance-based and decision trees, implemented by (Daelemans et al., 2003). They were used with their default parameter settings. We selected the instance-based model, because it is known to be useful when the amount of training data is not sufficient.</Paragraph> <Paragraph position="7"> Important part in the NE process takes the location and person gazetteer lists which were automatically extracted from unlabeled data. More detailed explanation about their generation can be found in Section 3.</Paragraph> <Paragraph position="8"> To explore the effect of labeled and unlabeled trainingdatatoourNER,twotypesofexperiments were conducted. For the supervised approach, the labels in the training data were previously known.</Paragraph> <Paragraph position="9"> Forthe semi-supervised approach, the labels in the training data were hidden. We used bootstrapping (Abney, 2002) which refers to a problem setting in which one is given a small set of labeled data and a large set of unlabeled data, and the task is to induce a classifier.</Paragraph> <Paragraph position="10"> * Goals: - utilize a minimal amount of supervised examples; null 3&quot;President of the United States&quot; 4&quot;Technical University of Catalu~na&quot; 5book titles, sport events, etc.</Paragraph> <Paragraph position="11"> - obtain learning from many unlabeled examples; null * General scheme: - initial supervision seed examples for training an initial model; - corpus classification with seed model; - add most confident classifications to train null ing data and iterate.</Paragraph> <Paragraph position="12"> In our bootstrapping, a newly labeled example was added into the training data L, if the two classifiers C1 and C2 agreed on the class of that example. The number n of iterations for our experiments is set up to 25 and when this bound is reached the bootstrapping stops. The scheme we follow is described below.</Paragraph> <Paragraph position="13"> 1. for iteration = 0...n do 2. pool 1000 examples from unlabeled data; 3. annotate all 1000 examples with classifier C1 and C2; 4. for each of the 1000 examples compare classes of C1 and C2; 5. add example into L only if classes of C1 and C2 agree; 6. train model with L; 7. calculate result 8. end for Bootstrapping was previously used by (Carreras et al., 2003), who were interested in recognizing Catalan names using Spanish resources. (Becker et al., 2005) employed bootstrapping in an active learning method for tagging entities in an astronomic domain. (Yarowsky, 1995) and (Mihalcea and Moldovan, 2001) utilized bootstrapping for word sense disambiguation. (Collins and Singer, 1999) classified NEs through co-training, (Kozareva et al., 2005a) used self-training and co-training to detect and classify named entities in news domain, (Shen et al., 2004) conducted experimentswithmulti-criteria-basedactivelearning null for biomedical NER.</Paragraph> <Paragraph position="14"> The experimental data we work with is taken from the CoNLL-2002 competition. The Spanish corpus6 comes from news domain and was previously manually annotated. The train data set contains 264715 words of which 18798 are entities and the test set has 51533 words of which 3558 are entities.</Paragraph> <Paragraph position="15"> We decided to work with available NE annotatedcorporainordertoconductanexhaustiveand null comparative NER study when labeled and unlabeld data is present. For our bootstrapping experiment, we simply ignored the presence of the labels inthetrainingdata. Ofcoursethisapproachcanbe applied to other domain or language, the only need is labeled test data to conduct correct evaluation. TheevaluationiscomputedperNEclassbythe help of conlleval7 script. The evaluation measures are:</Paragraph> <Paragraph position="17"/> </Section> class="xml-element"></Paper>