File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1062_intro.xml
Size: 6,629 bytes
Last Modified: 2025-10-06 14:01:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1062"> <Title>Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The data </SectionTitle> <Paragraph position="0"> Over a period of a year or so we have had over one million words of named-entity data annotated. The Computational Linguistics (ACL), Philadelphia, July 2002, pp. 489-496. Proceedings of the 40th Annual Meeting of the Association for data is drawn from web pages, the aim being to support a question-answering system over web data. A number of categories are annotated: the usual people, organization and location categories, as well as less frequent categories such as brand-names, scientific terms, event titles (such as concerts) and so on.</Paragraph> <Paragraph position="1"> From this data we created a training set of 53,609 sentences (1,047,491 words), and a test set of 14,717 sentences (291,898 words).</Paragraph> <Paragraph position="2"> The task we consider is to recover named-entity boundaries. We leave the recovery of the categories of entities to a separate stage of processing.1 We evaluate different methods on the task through precision and recall. If a method proposes a0 entities on the test set, and a1 of these are correct (i.e., an entity is marked by the annotator with exactly the same span as that proposed) then the precision of a method is</Paragraph> <Paragraph position="4"> a12 is the total number of entities in the human annotated version of the test set, then the recall is a2a4a3a5a3a7a6a13a8 a1a11a10a14a12 .</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The baseline tagger </SectionTitle> <Paragraph position="0"> The problem can be framed as a tagging task - to tag each word as being either the start of an entity, a continuation of an entity, or not to be part of an entity at all (we will use the tags S, C and N respectively for these three cases). As a baseline model we used a maximum entropy tagger, very similar to the ones described in (Ratnaparkhi 1996; Borthwick et. al 1998; McCallum et al. 2000). Max-ent taggers have been shown to be highly competitive on a number of tagging tasks, such as part-of-speech tagging (Ratnaparkhi 1996), named-entity recognition (Borthwick et. al 1998), and information extraction tasks (McCallum et al. 2000). Thus the maximum-entropy tagger we used represents a serious baseline for the task. We used the following features (several of the features were inspired by the approach of (Bikel et. al 1999), an HMM model which gives excellent results on named entity extraction): a15 The word being tagged, the previous word, and the next word.</Paragraph> <Paragraph position="1"> a15 The previous tag, and the previous two tags (bigram and trigram features).</Paragraph> <Paragraph position="2"> 1In initial experiments, we found that forcing the tagger to recover categories as well as the segmentation, by exploding the number of tags, reduced performance on the segmentation task, presumably due to sparse data problems.</Paragraph> <Paragraph position="3"> a15 A compound feature of three fields: (a) Is the word at the start of a sentence?; (b) does the word occur in a list of words which occur more frequently as lower case rather than upper case words in a large corpus of text? (c) the type of the first letter a16 of the word, where a17a19a18 a0a21a20a23a22 a16a25a24 is defined as 'A' if a16 is a capitalized letter, 'a' if a16 is a lower-case letter, '0' if a16 is a digit, and a16 otherwise. For example, if the word Animal is seen at the start of a sentence, and it occurs in the list of frequent lower-cased words, then it would be mapped to the feature 1-1-A.</Paragraph> <Paragraph position="4"> a15 The word with each character mapped to its a0a21a20 . For example, G.M. would be mapped to A.A., and Animal would be mapped to Aaaaaa.</Paragraph> <Paragraph position="5"> a15 The word with each character mapped to its type, but repeated consecutive character types are not repeated in the mapped string. For example, Animal would be mapped to Aa, G.M. would again be mapped to A.A..</Paragraph> <Paragraph position="6"> The tagger was applied and trained in the same way as described in (Ratnaparkhi 1996). The feature templates described above are used to create a set of a17a31a30a33a32a21a24 , where a17 is the tag, and a32 is the &quot;history&quot;, or context. An example is</Paragraph> <Paragraph position="8"> The parameters of the model are a44a45a28 for a46a39a38 a2a48a47a4a47a4a47 a26 , defining a conditional distribution over the tags given a history a32 as</Paragraph> <Paragraph position="10"> The parameters are trained using Generalized Iterative Scaling. Following (Ratnaparkhi 1996), we only include features which occur 5 times or more in training data. In decoding, we use a beam search to recover 20 candidate tag sequences for each sentence (the sentence is decoded from left to right, with the top 20 most probable hypotheses being stored at each point).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Applying the baseline tagger </SectionTitle> <Paragraph position="0"> As a baseline we trained a model on the full 53,609 sentences of training data, and decoded the 14,717 sentences of test data. This gave 20 candidates per test sentence, along with their probabilities. The baseline method is to take the most probable candidate for each test data sentence, and then to calculate precision and recall figures. Our aim is to come up with strategies for reranking the test data candidates, in such a way that precision and recall is improved.</Paragraph> <Paragraph position="1"> In developing a reranking strategy, the 53,609 sentences of training data were split into a 41,992 sentence training portion, and a 11,617 sentence development set. The training portion was split into 5 sections, and in each case the maximum-entropy tagger was trained on 4/5 of the data, then used to decode the remaining 1/5. The top 20 hypotheses under a beam search, together with their log probabilities, were recovered for each training sentence. In a similar way, a model trained on the 41,992 sentence set was used to produce 20 hypotheses for each sentence in the development set.</Paragraph> </Section> </Section> class="xml-element"></Paper>