File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0427_metho.xml

Size: 8,920 bytes

Last Modified: 2025-10-06 14:08:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0427">
  <Title>Memory-based one-step named-entity recognition: Effects of seed list features, classifier stacking, and unannotated data</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experimental setup
</SectionTitle>
    <Paragraph position="0"> In two subsections we briefly detail how the memory-based learner works, and how we optimized its parameters through an automatic process called iterative deepening. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Memory-based learning
</SectionTitle>
      <Paragraph position="0"> Memory-based learning is a supervised inductive learning algorithm for learning classification tasks. Memory-based learning treats a set of training instances as points in a multi-dimensional feature space, and stores them as such in an instance base in memory (rather than performing some abstraction over them).</Paragraph>
      <Paragraph position="1"> New (test) instances are classified by matching them to all instances in memory, and by calculating with each match the distance, given by a distance function between the new instance X and each of the n memory instances Y1...n. Classification in memory-based learning is performed by the k-NN algorithm that searches for the k 'nearest neighbours' among the memory instances according to the distance function. The majority class of the k nearest neighbours then determines the class of the new instance X. Cf. (Daelemans et al., 2002) for algorithmic details and background.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Iterative deepening
</SectionTitle>
      <Paragraph position="0"> Iterative deepening (ID) is a heuristic search algorithm for the optimization of algorithmic parameter and feature selection, that combines classifier wrapping (using the training material internally to test experimental variants) (Kohavi and John, 1997) with progressive sampling of training material (Provost et al., 1999). We start with a large pool of experiments, each with a unique combination of input features and algorithmic parameter settings. In the first step, each attempted setting is applied to a small amount of training material and tested on a fixed amount of held-out data (a held-out part of the training set). Only the best settings are kept; all others are removed from the pool of competing settings. In subsequent iterations, this step is repeated, cutting the number of settings in the pool by a half and retaining the best-performing half, while at the same time doubling the amount of training material.</Paragraph>
      <Paragraph position="1"> We selected 10% of the training set as held-out data.</Paragraph>
      <Paragraph position="2"> Six iterations were performed with increasing training set sizes, starting with 2000 instances, and doubling with each iteration up to 128,000 training instances, resulting in 16 best settings after the last iteration. Selection of the best experiments was based on their overall F-rate as computed by the conlleval script.</Paragraph>
      <Paragraph position="3"> The initial pool of experiments was created by systematically varying parameters of the memory-based learner and some limited feature selections, (for details, cf.</Paragraph>
      <Paragraph position="4">  exponential decay with a=1 and a=4 * Feature selection: apart from the wordform and its provided CoNLL-2003 PoS tag, create a local window of either no, 1, or 2 wordforms to the left and right of the focus word. For all words in a window, all features are selected.</Paragraph>
      <Paragraph position="5"> The first round of the ID process therefore tests 2 x 4 x 9 x 5 x 3 = 1080 systematic permutations of these parameter settings and feature selection.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Extensions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Seed list features
</SectionTitle>
      <Paragraph position="0"> The first extension is to incorporate language-specific seed-list (gazetteer) information. Rather than using these lists external to the classifier, we encode them as internal features associated to wordforms. For each of the four named entity classes we gathered one list of names, containing material garnered from name sites on the internet, from the training set (for the MISC category), and from the CELEX English lexical data base (Baayen et al., 1993). These lists vary in size from 1269 names to 78,732 names. Each wordform in the training and test data is then enriched with four binary features, each representing whether the word occurs in the respective seed list. One problem with seed lists is that a word can occur in more than one seed list, so that more than one of these four bits may be active.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Second-stage stacking
</SectionTitle>
      <Paragraph position="0"> The second extension is to use second-stage stacking.</Paragraph>
      <Paragraph position="1"> Stacking in general (Wolpert, 1992) encompasses a class of meta-learning systems that learn to correct errors made by lower-level classifiers. We adopt the particular method pioneered in (Veenstra, 1998) in which classifications of a first memory-based classifier are added as windowed features to the instances presented to the second classifier.</Paragraph>
      <Paragraph position="2"> Since the second-stage classifier also computes the similarities between instances using these extra features, it is able, in principle, to recognise and correct reoccurring patterns of errors within sub-sentential sequences. This could correct errors made due to the &amp;quot;blindness&amp;quot; of the first-stage classifier, which is unaware of its own classifications left or right of the wordform in the current focus position. We used stacking on top of the first extension.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Unannotated data
</SectionTitle>
      <Paragraph position="0"> For both languages a large unannotated dataset was made available for extracting data or information. Alternative to using this data to expand or bootstrap seed lists (Cucerzan and Yarowsky, 1999; Buchholz and Van den Bosch, 2000), we use the unannotated corpus to select useful instances to be added directly to the training set. Not unlike (Yarowsky, 1995) we use confidence of our classifier on unannotated data to enrich itself; that is, by adding confidently-classified instances to the memory. We make the simple assumption that entropy in the class distribution in the nearest neighbour set computed in the classification of a new instance is correlated with the reliability of the classification, when k &gt; 1. When k nearest neighbours all vote for the same class, the entropy of that class vote is 0.0. Alternatively, when the votes tie, the entropy is maximal.</Paragraph>
      <Paragraph position="1"> A secondary heuristic assumption is that it is probably not useful to add (almost) exact matches to the memory, since adding those is likely to have little effect on the performance of the k-NN classifier. More effect can be expected from adding instances to memory that have a low-entropy class distribution in their nearest neighbour set and of which the nearest neighbours are at a relatively  initial system on the test sets of both languages.</Paragraph>
      <Paragraph position="2"> settings mvdm feat. weight k dist. w Eng, initial yes gain ratio 21 IL 1 Eng, seedlist yes gain ratio 5 ID 1 Ger, initial yes gain ratio 21 IL 2 Ger, seedlist yes gain ratio 9 ID 2  tive deepening. &amp;quot;w&amp;quot; stands for window.</Paragraph>
      <Paragraph position="3"> large distance. A large distance entails that the instances contains previously unseen feature values (words), and assuming that the predicted class label is correct, these new values can be valuable in matching and therefore classifying new test material better.</Paragraph>
      <Paragraph position="4"> We applied our selection method to the first 2 million words of the unannotated English dataset. For German we were able to process 0.25 million words. First we applied the classifier with two extensions, seed list information and second stage stacking, to classify the unannotated data. We selected instances with an entropy in the class distribution lower than 0.05 and a distance of the nearest neighbour of at least 0.1. For English, in total 179,391 instances (9%) were selected from the unannotated dataset and added to the training set. For German. markedly less instances were selected: 467 (0.19%).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML