File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0429_metho.xml
Size: 3,616 bytes
Last Modified: 2025-10-06 14:08:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0429"> <Title>Named Entity Recognition using Hundreds of Thousands of Features</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Features </SectionTitle> <Paragraph position="0"> The advantage of the ability to handle large numbers of features is that we do not need to consider how well a feature is likely to work in a particular language before proposing it. We use the following features: 1. the word itself, both unchanged and lower-cased; 2. the character 3-grams and 4-grams that compose the word; 3. the word's capitalization pattern and digit pattern; 4. the inverse of the word's length; 5. whether the word contains a dash; 6. whether the word is inside double quote marks; 7. the inverse of the word's position in the sentence, and of the position of that sentence in the document; 8. the POS, CHUNK and LEMMA features from the training data; 9. whether the word is part of any entity, according to a previous application of the TnT-Subcat tagger (Brants, 2000) (see below) trained on the tag set {O, 10. the maximum likelihood estimate, based on the training data, of the word's prior probability of being in each class.</Paragraph> <Paragraph position="1"> In some runs, we also use: 11. the tag assigned by a previous application of the SVM-Lattice tagger, or by another tagger.</Paragraph> <Paragraph position="2"> Each of these features is applied not just to the word being featurized, but also to a range of words on either side of it. We typically use a range of three (or, phrased differently, a centered window of seven). We also applied some of these features to the environment of the first occurrence of the word in the document. For example, if the first occurrence of 'Bush' in the document were followed by 'League,' then the second occurrence of 'Bush' would receive the feature 'first-occurrence-isfollowed-by-league.' null Some values of the above features will be encountered during testing but not during training. For example, a word that occurs in the test set but not the training set will lack a known value for the first feature in the list above. To handle these cases, we assign any feature that appears only once in the training data to a special 'never-beforeseen' class. This gives us examples at training time of unseen features, which we can then train on.</Paragraph> <Paragraph position="3"> Using the Shared Task English training data, this approach to featurization leads to a feature space of well over 600,000 features, while the German data results in over a million features. Individual vectors typically have extent along a few hundred of these features.</Paragraph> <Paragraph position="4"> There is a significant practical consideration in applying the method. The vectors produced by the featurizer for input to the SVM package are voluminous, leading to significant I/O costs, and slowing tag assignment. Two methods might ameliorate this problem. First, simple compression techniques would be quite effective in reducing file sizes, if the SVM package would support them. Secondly, most vectors represent negative examples; a portion of these could probably be eliminated entirely without significantly affecting system performance. We have done no tuning of our feature set, preferring to spend our time adding new features and relying on the SVMs to ignore useless features. This is advantageous when applying the technique to a language that we do not understand (such as any of the world's various non-English languages).</Paragraph> </Section> class="xml-element"></Paper>