File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0429_intro.xml
Size: 7,063 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0429"> <Title>Named Entity Recognition using Hundreds of Thousands of Features</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Model </SectionTitle> <Paragraph position="0"> We are interested in a lattice-based approach to named entity recognition. In this approach, each sentence is processed individually. A lattice is built with one column per word of the sentence (plus a start state). Each column contains one vertex for each possible tag. Each vertex in one column is connected by an edge to every vertex in the next column that may legitimately follow it (some transitions, such as from I-LOC to B-PER are disallowed).</Paragraph> <Paragraph position="1"> Given such a lattice, our task is first to assign probabilities to each of the arcs, then to find the highest likelihood path through the lattice based on those probabilities. This path corresponds to the highest likelihood tagging of the sentence.</Paragraph> <Paragraph position="2"> Hidden Markov models break the probability calculations into two pieces: transition probabilities (the probability of moving from one vertex to another independent of the word at the destination node), and emission probabilities (the probability that a given word would be generated from a certain state independent of the path taken to get to that state). These probability distributions are calculated separately because the training data are typically too sparse to support a reasonable maximum likelihood estimate of the joint probability. However, there is no reason that these two distributions could not be combined given a suitable estimation technique.</Paragraph> <Paragraph position="3"> A support vector machine is a binary classifier that uses supervised training to predict whether a given vector is in a target class. All SVM training and test data occupy a single high-dimensional vector space. In its simplest form, training an SVM amounts to finding the hyperplane that separates the positive training samples from the negative samples by the largest possible margin. This hyper-plane is then used to classify the test vectors; those that lie on one side of the hyperplane are classified as members of the positive class, while others are classified as members of the negative class. In addition to the classification decision, the SVM also produces a margin for each vector-its distance from the hyperplane.</Paragraph> <Paragraph position="4"> SVMs have two useful properties for our purposes.</Paragraph> <Paragraph position="5"> First, they can handle very high dimensional spaces, as long as individual vectors are sparse (i.e., each vector has extent along only a small subset of the dimensions). Secondly, SVMs are resistant to overtraining, because only the training vectors that are closest to the hyperplane (called support vectors) dictate the parameters for the hyperplane. So SVMs would seem to be ideal candidates for estimating lattice probabilities.</Paragraph> <Paragraph position="6"> Unfortunately, SVMs do not produce probabilities, but rather margins. In fact, one of the reasons that SVMs work so well is precisely because they do not attempt to model the entire distribution of training points. To use SVMs in a lattice approach, then, a mechanism is needed to estimate probability of category membership given a margin.</Paragraph> <Paragraph position="7"> Platt (1999) suggests such a method. If the range of possible margins is partitioned into bins, and positive and negative training vectors are placed into these bins, each bin will have a certain percentage of positive examples.</Paragraph> <Paragraph position="8"> These percentages can be approximated by a sigmoid function: P(y = 1 |f) = 1/(1 + exp(Ax + b)). Platt gives a simple iterative method for estimating sigmoid parameters A and B, given a set of training vectors and their margins.</Paragraph> <Paragraph position="9"> This approach can work well if a sufficient number of positive training vectors are available. Unfortunately, in the CoNLL-2003 shared task, many of the possible label transitions have few exemplars. Two methods are available to handle insufficient training data: smoothing, and guessing.</Paragraph> <Paragraph position="10"> In the smoothing approach, linear interpolation is used to combine the model for the source to target pair that lacks sufficient data with the model made from a combination of all transitions going to the target label. For example, we could smooth the probabilities derived for the I-ORG to I-LOC transition with the probability that any tag would transition to the I-LOC state at the same point in the sentence.</Paragraph> <Paragraph position="11"> The second approach is to guess at an appropriate model without examining the training data. While in theory this could prove to be a terrible approach, in practice for the Shared Task, selection of fixed sigmoid parameters works better than using Platt's method to train the parameters. Thus, we fix A = [?]2 and b = 0. We continue to believe that Platt's method or something like it will ultimately lead to superior performance, but our current experiments use this untrained model.</Paragraph> <Paragraph position="12"> Our overall approach then is to use SVMs to estimate lattice transition probabilities. First, due to the low frequency of B-XXX tags in the training data, we convert each B-XXX tags to the corresponding I-XXX tag; thus, our system never predicts B-XXX tags. Then, we featurize the training data, forming sparse vectors suitable for input to our SVM package, SVMLight 5.00 (Joachims, 1999). Our feature set is described in the following section. Next, we train one SVM for each transition type seen in the training data. We used a cubic kernel for all of our experiments; this kernel gives a consistent boost over a linear kernel, while still training in a reasonable amount of time. If we were to use Platt's approach, the resulting classifiers would be applied to further (preferably held-out) training data to produce a set of margins, which would be used to estimate appropriate sigmoid parameters for each classifier. Sigmoid estimates that suffered from too few positive input vectors would be replaced by static estimates, and the sigmoids would optionally be smoothed.</Paragraph> <Paragraph position="13"> To evaluate a test set, the test input is featurized using the same features as were used with the training data, resulting in a separate vector for each word of the input.</Paragraph> <Paragraph position="14"> Each classifier built during the training phase is then applied to each test vector to produce a margin. The margin is mapped to a probability estimate using the static sigmoid described above. When all of the probabilities have been estimated and applied to the lattice, a Viterbi-like algorithm is used to find the most likely path through the lattice. This path identifies the final tag for each word of the input sentence.</Paragraph> </Section> class="xml-element"></Paper>