File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1012_metho.xml
Size: 20,219 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1012"> <Title>Reducing Weight Undertraining in Structured Discriminative Learning</Title> <Section position="3" start_page="89" end_page="89" type="metho"> <SectionTitle> 2 Conditional Random Fields </SectionTitle> <Paragraph position="0"> Conditional random fields (CRFs) (Lafferty et al., 2001; Sutton and McCallum, 2006) are undirected graphical models of a conditional distribution. Let G be an undirected graphical model over random vectors y and x.</Paragraph> <Paragraph position="1"> As a typical special case, y = {y</Paragraph> <Paragraph position="3"> cliques in G, a CRF models the conditional probability of an assignment to labels y given the observed variables</Paragraph> <Paragraph position="5"> ) is a normalization factor over all possible label assignments. We assume the potentials factorize according to a set of features {f k }, which are given and fixed, so that</Paragraph> <Paragraph position="7"> parenrightBigg (2) The model parameters are a set of real weightsL = {l k }, one weight for each feature. Many applications have used the linear-chain CRF, in which a first-order Markov assumption is made on the hidden variables. In this case, the cliques of the conditional model are the nodes and edges, so that there are feature functions f</Paragraph> <Paragraph position="9"> tion. (Here we write the feature functions as potentially weaker features in logistic regression on synthetic data. The x-axis indicates the strength of the strong feature. In the top line, the strong feature is present at training and test time. In the bottom line, the strong feature is missing from the training data at test time.</Paragraph> <Paragraph position="10"> depending on the entire input sequence.) Feature functions can be arbitrary. For example, a feature function</Paragraph> <Paragraph position="12"> has the label &quot;adjective&quot;, y</Paragraph> <Paragraph position="14"> has the label &quot;proper noun&quot;, and x</Paragraph> <Paragraph position="16"> begins with a capital letter.</Paragraph> <Paragraph position="17"> Linear-chain CRFs correspond to finite state machines, and can be roughly understood as conditionally-trained hidden Markov models (HMMs). This class of CRFs is also a globally-normalized extension to Maximum Entropy Markov Models (McCallum et al., 2000) that avoids the label bias problem (Lafferty et al., 2001).</Paragraph> <Paragraph position="18"> Note that the number of state sequences is exponential in the input sequence length T. In linear-chain CRFs, the partition function Z(x), the node marginals p(y i |x), and the Viterbi labeling can be calculated efficiently by variants of the dynamic programming algorithms for HMMs.</Paragraph> </Section> <Section position="4" start_page="89" end_page="90" type="metho"> <SectionTitle> 3 Weight Undertraining </SectionTitle> <Paragraph position="0"> In the section, we give a simple demonstration of weight undertraining. In a discriminative classifier, such as a neural network or logistic regression, a few strong features can drown out the effect of many individually weaker features, even if the weak features are just as indicative put together. To demonstrate this effect, we present an illustrative experiment using logistic regression, because of its strong relation to CRFs. (Linear null chain conditional random fields are the generalization of logistic regression to sequence data.)</Paragraph> <Paragraph position="2"> as independent standard normal variables. The output y is a binary variable whose probability depends on all the x</Paragraph> <Paragraph position="4"> )). The correct decision boundary in this synthetic problem is the hyperplane tangent to the weight vector (1, 1, . . . , 1). Thus, if n is large, each x i contributes weakly to the output y. Finally, we include a highly indicative feature x</Paragraph> <Paragraph position="6"> = 0.04). This variable alone is sufficient to determine the distribution of y. The variable a is a parameter of the problem that determines how strongly indicative</Paragraph> <Paragraph position="8"> is; specifically, when a = 0, the variable x S is random noise.</Paragraph> <Paragraph position="9"> We choose this synthetic model by analogy to Pomerleau's observations. The x i correspond to the side of the road in Pomerleau's case--the weak features present at both testing and training--and x S corresponds to the ditch--the strongly indicative feature that is corrupted at test time.</Paragraph> <Paragraph position="10"> We examine how badly the learned classifier is degraded when x S feature is present at training time but missing at test time. For several values of the weight parameter a, we train a regularized logistic regression classifier on 1000 instances with n = 10 weak variables. In Figure 1, we show how the amount of error caused by is weakly indicative, it does not affect the predictions of the model at all, and the classifier's performance is the same whether it appears at test time or not. When x S becomes strongly indicative, however, the classifier learns to depend on it, and performs much more poorly when x S is ablated, even though exactly the same information is available in the weak features.</Paragraph> </Section> <Section position="5" start_page="90" end_page="91" type="metho"> <SectionTitle> 4 Feature Bagging </SectionTitle> <Paragraph position="0"> In this section, we describe the feature bagging method.</Paragraph> <Paragraph position="1"> We divide the set of features F = {f k } into a collection of possibly overlapping subsets F = {F</Paragraph> <Paragraph position="3"> which we call feature bags. We train individual CRFs on each of the feature bags using standard MAP training, yielding individual CRFs {p</Paragraph> <Paragraph position="5"> We average the individual CRFs into a single combined model. This averaging can be performed in several ways: we can average probabilities of entire sequences, or of individual transitions; and we can average using the arithmetic mean, or the geometric mean. This yields four combination methods: 1. Per-sequence mixture. The distribution over label sequences y given inputs x is modeled as a mixture of the individual CRFs. Given nonnegative weights {a 1 , . . .a m } that sum to 1, the combined model is given by</Paragraph> <Paragraph position="7"> It is easily seen that if the sequence model is defined as in Equation 3, then the pairwise marginals are mixtures as well:</Paragraph> <Paragraph position="9"> marginal probabilities in the individual models, which can be efficiently computed by the forward-backward algorithm.</Paragraph> <Paragraph position="10"> We can perform decoding in the mixture model by maximizing the individual node marginals. That is, to predict y</Paragraph> <Paragraph position="12"> |x) is computed by first running forward-backward on each of the individual CRFs. In the results here, however, we compute the maximum probability sequence approximately, as follows. We form a linear-chain distribution</Paragraph> <Paragraph position="14"> most probable sequence according to p bardblq) over all linear-chain distributions q. The mixture weights can be selected in a variety of ways, including equal voting, as in traditional bagging, or EM.</Paragraph> <Paragraph position="15"> 2. Per-sequence product of experts. These are the logarithmic opinion pools that have been applied to CRFs by (Smith et al., 2005). The distribution over label sequences y given inputs x is modeled as a product of experts (Hinton, 2000). In a product of experts, instead of summing the probabilities from the individual models, we multiply them together. Essentially we take a geometric mean instead of an arithmetic mean. Given nonnegative weights</Paragraph> <Paragraph position="17"> } that sum to 1, the product model is</Paragraph> <Paragraph position="19"> The combined model can also be viewed as a conditional random field whose features are the log probabilities from the original models:</Paragraph> <Paragraph position="21"> that the model in Equation 7 is simply a single CRF whose parameters are a weighted average of the original parameters. So feature bagging using the product method does not increase the family of models that are considered: standard training of a single CRF on all available features could potentially pick the same parameters as the bagged model.</Paragraph> <Paragraph position="22"> Nevertheless, in Section 5, we show that this feature bagging method performs better than standard CRF training.</Paragraph> <Paragraph position="23"> The previous two combination methods combine the individual models by averaging probabilities of entire sequences. Alternatively, in a sequence model we can average probabilities of individual transitions</Paragraph> <Paragraph position="25"> , x). Computing these transition probabilities requires performing probabilistic inference in each of the original CRFs, because p</Paragraph> <Paragraph position="27"> This yields two other combination methods: 3. Per-transition mixture. The transition probabilities are modeled as</Paragraph> <Paragraph position="29"> Intuitively, the difference between per-sequence and per-transition mixtures can be understood generatively. In order to generate a label sequence y given an input x, the per-sequence model selects a mixture component, and then generates y using only that component. The per-transition model, on the other hand, selects a component, generates y on.</Paragraph> <Paragraph position="30"> 4. Per-transition product of experts. Finally, we can combine the transition distributions using a product model</Paragraph> <Paragraph position="32"> Each transition distribution is thus--similarly to the per-sequence case--an exponential-family distribution whose features are the log transition probabilities from the individual models. Unlike the per-sequence product, there is no weight-averaging trick here, because the probabilities p(y</Paragraph> <Paragraph position="34"> are marginal probabilities.</Paragraph> <Paragraph position="35"> Considered as a sequence distribution p(y|x), the per-transition product is a locally-normalized maximum-entropy Markov model (McCallum et al., 2000). It would not be expected to suffer from label bias, however, because each of the features take the future into account; they are marginal probabilities from CRFs.</Paragraph> <Paragraph position="36"> Of these four combination methods, Method 2, the per-sequence product of experts, is originally due to Smith et al. (2005). The other three combination methods are as far as we know novel. In the next section, we compare the four combination methods on several sequence labeling tasks. Although for concreteness we describe them in terms of sequence models, they may be generalized to arbitrary graphical structures.</Paragraph> </Section> <Section position="6" start_page="91" end_page="92" type="metho"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> We evaluate feature bagging on two natural language tasks, named entity recognition and noun-phrase chunking. We use the standard CoNLL 2003 English data set, which is taken from Reuters newswire and consists of a training set of 14987 sentences, a development set of 3466 sentences, and a testing set of 3684 sentences. The named-entity labels in this data set corresponding to people, locations, organizations and other miscellaneous entities. Our second task is noun-phrase chunking. We use the standard CoNLL 2000 data set, which consists of 8936 sentences for training and 2012 sentences for testing, taken from Wall Street Journal articles annotated by the Penn Treebank project. Although the CoNLL 2000 data set is labeled with other chunk types as well, here we use only the NP chunks.</Paragraph> <Paragraph position="1"> As is standard, we compute precision and recall for both tasks based upon the chunks (or named entities for We report the harmonic mean of precision and recall as</Paragraph> <Paragraph position="3"> For both tasks, we use per-sequence product-of-experts feature bagging with two feature bags which we manually choose based on prior experience with the data set.</Paragraph> <Paragraph position="4"> For each experiment, we report two baseline CRFs, one trained on union of the two feature sets, and one trained only on the features that were present in both bags, such as lexical identity and regular expressions. In both data sets, we trained the individual CRFs with a Gaussian prior on parameters with variance s = 10.</Paragraph> <Paragraph position="5"> For the named entity task, we use two feature bags based upon character ngrams and lexicons. Both bags contain a set of baseline features, such as word identity and regular expressions (Table 4). The ngram CRF includes binary features for character ngrams of length 2, 3, and 4 and word prefixes and suffixes of length 2, 3, and 4. The lexicon CRF includes membership features for a variety of lexicons containing people names, places, and company names. The combined model has 2,342,543 features. The mixture weight a is selected using the development set.</Paragraph> <Paragraph position="6"> For the chunking task, the two feature sets are selected based upon part of speech and lexicons. Again, a set of baseline features are used, similar to the regular expressions and word identity features used on the named entity task (Table 4). The first bag also includes part-of-speech tags generated by the Brill tagger and the conjunctions of those tags used by Sha and Pereira (2003). The second bag uses lexicon membership features for lexicons containing names of people, places, and organizations. In addition, we use part-of-speech lexicons generated from the entire Treebank, such as a list of all words that appear as nouns. These lists are also used by the Brill tagger (Brill, 1994). The combined model uses 536,203 features. The mixture weight a is selected using 2-fold cross validation. The chosen model had weight 0.55 on the lexicon model, and weight 0.45 on the ngram model.</Paragraph> <Paragraph position="7"> In both data sets, the bagged model performs better than the single CRF trained with all of the features. For the named entity task, bagging improves performance from 85.45% to 86.61%, with a substantial error reduction of 8.32%. This is lower than the best reported results for this data set, which is 89.3% (Ando and Zhang, 2005), using a large amount of unlabeled data. For the chunking task, bagging improved the performance from 94.34% to 94.77%, with an error reduction of 7.60%. In both data sets, the improvement is statistically significant (McNemar's test; p < 0.01).</Paragraph> <Paragraph position="8"> On the chunking task, the bagged model also outperforms the models of Kudo and Matsumoto (2001) and Sha and Pereira (2003), and equals the currently-best results of (Ando and Zhang, 2005), who use a large amount of unlabeled data. Although we use lexicons that were not included in the previous models, the additional features actually do not help the original CRF. Only with feature bagging do these lexicons improve performance.</Paragraph> <Paragraph position="9"> Finally, we compare the four bagging methods of Section 4: pre-transition mixture, pre-transition product of experts, and per-sequence mixture. On the named entity data, all four models perform in a statistical tie, with no statistically significant difference in their performance (Table 1). As we mentioned in the last section, the de- null Task. The bagged CRF performs significantly better than a single CRF with all available features (McNemar's test; p < 0.01).</Paragraph> <Paragraph position="10"> coding procedure for the per-sequence mixture is approximate. It is possible that a different decoding procedure, such as maximizing the node marginals, would yield better performance.</Paragraph> </Section> <Section position="7" start_page="92" end_page="93" type="metho"> <SectionTitle> 6 Previous Work </SectionTitle> <Paragraph position="0"> In the machine learning literature, there is much work on ensemble methods such as stacking, boosting, and bagging. Generally, the ensemble of classifiers is generated by training on different subsets of data, rather than different features. However, there is some literature within unstructured classified on combining models trained on feature subsets. Ho (1995) creates an ensemble of decision trees by randomly choosing a feature subset on which to grow each tree using standard decision tree learners. Other work along these lines include that of Bay (1998) using nearest-neighbor classifiers, and more recently Bryll et al (2003). Also, in Breiman's work on random forests (2001), ensembles of random decision trees are constructed by choosing a random feature at each node. This literature mostly has the goal of improving accuracy by reducing the classifier's variance, that is, reducing overfitting.</Paragraph> <Paragraph position="1"> In contrast, O'Sullivan et al. (2000) specifically focus on increasing robustness by training classifiers to use all of the available features. Their algorithm FeatureBoost is analogous to AdaBoost, except that the meta-learning algorithm maintains weights on features instead of on instances. Feature subsets are automatically sampled based on which features, if corrupted, would most affect the ensemble's prediction. They show that FeatureBoost is more robust than AdaBoost on synthetically corrupted UCI data sets. Their method does not easily extend to sequence models, especially natural-language models with hundreds of thousands of features.</Paragraph> <Paragraph position="2"> The bagged CRF performs significantly better than a single CRF (McNemar's test; p < 0.01), and equals the results of (Ando and Zhang, 2005), who use a large amount of unlabeled data.</Paragraph> <Paragraph position="4"> is the POS tag at position t, w ranges over all words in the training data, and P ranges over all chunk tags supplied in the training data. The &quot;appears to be&quot; features are based on hand-designed regular expressions.</Paragraph> <Paragraph position="5"> There is less work on ensembles of sequence models, as opposed to unstructured classifiers. One example is Altun, Hofmann, and Johnson (2003), who describe a boosting algorithm for sequence models, but they boost instances, not features. In fact, the main advantage of their technique is increased model sparseness, whereas in this work we aim to fully use more features to increase accuracy and robustness.</Paragraph> <Paragraph position="6"> Most closely related to the present work is that on logarithmic opinion pools for CRFs (Smith et al., 2005), which we have called per-sequence mixture of experts in this paper. The previous work focuses on reducing overfitting, combining a model of many features with several simpler models. In contrast, here we apply feature bagging to reduce feature undertraining, combining several models with complementary feature sets. Our current positive results are probably not due to reduction in overfitting, for as we have observed, all the models we test, including the bagged one, have 99.9% F1 on the training set. Now, feature undertraining can be viewed as a type of overfitting, because it arises when a set of features is more indicative in the training set than the testing set. Understanding this particular type of overfitting is useful, because it motivates the choice of feature bags that we explore in this work. Indeed, one contribution of the present work is demonstrating how a careful choice of feature bags can yield state-of-the-art performance.</Paragraph> <Paragraph position="7"> Concurrently and independently, Smith and Osborne (2006) present similar experiments on the CoNLL-2003 data set, examining a per-sequence mixture of experts (that is, a logarithmic opinion pool), in which the lexicon features are trained separately. Their work presents more detailed error analysis than we do here, while we present results both on other combination methods and on NP chunking.</Paragraph> </Section> class="xml-element"></Paper>