File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1668_metho.xml
Size: 14,474 bytes
Last Modified: 2025-10-06 14:10:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1668"> <Title>Competitive generative models with structure learning for NLP classification tasks</Title> <Section position="4" start_page="576" end_page="579" type="metho"> <SectionTitle> 2 Model Classes and Methodology 2.1 Generative Models </SectionTitle> <Paragraph position="0"> In classification tasks, given a training set of instances D = {[xi, yi]}, where xi are the input features for the i-th instance, and yi is its label, the task is to learn a classifier that predicts the labels of new examples. If X is the space of inputs and Y is the space of labels, a classifier is a function f : X - Y. A generative model is one that models the joint probability of inputs and labels PD(x, y) through a distribution Pth(x, y), dependent on some parameter vector th. The classifier based on this generative model chooses the most likely label given an input according to the conditionalized estimated joint distribution. The parameters th of the fitted distribution are usually estimated using the maximum joint likelihood estimate, possibly with a prior.</Paragraph> <Paragraph position="1"> We study generative models represented as Bayesian Networks (Pearl, 1988), because their parameters can be estimated extremely fast as the maximizer of the joint likelihood is the closed form relative frequency estimate. A Bayesian Network is an acyclic directed graph over a set of nodes. For every variable Z, let Pa(Z) denote the set of parents of Z. The structure of the Bayesian Network encodes the following set of indepen-</Paragraph> <Paragraph position="3"> dence assumptions: every variable is conditionally independent of its non-descendants given its parents. For example, the structure of the Bayesian Network model in Figure 1 encodes the independence assumption that the input features are conditionally independent given the class label.</Paragraph> <Paragraph position="4"> Let the input be represented as a vector of m nominal features. We define Bayesian Networks over the m input variables X1, X2, . . ., Xm and the class variable Y . In all networks, we add links from the class variable Y to all input features.</Paragraph> <Paragraph position="5"> In this way we have generative models which estimate class-specific distributions over features P(X|Y ) and a prior over labels P(Y ). Figure 1 shows a simple Bayesian Network of this form, which is the well-known Naive Bayes model.</Paragraph> <Paragraph position="6"> A specific joint distribution for a given Bayesian Network (BN) is given by a set of conditional probability tables (CPTs) which specify the distribution over each variable given its parents P(Z|Pa(Z)). The joint distribution</Paragraph> <Paragraph position="8"> The parameters of a Bayesian Network model given its graph structure are the values of the conditional probabilities P(Zi|Pa(Zi)). If the model is trained through maximizing the joint likelihood of the data, the optimal parameters are the relative frequency estimates:</Paragraph> <Paragraph position="10"> i)=vectoru) Herev denotes a value of Z i and vectoru denotes a vector of values for the parents of Zi.</Paragraph> <Paragraph position="11"> Most often smoothing is applied to avoid zero probability estimates. A simple form of smoothing is add-a smoothing which is equivalent to a Dirichlet prior. For NLP tasks it has been shown that other smoothing methods are far superior to add-a smoothing - see, for example, Goodman (2001). In particular, it is important to incorporate lower-order information based on subsets of the conditioning information. Therefore we assume a structural form of the conditional probability tables which implements a more sophisticated type of smoothing - interpolated Witten-Bell (Witten and Bell, 1991). This kind of smoothing has also been used in the generative parser of (Collins, 1997) and has been shown to have a relatively good performance for language modeling (Goodman, 2001).</Paragraph> <Paragraph position="12"> To describe the form of the conditional probability tables, we introduce some notation. Let Z denote a variable in the BN and Z1, Z2, . . ., Zk denote the set of its parents. The probability P(Z = z|Z1 = z1, Z2 = z2, . . ., Zk = zk) is estimated using Witten-Bell smoothing as follows: (below the tuple of values z1, z2, . . ., zk is denoted by</Paragraph> <Paragraph position="14"> In the above equation, ^P is the relative frequency estimator. The recursion is ended by interpolating with a uniform distribution 1Vz , where Vz is the vocabulary of values for the prediction variable Z. We determine the interpolation back-off order by looking at the number of values of each variable. We apply the following rule: the variable with the highest number of values observed in the training set is backed off first, then the variable with the next highest number of values, and so on.</Paragraph> <Paragraph position="15"> Typically, the class variable will be backed-off last according to this rule.</Paragraph> <Paragraph position="16"> In Witten-Bell smoothing, the values of the interpolation coefficients are as follows: l(z1k) = count(z1k) count(z1k)+dx|z:count(z,z1k)>0|. The weight of therelative frequency estimate based on a given context increases if the context has been seen more often in the training data and decreases if the context has been seen with more different values for the predicted variable z.</Paragraph> <Paragraph position="17"> Looking at the form of our conditional probability tables, we can see that the major parameters are estimated directly based on the counts of the events in the training data. In addition, there are interpolation parameters (denoted by d above), which participate in computing the interpolation weights l. The d parameters are hyper-parameters and we learn them on a development set of samples. We experimented with learning a single d parameter which is shared by all CPTs and learning multiple d parameters - one for every type of conditioning context in every CPT - i.e., each CPT has as many d parameters as there are back-off levels. null We place some restrictions on the Bayesian Networks learned, for closer correspondence with the discriminative models and for tractability: Every input variable node has the label node as a parent, and at most three parents per variable are allowed.</Paragraph> <Paragraph position="18"> Our structure search method differs slightly from previously proposed methods in the literature (Heckerman, 1999; Pernkopf and Bilmes, 2005).</Paragraph> <Paragraph position="19"> The search space is defined as follows. We start with a Bayesian Network containing only the class variable. We denote by CHOSEN the set of variables already in the network and by REMAINING the set of unplaced variables. Initially, only the class variable Y is in CHOSEN and all other variables are in REMAINING. Starting from the current BN, the set of next candidate structures is defined as follows: For every unplaced variable R in REMAINING, and for every subset Sub of size at most two from the already placed variables in CHOSEN, consider adding R with parents Sub[?]Y to the current BN. Thus the number of candidate structures for extending a current BN is on the order of m3, where m is the number of variables.</Paragraph> <Paragraph position="20"> We perform a greedy search. At each step, if the best variable B with the best set of parents Pa(B) improves the evaluation criterion, move B from REMAINING to CHOSEN, and continue the search until there are no variables in REMAINING or the evaluation criterion can not be improved.</Paragraph> <Paragraph position="21"> The evaluation criterion for BNs we use is classification accuracy on a development set of samples. Thus our structure search method is discriminative, in the terminology of (Grossman and Domingos, 2004; Pernkopf and Bilmes, 2005). It is very easy to evaluate candidate BN structures.</Paragraph> <Paragraph position="22"> The main parameters in the CPTs are estimated via the relative frequency estimator on the training set, as discussed in the previous section. We do not fit the hyper-parameters d during structure search.</Paragraph> <Paragraph position="23"> We fit these parameters only after we have selected a final BN structure. Throughout the structure search, we use a fixed value of 1 for d for all CPTs and levels of back-off. Therefore we are using generative parameter estimation and discriminative structure search. See Section 4 for discussion on how this method relates to previous work. Notice that the optimal parameters of the conditional probability tables of variables already in the current BN do not change at all when a new variable is added, thus making update very efficient. After the stopping criterion is met, the hyper-parameters of the resulting BN are fit on the development set. As discussed in the previous subsection, we fit either a single or multiple hyper-parameters d. The fitting criterion for the generative Bayesian Networks is joint likelihood of the development set of samples with a Gaussian prior on the values log(d). 1 Additionally, we explore fitting the hyper-parameters of the Bayesian Networks by optimizing the conditional likelihood of the development set of samples. In this case we call the resulting models Hybrid Bayesian Network models, since they incorporate a number of discriminatively trained parameters. Hybrid models have been proposed before and shown to perform very competitively (Raina et al., 2004; Och and Ney, 2002). In Section 3.2 we compare generative and hybrid Bayesian Networks.</Paragraph> <Section position="1" start_page="578" end_page="579" type="sub_section"> <SectionTitle> 2.2 Discriminative Models </SectionTitle> <Paragraph position="0"> Discriminative models learn a conditional distribution Pth(Y |vectorX) or discriminant functions that discriminate between classes. Here we concentrate on conditional log-linear models. A simple example of such model is logistic regression, which directly corresponds to Naive Bayes but is trained to maximize the conditional likelihood. 2 To describe the form of models we study, let us introduce some notation. We represent a tuple of nominal variables (X1,X2,. . . ,Xm) as a vector of 0s and 1s in the following standard way: We map the tuple of values of nominal variables to a vector space with dimensionality the sum of possible values of all variables. There is a single dimension in the vector space for every value of each input variable Xi. The tuple (X1,X2,. . . ,Xm) is mapped to a vector which has 1s in m places, which are the corresponding dimensions for the values of each variable Xi. We denote this mapping by Ph.</Paragraph> <Paragraph position="1"> In logistic regression, the probability of a label</Paragraph> <Paragraph position="3"> one constraint on weights but it can be shown that this does not increase the representational power of the model.</Paragraph> <Paragraph position="4"> vectorx is estimated as:</Paragraph> <Paragraph position="6"> There is a parameter vector of feature weights vectorwy for each label y. We fit the parameters of the log-linear model by maximizing the conditional likelihood of the training set including a gaussian prior on the parameters. The prior has mean 0 and variance s2. The variance is a hyper-parameter, which we optimize on a development set.</Paragraph> <Paragraph position="7"> In addition to this simple logistic regression model, as for the generative models, we consider models with much richer structure. We consider more complex mappings Ph, which incorporate conjunctions of combinations of input variables.</Paragraph> <Paragraph position="8"> We restrict the number of variables in the combinations to three, which directly corresponds to our limit on number of parents in the Bayesian Network structures. This is similar to considering polynomial kernels of up to degree three, but is more general, because, for example, we can add only some and not all bigram conjunctions of variables. Structure search (or feature selection) for log-linear models has been done before e.g. (Della Pietra et al., 1997; McCallum, 2003).</Paragraph> <Paragraph position="9"> We devise our structure search methodology in a way that corresponds as closely as possible to our structure search for Bayesian Networks. The exact hypothesis space considered is defined by the search procedure for an optimal structure we apply, which we describe next.</Paragraph> <Paragraph position="10"> We start with an initial empty feature set and a candidate feature set consisting of all input features: CANDIDATES={X1,X2,. . . ,Xm}. In the course of the search, the set CANDIDATES may contain feature conjunctions in addition to the initial input features. After a feature is selected from the candidates set and added to the model, the feature is removed from CANDIDATES and all conjunctions of that feature with all input features are added to CANDIDATES. For example, if a feature conjunction <Xi1,Xi2,. . .,Xin> is selected, all of its expansions of the form <Xi1,Xi2,. . .,Xin,Xi> , where Xi is not in the conjunction already, are added to CANDIDATES.</Paragraph> <Paragraph position="11"> We perform a greedy search and at each step select the feature which maximizes the evaluation criterion, add it to the model and extend the set CANDIDATES as described above. The evaluation criterion for selecting features is classification accuracy on a development set of samples, as for the Bayesian Network structure search.</Paragraph> <Paragraph position="12"> At each step, we evaluate all candidate features. This is computationally expensive, because it requires iterative re-estimation. In addition to estimating weights for the new features, we re-estimate the old parameters, since their optimal values change. We did not preform search for the hyper-parameter s when evaluating models. We fit s by optimizing the development set accuracy after a model was selected. Note that our feature selection algorithm adds an input variable or a variable conjunction with all of its possible values in a single step of the search. Therefore we are adding hundreds or thousands of binary features at each step, as opposed to only one as in (Della Pietra et al., 1997). This is why we can afford to perform complete re-estimation of the parameters of the model at each step.</Paragraph> </Section> </Section> class="xml-element"></Paper>