File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/j99-2002_metho.xml
Size: 30,404 bytes
Last Modified: 2025-10-06 14:15:19
<?xml version="1.0" standalone="yes"?> <Paper uid="J99-2002"> <Title>at Asheville</Title> <Section position="3" start_page="0" end_page="196" type="metho"> <SectionTitle> 2. Probabilistic Modeling </SectionTitle> <Paragraph position="0"> We will use word sense disambiguation of the word interest as a concrete example in this section. For simplicity, we will use only two contextual features, the part of speech of the word to the left and the part of speech of the word to the right. Assume that there are 8 senses of interest and 20 part of speech tags. We will map the features to feature variables and the sense tag to the classification variable, yielding a discrete, finite random vector X -- (FV1 ..... Fl,~w, CV) (where w here is 2).</Paragraph> <Paragraph position="1"> Suppose that there are N occurrences of interest in the training sample. The training sample is viewed as being composed of a set of N independent and identical trials drawn from a three-variable population distribution. The outcome of each trial is a particular combination of the values of the three variables, i.e., one of the 3,200 (8 x 20 x 20) possible configurations of the variables in X. Let y~ be the frequency and Pi the probability of the i th configuration of the variables in X. Then (/:1 ..... f3,200) has a multinomial distribution with parameters (N, P1,..., P3,200), where N is fixed. The parameters P1 ..... P3,200 define the joint probability distribution of the variables in X.</Paragraph> <Paragraph position="2"> These parameters could be estimated directly from counts in the training data; that is, we could use the unrestricted maximum-likelihood estimate of Pi (Mood, Graybill, and Boes 1974): =~.</Paragraph> <Paragraph position="3"> If there is not enough training data for the estimation task at hand, then there are many configurations of the variables that seldom or never occur in the training data.</Paragraph> <Paragraph position="4"> For these, the unrestricted maximum-likelihood estimates are unreliable.</Paragraph> <Paragraph position="5"> An alternative is to hypothesize conditional independence assumptions of the form: variables FVi and FVj are conditionally independent of one another, given the values of the remaining variables. Then, we need only count the configurations of the sets of variables that are still interdependent.</Paragraph> <Paragraph position="6"> A simple example will show why (Whittaker 1990). Recall that X is a vector of three binary variables, X = (FV1, FV2, CV). There are 3,200 parameters to be estimated, namely:</Paragraph> <Paragraph position="8"> Bruce and Wiebe Decomposable Modeling in Natural Language Processing The joint distribution can be expressed as follows, according to a basic axiom of probability theory (where fVl, fv2, and cv represent particular values of FV1, FV2, and CV, respectively):</Paragraph> <Paragraph position="10"> distribution can be expressed as follows:</Paragraph> <Paragraph position="12"> The parameters of the model expressed in (2) are the terms on the right-hand side of the equation. They describe the marginal distributions of just the interdependent variables. Thus, we see that the conditional-independence constraint allows us to express the joint distribution in terms of these smaller marginal distributions.</Paragraph> <Paragraph position="13"> The marginal distribution of FV1 and CV is the full joint distribution &quot;collapsed&quot; over FV2. For example, the estimate for P(FV1 = O, CV = 0) is the sum of the relative frequencies of (FV1 = O, CV = O, FV2 = 0), (FV1 = O, CV = 0,FV2 = 1),...,(FV1 = 0, CV = 0, FV2 = 19), i.e., the relative frequency of configurations for which FV1 -- 0 and CV = 0, whatever the value of FV2. Maximum likelihood estimates of the parameters of marginal distributions are more reliable than those of the full joint distribution, because, in a given sample of training data, the frequency of each combination of values of the variables in a marginal distribution is always as large or larger than the frequency of each combination of the values of the variables in the full distribution.</Paragraph> <Paragraph position="14"> There are many possible sets of noninteraction assumptions that could be made regarding a set of variables. The various possibilities can be formalized as probability models. A probability model (more specifical134 its parametric form) expresses the relationships among the variables of the model, and specifies a family of distributions--all distributions in which those relationships hold. For example, the model in which FV1 is conditionally independent of FV2 given the value of CV is the family of all distributions for vector X in which this constraint holds. The differences among the members of this family result from differences in the values of the parameters.</Paragraph> <Paragraph position="15"> A probabilistic model (a parametric form complete with parameter estimates) forms the basis of a probabilistic classifier. The classifier assigns to each ambiguous object the category or tag that has the highest probability of occurring, given the observed values of the feature variables:</Paragraph> <Paragraph position="17"> Since the denominator is the same for all classes, the numerator, i.e., the joint distribution defined by the model, determines which class is assigned.</Paragraph> </Section> <Section position="4" start_page="196" end_page="199" type="metho"> <SectionTitle> 3. The Class of Models </SectionTitle> <Paragraph position="0"> Recall that the parametric form of a model expresses a set of noninteraction assumptions regarding the relationships among the variables. Different model classes allow for different types of noninteraction assumptions.</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 25, Number 2 The class of log-linear models is the most widely used class of probability models for analyzing discrete data. It supports a wide range of noninteraction assumptions and the use of maximum likelihood parameter estimates (Bishop, Fienberg, and Holland 1975). Graphical models are the subset of log-linear models in which the only kind of noninteraction is conditional independence (Whittaker 1990).</Paragraph> <Paragraph position="2"> The interdependencies among the variables in a graphical model can be expressed graphically, in a dependency graph. A dependency graph is formed by mapping each variable in the model to a node in the graph and drawing an edge between the nodes corresponding to interdependent variables. All variables that are not directly connected are conditionally independent given the values of the variables mapping to the connecting nodes. Therefore, the maximal sets of interdependent variables correspond exactly to the cliques of the graph (where a clique is a maximal fully connected component).</Paragraph> <Paragraph position="3"> As shown by Darroch, Lauritzen, and Speed (1980), each graphical model describes a Markov random field. The fundamental property of a Markov random field is that the conditional probability of a variable given the values of the others is the same as the conditional probability of that variable given only the values of the variables corresponding to adjacent nodes. Thus:</Paragraph> <Paragraph position="5"> where Xk through Xm are the adjacent variables. It is this property of conditional independence that was used to formulate equation (2) from (1).</Paragraph> <Paragraph position="6"> The framework described in this paper uses decomposable models, a subclass of graphical models (Whittaker 1990; Darroch, Lauritzen, and Speed 1980), because they offer many computational advantages while retaining a great deal of expressive power.</Paragraph> <Paragraph position="7"> There are a number of different ways to define the class of decomposable models, one of which is the following: The class of decomposable models is composed of all graphical models that have triangulated dependency graphs, i.e., all cycles of length _> four in the dependency graph contain a chord. A chord is an edge between nonadjacent nodes in the cycle.</Paragraph> <Paragraph position="8"> Another definition of decomposable models is the following: They are those graphical models that express the joint distribution of a set of variables as the product of marginal distributions of those variables, where the new expression is a full factorization (Whittaker 1990) of the joint distribution. A product of marginal distributions is a full factorization of a joint distribution if the former is derived from the latter by factorization steps such as that between equations (1) and (2), and &quot;an independence statement corresponding to every pair of non-adjacent vertices in the dependency graph of X is applied exactly once to factorize the joint distribution into the product of marginal distributions&quot; (Whittaker 1990, 393).</Paragraph> <Paragraph position="9"> Consider a set of five random variables, X = (A, B, C, D, E) (E, say, might be the classification variable, and the others the feature variables). We will consider the model in which: c. d.</Paragraph> <Paragraph position="10"> joint distribution by applying an independence statement corresponding to each of (CI1)-(CI3) in turn. 2 As in equation (1), the joint distribution of the variables can be expressed as: P(a, b, c, d, e) = P(a \] b, c, d, e)P(b \[ c, d, e)P(c \] d, e)P(d \] e)P(e) (4) The dependency graph of the model corresponding to this equation is shown in Figure l(a). Applying (Ch) to (4), the following factorization can be performed, by the definition of conditional independence.</Paragraph> <Paragraph position="11"> P(a \] b, c, d, e) = P(a \[ c, d, e) (5) The resulting model is:</Paragraph> <Paragraph position="13"> The dependency graph of the model containing (CIt) is shown in Figure l(b).</Paragraph> <Paragraph position="14"> Factorization (5) can be understood in terms of this dependency graph by noting that the neighbors of A in this graph are {C, D, E} (and not {B, C, D, E}).</Paragraph> <Paragraph position="15"> Applying (CI2) to (6): P(a I c, d, e) = P(a I d, e) (7) The resulting model is:</Paragraph> <Paragraph position="17"> The dependency graph of the model containing (CI1)-(CI2) is shown in Figure 1(c). To see that (7) can be performed, note that the neighbors of A in Figure 1(c) are {E, D}, so 2 Such a factorization exists for any decomposable model, but the independence statements must be applied in an appropriate order to achieve the factorization; see Whittaker (1990). Computational Linguistics Volume 25, Number 2 that A is conditionally independent of {B, C} given the values of {E, D}, and it follows from a basic axiom of probability that A is conditionally independent of {C} given the values of {E, D}.</Paragraph> <Paragraph position="18"> Finally, applying (C/3) to (8): P(b I c, d, e) = P(b I d, e) (9) The final model incorporating all factorizations is: n(a, b, c, el, e) = P(a I c l, e)P(b \[ d, e)P(c l d, e)P(d l e)P(e )</Paragraph> <Paragraph position="20"> The dependency graph of the model containing all three conditional independencies is shown in Figure l(d).</Paragraph> <Paragraph position="21"> Thus, a decomposable model expresses the joint distribution of a set of variables as a product of the marginal distributions of the maximal sets of interdependent variables (the cliques in the dependency graph) scaled by the marginal distributions of the variables common to two or more of these maximal sets. The fact that this kind of closed-form expression of the joint distribution exists provides one of the key advantages in using decomposable models. The parameters of the marginal distributions can be estimated directly from the counts in the data. The joint distribution is expressed in terms of these, as in equations (6), (8), and (10). Thus, we can estimate the parameters from the data without the need for an iterative fitting procedure (as used in NLP maximum entropy modeling \[Berger, Della Pietra, and Della Pietra 1996\]). This property is unique to decomposable models (Pearl 1988; Whittaker 1990).</Paragraph> </Section> <Section position="5" start_page="199" end_page="201" type="metho"> <SectionTitle> 4. Model Selection </SectionTitle> <Paragraph position="0"> We showed above how conditional independence assumptions can be used to simplify the expression of the joint distribution. Given a particular set of variables, there are often very many different conditional independence assumptions that could be made.</Paragraph> <Paragraph position="1"> The generation and testing of different sets of assumptions can be computationally realized as a search through a space of probability models, in our case decomposable models. Removing an edge from a dependency graph of a decomposable model is equivalent to adding a conditional independency to the model. Another way to view the derivation of equations (6), (8), and (10) is as the process of beginning with the fully connected model, in which all variables are interdependent, and successively removing edges, corresponding to adding conditional independencies (CI1)-(CI3). This is how backward search is done. In forward search, we start with the fully disconnected model, in which all variables are independent, and successively add edges, corresponding to adding interdependencies. The space of decomposable models is very large, so greedy search is typically done. In a backward search, at each step, all edges in the current model are evaluated, and one is removed; in forward search, at each step, all edges that could be added are evaluated, and one is added. (Note that decomposable models are not closed under the operations of adding and deleting edges, so a test must be performed to assure that all the models considered are decomposable.) As in decision tree induction, feature selection is also performed as a result of model search (Pedersen, Bruce, and Wiebe 1997). If a feature is not connected to the classification variable in a model, then that feature cannot affect which class is assigned by a classifier based on that model.</Paragraph> <Paragraph position="2"> Bruce and Wiebe Decomposable Modeling in Natural Language Processing The goal of the search process is to find a model with the fewest interdependencies that fits the data well. The fit of the model is how closely the counts observed in a training sample correspond to those that would be expected if the model being tested were the true population model. This is measured using a goodness-of-fit statistic.</Paragraph> <Paragraph position="3"> Read and Cressie (1988) have shown that most measures used to evaluate model fit are instances of the power divergence statistic, where different measures are generated by changing a single parameter. These include Pearson's X 2, the Kullback-Leibler information divergence D, which is also known as cross entropy; and the log-likelihood ratio statistic, G 2. The two most commonly used measures in NLP, D and G 2, are trivially expressed in terms of each other. In the general case, D is used to evaluate the difference between any two density functions gy and fy for the same random vector Y. When D is used to evaluate model fit, gy is the distribution of Y in the data sample, fy is the distribution of Y predicted by the model, and G 2 is 2N x D(gy;fy).</Paragraph> <Paragraph position="4"> In the model search described above, models are modified an edge at a time. In evaluating an edge, we are testing the model of conditional independence between the two variables connected by that edge. The information divergence applied in this case is the same as conditional mutual information, another widely used measure in NLP.</Paragraph> <Paragraph position="5"> Using decomposable models affords an important advantage in assessing model fit: the test for conditional independence of two nodes as described above is simplified. Rather than assessing the conditional independence of the two nodes conditioned on all of the other variables, we need only consider the other nodes in the same clique in the dependency graph.</Paragraph> <Paragraph position="6"> In general, a goodness-of-fit statistic can be thought of as a cost function, where a lower value represents a better model fit. Model selection can be based directly on the value of a goodness-of-fit statistic, or it can be based on a cost function that combines a goodness-of-fit statistic with a penalty for model complexity, such as the Akaike information criterion (AIC) (Akaike 1974) or the Bayesian information criterion (BIC) (Schwarz 1978).</Paragraph> <Paragraph position="7"> The final model selected can be based on a predefined cutoff value. In the case of measures such as AIC and BIC, a cutoff on the value of the measure itself can be defined. In the case of statistics such as G 2, the appropriate cutoff is a predetermined threshold defining statistical significance. Alternatively, all the models generated during search can be considered, and the one with the highest accuracy on a held-out portion of the training data can be selected as the final model (Kayaalp, Pedersen, and Bruce 1997; Wiebe, Bruce, and Duan 1997). 3 The freely available software package CoCo performs forward and backward search using all of the measures described above (Badsberg 1995). 4 Pedersen, Bruce, and Wiebe (1997) present the results of experiments covarying these measures and the direction of search. In addition to these methods, Buntine (1996) describes other search strategies and measures, such as minimum description length, that can be used for model selection.</Paragraph> <Paragraph position="8"> There are a number of other ways to utilize the results of a model search procedure that are extensions to the basic framework. In model switching (Kayaalp, Pedersen, and Bruce 1997) and the naive mix (Pedersen and Bruce 1997), more than one of the models generated during search is used to perform classification. In Boutilier et al. 3 One could also consider applying this kind of test to evaluate each edge, replacing the goodness-of-fit or cost metric. However, this would be more computationally expensive, and would not directly measure conditional independence. 4 CoCo is available at http: / / web.math.auc.dk/jhb / CoCo / others.html Computational Linguistics Volume 25, Number 2 (1996), context-sensitive models are formulated. These models include independencies that hold only in certain contexts, that is, they hold only given specific assignments of values to variables.</Paragraph> </Section> <Section position="6" start_page="201" end_page="201" type="metho"> <SectionTitle> 5. Diagnostic Analysis </SectionTitle> <Paragraph position="0"> As seen above, the model selection framework provides many choices. Because the approach is a formal approach to probabilistic modeling, we can analyze the quality of the three determinants of classifier performance: the features, the form of the model and the parameter estimates. In the paragraphs below, we describe how to isolate the contribution that each of them makes to classification error. This analysis can provide insight into which choices are most appropriate for a particular data set. 5 Features. For diagnostic purposes, it is revealing to train and test the model on the same data. 6 First, consider training and testing the fully connected model on the same data.</Paragraph> <Paragraph position="1"> Since the fully connected model contains no conditional independence assumptions, and the model parameters are not estimated on a separate training set, the model describes the exact joint distribution of the data. Because of this, classification errors can only be due to a lack of discriminatory power of the features. That is, there must be combinations of feature values that occur with more than one class.</Paragraph> <Paragraph position="2"> Form of the model. Consider training and testing other models on the same data. As for the fully connected model, the parameter estimates are optimal for that data. However, we have added approximations to the model in the form of conditional independence assumptions. Thus, for the same data and feature set, variations in the performance of different models are due only to the different conditional independence assumptions made in those models.</Paragraph> <Paragraph position="3"> Parameter estimates. Consider a comparison in which the features, test set, and model form are fixed, but in one case, the parameters are estimated on a separate training set, and in the other case, the parameters are estimated from the test set, as above.</Paragraph> <Paragraph position="4"> Differences in the performance of two such models can only be due to the parameter estimates.</Paragraph> <Paragraph position="5"> As more conditional independence assumptions are made, the parameter estimates become more reliable, in the sense that they are based on the same or greater frequencies (see Section 2). Even so, if important interdependencies are removed from the form of the model, model performance may actually degrade. Thus, by evaluating the contribution that each of the above factors makes to model performance, we can assess how well the model search procedure is balancing model expressiveness and the reliability of the parameter estimates.</Paragraph> </Section> <Section position="7" start_page="201" end_page="203" type="metho"> <SectionTitle> 6. Shortcomings </SectionTitle> <Paragraph position="0"> We have described a very general and expressive framework, but of course there are some shortcomings. The approach is a supervised learning approach, and therefore requires manually tagged training data. In fact, to take full advantage of high-complexity models, a large amount of data may be required. However, by generating models of Bruce and Wiebe Decomposable Modeling in Natural Language Processing varying complexity, the model search procedure can adjust the complexity of the final model to the amount of data that is available.</Paragraph> <Paragraph position="1"> Another point of concern is the computational complexity of the search procedure. Because it is greedy, the search procedure itself is not inefficient: the number of edges evaluated during the search is polynomial in the number of variables. However, the measures used to evaluate edges during the search procedure are inefficient. Section 4 mentions a number of these measures, which can all be expressed as a function of G 2. The complexity of calculating G 2 is a function of the number of configurations of the variables, which is exponential in the number of variables. Therefore, the worst-case time complexity of any search procedure that uses a function of G 2 is exponential in the number of variables. In practice, the method is feasible for a reasonable number of variables (certainly on the order of 100 in the final model), and, once the model is developed during training, the process does not need to be repeated.</Paragraph> <Paragraph position="2"> 7. Relationships to Other Classes of Models It is common in NLP to simply assume a particular model form rather than searching for one that is appropriate for the data. Two kinds of statistical models widely used in NLP are the n-gram and naive Bayes models. These models are decomposable models.</Paragraph> <Paragraph position="3"> In an n-gram model the variables are the class assigned to the current object and the classes assigned to the previous N- 1 objects, and there are edges between all pairs of variables. A naive Bayes model includes edges between the classification variable and each feature variable (and contains no other edges). Because n-gram and naive Bayes models are decomposable, they are possible candidates during model selection.</Paragraph> <Paragraph position="4"> However, they would be selected only if they appear to be the most appropriate models for the particular data.</Paragraph> <Paragraph position="5"> In maximum entropy modeling as applied to NLP (Berger, Della Pietra, and Della Pietra 1996; Ratnaparkhi 1997), feature selection and model search are typically combined, but the procedure differs from that described here. It is important to note that decomposable models are a subset of maximum entropy models. Even so, no effort is made to select for decomposable models (and take advantage of their benefits), or to demonstrate the need for a broader class of models.</Paragraph> <Paragraph position="6"> Bayesian networks are extensively used in artificial intelligence. They are popular because of their graphical representations and because there are probability propagation algorithms for computing the joint and conditional distributions of the variables. Decomposable models can be represented as Bayesian networks. In fact, in the widely used probability propagation algorithm described by Lauritzen and Spiegelhalter (1988) and Pearl (1988), a Bayesian network is ultimately transformed into a decomposable model, to take advantage of the computational benefits of that class of models (see the triangulation step described in Pearl \[1988\]).</Paragraph> <Paragraph position="7"> Although decision trees are not formal probability models, there are similarities between decision tree induction (Breiman et al. 1984) and the model selection framework presented here. Both search procedures perform feature selection and reduce the interdependencies between features to avoid overfitting the data. For a further discussion of the relationships between graphical models and decision trees, see Buntine and Roy (1995).</Paragraph> <Paragraph position="8"> 8. Word Sense Disambiguation Results In a recent collection of experiments, we applied the basic method to word sense disambiguation of 34 words from the HECTOR corpus (Atkins 1993; Hanks 1996). The</Paragraph> </Section> <Section position="8" start_page="203" end_page="204" type="metho"> <SectionTitle> * Computational Linguistics Volume 25, Number 2 </SectionTitle> <Paragraph position="0"> words were not chosen by the authors, but were randomly selected from a set of 38 words included in the training set for the SENSEVAL evaluation project (Kilgarriff 1998). The data set for each word consisted of all sentences containing that word in the corpus. The results are presented in Figure 2. Tenfold cross validation was performed for each word, for a total of 340 experiments. On each fold, a forward search with G 2 as the goodness-of-fit test was performed. In addition, we ensured that naive Bayes was included as a competitor in each fold. For each fold, evaluation on a single held-out portion of the training data was performed to choose the final model. The results of applying this model to the actual test set, averaged over folds, are shown in the column labeled Model Selection. The results of applying naive Bayes exclusively (averaged over folds) are shown in the column labeled Naive Bayes. The column labeled Best Model shows the highest results on the actual test set obtained by any of the models generated during search (again, averaged over folds). The same types of features were used in each model: the part-of-speech tags one place to the left and right of the ambiguous word; the part-of-speech tags two places to the left and right of the word; the part-of-speech tag of the word; and a collocation variable for each sense of the word whose representation is per-class-binary as presented in Wiebe, Bruce, and Duan (1997). 7 Naive Bayes has been shown to be competitive with state-of-the-art classifiers, and has proven remarkably successful on many AI and NLP applications (see, for example, Leacock, Towell, and Voorhees \[1993\]; Friedman, Geiger, and Goldszmidt \[1997\]; Mooney \[1996\]; Langley, Iba, and Thompson \[1992\]). As can be seen by comparing columns 5 and 6, the model selection procedure achieves an overall average accuracy that is 1.4 percentage points higher than exclusively using the naive Bayes classifier.</Paragraph> <Paragraph position="1"> Evaluating the results on a per-word basis more clearly shows the benefits of performing model selection in these experiments. There are more words for which model selection is better than there are words for which model selection is worse. Further, we assessed the statistical significance of the differences in accuracy presented, in Figure 2 between the two methods for the individual words, using a paired t-test (described in Cohen \[1995\]) with 0.05 as the significance level. For six of the words, the model selection performance is significantly better than the performance of exclusively using naive Bayes. Further, the model selection procedure is not significantly worse than naive Bayes for any of the words.</Paragraph> <Paragraph position="2"> In addition, on average, the set of words for which model selection is superior are more difficult than the ones on which naive Bayes is superior: for the former set, the average number of senses is 10 and the average entropy is 2.2; for the latter set, the average number of senses is 7 and the average entropy is 1.7 (see columns 2 and 4 in Figure 2). We can also see that, on average, there is less annotated data available for the words on which model selection does better, a total of 524 tagged instances (see column 3 in Figure 2), than for those on which it does worse, a total of 645 tagged instances. 8 This supports the idea that model selection tailors the complexity of the model to the amount of data that is available.</Paragraph> <Paragraph position="3"> As shown in column 8, models are generated during model search that provide high accuracy. In fact, the accuracy of the best model generated is consistently higher than that of both naive Bayes and the final model actually selected during the search.</Paragraph> <Paragraph position="4"> This illustrates that there are further potential gains to be exploited by investigating alternative methods for selecting the best model for each fold.</Paragraph> <Paragraph position="5"> 7 The variable for each sense S is binary, corresponding to the absence or presence of any word in a set specifically chosen for S. A word W is chosen for S if P(S I W) > 0.5. 8 In 10-fold cross validation, 90% of the data is used for training on each fold.</Paragraph> </Section> class="xml-element"></Paper>