XML Viewer - w02-1002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1002_intro.xml
Size: 28,192 bytes
Last Modified: 2025-10-06 14:01:33
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1002">
  <Title>Conditional Structure versus Conditional Estimation in NLP Models</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Objective Functions: Naive-Bayes
</SectionTitle>
    <Paragraph position="0"> For bag-of-words WSD, we have a corpus D of labeled examples .s; o/. Each o = hoii is a list of context words, and the corresponding s is the correct sense of a fixed target word occuring in that context.</Paragraph>
    <Paragraph position="1"> A particular model for this task is the familiar multi-Association for Computational Linguistics.</Paragraph>
    <Paragraph position="2"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 9-16. Proceedings of the Conference on Empirical Methods in Natural nomial Naive-Bayes (NB) model (Gale et al., 1992; McCallum and Nigam, 1998), where we assume conditional independence between each of the oi . This NB model gives a joint distribution over the s and hoii variables:</Paragraph>
    <Paragraph position="4"> It also implicitly makes conditional predictions:</Paragraph>
    <Paragraph position="6"> In NLP, NB models are typically used in this latter way to make conditional decisions, such as chosing the most likely word sense.1 The parameters 2 D h sI ojsi for this model are the sense priors P.s/ and the sense-conditional word distributions P.ojs/. These are typically set using (smoothed) relative frequency estimators (RFEs):</Paragraph>
    <Paragraph position="8"> These intuitive relative frequency estimators are the estimates for 2 which maximize the joint likelihood (JL) of D according to the NB model:</Paragraph>
    <Paragraph position="10"> A NB model which has been trained to maximize JL will be referred to as NB-JL. It is worth emphasizing that, in NLP applications, the model is typically trained jointly, then used for its P.sjo/ predictions.</Paragraph>
    <Paragraph position="11"> We can set the parameters in other ways, without changing our model. If we are doing classification, we may not care about JL. Rather, we will want to minimize whatever kinds of errors we get charged for. The JL objective is the evaluation criterion for language modeling, but a decision process' evaluation is more naturally phrased in terms of P.sjo/. If we want to maximize the probability assigned to the correct labeling of the corpus, the appropriate objective is conditional likelihood (CL):</Paragraph>
    <Paragraph position="13"> This focuses on the sense predictions, not the words, which is what we cared about in the first place.</Paragraph>
    <Paragraph position="14"> Figure 1 shows an example of the trade-offs between JL and CL. Assume there are two classes (1 and 2), two words (a and b), and only 2-word contexts. Assume the actual distribution (training and test) is 3 each of (1, ab) and (1, ba) and one (2, aa) 1A possible use for the joint predictions would be a topicconditional unigram language model.</Paragraph>
    <Paragraph position="16"> maximizing NB model has priors of 6/7 and 1/7, like the data. The actual (joint) distribution is not in the family of NB models, and so it cannot be learned perfectly. Still, the NB-JL assigns reasonable probabilities to all occurring events. However, its priors cause it to incorrectly predict that aa belongs to class 1. On the other hand, maximizing CL will push the prior for sense 1 arbitrarily close to zero. As a result, its conditional predictions become more accurate at the cost of its joint prediction. NB-CL joint prediction assigns vanishing mass to events other than (2, aa), and so its joint likelihood score gets arbitrarily bad.</Paragraph>
    <Paragraph position="17"> There are other objectives (or loss functions). In the SENSEVAL competition (Kilgarriff, 1998), we guess sense distributions, and our score is the sum of the masses assigned to the correct senses. This objective is the sum of conditional likelihoods (SCL):</Paragraph>
    <Paragraph position="19"> SCL is less appropriate that CL when the model is used as a step in a probabilistic process, rather than in isolation. CL is more appropriate for filter processes, because it highly punishes assigning zero or near-zero probabilities to observed outcomes.</Paragraph>
    <Paragraph position="20"> If we choose single senses and receive a score of either 1 or 0 on an instance, then we have 0/1-loss (Friedman, 1997). This gives the &amp;quot;number correct&amp;quot; and so we refer to the corresponding objective as ac-</Paragraph>
    <Paragraph position="22"> In the following experiments, we illustrate that, for a fixed model structure, it is advantageous to maximize objective functions which are similar to the evaluation criteria. Although in principle we can optimize any of the objectives above, in practice some are harder to optimize than others. As stated above, JL is trivial to maximize with a NB model. CL and SCL, since they are continuous in 2, can be optimized by gradient methods. Acc is not continuous in 2 and is unsuited to direct optimization (indeed, finding an optimum is NP-complete).</Paragraph>
    <Paragraph position="23"> When optimizing an arbitrary function of 2, we have to make sure that our probabilities remain wellformed. If we want to have a well-formed joint NB interpretation, we must have non-negative parameters and the inequalities 8s Po ojs 1 and Ps s 1.</Paragraph>
    <Paragraph position="24"> If we want to be guaranteed a non-deficient joint interpretation, we can require equality. However, if we relax the equality then we have a larger feasible space which may give better values of our objective.</Paragraph>
    <Paragraph position="25"> We performed the following WSD experiments with Naive-Bayes models. We took as data the collection of SENSEVAL-2 English lexical sample WSD corpora.2 We set the NB model parameters in several ways. We optimized JL (using the RFEs).3 We also optimized SCL and (the log of) CL, using a conjugate gradient (CG) method (Press et al., 1988).4 For CL and SCL, we optimized each objective both over the space of all distributions and over the subspace of non-deficient models (giving CL and SCL ). Acc was not directly optimized.</Paragraph>
    <Paragraph position="26"> Unconstrained CL corresponds exactly to a conditional maximum entropy model (Berger et al., 1996; Lafferty et al., 2001). This particular case, where there are multiple explanatory variables and a single categorical response variable, is also precisely the well-studied statistical model of (multinomial) logistic regression (Agresti, 1990). Its optimization problem is concave (over log parameters) and therefore has a unique global maximum. For CL , SCL, and SCL , we are only guaranteed local optima, but in practice we detected no maxima which were not  various estimates would be smoothed as similarly as possible, we smoothed implicitly, by adding smoothing data. We added one instance of each class occurring with the bag containing each vocabulary word once. This gave the same result as add-one smoothing on the RFEs for NB-JL, and ensured that NB-CL would not assign zero conditional probability to any unseen event. The smoothing data did not, however, result in smoothed estimates for SCL; any conditional probability will sum to one over the smoothing instances. For this objective, we added a penalty term proportional to P 2, which ensured that no conditional sense probabilities reached 0 or 1.</Paragraph>
    <Paragraph position="27"> 4All optimization was done using conjugate gradient ascent over log parameters i D log i , rather than the given parameters due to sensitivity near zero and improved quality of quadratic approximations during optimization. Linear constraints over are not linear in log space, and were enforced using a quadratic Lagrange penalty method (Bertsekas, 1995).  ous objectives. Scores are usually higher on both training and test sets for the objective maximized, and discriminative criteria lead to better test-set accuracy. The best scores are in bold. global over the feasible region.</Paragraph>
    <Paragraph position="28"> Figure 2 shows, for each objective maximized, the values of all objectives on both the training and test set. Optimizing for a given objective generally gave the best score for that objective for both the training set and the test set. The exception is NB-SCL and NB-SCL* which have lower SCL score than NB-CL and NB-CL*. This is due to the penalty used for smoothing the summed models (see fn. 3).</Paragraph>
    <Paragraph position="29"> Accuracy is higher when optimizing the discriminative objectives, CL and SCL, than when optimizing JL (including for macro-averaging, where each word's contribution to average accuracy is made equal). That these estimates beat NB-JL on accuracy is unsurprising, since Acc is a discretization of conditional predictions, not joint ones. This supports the claim that maximizing conditional likelihood, or other discriminative objectives, improves test set accuracy for realistic NLP tasks. NB-SCL, though harder to maximize in general, gives better test-set accuracy than NB-CL.5 NB-CL* is somewhere between JL and CL for all objectives on the training data. Its behavior shows that the change from a standard NB approach (NB-JL) to a maximum entropy classifier (NB-CL) can be broken into two aspects: a change in objective and an abandonment of a non-deficiency constraint.6 Note that the JL score for NB-CL*, is not very much lower than for NB-JL, despite a large change in CL.</Paragraph>
    <Paragraph position="30"> It would be too strong to state that maximizing CL 5This difference seems to be partially due to the different smoothing methods used: Chen and Rosenfeld (1999) show that quadratic penalties are very effective in practice, while the smoothing-data method is quite crude.</Paragraph>
    <Paragraph position="31">  WSD on most SENSEVAL-2 word sets. The relative improvement gained by switching to conditional estimation is positively correlated to training set size.</Paragraph>
    <Paragraph position="32"> (in particular) and discriminative objectives (in general) is always better than maximizing JL for improving test-set accuracy. Even on the present task, CL strictly beat JL in accuracy for only 15 of 24 words.</Paragraph>
    <Paragraph position="33"> Figure 3 shows a plot of the relative accuracy for CL: .AccCL AccJL/=AccJL. The x-axis is the average number of training instances per sense, weighted by the frequency of that sense in the test data. There is a clear trend that larger training sets saw a larger benefit from using NB-CL. The scatter in this trend is partially due to the wide range in data set conditions. The data sets exhibit an unusual amount of drift between training and test distributions. For example, the test data for amaze consists entirely of 70 instances of the less frequent of its two training senses, and represents the highest point on this graph, with NB-CL having a relative accuracy increase of 28%. This drift between the training and test corpora generally favors conditional estimates. On the other hand, many of these data sets are very small, individually, and 6 of the 7 sets where NB-JL wins are among the 8 smallest, 4 of them in fact being the 4 smallest. Ng and Jordan (2002) show that, between NB-JL and NB-CL, the discriminative NB-CL should, in principle, have a lower asymptotic error, but the generative NB-JL should perform better in low-data situations. They argue that unless one has a relatively large data set, one is in fact likely to be better off with the generative estimate. Their claim seems too strong here; even smaller data sets often show benefit to accuracy from CL estimation, although all would qualify as small on their scale.</Paragraph>
    <Paragraph position="34"> Since the number of senses and skew towards common senses is so varied between SENSEVAL-2 words, we turned to larger data sets to test the effective &amp;quot;break-even&amp;quot; size for WSD data, using the hard and line data from Leacock et al. (1998). Figure 4 shows the accuracy of NB-CL and NB-JL as the amount of training data increases. Conditional beats  provement is greater with larger training sets. Only for the line data does the conditional model ever drop below the joint model.</Paragraph>
    <Paragraph position="35"> For this task, then, NB-CL is performing better than expected. This appears to be due to two ways in which CL estimation is suited to linguistic data. First, the Ng and Jordan results do not involve smoothed data. Their data sets do not require it like linguistic data does, and smoothing largely prevents the low-data overfitting that can plague conditional models. There is another, more interesting reason why conditional estimation for this model might work better for an NLP task like WSD than for a general machine learning task. One signature difficulty in NLP is that the data contains a great many rare observations. In the case of WSD, the issue is in telling the kinds of rare events apart. Consider a word w which occurs only once, with a sense s. In the joint model, smoothing ensures that w does not signal s too strongly. However, every w which occurs only once with s will receive the same P.wjs/. Ideally, we would want to be able to tell the accidental singletons from true indicator words. The conditional model implicitly does this to a certain extent. If w occurs with s in an example where other good indicator words are present, then those other words' large weights will explain the occurrence of s, and without w having to have a large weight, its expected count with s in that instance will approach 1. On the other hand, if no trigger words occur in that instance, there will be no other explanation for s other than the presence of w and the other non-indicative words. Therefore, w's weight, and the other words', will grow until s is predicted sufficiently strongly.</Paragraph>
    <Paragraph position="36"> As a concrete illustration, we isolated two senses of &amp;quot;line&amp;quot; into a two-sense data set. Sense 1 was &amp;quot;a queue&amp;quot; and sense 2 was &amp;quot;a phone line.&amp;quot; In this corpus, the words transatlantic and flowers both occur only once, and only with the &amp;quot;phone&amp;quot; sense (plus once with each in the smoothing data). However, transatlantic occurs in the instance thanks, anyway, the transatlantic line 2 died. , while flowers occurs in the longer instance . . . phones with more than one line 2, plush robes, exotic flowers, and complimentary wine. In the first instance, the only non-singleton content word is died which occurs once with sense 1 and twice with sense 2. However, in the other case, phone occurs 191 times with sense 2 and only 4 times with sense 1. Additionally, there are more words in the second instance with which flowers can share the burden of increasing its expectation. Experimentally,</Paragraph>
    <Paragraph position="38"> With joint estimation, both words signal sense 2 with equal strength. With conditional estimation, the presense of words like phone cause flowers to indicate sense 2 less strongly that transatlantic. Given that the conditional estimation is implicitly differentially weighting rare events in a plausibly way, it is perhaps unsurprising that a task like WSD would see the benefits on smaller corpus sizes than would be expected on standard machine-learning data sets.7 These trends are reliable, but sometimes small. In practice, one must decide if, for example, a 5% error reduction is worth the added work: CG optimization, especially with constraints, is considerably harder to implement than simple RFE estimates for JL. It is also considerably slower: the total training time for the entire SENSEVAL-2 corpus was less than 3 seconds for NB-JL, but two hours for NB-CL.</Paragraph>
    <Paragraph position="39"> 3 Model Structure: HMMs and CMMs We now consider sequence data, with POS tagging as a concrete NLP example. In the previous section, we had a single model, but several ways of estimating parameters. In this section, we have two different model structures.</Paragraph>
    <Paragraph position="40"> First is the classic hidden Markov model (HMM), shown in figure 6a. For an instance .s; o/, where 7Interestingly, the common approach of discarding low-count events (for both training speed and overfitting reasons) when estimating the conditional models used in maxent taggers robs the system of the opportunity to exploit this effect of conditional estimation.</Paragraph>
    <Paragraph position="41">  estimation is slightly advantageous. For a fixed objective, the MEMM is inferior, though it can be improved by unobserving unambiguous words.</Paragraph>
    <Paragraph position="42"> o D hoii is a word sequence and s D hsii is a tag sequence, we write the following (joint) model:</Paragraph>
    <Paragraph position="44"> where we use a start state s0 to simplify notation.</Paragraph>
    <Paragraph position="45"> The parameters of this model are the transition and emission probabilities. Again, we can set these parameters to maximize JL, as is typical, or we can set them to maximize other objectives, without changing the model structure. If we maximize CL, we get (possibly deficient) HMMs which are instances of the conditional random fields of Lafferty et al. (2001).8 Figure 5 shows the tagging accuracy of an HMM trained to maximize each objective. JL is the standard HMM. CL duplicates the simple CRFs in (Lafferty et al., 2001). CL is again an intermediate, where we optimized conditional likelihood but required the HMM to be non-deficient. This separates out the benefit of the conditional objective from the benefit from the possibility of deficiency (which relates to label bias, see below). In accordance with our observations in the last section, and consistent with the results of (Lafferty et al., 2001), the CL accuracy is slightly higher than JL for this fixed model.</Paragraph>
    <Paragraph position="46"> Another model often used for sequence data is the upward Conditional Markov Model (CMM), shown as a graphical model in figure 6b. This is the model used in maximum entropy tagging. The graphical model shown gives a joint distribution over .s; o/, just like an HMM. It is a conditionally structured model, in the sense that that distribution can be written as P.s; o/ D P.sjo/P.o/. Since tagging only uses P.sjo/, we can discard what the model says about P.o/. The model as drawn assumes that each observation is independent, but we could add any arrows we please among the oi without changing the conditional predictions. Therefore, it is common to think about this model as if the joint interpretation were absent, and not to model the observations at all. For models which are conditional in the sense of 8The general class of CRFs is more expressive and reduces to deficient HMMs only when they have just these features.</Paragraph>
    <Paragraph position="48"> the upward conditional Markov model (CMM).</Paragraph>
    <Paragraph position="49"> the factorization above, the JL and CL estimates for P.sjo/ will always be the same. It is therefore tempting to believe that since one can find closed-form CL estimates (the RFEs) for these models, one can gain the benefit of conditional estimation. We will show that this is not true, at least not here.</Paragraph>
    <Paragraph position="50"> Adopting the CMM has effects in and of itself, regardless of whether a maximum entropy approach is used to populate the P.sjs 1; o/ estimates. The ML estimate for this model is the RFE for P.sjs 1; o/.</Paragraph>
    <Paragraph position="51"> For tagging, sparsity makes this impossible to reliably estimate directly, but even if we could do so, we would have a graphical model with several defects.</Paragraph>
    <Paragraph position="52"> Every graphical model embodies conditional independence assumptions. The NB model assumes that observations are independent given the class. The HMM assumes the Markov property that future observations are independent from past ones given the intermediate state. Both assumptions are obviously false in the data, but the models do well enough for the tasks we ask of them. However, the assumptions in this upward model are worse, both qualitatively and quantitatively. It is a conditional model, in that the model can be factored as P.o/P.sjo/. As a result, it makes no useful statement about the distribution of the data, making it useless, for example, for generation or language modeling. But more subtly note that states are independent of future observations. As a result, future cues are unable to influence past decisions in certain cases. For example, imagine tagging an entire sentence where the first word is an unknown word. With this model structure, if we ask about the possible tags for the first word, we will get back the marginal distribution over (sentence-initial) unknown words' tags, regardless of the following words.</Paragraph>
    <Paragraph position="53"> We constructed two taggers. One was an HMM, as in figure 6a. It was trained for JL, CL , and CL. The second was a CMM, as in figure 6b. We used a maximum entropy model over the (word, tag) and (previous-tag, tag) features to approximate the P.sjs 1; o/ conditional probabilities. This CMM is referred to as an MEMM. A 9-1 split of the Penn tree-bank was used as the data corpus. To smooth these models as equally as possible and to give as unified a treatment of unseen words as possible, we mapped all words which occurred only once in training to an unknown token. New words in the test data were also mapped to this token.9 Using these taggers, we examined what kinds of errors actually occurred. One kind of error tendency in CMMs which has been hypothesized in the literature is called label bias (Bottou, 1991; Lafferty et al., 2001). Label bias is a type of explaining-away phenomenon (Pearl, 1988) which can be attributed to the local conditional modeling of each state. The idea is that states whose following-state distributions have low entropy will be preferred. Whatever mass arrives at a state must be pushed to successor states; it cannot be dumped on alternate observations as in an HMM. In theory, this means that the model can get into a dysfunctional behavior where a trajectory has no relation to the observations but will still stumble onward with high conditional probability. The sense in which this is an explaining-away phenomenon is that the previous state explains the current state so well that the observation at the current state is effectively ignored. What we found in the case of POS tagging was the opposite. The state-state distributions are on average nowhere near as sharply distributed as the state-observation distributions. This gives rise to the reverse explaining-away effect. The observations explain the states above them so well that the previous states are effectively ignored. We call this observation bias.</Paragraph>
    <Paragraph position="54"> As an example, consider what happens when a word has only a single tag. The conditional distribution for the tag above that word will always assign conditional probability one to that single tag, regardless of the previous tag. Figure 7 shows the sentence All the indexes dove ., in which All should be tagged as a predeterminer (PDT).10 Most occurrences of All, however, are as a determiner (DT, 106/135 vs 26/135), and it is much more common for a sentence to begin with a determiner than a predeterminer. The  other words occur with only one tag in the treebank.11 The HMM tags this sentence correctly, because two determiners in a row is rarer than All being a predeterminer (and a predeterminer beginning a sentence). However, the MEMM shows exactly the effect described above, choosing the most common tag (DT) for All, since the choice of tag for All does not effect the conditional tagging distribution for the. The MEMM parameters do assign a lower weight to the DT DT feature than to the PDT DT feature, but the the ensures a DT tag, regardless.</Paragraph>
    <Paragraph position="55"> Exploiting the joint interpretation of the CMM, what we can do is to unobserve word nodes, leaving the graphical model as it is, but changing the observation status of a given node to &amp;quot;not observed&amp;quot;. For example, we can retain our knowledge that the state above the is DT, but &amp;quot;forget&amp;quot; that we know that the word at that position is the. If we do inference in this example with the unobserved, taking a weighted sum over all values of that node, then the conditional distribution over tag sequences changes as shown under MEMM+: the correct tagging has once again become most probable. Unobserving the word itself is not a priori a good idea. It could easily put too much pressure on the last state to explain the fixed state. This effect is even visible in this small example: the likelihood of the more typical PDT-DT tag sequence is even higher for MEMM+ than the HMM.</Paragraph>
    <Paragraph position="56"> These issues are quite important for NLP, since state-of-the-art statistical taggers are all based on one of these two models. In order to check which, if either, of label or observation bias is actually contributing to tagging error, we performed the following experiments with our simple HMM and MEMM taggers.</Paragraph>
    <Paragraph position="57"> First, we measured, on the training data, the entropy of the next-state distribution P.sjs 1/ for each state s. For both the HMM and MEMM, we then measured the relative overproposal rate for each state: the number of errors where that state was incorrectly predicted in the test set, divided by the overall frequency of that state in the correct answers. The label bias hypothesis makes a concrete prediction: lower entropy 11For the sake of clarity, this example has been slightly doctored by the removal of several non-DT occurrences of the in the  positively correlated with the relative over-proposal frequency (y-axis) of the tags for the MEMM model, though it is slightly so with the HMM model.</Paragraph>
    <Paragraph position="58"> states should have higher relative overproposal values, especially for the MEMM. Figure 8 shows that the trends, if any, are not clear. There does appear to be a slight tendency to have higher error on the low-entropy tags for the HMM, but if there is any superficial trend for the MEMM, it is the reverse.</Paragraph>
    <Paragraph position="59"> On the other hand, if systematically unobserving unambiguous observations in the MEMM led to an increase in accuracy, then we would have evidence of observation bias. Figure 5 shows that this is exactly the case. The error rate of the MEMM drops when we unobserve these single-tag words (from 10.8% to 9.5%), and the error rate in positions before such words drops even more sharply (17.1% to 15.0%).</Paragraph>
    <Paragraph position="60"> The drop in overall error in fact cuts the gap between the HMM and the MEMM by about half.</Paragraph>
    <Paragraph position="61"> The claim here is not that label bias is impossible for MEMMs, nor that state-of-the-art maxent taggers would necessarily benefit from the unobserving of fixed-tag words - if there are already (tag, nextword) features in the model, this effect should be far weaker. The claim is that the independence assumptions embodied by the conditionally structured model were the primary root of the lower accuracy for this model. Label bias and observation bias are both explaining-away phenomena, and are both consequences of these assumptions. Explaining-away effects will be found quite generally in conditionallystructured models, and should be carefully considered before such models are adopted. The effect can be good or bad: In the case of the NB-CL model, there was also an explaining-away effects among the words. This is exactly the cause for flowers being a weaker indicator than transatlantic in our conditional estimation example. In that case, we wanted certain word occurrences to be explained away by the presence of more explanatory words. However, when some of the competing conditioned features are previous local decisions, ignoring them can be harmful.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML