XML Viewer - p01-1042

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1042_intro.xml
Size: 8,129 bytes
Last Modified: 2025-10-06 14:01:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1042">
  <Title>Joint and conditional estimation of tagging and parsing models</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Many statistical NLP applications, such as tagging and parsing, involve finding the value of some hidden variable Y (e.g., a tag or a parse tree) which maximizes a conditional probability distribution P (Y jX), where X is a given word string. The model parameters are typically estimated by maximum likelihood: i.e., maximizing the likelihood of the training I would like to thank Eugene Charniak and the other members of BLLIP for their comments and suggestions. Fernando Pereira was especially generous with comments and suggestions, as were the ACL reviewers; I apologize for not being able to follow up all of your good suggestions. This research was supported by NSF awards 9720368 and 9721276 and NIH award R01 MH60922-01A2.</Paragraph>
    <Paragraph position="1"> data. Given a (fully observed) training corpus D = ((y1;x1);::: ;(yn;xn)), the maximum (joint) likelihood estimate (MLE) of is:</Paragraph>
    <Paragraph position="3"> However, it turns out there is another maximum likelihood estimation method which maximizes the conditional likelihood or &amp;quot;pseudo-likelihood&amp;quot; of the training data (Besag, 1975). Maximum conditional likelihood is consistent for the conditional distribution. Given a training corpus D, the maximum conditional likelihood estimate (MCLE) of the model parameters is:</Paragraph>
    <Paragraph position="5"> Figure 1 graphically depicts the difference between the MLE and MCLE. Let be the universe of all possible pairs (y;x) of hidden and visible values. Informally, the MLE selects the model parameter which make the training data pairs (yi;xi) as likely as possible relative to all other pairs (y0;x0) in . The MCLE, on the other hand, selects the model parameter in order to make the training data pair (yi;xi) more likely than other pairs (y0;xi) in , i.e., pairs with the same visible value xi as the training datum.</Paragraph>
    <Paragraph position="6"> In statistical computational linguistics, maximum conditional likelihood estimators have mostly been used with general exponential or &amp;quot;maximum entropy&amp;quot; models because standard maximum likelihood estimation is usually computationally intractable (Berger et al., 1996; Della Pietra et al., 1997; Jelinek, 1997). Well-known computational linguistic models such as</Paragraph>
    <Paragraph position="8"> likely as possible (relative to ), while the MCLE makes (yi; xi) as likely as possible relative to other pairs (y0; xi).</Paragraph>
    <Paragraph position="9"> Maximum-Entropy Markov Models (McCallum et al., 2000) and Stochastic Unification-based Grammars (Johnson et al., 1999) are standardly estimated with conditional estimators, and it would be interesting to know whether conditional estimation affects the quality of the estimated model. It should be noted that in practice, the MCLE of a model with a large number of features with complex dependencies may yield far better performance than the MLE of the much smaller model that could be estimated with the same computational effort. Nevertheless, as this paper shows, conditional estimators can be used with other kinds of models besides MaxEnt models, and in any event it is interesting to ask whether the MLE differs from the MCLE in actual applications, and if so, how.</Paragraph>
    <Paragraph position="10"> Because the MLE is consistent for the joint distribution P(Y;X) (e.g., in a tagging application, the distribution of word-tag sequences), it is also consistent for the conditional distribution P(Y jX) (e.g., the distribution of tag sequences given word sequences) and the marginal distribution P(X) (e.g., the distribution of word strings). On the other hand, the MCLE is consistent for the conditional distribution P(Y jX) alone, and provides no information about either the joint or the marginal distributions. Applications such as language modelling for speech recognition and EM procedures for estimating from hidden data either explicitly or implicitly require marginal distributions over the visible data (i.e., word strings), so it is not statistically sound to use MCLEs for such applications. On the other hand, applications which involve predicting the value of the hidden variable from the visible variable (such as tagging or parsing) usually only involve the conditional distribution, which the MCLE estimates directly.</Paragraph>
    <Paragraph position="11"> Since both the MLE and MCLE are consistent for the conditional distribution, both converge in the limit to the &amp;quot;true&amp;quot; distribution if the true distribution is in the model class. However, given that we often have insufficient data in computational linguistics, and there are good reasons to believe that the true distribution of sentences or parses cannot be described by our models, there is no reason to expect these asymptotic results to hold in practice, and in the experiments reported below the MLE and MCLE behave differently experimentally. null A priori, one can advance plausible arguments in favour of both the MLE and the MCLE. Informally, the MLE and the MCLE differ in the following way. Since the MLE is obtained by maximizing Qi P (yijxi)P (xi), the MLE exploits information about the distribution of word strings xi in the training data that the MCLE does not. Thus one might expect the MLE to converge faster than the MCLE in situations where training data is not over-abundant, which is often the case in computational linguistics.</Paragraph>
    <Paragraph position="12"> On the other hand, since the intended application requires a conditional distribution, it seems reasonable to directly estimate this conditional distribution from the training data as the MCLE does. Furthermore, suppose that the model class is wrong (as is surely true of all our current language models), i.e., the &amp;quot;true&amp;quot; model P(Y;X) 6= P (Y;X) for all , and that our best models are particularly poor approximations to the true distribution of word strings P(X). Then ignoring the distribution of word strings in the training data as the MCLE does might indeed be a reasonable thing to do.</Paragraph>
    <Paragraph position="13"> The rest of this paper is structured as follows. The next section formulates the MCLEs for HMMs and PCFGs as constrained optimization problems and describes an iterative dynamic-programming method for solving them. Because of the computational complexity of these problems, the method is only applied to a simple PCFG based on the ATIS corpus. For this example, the MCLE PCFG does perhaps produce slightly better parsing results than the standard MLE (relative-frequency) PCFG, although the result does not reach statistical significance.</Paragraph>
    <Paragraph position="14"> It seems to be difficult to find model classes for which the MLE and MCLE are both easy to compute. However, often it is possible to find two closely related model classes, one of which has an easily computed MLE and the other which has an easily computed MCLE. Typically, the model classes which have an easily computed MLE define joint probability distributions over both the hidden and the visible data (e.g., over word-tag pair sequences for tagging), while the model classes which have an easily computed MCLE define conditional probability distributions over the hidden data given the visible data (e.g., over tag sequences given word sequences).</Paragraph>
    <Paragraph position="15"> Section 3 investigates closely related joint and conditional tagging models (the latter can be regarded as a simplification of the Maximum Entropy Markov Models of McCallum et al. (2000)), and shows that MLEs outperform the MCLEs in this application. The final empirical section investigates two different kinds of stochastic shift-reduce parsers, and shows that the model estimated by the MLE outperforms the model estimated by the MCLE.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML