File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/j05-1003_metho.xml
Size: 11,891 bytes
Last Modified: 2025-10-06 14:09:42
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-1003"> <Title>Articles Discriminative Reranking for Natural Language Parsing</Title> <Section position="3" start_page="27" end_page="29" type="metho"> <SectionTitle> 2. History-Based Models </SectionTitle> <Paragraph position="0"> Before discussing the reranking approaches, we describe history-based models (Black et al. 1992). They are important for a few reasons. First, several of the best-performing parsers on the WSJ treebank (e.g., Ratnaparkhi 1997; Charniak 1997, 2000; Collins 1997, 1999; Henderson 2003) are cases of history-based models. Many systems applied to part-of-speech tagging, speech recognition, and other language or speech tasks also fall into this class of model. Second, a particular history-based model (that of Collins [1999]) is used as the initial model for our approach. Finally, it is important to describe history-based models--and to explain their limitations--to motivate our departure from them.</Paragraph> <Paragraph position="1"> Parsing can be framed as a supervised learning task, to induce a function f : XYY given training examples dx</Paragraph> <Paragraph position="3"> set of candidates for a given input x. In the parsing problem x is a sentence, and 1 Note, however, that log-linear models which employ regularization methods instead of feature selection--see, for example, Johnson et al. (1999) and Lafferty, McCallum, and Pereira (2001)--are likely to be comparable in terms of efficiency to our feature selection approach. See section 6.3 for more discussion.</Paragraph> <Paragraph position="4"> Collins and Koo Discriminative Reranking for NLP GENdxTh is a set of candidate trees for that sentence. A particular characteristic of the problem is the complexity of GENdxTh : GENdxTh can be very large, and each member of GENdxTh has a rich internal structure. This contrasts with ''typical'' classification problems in which GENdxTh is a fixed, small set, for example, fC01;th1g in binary classification problems.</Paragraph> <Paragraph position="5"> In probabilistic approaches, a model is defined which assigns a probability Pdx, yTh to each dx, yTh pair.</Paragraph> <Paragraph position="6"> The most likely parse for each sentence x is then arg max yZGEN(x) Pdx, yTh. This leaves the question of how to define Pdx, yTh. In history-based approaches, a one-to-one mapping is defined between each pair dx, yTh and a decision sequence Th is the history for the ith decision. F is a function which groups histories into equivalence classes, thereby making independence assumptions in the model. Probabilistic context-free grammars (PCFGs) are one example of a history-based model. The decision sequence bd ::: d n A is defined as the sequence of rule expansions in a top-down, leftmost derivation of the tree. The history is equivalent to a partially built tree, and F picks out the nonterminal being expanded (i.e., the leftmost nonterminal in the fringe of this tree), making the assumption that Pdd</Paragraph> <Paragraph position="8"> Th depends only on the nonterminal being expanded. In the resulting model a tree with rule expansions</Paragraph> <Paragraph position="10"> A is assigned a probability</Paragraph> <Paragraph position="12"> Our base model, that of Collins (1999), is also a history-based model. It can be considered to be a type of PCFG, where the rules are lexicalized. An example rule would be VPdsawThC0> VBDdsawTh NPC0CdherTh NPdtodayTh Lexicalization leads to a very large number of rules; to make the number of parameters manageable, the generation of the right-hand side of a rule is broken down into a number of decisions, as follows: First the head nonterminal (VBD in the above example) is chosen.</Paragraph> <Paragraph position="13"> Next, left and right subcategorization frames are chosen ({} and {NP-C}).</Paragraph> <Paragraph position="14"> Nonterminal sequences to the left and right of the VBD are chosen (an empty sequence to the left, bNP-C,NPA to the right). Finally, the lexical heads of the modifiers are chosen (her and today). 2 To be more precise, generative probabilistic models assign joint probabilities Pdx, yTh to each dx, yTh pair. Similar arguments apply to conditional history-based models, which define conditional probabilities Pdy j xTh through a definition are again the decisions made in building a parse, and F is a function that groups histories into equivalence classes. Note that x is added to the domain of F (the context on which decisions are conditioned). See Ratnaparkhi (1997) for one example of a method using this approach. Computational Linguistics Volume 31, Number 1 Figure 1 illustrates this process. Each of the above decisions has an associated probability conditioned on the left-hand side of the rule (VP(saw)) and other information in some cases.</Paragraph> <Paragraph position="15"> History-based approaches lead to models in which the log-probability of a parse tree can be written as a linear sum of parameters a</Paragraph> <Paragraph position="17"> dx, yTh is the count of a different ''event'' or fragment within the tree. As an example, consider a PCFG with rules bA</Paragraph> <Paragraph position="19"> A for 1 C20 k C20 m.Ifh k dx, yTh is the number of times bA</Paragraph> <Paragraph position="21"> A is seen in the tree, and a</Paragraph> <Paragraph position="23"> Th is the parameter associated with that rule, then</Paragraph> <Paragraph position="25"> All models considered in this article take this form, although in the boosting models the score for a parse is not a log-probability. The features h k define an m-dimensional vector of counts which represent the tree. The parameters a k represent the influence of each feature on the score of a tree.</Paragraph> <Paragraph position="26"> A drawback of history-based models is that the choice of derivation has a profound influence on the parameterization of the model. (Similar observations have been made in the related cases of belief networks [Pearl 1988], and language models for speech recognition [Rosenfeld 1997].) When designing a model, it would be desirable to have a framework in which features can be easily added to the model. Unfortunately, with history-based models adding new features often requires a modification of the underlying derivations in the model. Modifying the derivation to include a new feature type can be a laborious task. In an ideal situation we would be able to encode arbitrary features h k , without having to worry about formulating a derivation that included these features.</Paragraph> <Paragraph position="27"> To take a concrete example, consider part-of-speech tagging using a hidden Markov model (HMM). We might have the intuition that almost every sentence has at least one verb and therefore that sequences including at least one verb should have increased scores under the model. Encoding this constraint in a compact way in an HMM takes some ingenuity. The obvious approach--to add to each state the information about whether or not a verb has been generated in the history--doubles Figure 1 The sequence of decisions involved in generating the right-hand side of a lexical rule. Collins and Koo Discriminative Reranking for NLP the number of states (and parameters) in the model. In contrast, it would be trivial to implement a feature h k dx, yTh which is 1 if y contains a verb, 0 otherwise.</Paragraph> </Section> <Section position="4" start_page="29" end_page="31" type="metho"> <SectionTitle> 3. Logistic Regression and Boosting </SectionTitle> <Paragraph position="0"> We now turn to machine-learning methods for the ranking task. In this section we review two methods for binary classification problems: logistic regression (or maximum-entropy) models and boosting. These methods form the basis for the reranking approaches described in later sections of the article. Maximum-entropy models are a very popular method within the computational linguistics community; see, for example, Berger, Della Pietra, and Della Pietra (1996) for an early article which introduces the models and motivates them. Boosting approaches to classification have received considerable attention in the machine-learning community since the introduction of AdaBoost by Freund and Schapire (1997).</Paragraph> <Paragraph position="1"> Boosting algorithms, and in particular the relationship between boosting algorithms and maximum-entropy models, are perhaps not familiar topics in the NLP literature. However there has recently been much work drawing connections between the two methods (Friedman, Hastie, and Tibshirani 2000; Lafferty 1999; Duffy and Helmbold 1999; Mason, Bartlett, and Baxter 1999; Lebanon and Lafferty 2001; Collins, Schapire, and Singer 2002); in this section we review this work. Much of this work has focused on binary classification problems, and this section is also restricted to problems of this type. Later in the article we show how several of the ideas can be carried across to reranking problems.</Paragraph> <Section position="1" start_page="29" end_page="31" type="sub_section"> <SectionTitle> 3.1 Binary Classification Problems </SectionTitle> <Paragraph position="0"> The general setup for binary classification problems is as follows: The ''input domain'' (set of possible inputs) is X. The ''output domain'' (set of possible labels) is simply a set of two labels, Y ={C01, +1}.</Paragraph> <Paragraph position="2"> We show that both logistic regression and boosting implement a linear, or hyperplane, classifier. This means that given an input example x and parameter values -a, the output from the classifier is sign dFdx, -aThTh d1Th of the space and has -a as its normal.</Paragraph> <Paragraph position="3"> Points lying on one side of this hyperplane are classified as +1; points on the other side are classified as C01. The central question in learning is how to set the parameters -a, given the training examples</Paragraph> <Paragraph position="5"> A . Logistic regression and boosting involve different algorithms and criteria for training the parameters -a, but recent work (Friedman, Hastie, and Tibshirani 2000; Lafferty 1999; Duffy and Helmbold 1999; Mason, Bartlett, and Baxter 1999; Lebanon and Lafferty 2001; Collins, Schapire, and Singer 2002) has shown that the methods have strong similarities. The next section describes parameter estimation methods.</Paragraph> </Section> <Section position="2" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 3.2 Loss Functions for Logistic Regression and Boosting </SectionTitle> <Paragraph position="0"> A central idea in both logistic regression and boosting is that of a loss function, which drives the parameter estimation methods of the two approaches. This section describes loss functions for binary classification. Later in the article, we introduce loss functions for reranking tasks which are closely related to the loss functions for classification tasks.</Paragraph> <Paragraph position="1"> First, consider a logistic regression model. The parameters of the model -a are used to define a conditional probability</Paragraph> <Paragraph position="3"> where Fdx, -aTh is as defined in equation (2). Some form of maximum-likelihood estimation is often used for parameter estimation. The parameters are chosen to maximize the log-likelihood of the training set; equivalently: we talk (to emphasize the similarities to the boosting approach) about minimizing the negative log-likelihood.</Paragraph> <Paragraph position="4"> The negative log-likelihood, LogLoss(-a), is defined as</Paragraph> <Paragraph position="6"/> <Paragraph position="8"> There are many methods in the literature for minimizing LogLoss(-a) with respect to -a, for example, generalized or improved iterative scaling (Berger, Della Pietra, and</Paragraph> </Section> </Section> class="xml-element"></Paper>