File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/j05-1003_abstr.xml

Size: 7,684 bytes

Last Modified: 2025-10-06 13:44:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-1003">
  <Title>Articles Discriminative Reranking for Natural Language Parsing</Title>
  <Section position="2" start_page="0" end_page="27" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Machine-learning approaches to natural language parsing have recently shown some success in complex domains such as news wire text. Many of these methods fall into the general category of history-based models, in which a parse tree is represented as a derivation (sequence of decisions) and the probability of the tree is then calculated as a product of decision probabilities. While these approaches have many advantages, it can be awkward to encode some constraints within this framework. In the ideal case, the designer of a statistical parser would be able to easily add features to the model  Submission received: 15th October 2003; Accepted for publication: 29th April 2004 that are believed to be useful in discriminating among candidate trees for a sentence. In practice, however, adding new features to a generative or history-based model can be awkward: The derivation in the model must be altered to take the new features into account, and this can be an intricate task.</Paragraph>
    <Paragraph position="1"> This article considers approaches which rerank the output of an existing probabilistic parser. The base parser produces a set of candidate parses for each input sentence, with associated probabilities that define an initial ranking of these parses. A second model then attempts to improve upon this initial ranking, using additional features of the tree as evidence. The strength of our approach is that it allows a tree to be represented as an arbitrary set of features, without concerns about how these features interact or overlap and without the need to define a derivation which takes these features into account.</Paragraph>
    <Paragraph position="2"> We introduce a new method for the reranking task, based on the boosting approach to ranking problems described in Freund et al. (1998). The algorithm can be viewed as a feature selection method, optimizing a particular loss function (the exponential loss function) that has been studied in the boosting literature. We applied the boosting method to parsing the Wall Street Journal (WSJ) treebank (Marcus, Santorini, and Marcinkiewicz 1993). The method combines the log-likelihood under a baseline model (that of Collins [1999]) with evidence from an additional 500,000 features over parse trees that were not included in the original model. The baseline model achieved 88.2% F-measure on this task. The new model achieves 89.75% Fmeasure, a 13% relative decrease in F-measure error.</Paragraph>
    <Paragraph position="3"> Although the experiments in this article are on natural language parsing, the approach should be applicable to many other natural language processing (NLP) problems which are naturally framed as ranking tasks, for example, speech recognition, machine translation, or natural language generation. See Collins (2002a) for an application of the boosting approach to named entity recognition, and Walker, Rambow, and Rogati (2001) for the application of boosting techniques for ranking in the context of natural language generation.</Paragraph>
    <Paragraph position="4"> The article also introduces a new, more efficient algorithm for the boosting approach which takes advantage of the sparse nature of the feature space in the parsing data. Other NLP tasks are likely to have similar characteristics in terms of sparsity. Experiments show an efficiency gain of a factor of 2,600 for the new algorithm over the obvious implementation of the boosting approach. Efficiency issues are important, because the parsing task is a fairly large problem, involving around one million parse trees and over 500,000 features. The improved algorithm can perform 100,000 rounds of feature selection on our task in a few hours with current processing speeds. The 100,000 rounds of feature selection require computation equivalent to around 40 passes over the entire training set (as opposed to 100,000 passes for the ''naive'' implementation).</Paragraph>
    <Paragraph position="5"> The problems with history-based models and the desire to be able to specify features as arbitrary predicates of the entire tree have been noted before. In particular, previous work (Ratnaparkhi, Roukos, and Ward 1994; Abney 1997; Della Pietra, Della Pietra, and Lafferty 1997; Johnson et al. 1999; Riezler et al. 2002) has investigated the use of Markov random fields (MRFs) or log-linear models as probabilistic models with global features for parsing and other NLP tasks. (Log-linear models are often referred to as maximum-entropy models in the NLP literature.) Similar methods have also been proposed for machine translation (Och and Ney 2002) and language understanding in dialogue systems (Papineni, Roukos, and Ward 1997, 1998). Previous work (Friedman, Hastie, and Tibshirani 1998) has drawn connections between log-linear models and  boosting for classification problems. One contribution of our research is to draw similar connections between the two approaches to ranking problems.</Paragraph>
    <Paragraph position="6"> We argue that the efficient boosting algorithm introduced in this article is an attractive alternative to maximum-entropy models, in particular, feature selection methods that have been proposed in the literature on maximum-entropy models. The earlier methods for maximum-entropy feature selection methods (Ratnaparkhi, Roukos, and Ward 1994; Berger, Della Pietra, and Della Pietra 1996; Della Pietra, Della Pietra, and Lafferty 1997; Papineni, Roukos, and Ward 1997, 1998) require several full passes over the training set for each round of feature selection, suggesting that at least for the parsing data, the improved boosting algorithm is several orders of magnitude more efficient.</Paragraph>
    <Paragraph position="7">  In section 6.4 we discuss our approach in comparison to these earlier methods for feature selection, as well as the more recent work of McCallum (2003); Zhou et al. (2003); and Riezler and Vasserman (2004).</Paragraph>
    <Paragraph position="8"> The remainder of this article is structured as follows. Section 2 reviews history-based models for NLP and highlights the perceived shortcomings of history-based models which motivate the reranking approaches described in the remainder of the article. Section 3 describes previous work (Friedman, Hastie, and Tibshirani 2000; Duffy and Helmbold 1999; Mason, Bartlett, and Baxter 1999; Lebanon and Lafferty 2001; Collins, Schapire, and Singer 2002) that derives connections between boosting and maximum-entropy models for the simpler case of classification problems; this work forms the basis for the reranking methods. Section 4 describes how these approaches can be generalized to ranking problems. We introduce loss functions for boosting and MRF approaches and discuss optimization methods. We also derive the efficient algorithm for boosting in this section. Section 5 gives experimental results, investigating the performance improvements on parsing, efficiency issues, and the effect of various parameters of the boosting algorithm. Section 6 discusses related work in more detail. Finally, section 7 gives conclusions.</Paragraph>
    <Paragraph position="9"> The reranking models in this article were originally introduced in Collins (2000). In this article we give considerably more detail in terms of the algorithms involved, their justification, and their performance in experiments on natural language parsing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML