File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-2030_intro.xml

Size: 4,991 bytes

Last Modified: 2025-10-06 14:01:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2030">
  <Title>Feature Selection for a Rich HPSG Grammar Using Decision Trees</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Hand-built NLP grammars frequently have a depth of linguistic representation and constraints not present in current treebanks, giving them potential importance for tasks requiring deeper processing. On the other hand, these manually built grammars need to solve the disambiguation problem to be practically usable.</Paragraph>
    <Paragraph position="1"> This paper presents work on the problem of probabilistic parse selection from among a set of alternatives licensed by a hand-built grammar in the context of the newly developed Redwoods HPSG treebank (Oepen et al., 2002). HPSG (Head-driven Phrase Structure Grammar) is a modern constraint-based lexicalist (unification) grammar, described in Pollard and Sag (1994).</Paragraph>
    <Paragraph position="2"> The Redwoods treebank makes available syntactic and semantic analyses of much greater depth than, for example, the Penn Treebank. Therefore there are a large number of features available that could be used by stochastic models for disambiguation. Other researchers have worked on extracting features useful for disambiguation from unification grammar analyses and have built log linear models a.k.a. Stochastic Unification Based Grammars (Johnson et al., 1999; Riezler et al., 2000). Here we also use log linear models to estimate conditional probabilities of sentence analyses. Since feature selection is almost prohibitive for these models, because of high computational costs, we use PCFG models to select features for log linear models. Even though this method may be expected to be suboptimal, it proves to be useful. We select features for PCFGs using decision trees and use the same features in a conditional log linear model. We compare the performance of the two models using equivalent features.</Paragraph>
    <Paragraph position="3"> Our PCFG models are comparable to branching process models for parsing the Penn Treebank, in which the next state of the model depends on a history of features. In most recent parsing work the history consists of a small number of manually selected features (Charniak, 1997; Collins, 1997). Other researchers have proposed automatically selecting the conditioning information for various states of the model, thus potentially increasing greatly the space of possible features and selectively choosing the best predictors for each situation. Decision trees have been applied for feature selection for statistical parsing models by Magerman (1995) and Haruno et al. (1998). Another example of automatic feature selection for parsing is in the context of a deterministic parsing model that chooses parse actions based on automatically induced decision structures over a very rich feature set (Hermjakob and Mooney, 1997).</Paragraph>
    <Paragraph position="4"> Our experiments in feature selection using decision trees suggest that single decision trees may not be able to make optimal use of a large number of relevant features. This may be due to the greedy search procedures or to the fact that trees combine information from different features only through partitioning of the space. For example they have difficulty in weighing evidence from different features without fully partitioning the space.</Paragraph>
    <Paragraph position="5"> A common approach to overcoming some of the problems with decision trees - such as reducing their variance or increasing their representational power - has been building ensembles of decision trees by, for example, bagging (Breiman, 1996) or boosting (Freund and Schapire, 1996). Haruno et al. (1998) have experimented with boosting decision trees, reporting significant gains. Our approach is to build separate decision trees using different (although not disjoint) subsets of the feature space and then to combine their estimates by using the average of their predictions. A similar method based on random feature subspaces has been proposed by Ho (1998), who found that the random feature sub-space method outperformed bagging and boosting for datasets with a large number of relevant features where there is redundancy in the features. Other examples of ensemble combination based on different feature subspaces include Zheng (1998) who learns combinations of Naive Bayes classifiers and Zenobi and Cunningham (2001) who create ensembles of kNN classifiers.</Paragraph>
    <Paragraph position="6"> We begin by describing the information our HPSG corpus makes available and the subset we have attempted to use in our models. Next we describe our ensembles of decision trees for learning parameterizations of branching process models. Finally, we report parse disambiguation results for these models and corresponding conditional log linear models.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML