XML Viewer - w03-1020

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1020_intro.xml
Size: 4,533 bytes
Last Modified: 2025-10-06 14:02:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1020">
  <Title>A Fast Algorithm for Feature Selection in Conditional Maximum Entropy Modeling</Title>
  <Section position="2" start_page="0" end_page="2" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Maximum Entropy (ME) modeling has received a lot of attention in language modeling and natural language processing for the past few years (e.g., Rosenfeld, 1994; Berger et al 1996; Ratnaparkhi, 1998; Koeling, 2000). One of the main advantages using ME modeling is the ability to incorporate various features in the same framework with a sound mathematical foundation. There are two main tasks in ME modeling: the feature selection process that chooses from a feature space a subset of good features to be included in the model; and the parameter estimation process that estimates the weighting factors for each selected feature in the exponential model. This paper is primarily concerned with the feature selection process in ME modeling.</Paragraph>
    <Paragraph position="1"> While the majority of the work in ME modeling has been focusing on parameter estimation, less effort has been made in feature selection. This is partly because feature selection may not be necessary for certain tasks when parameter estimate algorithms are fast. However, when a feature space is large and complex, it is clearly advantageous to perform feature selection, which not only speeds up the probability computation and requires smaller memory size during its application, but also shortens the cycle of model selection during the training.</Paragraph>
    <Paragraph position="2"> Feature selection is a very difficult optimization task when the feature space under investigation is large. This is because we essentially try to find a best subset from a collection of all the possible feature subsets, which has a size of 2 |W| , where |W| is the size of the feature space.</Paragraph>
    <Paragraph position="3"> In the past, most researchers resorted to a simple count cutoff technique for selecting features (Rosenfeld, 1994; Ratnaparkhi, 1998; Reynar and Ratnaparkhi, 1997; Koeling, 2000), where only the features that occur in a corpus more than a pre-defined cutoff threshold get selected. Chen and Rosenfeld (1999) experimented on a feature selection technique that uses a c  test to see whether a feature should be included in the ME model, where the c  test is computed using the count from a prior distribution and the count from the real training data. It is a simple and probably effective technique for language modeling tasks. Since ME models are optimized using their likelihood or likelihood gains as the criterion, it is important to establish the relationship between c  test score and the likelihood gain, which, however, is absent. Berger et al. (1996) presented an incremental feature selection (IFS) algorithm where only one feature is added at each selection and the estimated parameter values are kept for the features selected in the previous stages. While this greedy search assumption is reasonable, the speed of the IFS algorithm is still an issue for complex tasks. For better understanding its performance, we re-implemented the algorithm. Given a task of 600,000 training instances, it takes nearly four days to select 1000 features from a feature space with a little more than 190,000 features. Berger and Printz (1998) proposed an f-orthogonal condition for selecting k features at the same time without affecting much the quality of the selected features. While this technique is applicable for certain feature sets, such as word link features reported in their paper, the f-orthogonal condition usually does not hold if part-of-speech tags are dominantly present in a feature subset. Past work, including Ratnaparkhi (1998) and Zhou et al (2003), has shown that the IFS algorithm utilizes much fewer features than the count cutoff method, while maintaining the similar precision and recall on tasks, such as prepositional phrase attachment, text categorization and base NP chunking. This leads us to further explore the possible improvement on the IFS algorithm.</Paragraph>
    <Paragraph position="4"> In section 2, we briefly review the IFS algorithm. Then, a fast feature selection algorithm is described in section 3. Section 4 presents a number of experiments, which show a massive speed-up and quality feature selection of the new algorithm. Finally, we conclude our discussion in section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML