XML Viewer - w03-1020

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1020_metho.xml
Size: 13,810 bytes
Last Modified: 2025-10-06 14:08:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1020">
  <Title>A Fast Algorithm for Feature Selection in Conditional Maximum Entropy Modeling</Title>
  <Section position="3" start_page="2" end_page="2" type="metho">
    <SectionTitle>
2 The Incremental Feature Selection Al-
</SectionTitle>
    <Paragraph position="0"> gorithm For better understanding of our new algorithm, we start with briefly reviewing the IFS feature selection algorithm. Suppose the conditional ME model takes the following form:</Paragraph>
    <Paragraph position="2"> are their corresponding weights, and Z(x) is the normalization factor.</Paragraph>
    <Paragraph position="3"> The algorithm makes the approximation that the addition of a feature f in an exponential model affects only its associated weight a, leaving unchanged the l-values associated with the other features. Here we only present a sketch of the algorithm in Figure 1. Please refer to the original paper for the details.</Paragraph>
    <Paragraph position="4"> In the algorithm, we use I for the number of training instances, Y for the number of output classes, and F for the number of candidate features or the size of the candidate feature set.</Paragraph>
    <Paragraph position="6"> 3. if termination condition is met, then stop 4. Model adjustment: for instance i such that there is y and f</Paragraph>
    <Paragraph position="8"> One difference here from the original IFS algorithm is that we adopt a technique in (Goodman, 2002) for optimizing the parameters in the conditional ME training. Specifically, we use array z to store the normalizing factors, and array sum for all the un-normalized conditional probabilities sum[i, y]. Thus, one only needs to modify those sum[i, y] that satisfy f</Paragraph>
    <Paragraph position="10"> , y)=1, and to make changes to their corresponding normalizing factors z[i]. In contrast to what is shown in Berger et al 1996's paper, here is how the different values in this variant of the IFS algorithm are computed.</Paragraph>
    <Paragraph position="11"> Let us denote A</Paragraph>
    <Paragraph position="13"> Then, the model can be represented by sum(y|x) and Z(x) as follows:</Paragraph>
    <Paragraph position="15"> ) correspond to sum[i,y] and z[i] in Figure 1, respectively. Assume the selected feature set is S, and f is currently being considered. The goal of each selection stage is to select the feature f that maximizes the gain of the log likelihood, where the a and gain of f are derived through following steps: Let the log likelihood of the model be  With the approximation assumption in Berger et al (1996)'s paper, the un-normalized component and the normalization factor of the model have the following recursive forms:</Paragraph>
    <Paragraph position="17"> The maximum approximate gain and its corresponding a are represented as:</Paragraph>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 A Fast Feature Selection Algorithm
</SectionTitle>
    <Paragraph position="0"> The inefficiency of the IFS algorithm is due to the following reasons. The algorithm considers all the candidate features before selecting one from them, and it has to re-compute the gains for every feature at each selection stage. In addition, to compute a parameter using Newton's method is not always efficient. Therefore, the total computation for the whole selection processing can be very expensive.</Paragraph>
    <Paragraph position="1"> Let g(j, k) represent the gain due to the addition of feature f j to the active model at stage k. In our experiments, it is found even if D (i.e., the additional number of stages after stage k) is large, for most j, g(j, k+D) - g(j, k) is a negative number or at most a very small positive number. This leads us to use the g(j, k) to approximate the upper bound of g(j, k+D).</Paragraph>
    <Paragraph position="2"> The intuition behind our new algorithm is that when a new feature is added to a model, the gains for the other features before the addition and after the addition do not change much. When there are changes, their actual amounts will mostly be within a narrow range across different features from top ranked ones to the bottom ranked ones. Therefore, we only compute and compare the gains for the features from the top-ranked downward until we reach the one with the gain, based on the new model, that is bigger than the gains of the remaining features. With a few exceptions, the gains of the majority of the remaining features were computed based on the previous models.</Paragraph>
    <Paragraph position="3"> As in the IFS algorithm, we assume that the addition of a feature f only affects its weighting factor a. Because a uniform distribution is assumed as the prior in the initial stage, we may derive a closed-form formula for a(j, 0) and g(j, 0) as follows. null  where [?] denotes an empty set, p [?] is the uniform distribution. The other steps for computing the gains and selecting the features are given in Figure 2 as a pseudo code. Because we only compute gains for a small number of top-ranked features, we call this feature selection algorithm as Selective Gain Computation (SGC) Algorithm.</Paragraph>
    <Paragraph position="4"> In the algorithm, we use array g to keep the sorted gains and their corresponding feature indices. In practice, we use a binary search tree to maintain the order of the array.</Paragraph>
    <Paragraph position="5"> The key difference between the IFS algorithm and the SGC algorithm is that we do not evaluate all the features for the active model at every stage (one stage corresponds to the selection of a single feature). Initially, the feature candidates are ordered based on their gains computed on the uniform distribution. The feature with the largest gain gets selected, and it forms the model for the next stage. In the next stage, the gain of the top feature in the ordered list is computed based on the model just formed in the previous stage. This gain is compared with the gains of the rest features in the list. If this newly computed gain is still the largest, this feature is added to form the model at stage 3. If the gain is not the largest, it is inserted in the ordered list so that the order is maintained. In this case, the gain of the next top-ranked feature in the ordered list is re-computed using the model at the current stage, i.e., stage 2.</Paragraph>
    <Paragraph position="6"> This process continues until the gain of the top-ranked feature computed under the current model is still the largest gain in the ordered list. Then, the model for the next stage is created with the addition of this newly selected feature. The whole feature selection process stops either when the number of the selected features reaches a pre-defined value in the input, or when the gains become too small to be useful to the model.</Paragraph>
    <Paragraph position="8"> 3. if termination condition is met, then stop 4. Model adjustment: for instance i such that there is y and f</Paragraph>
    <Paragraph position="10"> In addition to this basic version of the SGC algorithm, at each stage, we may also re-compute additional gains based on the current model for a pre-defined number of features listed right after feature f * (obtained in step 2) in the ordered list. This is to make sure that the selected feature f</Paragraph>
    <Paragraph position="12"> indeed the feature with the highest gain within the pre-defined look-ahead distance. We call this variant the look-ahead version of the SGC algorithm.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="10" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> A number of experiments have been conducted to verify the rationale behind the algorithm. In particular, we would like to have a good understanding of the quality of the selected features using the SGC algorithm, as well as the amount of speedups, in comparison with the IFS algorithm.</Paragraph>
    <Paragraph position="1"> The first sets of experiments use a dataset {(x, y)}, derived from the Penn Treebank, where x is a 10 dimension vector including word, POS tag and grammatical relation tag information from two adjacent regions, and y is the grammatical relation tag between the two regions. Examples of the grammatical relation tags are subject and object with either the right region or the left region as the head. The total number of different grammatical tags, i.e., the size of the output space, is 86. There are a little more than 600,000 training instances generated from section 02-22 of WSJ in Penn Treebank, and the test corpus is generated from section 23.</Paragraph>
    <Paragraph position="2"> In our experiments, the feature space is partitioned into sub-spaces, called feature templates, where only certain dimensions are included. Considering all the possible combinations in the 10dimensional space would lead to 2  feature templates. To perform a feasible and fair comparison, we use linguistic knowledge to filter out implausible subspaces so that only 24 feature templates are actually used. With this amount of feature templates, we get more than 1,900,000 candidate features from the training data. To speed up the experiments, which is necessary for the IFS algorithm, we use a cutoff of 5 to reduce the feature space down to 191,098 features. On average, each candidate feature covers about 485 instances, which accounts for 0.083% over the whole training instance set and is computed through:  The first experiment is to compare the speed of the IFS algorithm with that of SGC algorithm.</Paragraph>
    <Paragraph position="3"> Theoretically speaking, the IFS algorithm computes the gains for all the features at every stage. This means that it requires O(NF) time to select a feature subset of size N from a candidate feature set of size F. On the other hand, the SGC algorithm considers much fewer features, only 24.1 features on average at each stage, when selecting a feature from the large feature space in this experiment.</Paragraph>
    <Paragraph position="4"> Figure 3 shows the average number of features computed at the selected points for the SGC algorithm, SGC with 500 look-ahead, as well as the IFS algorithm. The averaged number of features is taken over an interval from the initial stage to the current feature selection point, which is to smooth out the fluctuation of the numbers of features each selection stage considers. The second algorithm looks at an additional fixed number of features, 500 in this experiment, beyond the ones considered by the basic SGC algorithm. The last algorithm has a linear decreasing number of features to select, because the selected features will not be considered again. In Figure 3, the IFS algorithm stops after 1000 features are selected. This is because it takes too long for this algorithm to complete the entire selection process. The same thing happens in  SGC algorithm, in comparison with the IFS algorithm. null To see the actual amount of time taken by the SGC algorithms and the IFS algorithm with the currently available computing power, we use a Linux workstation with 1.6Ghz dual Xeon CPUs and 1 GB memory to run the two experiments simultaneously. As it can be expected, excluding the beginning common part of the code from the two algorithms, the speedup from using the SGC algorithm is many orders of magnitude, from more than 100 times to thousands, depending on the number of features selected. The results are shown in Fig- null comparison with the IFS algorithm.</Paragraph>
    <Paragraph position="5"> To verify the quality of the selected features using our SGC algorithm, we conduct four experiments: one uses all the features to build a conditional ME model, the second uses the IFS algorithm to select 1,000 features, the third uses our SGC algorithm, the fourth uses the SGC algorithm with 500 look-ahead, and the fifth takes the top n most frequent features in the training data. The precisions are computed on section 23 of the WSJ data set in Penn Treebank. The results are listed in Figure 5. Three factors can be learned from this figure. First, the three IFS and SGC algorithms perform similarly. Second, 3000 seems to be a dividing line: when the models include fewer than 3000 selected features, the IFS and SGC algorithms do not perform as well as the model with all the features; when the models include more than 3000 selected features, their performance significantly surpass the model with all the features. The inferior performance of the model with all the features at the right side of the chart is likely due to the data over-fitting problem. Third, the simple count cutoff algorithm significantly underperforms the other feature selection algorithms when feature subsets with no more than 10,000 features are considered.</Paragraph>
    <Paragraph position="6"> To further confirm the findings regarding precision, we conducted another experiment with Base NP recognition as the task. The experiment uses section 15-18 of WSJ as the training data, and section 20 as the test data. When we select 1,160 features from a simple feature space using our SGC algorithm, we obtain a precision/recall of 92.75%/93.25%. The best reported ME work on this task includes Koeling (2000) that has the precision/recall of 92.84%/93.18% with a cutoff of 5, and Zhou et al. (2003) has reached the performance of 93.04%/93.31% with cutoff of 7 and reached a performance of 92.46%/92.74% with 615 features using the IFS algorithm. While the results are not directly comparable due to different feature spaces used in the above experiments, our result is competitive to these best numbers. This shows that our new algorithm is both very effective in selecting high quality features and very efficient in performing the task.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML