File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1071_metho.xml
Size: 15,400 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1071"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Progressive Feature Selection Algorithm for Ultra Large Feature Spaces</Title> <Section position="5" start_page="562" end_page="563" type="metho"> <SectionTitle> 3 Progressive Feature Selection Algo- </SectionTitle> <Paragraph position="0"> rithm In general, the more contextual information is used, the better a system performs. However, richer context can lead to combinatorial explosion of the feature space. When the feature space is huge (e.g., in the order of tens of millions of features or even more), the SGC algorithm exceeds the memory limitation on commonly available computing platforms with gigabytes of memory.</Paragraph> <Paragraph position="1"> To address the limitation of the SGC algorithm, we propose a progressive feature selection algorithm that selects features in multiple rounds. The main idea of the PFS algorithm is to split the feature space into tractable disjoint sub-spaces such that the SGC algorithm can be performed on each one of them. In the merge step, the features that SGC selects from different sub-spaces are merged into groups. Instead of re-generating the feature-to-instance mapping table for each sub-space during the time of splitting and merging, we create the new mapping table from the previous round's tables by collecting those entries that correspond to the selected features. Then, the SGC algorithm is performed on each of the feature groups and new features are selected from each of them. In other words, the feature space splitting and subspace merging are performed mainly on the feature-to-instance mapping tables. This is a key step that leads to this very efficient PFS algorithm.</Paragraph> <Paragraph position="2"> At the beginning of each round for feature selection, a uniform prior distribution is always assumed for the new CME model. A more precise description of the PFS algorithm is given in In Table 1, SGC() invokes the SGC algorithm, and Opt() optimizes feature weights. The functions split() and merge() are used to split and merge the feature space respectively.</Paragraph> <Paragraph position="3"> Two variations of the split() function are investigated in the paper and they are described below: 1. random-split: randomly split a feature space into n- disjoint subspaces, and select an equal amount of features for each feature subspace.</Paragraph> <Paragraph position="4"> 2. dimension-based-split: split a feature space into disjoint subspaces based on fea- null ture dimensions/variables, and select the number of features for each feature sub-space with a certain distribution.</Paragraph> <Paragraph position="5"> We use a simple method for merge() in the experiments reported here, i.e., adding together the features from a set of selected feature subspaces. null One may image other variations of the split() function, such as allowing overlapping subspaces. Other alternatives for merge() are also possible, such as randomly grouping the selected feature subspaces in the dimension-based split. Due to the limitation of the space, they are not discussed here.</Paragraph> <Paragraph position="6"> This approach can in principle be applied to other machine learning algorithms as well.</Paragraph> </Section> <Section position="6" start_page="563" end_page="566" type="metho"> <SectionTitle> 4 Experiments with PFS for Edit Re- gion Identification </SectionTitle> <Paragraph position="0"> In this section, we will demonstrate the benefits of the PFS algorithm for identifying edit regions.</Paragraph> <Paragraph position="1"> The main reason that we use this task is that the edit region detection task uses features from several levels, including prosodic, lexical, and syntactic ones. It presents a big challenge to find a set of good features from a huge feature space.</Paragraph> <Paragraph position="2"> First we will present the additional features that the PFS algorithm allows us to include.</Paragraph> <Paragraph position="3"> Then, we will briefly introduce the variant of the Switchboard corpus used in the experiments. Finally, we will compare results from two variants of the PFS algorithm.</Paragraph> <Section position="1" start_page="563" end_page="563" type="sub_section"> <SectionTitle> 4.1 Edit Region Identification Task </SectionTitle> <Paragraph position="0"> In spoken utterances, disfluencies, such as selfediting, pauses and repairs, are common phenomena. Charniak and Johnson (2001) and Kahn et al. (2005) have shown that improved edit region identification leads to better parsing accuracy - they observe a relative reduction in parsing f-score error of 14% (2% absolute) between automatic and oracle edit removal.</Paragraph> <Paragraph position="1"> The focus of our work is to show that our new PFS algorithm enables the exploration of much larger feature spaces for edit identification - including prosodic features, their confidence scores, and various feature combinations - and consequently, it further improves edit region identification. Memory limitation prevents us from including all of these features in experiments using the boosting method described in Johnson and Charniak (2004) and Zhang and Weng (2005). We couldn't use the new features with the SGC algorithm either for the same reason. null The features used here are grouped according to variables, which define feature sub-spaces as in Charniak and Johnson (2001) and Zhang and Weng (2005). In this work, we use a total of 62 variables, which include 16 variables from Charniak and Johnson (2001) and Johnson and Charniak (2004), an additional 29 variables from Zhang and Weng (2005), 11 hierarchical POS tag variables, and 8 prosody variables (labels and their confidence scores). Furthermore, we explore 377 combinations of these 62 variables, which include 40 combinations from Zhang and Weng (2005). The complete list of the variables is given in Table 2, and the combinations used in the experiments are given in Table 3. One additional note is that some features are obtained after the rough copy procedure is performed, where we used the same procedure as the one by Zhang and Weng (2005). For a fair comparison with the work by Kahn et al. (2005), word fragment information is retained.</Paragraph> </Section> <Section position="2" start_page="563" end_page="564" type="sub_section"> <SectionTitle> 4.2 The Re-segmented Switchboard Data </SectionTitle> <Paragraph position="0"> In order to include prosodic features and be able to compare with the state-oft-art, we use the University of Washington re-segmented Switchboard corpus, described in Kahn et al.</Paragraph> <Paragraph position="1"> (2005). In this corpus, the Switchboard sentences were segmented into V5-style sentence-like units (SUs) (LDC, 2004). The resulting sentences fit more closely with the boundaries that can be detected through automatic procedures (e.g., Liu et al., 2005). Because the edit region identification results on the original Switchboard are not directly comparable with the results on the newly segmented data, the state-of-art results reported by Charniak and Johnson (2001) and Johnson and Charniak (2004) are repeated on this new corpus by Kahn et al. (2005).</Paragraph> <Paragraph position="2"> The re-segmented UW Switchboard corpus is labeled with a simplified subset of the ToBI prosodic system (Ostendorf et al., 2001). The three simplified labels in the subset are p, 1 and 4, where p refers to a general class of disfluent boundaries (e.g., word fragments, abruptly shortened words, and hesitation); 4 refers to break level 4, which describes a boundary that has a boundary tone and phrase-final lengthening; Among the original 18 variables, two variables, P</Paragraph> <Paragraph position="4"> are not used in our experiments, because they are mostly covered by the other variables. Partial word flags only contribute to 3 features in the final selected feature list.</Paragraph> <Paragraph position="5"> and 1 is used to include the break index levels BL 0, 1, 2, and 3. Since the majority of the corpus is labeled via automatic methods, the f-scores for the prosodic labels are not high. In particular, 4 and p have f-scores of about 70% and 60% respectively (Wong et al., 2005). Therefore, in our experiments, we also take prosody confidence scores into consideration.</Paragraph> <Paragraph position="6"> Besides the symbolic prosody labels, the corpus preserves the majority of the previously annotated syntactic information as well as edit region labels.</Paragraph> <Paragraph position="7"> In following experiments, to make the results comparable, the same data subsets described in Kahn et al. (2005) are used for training, developing and testing.</Paragraph> </Section> <Section position="3" start_page="564" end_page="566" type="sub_section"> <SectionTitle> 4.3 Experiments </SectionTitle> <Paragraph position="0"> The best result on the UW Switchboard for edit region identification uses a TAG-based approach (Kahn et al., 2005). On the original Switchboard corpus, Zhang and Weng (2005) reported nearly 20% better results using the boosting method with a much larger feature space . To allow comparison with the best past results, we create a new CME baseline with the same set of features as that used in Zhang and Weng (2005).</Paragraph> <Paragraph position="1"> We design a number of experiments to test the following hypotheses: 1. PFS can include a huge number of new features, which leads to an overall performance improvement.</Paragraph> <Paragraph position="2"> 2. Richer context, represented by the combinations of different variables, has a positive impact on performance.</Paragraph> <Paragraph position="3"> 3. When the same feature space is used, PFS performs equally well as the original SGC algorithm.</Paragraph> <Paragraph position="4"> The new models from the PFS algorithm are trained on the training data and tuned on the development data. The results of our experiments on the test data are summarized in Table 4. The first three lines show that the TAG-based approach is outperformed by the new CME base-line (line 3) using all the features in Zhang and Weng (2005). However, the improvement from PFS is not applied to the boosting algorithm at this time because it would require significant changes to the available algorithm.</Paragraph> <Paragraph position="5"> CME is significantly smaller than the reported results using the boosting method. In other words, using CME instead of boosting incurs a performance hit.</Paragraph> <Paragraph position="6"> The next four lines in Table 4 show that additional combinations of the feature variables used in Zhang and Weng (2005) give an absolute improvement of more than 1%. This improvement is realized through increasing the search space to more than 20 million features, 8 times the maximum size that the original boosting and CME algorithms are able to handle.</Paragraph> <Paragraph position="7"> Table 4 shows that prosody labels alone make no difference in performance. Instead, for each position in the sentence, we compute the entropy of the distribution of the labels' confidence scores. We normalize the entropy to the range [0, 1], according to the formula below: () ( )UniformHpHscore [?]=1 (4) Including this feature does result in a good improvement. In the table, cut2 means that we equally divide the feature scores into 10 buckets and any number below 0.2 is ignored. The total contribution from the combined feature variables leads to a 1.9% absolute improvement. This confirms the first two hypotheses.</Paragraph> <Paragraph position="8"> When Gaussian smoothing (Chen and Rosenfeld, 1999), labeled as +Gau, and post-processing (Zhang and Weng, 2005), labeled as +post, are added, we observe 17.66% relative improvement (or 3.85% absolute) over the previous best f-score of 78.2 from Kahn et al. (2005). To test hypothesis 3, we are constrained to the feature spaces that both PFS and SGC algorithms can process. Therefore, we take all the variables from Zhang and Weng (2005) as the feature space for the experiments. The results are listed in Table 5. We observed no f-score degradation with PFS. Surprisingly, the total amount of time PFS spends on selecting its best features is smaller than the time SGC uses in selecting its best features. This confirms our hypothesis 3. all the variables from Zhang and Weng (2005). The last set of experiments for edit identification is designed to find out what split strategies PFS algorithm should adopt in order to obtain good results. Two different split strategies are tested here. In all the experiments reported so far, we use 10 random splits, i.e., all the features are randomly assigned to 10 subsets of equal size. We may also envision a split strategy that divides the features based on feature variables (or dimensions), such as word-based, tag-based, etc. The four dimensions used in the experiments are listed as the top categories in Tables 2 and 3, and the results are given in Table 6.</Paragraph> <Paragraph position="9"> +HTag+HTagComb+WTComb+RCComb+PComb: cut2 In Table 6, the first two columns show criteria for splitting feature spaces and the number of features to be allocated for each group. Random and Dimension mean random-split and dimension-based-split, respectively. When the criterion is Random, the features are allocated to different groups randomly, and each group gets the same number of features. In the case of dimension-based split, we determine the number of features allocated for each dimension in two ways. When the split is Uniform, the same number of features is allocated for each dimension. When the split is Prior, the number of features to be allocated in each dimension is determined in proportion to the importance of each dimension. To determine the importance, we use the distribution of the selected features from each dimension in the model &quot;+ HTag + HTagComb + WTComb + RCComb + PComb: cut2&quot;, namely: Word-based 15%, Tag-based 70%, RoughCopy-based 7.5% and Prosody-based 7.5% . From the results, we can see no significant difference between the random-split and the dimension-based-split.</Paragraph> <Paragraph position="10"> To see whether the improvements are translated into parsing results, we have conducted one more set of experiments on the UW Switchboard corpus. We apply the latest version of Charniak's parser (2005-08-16) and the same procedure as Charniak and Johnson (2001) and Kahn et al.</Paragraph> <Paragraph position="11"> (2005) to the output from our best edit detector in this paper. To make it more comparable with the results in Kahn et al. (2005), we repeat the same experiment with the gold edits, using the latest parser. Both results are listed in Table 7. The difference between our best detector and the gold edits in parsing (1.51%) is smaller than the difference between the TAG-based detector and the gold edits (1.9%). In other words, if we use the gold edits as the upper bound, we see a relative error reduction of 20.5%.</Paragraph> <Paragraph position="12"> It is a bit of cheating to use the distribution from the selected model. However, even with this distribution, we do not see any improvement over the version with randomsplit. null</Paragraph> </Section> </Section> class="xml-element"></Paper>