XML Viewer - w04-3211

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3211_metho.xml
Size: 28,515 bytes
Last Modified: 2025-10-06 14:09:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3211">
  <Title>Mixing Weak Learners in Semantic Parsing</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Data
</SectionTitle>
    <Paragraph position="0"> The classifiers were trained on data derived from the PropBank corpus (Kingsbury et al., 2002). The same observations and features are used as described by (Pradhan et al., 2003). They acquired the original data from the July 15, 2002 release of PropBank, which the University of Pennsylvania created by manually labeling the constituents  of the Penn TreeBank gold-standard parses (Marcus et al., 1994). Predicate usages (at present, strictly verbs) are hand annotated with 22 possible semantic roles plus the null role to indicate grammatical constituents that are not arguments of the predicate. The argument labels can have different meanings depending on their target predicate, but the annotation method attempted to assign consistent meanings to labels, especially when associated with similar verbs. There are seven core roles or arguments, labeled ARG0-5 and ARG9. ARG0 usually corresponds to the semantic agent and ARG1 to the entity most affected by the action. In addition to the core arguments, there are 15 adjunctive arguments, such as ARGM-LOC which identifies locatives. Thus our previous example, &amp;quot;She bought the vase in Egypt&amp;quot;, would be parsed as shown in example 2. Figure 1 shows the associated syntactic parse without the parts of speech.</Paragraph>
    <Paragraph position="1"> (2) [Arg0 She] [P bought] [Arg1 the vase] [ArgM-Loc in Egypt] Development tuning is based on PropBank section 00 and final results are reported for section 23. We trained and tested on the same subset of observations as did Pradhan et al. (2003). They indicated that a small number of sentences (less than 1%) were discarded due to manual tagging errors in the original PropBank labeling process, (e.g., an empty role tag). This one percent reduction applies to all sections of the corpus (training, development and test). They removed an additional 2% of the training data due to issues involving the named entity tagger splitting corpus tokens into multiple words.</Paragraph>
    <Paragraph position="2"> However, where these issues occurred in tagging the section 23 test sentences, they were manually corrected. The size of the dataset is shown in Table 1.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Algorithm
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Random Forests
</SectionTitle>
      <Paragraph position="0"> Breiman (2001) defines a random forest as &amp;quot;a classifier consisting of a collection of tree structured classifiers {h(x,Thk),k=1,...} where the {Thk} are independently identically distributed random [training] vectors and each tree casts a unit vote for  icates, and labeled arguments in thousands the most popular class at input x.&amp;quot; Thus Bagging (Breiman, 1996) is a form of Random Forest, where each tree is grown based on the selection, with replacement, of N random training examples, where N is the number of total examples in the training set.</Paragraph>
      <Paragraph position="1"> Breiman (2001) describes two new subclasses of Random Forests, Forest-RI and Forest-RC. In each, he combines Bagging, using the CART methodology to create trees, with random feature selection (Amit and Geman, 1997) at each node in the tree.</Paragraph>
      <Paragraph position="2"> That is, at each node he selects a different random subset of the input features and considers only these in establishing the decision at that node.</Paragraph>
      <Paragraph position="3"> The big idea behind Random Forests is that by injecting randomness into the individual trees via random feature selection, the correlation between their classification results is minimized. A lower correlation combined with reasonably good classification accuracy for individual trees leads to a much higher accuracy for the composite forest. In fact, Breiman shows that a theoretical upper bound can be established for the generalization error in terms of the strength of the forest, s, and the mean value of the classification correlation from individual trees, -r.</Paragraph>
      <Paragraph position="4"> The strength, s, is the expected margin over the input space, where the margin of an ensemble classifier is defined as the difference between the fraction of the ensemble members that vote for the correct class versus the fraction voting for the most popular alternative class. See (Breiman, 2001) for a detailed description of s and -r and how they are calculated.</Paragraph>
      <Paragraph position="5"> The upper bound on the generalization error is given by the following equation:</Paragraph>
      <Paragraph position="7"> Breiman found that Forest-RI and Forest-RC compare favorably to AdaBoost in general, are far less sensitive to noise in the training data, and can learn well using weak inputs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Feature Issues
</SectionTitle>
      <Paragraph position="0"> Before describing the variant of Random Forests we use here, it is helpful to discuss a couple of important issues related to the input features. In the experiments here, the true input features to the algorithm are all categorical. Breiman's approach to handling categorical inputs is as follows. He modifies their selection probability such that they are V -1 times as likely as a numeric input to be selected for evaluation at each node, where V is the number of values the categorical feature can take. Then when a categorical input is selected he randomly chooses a subset of the category values and converts the input into a binary-valued feature whose value is one if the training observation's corresponding input value is in the chosen subset and zero otherwise.</Paragraph>
      <Paragraph position="1"> In many machine learning approaches, a categorical feature having V different values would be converted to V (or V -1) separate binary-valued features (e.g., this is the case with SVMs). Here, we process them as categorical features, but conceptually think of them as separate binary-valued features. In an attempt to minimize confusion, we will refer to the categorical input features simply as inputs or as input features, the equivalent set of binary-valued features as the binary-valued features, and the features that are randomly composed in the tree building process (via random category value subset selection) as composed features.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Algorithm Description
</SectionTitle>
      <Paragraph position="0"> Take any tree building algorithm (e.g., C5.0 (Quinlan, 2002)) and modify it such that instead of examining all of the input features at each node, it considers only a random subset of those features. Construct a large number of trees using all of the training data (we build 128 trees in each experiment). Finally, allow the trees to individually cast unit votes for each test observation. The majority vote determines the classification and ties are broken in favor of the class that occurs most frequently in the training set.</Paragraph>
      <Paragraph position="1"> Our implementation is the most similar to Forest-RI, but has several differences, some significant. These differences involve not using Bagging, the use of a single forest rather than two competing forests, the assumed size of ^Vi (the number of relevant values for input i), the probability of selecting individual inputs, how composed features are created, and the underlying tree building algorithm. We delineate each of these differences in the following paragraphs.</Paragraph>
      <Paragraph position="2"> Forest-RI combines random feature selection with Bagging. Surprisingly, we found that, in our experiments, the use of Bagging was actually hurting the classification accuracy of the forests and so we removed this feature from the algorithm. This means that we use all training observations to construct each tree in the forest. This is somewhat counter-intuitive given that it should increase correlation in the outputs of the trees. However, the strength of the forest is based in part on the accuracy of its trees, which will increase when utilizing more training data. We also hypothesize that, given the feature sets here, the correlation isn't affected significantly by the removal of Bagging. The reason for this is the massive number of binary-valued features in the problem (577,710 in just the baseline feature set). Given this fact, using random feature selection alone might result in substantially uncorrelated trees. As seen in equation 1 and shown empirically in (Breiman, 2001), the lack of correlation produced by random feature selection directly improves the error bound.</Paragraph>
      <Paragraph position="3"> Forest-RI involves growing two forests and selecting the one most likely to provide the best results. These two forests are constructed using different values for F, the number of random features evaluated at each node. The choice of which forest is more likely to provide the best results is based on estimates using the observations not included in the training data (the out-of-bag observations). Since we did not use Bagging, all of our observations are used in the training of each tree and we could not take this approach. Additionally, it is not clear that this provided better results in (Breiman, 2001) and preliminary experiments (not reported here) suggest that it might be more effective to simply find a good value for F.</Paragraph>
      <Paragraph position="4"> To create composed features, we randomly select a number of the input's category values, C, given by the following equation:</Paragraph>
      <Paragraph position="6"> where ^V is the number of category values still potentially relevant. Random category value selection is consistent with Breiman's work, as noted in section 3.2. This random selection method should act to further reduce the correlation between trees and Breiman notes that it gets around the problem caused by categorical inputs with large numbers of values. However, he leaves the number of values chosen unspecified. There is also no indication of what to do as the categorical input becomes more sparse near the leaves of the tree (e.g., if the algorithm sends every constituent whose head word is in a set Ph down the right branch of the node, what effect does this have on future random value selection in each branch). This is the role of ^V in the above equation.</Paragraph>
      <Paragraph position="7"> A value is potentially relevant if it is not known to have been effectively removed by a previous decision. The decision at a given node typically sends all of the observations whose input is in the selected category value subset down one branch, and the remaining observations are sent down the other (boolean compositions would result in exceptions).</Paragraph>
      <Paragraph position="8"> The list of relevant category values for a given input is immediately updated when the decision has obvious consequences (e.g., the values in Ph are removed from the list of relevant values used by the left branch in the previous example and the list for the right branch is set to Ph). However, a decision based on one input can also affect the remaining relevant category values of other inputs (e.g., suppose that at the node in our previous example, all prepositional phrase (PP) constituents had the head word with and with was a member of Ph, then the phrase type PP would no longer be relevant to decisions in the left branch, since all associated observations were sent down the right branch). Rather than update all of these lists at each node (a computationally expensive proposition), we only determine the unique category values when there are fewer than 1000 observations left on the path, or the number of observations has been cut to less than half what it was the last time unique values were determined. In early experimentation, this reduced the accuracy by about 0.4% relative to calculating the remaining category values after each decision. So when speed is not important, one should take the former approach.</Paragraph>
      <Paragraph position="9"> Breiman indicates that, when several of the inputs are categorical, in order to increase strength enough to obtain a good accuracy rate the number of inputs evaluated at each node must be increased to two-three times [?]1 + log2 M[?] (where M is the number of inputs). It is not clear whether the input selection process is with or without replacement.</Paragraph>
      <Paragraph position="10"> Some of the inputs in the semantic parsing problem have five orders of magnitude more category values than others. Given this issue, if the selection is without replacement, it leads to evaluating features composed from each of our seven baseline inputs (figure 2) at each node. This would likely increase correlation, since those inputs with a very small number of category values will almost always be the most informative near the root of the tree and would be consistently used for the upper most decisions in the tree. On the other hand, if selection is with replacement, then using the Forest-RI method for calculating the input selection probability will result in those inputs with few category values almost never being chosen. For example, the baseline feature set has 577710 equivalent binary-valued features by the Forest-RI definition, including two true binary inputs. The probability of one of these two inputs not being chosen in a given random draw according to the Forest-RI method is 577709/577710 (see section 3.2 above). With M=7 inputs, generating 3[?]1 + log2 M[?] = 9 random composed features results in these two binary inputs having a selection probability of 1[?](577709/577710)9, or 0.000016.</Paragraph>
      <Paragraph position="11"> Our compromise is first to use C and ^V from equation 2 to calculate a baseline number of composable features for each input i. This quantity is the total number of potentially relevant category values divided by the number used to create a composed feature:</Paragraph>
      <Paragraph position="13"> Second, given the large number of composable features fi, we also evaluate a larger number, F, of random features at each node in the tree:</Paragraph>
      <Paragraph position="15"> where f is the sum of fi over all inputs. Finally, selection and feature composition is done with replacement. The final feature selection process has at least two significant effects we find positive. First, the number of composable features reflects the fact that several category values are considered simultaneously, effectively splitting on Ci binary-valued features. This has the effect of reducing the selection probability of many-valued inputs and increasing the probability of selecting inputs with fewer category values. Using the baseline feature set as an example, the probability of evaluating one of the binary-valued inputs at the root of the tree increases from 0.000016 to 0.0058. Second, as category values are used they are periodically removed from the set under consideration, reducing the corresponding size of Vi, and the input selection probabilities are then adjusted accordingly. This has the effect of continuously raising the selection probability for those inputs that have not yet been utilized.</Paragraph>
      <Paragraph position="16"> Finally, we use ID3 to grow trees rather than CART, which is the tree algorithm Forest-RI uses.</Paragraph>
      <Paragraph position="17"> We don't believe this should have any significant effect on the final results. The choice was purely based on already having an implementation of ID3.</Paragraph>
      <Paragraph position="18"> From a set of possible split decisions, ID3 chooses the decision which leads to the minimum weighted average entropy among the training observations assigned to each branch, as determined by class labels (Quinlan, 1986; Mitchell, 1997).</Paragraph>
      <Paragraph position="19"> These algorithm enhancements are appropriate for any task with high dimensional categorical inputs, which includes many NLP applications.</Paragraph>
      <Paragraph position="20"> PREDICATE: the lemma of the predicate whose arguments are to be classified - the infinitive form of marked verbs in the corpus CONSTITUENT PHRASE TYPE: the syntactic type assigned to the constituent/argument being classified null HEAD WORD (HW): the head word of the target constituent PARSE TREE PATH (PATH): the sequence of parse tree constituent labels from the argument to its predicate POSITION: a binary value indicating whether the target argument precedes or follows its predicate VOICE: a binary value indicating whether the predicate was used in an active or passive phrase SUB-CATEGORIZATION: the parse tree expansion of the predicate's grandparent constituent  (Gildea and Jurafsky, 2002) for details</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Experiments
</SectionTitle>
    <Paragraph position="0"> Four experiments are reported: the first uses the baseline features of Gildea and Jurafsky (2002); the second is composed of features proposed by Pradhan et al. (2003) and Surdeanu et al. (2003); the third experiment evaluates a new feature set; and the final experiment addresses a method of reducing the feature space. The experiments all focus strictly on the classification task - given a syntactic constituent known to be an argument of a given predicate, decide which argument role is the appropriate one to assign to the constituent.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Experiment 1: Baseline Feature Set
</SectionTitle>
      <Paragraph position="0"> The first experiment compares the random forest classifier to three other classifiers, a statistical Bayesian approach with backoff (Gildea and Palmer, 2002), a decision tree classifier (Surdeanu et al., 2003), and a Support Vector Machine (SVM) (Pradhan et al., 2003). The baseline feature set utilized in this experiment is described in Figure 2 (see (Gildea and Jurafsky, 2002) for details).</Paragraph>
      <Paragraph position="1"> Surdeanu et al. omit the SUB-CATEGORIZATION feature, but add a binary-valued feature that indicates the governing category of noun-phrase argument constituents.</Paragraph>
      <Paragraph position="2"> This feature takes on the value S or VP depending on which constituent type (sentence or verb phase respectively) eventually dominates the argument in the parse tree. This generally indicates grammatical subjects versus objects, respectively. They also used the predicate with its case and morphology intact, in addition to using its lemma. Surdeanu et al. indicate that, due to memory limitations on  their hardware, they trained on only 75 KB of the PropBank argument constituents - about 60% of the annotated data.</Paragraph>
      <Paragraph position="3"> Table 2 shows the results of experiment 1, comparing the classifier accuracies as trained on the baseline feature set. Using a difference of two proportions test as described in (Dietterich, 1998), the accuracy differences are all statistically significant at p=0.01. The Random Forest approach outperforms the Bayesian method and the Decision Tree method. However, it does not perform as well as the SVM classifier. Interestingly, the classification accuracy of the first tree in the Random Forest, given in row four, is almost as high as that of the C5 decision trees (Quinlan, 2002) of Surdeanu et al.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Experiment 2: Extended Feature Set
</SectionTitle>
      <Paragraph position="0"> The second experiment compares the random forest classifier to the boosted decision tree and the SVM using all of the features reported by Pradhan et al. The additional features used in this experiment are listed in Figure 3 (see sources for further details). In addition to the extra features noted in the previous experiment, Surdeanu et al. report on four more features, not included here (content word part of speech (CW PoS)1, CW named entity class, and two phrasal verb collocation features).</Paragraph>
      <Paragraph position="1"> Table 3 shows the results of experiment 2, comparing the classifier accuracies using the full feature sets reported in each source. Surdeanu et al. also applied boosting in this experiment and chose the outcome of the boosting iteration that performed best.</Paragraph>
      <Paragraph position="2"> Using the difference of two proportions test, the accuracy differences are all statistically significant at p=0.01. The Random Forest approach outperforms the Boosted Decision Tree method by 3.5%, but trails the SVM classifier by 2.3%. In analyzing the performance on individual argument classes using McNemar's test, Random Forest performs significantly better on ARG0 (p=0.001) then the SVM, and  velopment results and was omitted.</Paragraph>
      <Paragraph position="3"> NAMED ENTITIES: seven binary-valued features indicating whether specific named entities (PERSON, ORGANIZATION, DATE, TIME, MONEY, LOCATION, and PERCENT) occurred anywhere in the target constituent (Surdeanu et al., 2003) HW POS: the grammatical part of speech of the target constituent's head word (Surdeanu et al., 2003) CONTENT WORD (CW): &amp;quot;lexicalized feature that selects an informative word from the constituent, different from the head word&amp;quot;(Surdeanu et al., 2003) VERB CLUSTER: a generalization of the verb predicate by clustering verbs into 64 classes (Pradhan et al., 2003) HALF PATH: the sequence of parse tree constituent labels from the argument to the lowest common ancestor of the predicate (Pradhan et al.,  prevent significance at p=0.1 for any other arguments, but the SVM appears to perform much better on ARG2 and ARG3.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Experiment 3: New Features
</SectionTitle>
      <Paragraph position="0"> We evaluated several new features and report on the most significant here, as described in figure 4.2 The results are reported in table 4. The accuracy improvements relative to the results from experiment 2 are all statistically significant at p=0.001 (McNemar's test is used for all significance tests in this section). Comparing the SVM results in experiment 2 to the best results here shows statistical significance 2Due to space, we cannot report all experiments; contact the first author for more information. The other features we evaluated involved: the phrase type of the parent constituent, the list of phrase types encompassing the sentence fragment between the target predicate and constituent, the prefix and suffix of the cw and hw, animacy, high frequency words preceding and following the predicate, and the morphological form of the predicate. All of these improved accuracy on the development set (some with statistical significance at p=0.01), but we suspect the development baseline was at a low point, since these features largely did not improve performance when combined with CW Base and GP.</Paragraph>
      <Paragraph position="1"> GOVERNING PREPOSITION (GP): if the constituent's parent is a PP, this is the associated preposition (e.g., in &amp;quot;made of [Arg2 gallium arsenide]&amp;quot;, this feature is 'of', since the Arg2-NP is governed by an 'of'-based PP) CW BASE: starting with the CW, convert it to its singular form, remove any prefix, and convert digits to 'n' (e.g., this results in the following CW CW Base mappings: accidents - accident, non- null In analyzing the effect on individual argument classes, seven have high kh2 values (ARG2-4, ARGM-DIS (discourse), ARGM-LOC (locative), ARGM-MNR (manner), and ARGM-TMP (temporal)), but given the large number of degrees of freedom, only ARGM-TMP is significant (p=0.05). Example section-00 sentence fragments including the target predicate (P) and ARG2 role whose classification was corrected by the GP feature include &amp;quot;[P banned] to [everyday visitors]&amp;quot;, &amp;quot;[P considered] as [an additional risk for the investor]&amp;quot;, and &amp;quot;[P made] of [gallium arsenide]&amp;quot;. Comparing the SVM results to the best results here, the Random Forest performs significantly better on Arg0 (p=0.001), and the SVM is significantly better on Arg1 (p=0.001). Again the degrees of freedom prevent significance at p=0.1, but the Random Forest outperforms the SVM with a fairly high kh2 value on ARG4, ARGM-DIS, ARGM-LOC, and ARGM-TMP.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Experiment 4: Dimensionality Reduction
</SectionTitle>
      <Paragraph position="0"> We originally assumed we would be using binary-valued features with sparse matrices, much like in the SVM approach. Since many of the features have a very large number of values (e.g., the PATH feature has over 540k values), we sought ways to reduce the number of equivalent binary-valued features. This section reports on one of these methods, which should be of interest to others in resource constrained environments.</Paragraph>
      <Paragraph position="1"> In this experiment, we preprocess the baseline inputs described in Figure 2 to reduce their number of category values. Specifically, for each original category value, vi [?] V , we determine whether it occurs in observations associated with one or more than one semantic role label, R. If it is associated with more than one R, vi is left as is. When vi maps to only a single Rj, we replace vi with an arbitrary value, vk /[?] V , which is the same for all such v occurring strictly in association with Rj. The PATH input starts with 540732 original feature values and has only 1904 values after this process, while HEAD WORD is reduced from 33977 values to 13208 and PHRASE TYPE is reduced from 62 to 44 values.</Paragraph>
      <Paragraph position="2"> The process has no effect on the other baseline input features. The total reduction in equivalent binary-valued features is 97%. We also test the effect of disregarding feature values during training if they only occur once in the training data. This has a more modest effect, reducing PATH to 156788 values and HEAD WORD to 29482 values, with no other reductions. The total reduction in equivalent binary-valued features is 67%.</Paragraph>
      <Paragraph position="3"> Training on the baseline feature set, the net effect of these two procedures was less than a 0.3% loss of accuracy on the development set. The McNemar test indicates this is not significant at p=0.1. In the end, our implementation used categorical features, rather than binary-valued features (e.g., rather than use 577710 binary-valued features to represent the baseline inputs, we use 7 features which might take on a large number of values -PATH has 540732 values). In this case, the method does not result in as significant a reduction in the memory requirements.</Paragraph>
      <Paragraph position="4"> While we did not use this feature reduction in any of the experiments reported previously, we see it as being very beneficial to others whose implementation may be more resource constrained, particularly those using a binary-valued feature representation.</Paragraph>
      <Paragraph position="5"> The method also reduced training time by 17% and should lead to much larger reductions for implementations using binary-valued features. For example, the worst case training time for SVMs is quadratic in the number of features and this method reduced the dimensionality to 3% of its original size. Therefore, the method has the theoretical potential to reduce training time by up to 100(10.032) = 99.91%. While it is unlikely to approach this in practice, it should provide significant savings. This may be especially helpful during model selection or feature evaluation, after which, one could revert to the full dimensionality for final training to improve classification accuracy. The slight decrement in accuracy may also be overcome by the ability to handle larger datasets.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML