File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2030_metho.xml

Size: 19,417 bytes

Last Modified: 2025-10-06 14:08:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2030">
  <Title>Feature Selection for a Rich HPSG Grammar Using Decision Trees</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Characteristics of the Treebank and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Features Used
</SectionTitle>
      <Paragraph position="0"> The Redwoods treebank (Oepen et al., 2002) is an under-construction treebank of sentences corresponding to a particular HPSG grammar, the LinGO ERG (Flickinger, 2000). The current preliminary version contains 10,000 sentences of spoken dialog material drawn from the Verbmobil project.</Paragraph>
      <Paragraph position="1"> The Redwoods treebank makes available the entire HPSG signs for sentence analyses, but we have used in our experiments only small subsets of this representation. These are (i) derivation trees composed of identifiers of lexical items and constructions used to build the analysis, and (ii) semantic dependency trees which encode semantic head-to-head relations. The Redwoods treebank provides deeper semantics expressed in the Minimum Recursion Semantics formalism (Copestake et al., 2001), but in the present experiments we have not explored this fully.</Paragraph>
      <Paragraph position="2"> The nodes in the derivation trees represent combining rule schemas of the HPSG grammar, and not  phrasal categories of the standard sort. The whole HPSG analyses can be recreated from the derivation trees, using the grammar. The preterminals of the derivation trees are lexical labels. These are much finer grained than Penn Treebank preterminals tags, and more akin to those used in Tree-Adjoining Grammar models (Bangalore and Joshi, 1999). There are a total of about 8;000 lexical labels occurring in the treebank. One might conjecture that a supertagging approach could go a long way toward parse disambiguation. However, an upper bound for such an approach for our corpus is below 55 percent parse selection accuracy, which is the accuracy of an oracle tagger that chooses at random among the parses having the correct tag sequence (Oepen et al., 2002).</Paragraph>
      <Paragraph position="3"> The semantic dependency trees are labelled with relations most of which correspond to words in the sentence. These labels provide some abstraction because some classes of words have the same semantic label -- for example all days of week are grouped in one class, as are all numbers.</Paragraph>
      <Paragraph position="4"> As an example the derivation tree for one analysis of the short sentence Let us see is shown in figure 1. The semantic dependency tree for the same sentence is: let_rel pron_rel see_understand_rel In addition to this information we have used the main part of speech information of the lexical head to annotate nodes in the derivation trees with labels like verb, noun, preposition, etc.</Paragraph>
      <Paragraph position="5"> Other information that we have not explored includes subcategorization information, lexical types (these are a rich set of about 500 syntactic types), individual features such as tense, aspect, gender, etc. Another resource is the type hierarchy which can be explored to form equivalence classes on which to base statistical estimation.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Models
3.1 Generative Models
</SectionTitle>
    <Paragraph position="0"> We learn generative models that assign probabilities to the derivation trees and the dependency trees. We train these models separately and in the final stage we combine them to yield a probability or score for an entire sentence analysis. We rank the possible analyses produced by the HPSG grammar in accordance with the estimated scores.</Paragraph>
    <Paragraph position="1"> We first describe learning such a generative model for derivation trees using a single decision tree and a set of available features. We will call the set of available features ff1;:::;fmg a history.</Paragraph>
    <Paragraph position="2"> We estimate the probability of a derivation tree as</Paragraph>
    <Paragraph position="4"> other words, the probability of the derivation tree is the product of the probabilities of the expansion of each node given its history of available features.</Paragraph>
    <Paragraph position="5"> Given a training corpus of derivation trees corresponding to preferred analyses of sentences we learn the distribution P(expansionjhistory) using decision trees. We used a standard decision tree learning algorithm where splits are determined based on gain ratio (Quinlan, 1993). We grew the trees fully and we calculated final expansion probabilities at the leaves by linear interpolation with estimates one level above. This is a similar, but more limited, strategy to the one used by Magerman (1995).</Paragraph>
    <Paragraph position="6"> The features over derivation trees which we made available to the learner are shown in Table 1. The node direction features indicate whether a node is a left child, a right child, or a single child. A number of ancestor features were added to the history. The grammar used, the LinGO ERG has rules which are maximally binary, and the complements and adjuncts of a head are collected through multiple rules. Moreover, it makes extensive use of unary rules for various kinds of &amp;quot;type changing&amp;quot; operations. A simple PCFG is reasonably effective to the extent that important dependencies are jointly expressed in a local tree, as is mostly the case for the much flatter representations used in the Penn Treebank. Here, this is not the case, and the inclusion of ancestor nodes in the history makes necessary information more often local in our models. Grandparent annotation was used previously by Charniak and Carroll  Similarly we learn generative models over semantic dependency trees. For these trees the expansion of a node is viewed as consisting of separate trials for each dependent. Any conditional dependencies among children of a node can be captured by expanding the history. The features used for the semantic dependency trees are shown in Table 2. This set of only 5 features for semantic trees makes the feature subset selection method less applicable since there is no obvious redundancy in the set. However the method still outperforms a single decision tree. The model for generation of semantic dependents to the left and right is as follows: First the left dependents are generated from right to left given the head, its parent, right sister, and the number of dependents to the left that have already been generated. After that, the right dependents are generated from left to right, given the head, its parent, left sister and number of dependents to the right that have already been generated. We also add stop symbols at the ends to the left and right. This model is very similar to the markovized rule models in Collins (1997). For example, the joint probability of the dependents of let_rel in the above example would be:</Paragraph>
    <Paragraph position="8"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Conditional Log Linear Models
</SectionTitle>
      <Paragraph position="0"> A conditional log linear model for estimating the probability of an HPSG analysis given a sentence has a set of features ff1;:::;fmg defined over analyses and a set of corresponding weights f,1;:::;,mg for them. In this work we have defined features over derivation trees and syntactic trees as described for the branching process models.</Paragraph>
      <Paragraph position="1"> For a sentence s with possible analyses t1;:::;tk, the conditional probability for analysis ti is given by:</Paragraph>
      <Paragraph position="3"> As in Johnson et al. (1999) we trained the model by maximizing the conditional likelihood of the preferred analyses and using a Gaussian prior for smoothing (Chen and Rosenfeld, 1999). We used conjugate gradient for optimization.</Paragraph>
      <Paragraph position="4"> Given an ensemble of decision trees estimating probabilities P(expansionjhistory) we define features for a corresponding log linear model as follows: For each path from the root to a leaf in any of the decision trees, and for each possible expansion for that path that was seen in the training set, we add a feature feh(t). For a tree t, this feature has as value the number of time the expansion e occurred in t with the history h.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We present experimental results comparing the parse ranking performance of different models. The accuracy results are averaged over a ten-fold cross-validation on the data set summarized in Table 3.</Paragraph>
    <Paragraph position="1"> The sentences in this data set have exactly one preferred parse selected by a human annotator. At this early stage, the treebank is expected to be noisy because all annotation was done by a single annotator.</Paragraph>
    <Paragraph position="2"> Accuracy results denote the percentage of test sentences for which the highest ranked analysis was the correct one. This measure scores whole sentence accuracy and is therefore more strict than the labelled precision/recall measures and more appropriate for the task of parse ranking. When a model ranks a set of m parses highest with equal scores and one of those parses is the preferred parse in the treebank, we compute the accuracy on this sentence as 1=m.</Paragraph>
    <Paragraph position="3"> To give an idea about the difficulty of the task on the corpus we have used, we also show a baseline which is the expected accuracy of choosing a parse sentences length lex ambig struct ambig 5277 7.0 4.1 7.3  els: single decision tree compared to simpler models null at random and accuracy results from simpler models that have been used broadly in NLP. PCFG-S is a simple PCFG model where we only have the node label (feature 0) in the history, and PCFG-GP has only the node and its parent's labels (features 0 and 1) as in PCFG grammars with grandparent annotation. null Table 4 shows the accuracy of parse selection of the three simple models mentioned above defined over derivation trees and the accuracy achieved by a single decision tree (PCFG-DTAll) using all features in Table 1. The third column contains accuracy results for log linear models using the same features. We can note from Table 4 that the generative models greatly benefit from the addition of more conditioning information, while the log linear model performs very well even with only simple rule features, and its accuracy does not increase so sharply with the addition of more complex features.</Paragraph>
    <Paragraph position="4"> The error reduction from PCFG-S to PCFG-DTAll is 25.36%, while the corresponding error reduction for the log linear model is 12%. The error reduction for the log linear model from PCFG-GP to PCFG-DTAll is very small which suggests an overfitting effect. PCFG-S is doing much worse than the log linear model with the same features, and this is true for the training data as well as for the test data. A partial explanation for this is the fact that PCFG-S tries to maximize the likelihood of the correct parses under strong independence assumptions, whereas the log linear model need only worry about making the correct parses more probable than the incorrect ones. Next we show results comparing the single deci- null sion tree model (PCFG-DTAll) to an ensemble of 11 decision trees based on different feature subspaces.</Paragraph>
    <Paragraph position="5"> The decision trees in the ensemble are used to rank the possible parses of a sentence individually and then their votes are combined using a simple majority vote. The sets of features in each decision tree are obtained by removing two features from the whole space. The left preterminal features (features with numbers 7 and 8) participate in only one decision tree. Also, features 2, 3, and 5 participate in all decision trees since they have very few possible values and should not partition the space too quickly. The feature space of each of the 10 decision trees not containing the left preterminal features was formed by removing two of the features from among those with numbers {0, 1, 4, 6, and 9} from the initial feature space (minus features 7 and 8). This method for constructing feature subspaces is heuristic, but is based on the intuition of removing the features that have the largest numbers of possible values.1 Table 5 shows the accuracy results for models based on derivation trees, semantic dependency trees, and a combined model. The first row shows parse ranking accuracy using derivation trees of generative and log linear models over the same features. Results are shown for features selected by a a single decision tree, and an ensemble of 11 decision tree models based on different feature subspaces as described above. The relative improvement in accuracy of the log linear model from single to multiple decision trees is fairly small.</Paragraph>
    <Paragraph position="6"> The second row shows corresponding models for the semantic dependency trees. Since there are a small number of features used for this task, the performance gain from using feature subspaces is  ery combination of two features from the whole space of features 0-8 to obtain subspaces. This results in a large number of feature subspaces (36). The performance of this method was slightly worse than the result reported in Table 5 (78.52%). We preferred to work with an ensemble of 11 decision trees for computational reasons.</Paragraph>
    <Paragraph position="7"> not so large. It should be noted that there is a 90:9% upper bound on parse ranking accuracy using semantic trees only. This is because for many sentences there are several analyses with the same semantic dependency structure. Interestingly, for semantic trees the difference between the log linear and generative models is not so large. Finally, the last row shows the combination of models over derivation trees and semantic trees. The feature sub-space ensemble of 11 decision tree models for the derivation trees is combined with the ensemble of 5 feature subspace models over semantic dependencies to yield a larger ensemble that ranks possible sentence analyses based on weighted majority vote (with smaller weights for the semantic models). The improvement for PCFG models from combining the syntactic and semantic models is about 5:4% error reduction from the error rate of the better (syntactic) models. The corresponding log linear model contains all features from the syntactic and semantic decision trees in the ensemble. The error reduction due to the addition of semantics is 6:1% for the log linear model. Overall the gains from using semantic information are not as good as we expected. Further research remains to be done in this area.</Paragraph>
    <Paragraph position="8"> The results show that decision trees and ensembles of decision trees can be used to greatly improve the performance of generative models over derivation trees and dependency trees. The performance of generative models using a lot of conditioning information approaches the performance of log linear models although the latter remain clearly superior.</Paragraph>
    <Paragraph position="9"> The corresponding improvement in log linear models when adding more complex features is not as large as the improvement in generative models. On the other hand, there might be better ways to incorporate the information from additional history in log linear models.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Error Analysis
</SectionTitle>
    <Paragraph position="0"> It is interesting to see what the hard disambiguation decisions are, that the combined syntactic-semantic models can not at present get right.</Paragraph>
    <Paragraph position="1"> We analyzed some of the errors made by the best log linear model defined over derivation trees and semantic dependency trees. We selected for analysis sentences that the model got wrong on one of the training - test splits in the 10 fold cross-validation on the whole corpus. The error analysis suggests the following breakdown: + About 40% of errors are due to inconsistency or errors in annotation + About 15% of the errors are due to grammar limitations + About 45% of the errors are real errors and we could hope to get them right The inconsistency in annotation hurts the performance of the model both when in the training data some sentences were annotated incorrectly and the model tried to fit its parameters to explain them, and when in the test data the model chose the correct analysis but it was scored as incorrect because of incorrect annotation. It is not straightforward to detect inconsistencies in the training data by inspecting test data errors. Therefore the percentages we have reported are not exact.</Paragraph>
    <Paragraph position="2"> The log linear model seems to be more susceptible to errors in the training set annotation than the PCFG models, because it can easily adjust its parameters to fit the noise (causing overfitting), especially when given a large number of features. This might partly explain why the log linear model does not profit greatly over this data set from the addition of a large number of features.</Paragraph>
    <Paragraph position="3"> A significant portion of the real errors made by the model are PP attachment errors. Another class of errors come from parallel structures and long distance dependencies. For example, the model did not disambiguate correctly the sentence Is anywhere from two thirty to five on Thursday fine?, preferring the interpretation from [two thirty] to [five on Thursday] rather than what would be the more common meaning [from [two thirty] to [five]] [on Thursday]. This disambiguation decision seems to require common world knowledge or it might be addressable with addition of knowledge about parallel structures. ( (Johnson et al., 1999) add features measuring parallelism).</Paragraph>
    <Paragraph position="4"> We also compared the errors made by the best log linear model using only derivation tree features to the ones made by the combined model. The large majority of the errors made by the combined model were also made by the syntactic model. Examples of errors corrected with the help of semantic information include: The sentence How about on the twenty fourth Monday? (punctuation is not present in the corpus) was analyzed by the model based on derivation trees to refer to the Monday after twenty three Mondays from now, whereas the more common interpretation would be that the day being referred to is the twenty fourth day of the month, and it is also a Monday.</Paragraph>
    <Paragraph position="5"> There were several errors of this sort corrected by the dependency trees model.</Paragraph>
    <Paragraph position="6"> Another interesting error corrected by the semantic model was for the sentence: We will get a cab and go. The syntactic model chose the interpretation of that sentence in the sense: We will become a cab and go, which was overruled by the semantic model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML