File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1052_metho.xml

Size: 22,311 bytes

Last Modified: 2025-10-06 14:13:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1052">
  <Title>Decision Tree Parsing using a Hidden Derivation Model</Title>
  <Section position="2" start_page="272" end_page="273" type="metho">
    <SectionTitle>
2. A derivation history model
</SectionTitle>
    <Paragraph position="0"> Current treebanks are a collection of n-ary branching trees, with each node in a tree labeled by either a non-terminal label or a part-of-speech label (called a tag). Usually, grammarians elevate constituents to the status of elementary units in a parse, especially in the case of rewrite-rule grammars where each rewrite rule defines a legal constituent. However, if a parse tree is interpreted as a geometric pattern, a constituent is no more than a set of edges which meet at the same tree node. In Figure 1, the noun phrase,&amp;quot;N&amp;quot;, which spans the tags &amp;quot;AT VVC NN 1&amp;quot;, which correspond to an article, a command verb, and a singular noun, respectively, consists of an edge extending to the right from &amp;quot;AT,&amp;quot; an edge extending straight up from &amp;quot;VVC,&amp;quot; and an edge extending to the left from &amp;quot;NNI&amp;quot; (see Figure 1).</Paragraph>
    <Paragraph position="1"> We introduce a new definition for a derivation of a parse tree by using Figure 2 which gives a subtree used in our parser for representing the noun phrase &amp;quot;the Enter key&amp;quot;. We associate with every node in the parse tree two features, a name which is either a tag or a non-terminal label, and an extension which indicates whether the edge going to its parent is going right, left, up, or unary. Unary corresponds to a renaming of a nonterminal. By specifying the two features (name and extension) for each node we can reconstruct the parse tree. The order of the nodes in which we specify these two features defines the derivation order. We only consider bottom-up derivations.</Paragraph>
    <Paragraph position="2"> In a bottom-up derivation, a node is named first, it may be extended only after it's named, and it is not named until all of the nodes beneath it are extended. Naming a node maybe a tagging or labeling action depending on whether or not the node is a leaf in the parse tree.</Paragraph>
    <Paragraph position="3"> Using Figure 2, one derivation is to tag the first word &amp;quot;the&amp;quot; as &amp;quot;AT&amp;quot;, then to extend it &amp;quot;right&amp;quot;, then to tag the third word &amp;quot;key&amp;quot; as &amp;quot;NNI&amp;quot;, then to tag the second word &amp;quot;Enter&amp;quot; as &amp;quot;VVC&amp;quot; (command verb), then to extend the resulting node by a &amp;quot;unary&amp;quot;, then to label the resulting node as &amp;quot;Nn&amp;quot; (computer noun), then to extend the resulting node &amp;quot;up&amp;quot;, then to extend the &amp;quot;NNi&amp;quot; node by a &amp;quot;left&amp;quot; to yield a node that spans the whole phrase &amp;quot;the Enter key&amp;quot;. By our definition of bottom-up derivation, it's only at this point in the derivation that we can label the node that spans the whole phrase as &amp;quot;N&amp;quot;, and then extend it &amp;quot;left&amp;quot; as is implied in Figure 2. Using the node numbering scheme in Figure 2, we have at the beginning of this derivation the words with the nodes {2, 4, 5} that have unassigned names. These are the active nodes at this point.</Paragraph>
    <Paragraph position="4"> Suppose that node 2 is picked and then tagged &amp;quot;AT&amp;quot;. That corresponds to the derivation \[2\]; at this point, only nodes {2, 4, 5} are active. If we pick node 2 again, then an extension step is required and the derivation is \[22\]. The derivation presented at the beginning of this paragraph corresponds to the sequence of nodes \[2254433511\].</Paragraph>
    <Paragraph position="5"> To derive the tree in Figure I when we are given the three-tag sequence, there are 6 possible derivations. We could start by extending any of the 3 tags, then we have either of 2 choices to extend, and we extend the one remaining choice, then we name the resulting node. This leads to 3x2xl=6 derivations for that tree.</Paragraph>
    <Paragraph position="6"> If we use a window of 1, then only a single derivation is permitted and we call it the bottom-up leftmost derivation. In our example, this leftmost derivation would be \[224433551\].</Paragraph>
  </Section>
  <Section position="3" start_page="273" end_page="274" type="metho">
    <SectionTitle>
3. The Parsing Model
</SectionTitle>
    <Paragraph position="0"> We represent a derivation of a parse tree by the sequence of nodes as they are visited by the derivation, denoted by d.</Paragraph>
    <Paragraph position="1"> Denote by ~ the i-th node of the derivation d. Denote by ld, the nanm feature for a node selected at the i-th step in the derivation and by ed~ its extension. A parse derivation is constructed' by the following 2-step algorithm: * select which node to extend among active nodes using p( active = di \[context), * then either - assign a name to the selected node whether it is tagging or labelling a node (constituent) with a non-terminal label using p(la, \[ context), or - extend the selected node (which adds an edge to the parse graph) using p(ed, \[ contezt).</Paragraph>
    <Paragraph position="2"> If the node selected has its name identified then an extension step is performed otherwise a naming step is performed. Note that only extension steps change which nodes are active. We have a different probabilistic model for each type of step in a parse derivation. The probabilistic models do not use the whole derivation history as context; but rather a five node window around the node in question. We will discuss this in more detail later on.</Paragraph>
    <Paragraph position="3"> The probability of a derivation of a parse tree is the product of the probabilities of each of the feature value assignments in that derivation and the probability of each active node selection made in that derivation:</Paragraph>
    <Paragraph position="5"> = p(active = dj I conte t(di-1))p(wj I ont t(dl)) where xj is either the name lj of node dj or its extension ej and d~ is the derivation up to thej-th step. The probability of a parse tree given the sentence is the sum over all derivations of that parse tree:</Paragraph>
    <Paragraph position="7"> Due to computational complexity, we restrict the number of bottom-up derivations we consider by using a window of n active nodes. For a window of 2, we can only choose either of the two leftmost nodes in the above process. So for the parse in Figure 1, we only get 4 derivations with a derivation window of 2.</Paragraph>
    <Paragraph position="8"> Eesh charscter used by the computer Is listed  Each internal node contains, from top to bottom, a label, word, tag, and extension value, and each leaf node contains a word, tag, and extension value.</Paragraph>
    <Paragraph position="9"> 4. Probabilistic Models for Node Features Node Representation We do not use all the subtree information rooted at a node N to condition our probabilistic models. But rather we have an equivalence class defined by the node name (if it's available), we also have for constituent nodes, a word, along with its corresponding part-of-speech tag, that is selected from each constituent to act as a lexical representative. The lexical representative from a constituent corresponds loosely to the linguistic notion of a head word. For example, the lexical representative of a noun phrase is the rightmost noun, and the lexical representative of a verb phrase is the leftmost non-auxiliary verb. However, the correlation to linguistic theory ends there. The deterministic rules (one per label) which select the representative word from each constituent were developed in the better part of an hour, in keeping with the philosophy of avoiding excessive dependence on carefully crafted rule-based methods. Figure 3 illustrates the word and tag features propagated along the parse tree for an example sentence. Each internal node is represented as a 4-feature vector: label, head word, head tag, and extension. Notation In the remainder of this section, the following notational scheme will be used. wi and ti refer to the word corresponding to the ith token in the sentence mad its part-of- null speech tag, respectively. N ~ refers to the 4-tuple of feature values at the kth node in the current parse state, where the nodes are numbered from left to right. N/~, N~, Nt k, and N~ refer, respectively, to the label, word, tag, and extension feature values at the node k. N C/j refers to the jth child of the current node where the leftmost child is child 1. N e-~ refers to the jth child of the current node where the rightmost child is child 1. The symbol Q,te refers to miscellaneous questions about the current state of the parser, such as the number of nodes in the sentence and the number of children of a particular node.</Paragraph>
    <Paragraph position="10"> The Tagging Model The tag feature value prediction is conditioned on the two words to the left, the two words to the right, and all information at two nodes to the left and two nodes to the right.</Paragraph>
    <Paragraph position="12"> The Extension Model The extension feature value prediction is conditioned on the node information at the node being extended, all information from two nodes to the left and two nodes to the right, and the two leftmost children and the two rightmost children of the current node (these will be redundant if there are less than 4 children at a node).</Paragraph>
    <Paragraph position="13"> v(N I o=te t) The Label Model The label feature value prediction is conditioned on questions about the presence of selected words in the constituent, all information from two nodes to the left and two nodes to the right, and the two leftmost children and the two rightmost children of the current node.</Paragraph>
    <Paragraph position="14"> p(N~ I contezt) ~ p(N~ I Q ~Nk-INk-2Nk+INk+2NC/I NC~NC-~NC-~) questions about the history. We have described in earlier papers, \[6, 4\], how we use mutual information clustering of words to define a set of classes on words that form the basis of the binary questions about words in the history. We also have defined by the same mutual information on the bigram tag distribution classes for binary questions on tags. We have identified by hand a set of classes for the binary questions on the labels. The decision trees are grown using the standard methods described in \[5\]. In the case of hidden derivations, the forward-backward algorithms can be used to get partial counts for the different events used in building the decision trees.</Paragraph>
  </Section>
  <Section position="4" start_page="274" end_page="275" type="metho">
    <SectionTitle>
5. Expectation Maximization Training
</SectionTitle>
    <Paragraph position="0"> The proposed history-based model cannot be estimated by direct frequency counts because the model contains a hidden component: the derivation model. The order in which the treebank parse trees were constructed is not encoded in the treebank, but the parser assigns probabilities to specific derivations of a parse tree. A forward-backward (FB) algorithm can be easily defined to compute a posteriori probabilities for. the states. These probabilities can then be used to define counts for the different events that are used to build the decision trees.</Paragraph>
    <Paragraph position="1"> To train the parser, all legal derivations of a parse tree (according to the derivational window constraint) are computed. ~p(N~\[ N~NtkNpN~N~-iN ~-2 Each derivation can be viewed as a path from a common ini-Nk+iNk+2NC~NC~NC-lNC-~}ial state, the words in the sentence, to a common final state, the completed parse tree. These derivations form a lattice of states, since different derivations of the same parse tree inevitably merge. For instance, the state created by tagging the first word in the sentence and then the second is the same state created by tagging the second word and then the first.</Paragraph>
    <Paragraph position="2"> These two derivations of this state have different probability estimates, but the state can be viewed as one state for future actions, since it represents a single history.</Paragraph>
    <Paragraph position="3"> The Derivation Model In initial experiments, the active node selection process was modelled by a uniform (p(active) = 1/n) model with n = 2. Our intuition was that by parametrizing the choice of which active node to process, we could improve the parser by delaying labeling and extension steps when the partial parse indicates ambiguity. We used the current node information and the node information available within the five node window.</Paragraph>
    <Section position="1" start_page="274" end_page="275" type="sub_section">
      <SectionTitle>
5.1. Decision Trees and the Forward-Backward
Algorithm
</SectionTitle>
      <Paragraph position="0"> Each leaf of decision tree represents the distribution of a class of histories. The parameters of these distributions can be updated using the F-B algorithm.</Paragraph>
      <Paragraph position="1"> Initially, the models in the parser are assumed to be uniform. Accordingly, each event in each derivation contributes equally to tlm process which selects which questions to ask about the history in order to predict each feature value. However, k k ~ 1 k ~ ~+1 ~ 2theunifdegrlnmdegdelis certainly not avery good model of p(active I contezt) ,~ p(active \[ Q &amp;quot;N N &amp;quot;- N - N N &amp;quot;-~ )feature value assignments. And, since some derivations of a parse tree are better than others, the events generated by Statistical Decision Trees The above probability distribu- the better derivations should contribute more to the decision tion are each modeled as a statistical decision tree with binary tree-growing process. The decision trees grown using the  uniform as!;umption collectively form a parsing model, MI.</Paragraph>
      <Paragraph position="2"> The F-B count for each event in the training corpus using MI can be used to grow a new set of decision trees, M2.</Paragraph>
      <Paragraph position="3"> The decision trees in M2 are constructed in a way which gives more weight to the events which contributed most to the probability of the corpus. However, there is no guarantee that M2 is a betl.er model than MI. It isn't even guaranteed that the probability of the training corpus according to M2 is higher than the probability according to MI. However, based on experimental results, the use of F-B counts in the construction of new decision trees is effective in acquiring a better model of the data.</Paragraph>
      <Paragraph position="4"> Thereis no &gt;way of knowing, apriori, which combination of the previously mentioned applications of the forward-backward algorithm will produce the best model. After initial experimentation, the following sequence of training steps proved effective: Grow initial decision trees (MI) based on uniform models null Create M2 by pruning trees in MI to a maximum depth of 10.</Paragraph>
      <Paragraph position="5"> Grow decision trees (M3) from F-B counts from M2.</Paragraph>
      <Paragraph position="6"> Perform F-B reestimation for leaves of decision trees in M3.</Paragraph>
      <Paragraph position="7"> Smoothing Decision Trees Once the leaf distributions for a, set of decision trees are fixed, the model must be smoothed using held-out data to avoid overtraining on the original training corpus.</Paragraph>
      <Paragraph position="8"> Each node in a decision tree potentially assigns a different distribution to the set of future values predicted by that tree. The problem of smoothing is to decide which combination of the distributions along a path from a leaf to the root will result in the most accurate model. The decision trees are smoothed by assigning a parameter to each node. This parameter represents the extent to which the distribution at that node should be trusted with respect to the distribution at the parent node.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="275" end_page="275" type="metho">
    <SectionTitle>
6. Experimental Results
</SectionTitle>
    <Paragraph position="0"> Task Domain We have chosen computer manuals as a task domain. We picked the most frequent 3000 words from 10 manuals as our vocabulary. We then extracted about 35,000 sentences covered by this vocabulary3 from40,000,000 words of computer manuals. This corpus was treebanked by the University of Lancaster. The Treebank uses 17 non-terminal labels and 240 tags4.</Paragraph>
    <Paragraph position="1"> actual vocabulary is around 7,000 words when we include the many symbols, formulas, and numbers that occur in tl~e manuals *we have projected the tag set to 193  and average number of non-terminals per sentence for the blind test set.</Paragraph>
    <Paragraph position="2"> A parse produced by the parser is judged to be correct under the &amp;quot;Exact Match&amp;quot; criterion if it agrees with the Treebank parse structurally and all NT labels and tags agree5 Length Experiment 1 The parser using a stack decoding search which produced 1 parse for each sentence, and this parse was compared to the treebank parse for that sentence. On this test set, the parser produced the correct parse, i.e. a parse which matched the treebank parse exactly, for 38% of the sentences. Ignoring part-of-speech tagging errors, it produced the correct parse tree for 47% of the sentences. Further, the correct parse tree is present in the top 20 parses produced by the parser for 64% of the sentences.</Paragraph>
    <Paragraph position="3">  No other parsers have reported results on exactly matching treebank parses, so we also evaluated on the crossing brackets measure from [2], which represents the percentage of sentences for which none of the constituents in a parser's analysis violate the constituent boundaries of the treebank parse. The crossing-brackets measure is a very weak measure of parsing accuracy, since it does not verify prepositional phrase attachment or any other decision which is indicated by omitting structure. However, based on analysis of parsing errors, in the current state-of-the-art, increases in the crossing brackets measure appear to correlated with improvements in overall parsing performance. This may not remain true as parsers become more accurate.</Paragraph>
    <Paragraph position="4"> Constituent1 Sentence The 1100 sentence corpus that we used in this first experiment was one of the test corpora used in several experiments reported in [2]. The grammar-based parser discussed in [2] uses a P-CFG based on a rule-based grammar developed by a grammarian by examining the same training set used above over a period of more than 3 years. This P-CFG parser produced parses which passed the crossing brackets test for 69% of the 1100 sentences. Our decision tree hidden derivation parser improves upon this result, passing the crossing brackets test for 78% of the sentences. The details of this experiment are discussed in [9].</Paragraph>
    <Paragraph position="5"> % sample of 5000 sentences (a training set of 4000, a development test of 500, and an evaluation test of 500) is available by request from roukos Q watson.ibm.com.</Paragraph>
    <Paragraph position="6">  that Exact Match accuracy decreases by two percentage points with a significant reduction in computational complexity. Using the simpler single derivation model, we built a new set of models. We also combined the naming and extension steps into one, improved some of our processing of the casing of words, and added a few additional questions. Using these models, we ran on all sentences in our blind test set. Table 1 gives some statistics a function of sentence length on our test set of 1656 sentences. Table 2 gives the parser's performance e. In Table 2, we show a measure of treebank consistency. During treebanking, a random sample of about 1000 sentences was treebanked by two treebankers. The percentage of sentences for which they both produce the same exact trees (tags included) is shown as Treebank Consistency in Table 2. We also show the percentage of sentences that match the Treebank, the percentage where the Treebank parse is among the top 20 parses produced by the parser, and the percentage of sentences without a crossing bracket. Currently, the parser parses every third sentence exactly as a treebanker and is about 15 percentage points below what the treebankers agree on when they are parsing in production mode. A more carefully treebanked test set may be necessary in the future as we improve our parser.</Paragraph>
    <Paragraph position="7"> We also explored the effect of training set size on parsing performance with an earlier version of the parsing model.</Paragraph>
    <Paragraph position="8"> Table 3 shows the Exact Match score for sentences of 23 words or less. From this data, we see that we have a small improvement in accuracy by doubling the training set size from 15k to 30k sentences.</Paragraph>
    <Paragraph position="9"> /</Paragraph>
  </Section>
  <Section position="6" start_page="275" end_page="275" type="metho">
    <SectionTitle>
7. Conclusion
</SectionTitle>
    <Paragraph position="0"> We presented a &amp;quot;linguistically&amp;quot; naive parsing model that has a parsing accuracy rate that we believe is state-of-the-art. We anticipate that by refining the &amp;quot;linguistic&amp;quot; features that can be examined by the decision trees, we can improve the parser's performance significantly. Of particular interest are linguistic 6 While we prefer to use Exact Match for automatic parsing, we computed the PARSEVAL performance measures to be: 80% Recall, 81% Precision, and 10% Crossing Brackets on the unseen test set of Experiment 2. Note: On this test set, 65.7% of the sentences are parsed without any crossing brackets.  features that may be helpful in conjunction and other long distance dependency. We are currently investigating some mehtods for building in some of these features.</Paragraph>
  </Section>
  <Section position="7" start_page="275" end_page="275" type="metho">
    <SectionTitle>
Acknowledgement
</SectionTitle>
    <Paragraph position="0"> We wish to thank Robert T. Ward for his measurements of Treebank consistency. This work was supported in part by</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML