File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1037_metho.xml
Size: 17,458 bytes
Last Modified: 2025-10-06 14:14:06
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1037"> <Title>Statistical Decision-Tree Models for Parsing*</Title> <Section position="3" start_page="276" end_page="278" type="metho"> <SectionTitle> 2 Decision-Tree Modeling </SectionTitle> <Paragraph position="0"> Much of the work in this paper depends on replacing human decision-making skills with automatic decision-making algorithms. The decisions under consideration involve identifying constituents and constituent labels in natural language sentences.</Paragraph> <Paragraph position="1"> Grammarians, the human decision-makers in parsing, solve this problem by enumerating the features of a sentence which affect the disambiguation decisions and indicating which parse to select based on the feature values. The grammarian is accomplishing two critical tasks: identifying the features which are relevant to each decision, and deciding which choice to select based on the values of the relevant features.</Paragraph> <Paragraph position="2"> Decision-tree classification algorithms account for both of these tasks, and they also accomplish a third task which grammarians classically find difficult. By assigning a probability distribution to the possible choices, decision trees provide a ranking system which not only specifies the order of preference for the possible choices, but also gives a measure of the relative likelihood that each choice is the one which should be selected.</Paragraph> <Section position="1" start_page="276" end_page="276" type="sub_section"> <SectionTitle> 2.1 What is a Decision Tree? </SectionTitle> <Paragraph position="0"> A decision tree is a decision-making device which assigns a probability to each of the possible choices based on the context of the decision: P(flh), where f is an element of the future vocabulary (the set of choices) and h is a history (the context of the decision). This probability P(flh) is determined by asking a sequence of questions ql q2 ... qn about the context, where the ith question asked is uniquely determined by the answers to the i - 1 previous questions. null For instance, consider the part-of-speech tagging problem. The first question a decision tree might ask is: 1. What is the word being tagged? If the answer is the, then the decision tree needs to ask no more questions; it is clear that the decision tree should assign the tag f = determiner with probability 1. If, instead, the answer to question 1 is bear, the decision tree might next ask the question: 2. What is the tag of the previous word? If the answer to question 2 is determiner, the decision tree might stop asking questions and assign the tag f = noun with very high probability, and the tag f = verb with much lower probability. However, if the answer to question 2 is noun, the decision tree would need to ask still more questions to get a good estimate of the probability of the tagging decision. The decision tree described in this paragraph is shown in Figure 1.</Paragraph> <Paragraph position="1"> Each question asked by the decision tree is represented by a tree node (an oval in the figure) and the possible answers to this question are associated with branches emanating from the node. Each node defines a probability distribution on the space of possible decisions. A node at which the decision tree stops asking questions is a leaf node. The leaf nodes represent the unique states in the decision-making problem, i.e. all contexts which lead to the same leaf node have the same probability distribution for the decision.</Paragraph> </Section> <Section position="2" start_page="276" end_page="278" type="sub_section"> <SectionTitle> 2.2 Decision Trees vs. n-graxns </SectionTitle> <Paragraph position="0"> A decision-tree model is not really very different from an interpolated n-gram model. In fact, they speech tagging.</Paragraph> <Paragraph position="1"> are equivalent in representational power. The main differences between the two modeling techniques are how the models are parameterized and how the parameters are estimated.</Paragraph> <Paragraph position="2"> First, let's be very clear on what we mean by an n-gram model. Usually, an n-gram model refers to a Markov process where the probability of a particular token being generating is dependent on the values of the previous n - 1 tokens generated by the same process. By this definition, an n-gram model has IWI&quot; parameters, where IWI is the number of unique tokens generated by the process.</Paragraph> <Paragraph position="3"> However, here let's define an n-gram model more loosely as a model which defines a probability distribution on a random variable given the values of n- 1 random variables, P(flhlh2... hn-1). There is no assumption in the definition that any of the random variables F or Hi range over the same vocabulary. The number of parameters in this n-gram model is IFI I'\[ IH, I.</Paragraph> <Paragraph position="4"> Using this definition, an n-gram model can be represented by a decision-tree model with n - 1 questions. For instance, the part-of-speech tagging model P(tilwiti_lti_2) can be interpreted as a 4-gram model, where HI is the variable denoting the word being tagged, Ha is the variable denoting the tag of the previous word, and Ha is the variable denoting the tag of the word two words back. Hence, this 4-gram tagging model is the same as a decision-tree model which always asks the sequence of 3 questions: null 1. What is the word being tagged? 2. What is the tag of the previous word? 3. What is the tag of the word two words back? But can a decision-tree model be represented by an n-gram model? No, but it can be represented by an interpolated n-gram model. The proof of this assertion is given in the next section.</Paragraph> <Paragraph position="5"> The standard approach to estimating an n-gram model is a two step process. The first step is to count the number of occurrences of each n-gram from a training corpus. This process determines the empirical distribution, Count(hlhz ... hn-lf) P(flhlh2... hn-1)= Count(hlh2... hn-1) The second step is smoothing the empirical distribution using a separate, held-out corpus. This step improves the empirical distribution by finding statistically unreliable parameter estimates and adjusting them based on more reliable information.</Paragraph> <Paragraph position="6"> A commonly-used technique for smoothing is deleted interpolation. Deleted interpolation estimates a model P(f\[hlh2... hn-1) by using a linear combination of empirical models P(f\]hklhk=... hk.,), where m < n and k,-x < ki < n for all i < m. For example, a model \[~(fihlh2h3) might be interpolated as follows:</Paragraph> <Paragraph position="8"> where ~'~)q(hlh2h3) = 1 for all histories hlhshs.</Paragraph> <Paragraph position="9"> The optimal values for the A~ functions can be estimated using the forward-backward algorithm (Baum, 1972).</Paragraph> <Paragraph position="10"> A decision-tree model can be represented by an interpolated n-gram model as follows. A leaf node in a decision tree can be represented by the sequence of question answers, or history values, which leads the decision tree to that leaf. Thus, a leaf node defines a probability distribution based on values of those questions: P(flhklhk2 ... ha.,), where m < n and ki-1 < ki < n, and where hk~ is the answer to one of the questions asked on the path from the root to the leaf. ~ But this is the same as one of the terms in the interpolated n-gram model. So, a decision 1Note that in a decision tree, the leaf distribution is not affected by the order in which questions are asked. Asking about hi followed by h2 yields the same future distribution as asking about h2 followed by hi.</Paragraph> <Paragraph position="11"> tree can be defined as an interpolated n-gram model where the At function is defined as:</Paragraph> <Paragraph position="13"/> </Section> <Section position="3" start_page="278" end_page="278" type="sub_section"> <SectionTitle> 2.3 Decision-Tree Algorithms </SectionTitle> <Paragraph position="0"> The point of showing the equivalence between n-gram models and decision-tree models is to make clear that the power of decision-tree models is not in their expressiveness, but instead in how they can be automatically acquired for very large modeling problems. As n grows, the parameter space for an n-gram model grows exponentially, and it quickly becomes computationally infeasible to estimate the smoothed model using deleted interpolation. Also, as n grows large, the likelihood that the deleted interpolation process will converge to an optimal or even near-optimal parameter setting becomes vanishingly small.</Paragraph> <Paragraph position="1"> On the other hand, the decision-tree learning algorithm increases the size of a model only as the training data allows. Thus, it can consider very large history spaces, i.e. n-gram models with very large n.</Paragraph> <Paragraph position="2"> Regardless of the value of n, the number of parameters in the resulting model will remain relatively constant, depending mostly on the number of training examples.</Paragraph> <Paragraph position="3"> The leaf distributions in decision trees are empirical estimates, i.e. relative-frequency counts from the training data. Unfortunately, they assign probability zero to events which can possibly occur. Therefore, just as it is necessary to smooth empirical n-gram models, it is also necessary to smooth empirical decision-tree models.</Paragraph> <Paragraph position="4"> The decision-tree learning algorithms used in this work were developed over the past 15 years by the IBM Speech Recognition group (Bahl et al., 1989). The growing algorithm is an adaptation of the CART algorithm in (Breiman et al., 1984). For detailed descriptions and discussions of the decision-tree algorithms used in this work, see (Magerman, 1994).</Paragraph> <Paragraph position="5"> An important point which has been omitted from this discussion of decision trees is the fact that only binary questions are used in these decision trees. A question which has k values is decomposed into a sequence of binary questions using a classification tree on those k values. For example, a question about a word is represented as 30 binary questions. These 30 questions are determined by growing a classification tree on the word vocabulary as described in (Brown et al., 1992). The 30 questions represent 30 different binary partitions of the word vocabulary, and these questions are defined such that it is possible to identify each word by asking all 30 questions. For more discussion of the use of binary decision-tree questions, see (Magerman, 1994).</Paragraph> </Section> </Section> <Section position="4" start_page="278" end_page="280" type="metho"> <SectionTitle> 3 SPATTER Parsing </SectionTitle> <Paragraph position="0"> The SPATTER parsing algorithm is based on interpreting parsing as a statistical pattern recognition process. A parse tree for a sentence is constructed by starting with the sentence's words as leaves of a tree structure, and labeling and extending nodes these nodes until a single-rooted, labeled tree is constructed. This pattern recognition process is driven by the decision-tree models described in the previous section.</Paragraph> <Section position="1" start_page="278" end_page="279" type="sub_section"> <SectionTitle> 3.1 SPATTER Representation </SectionTitle> <Paragraph position="0"> A parse tree can be viewed as an n-ary branching tree, with each node in a tree labeled by either a non-terminal label or a part-of-speech label. If a parse tree is interpreted as a geometric pattern, a constituent is no more than a set of edges which meet at the same tree node. For instance, the noun phrase, &quot;a brown cow,&quot; consists of an edge extending to the right from &quot;a,&quot; an edge extending to the left of extensions in SPATTER.</Paragraph> <Paragraph position="1"> In SPATTER, a parse tree is encoded in terms of four elementary components, or features: words, tags, labels, and extensions. Each feature has a fixed vocabulary, with each element of a given feature vocabulary having a unique representation. The word feature can take on any value of any word. The tag feature can take on any value in the part-of-speech tag set. The label feature can take on any value in the non-terminal set. The extension can take on any of the following five values: right - the node is the first child of a constituent; left - the node is the last child of a constituent; up - the node is neither the first nor the last child of a constituent; unary - the node is a child of a unary constituent; root - the node is the root of the tree.</Paragraph> <Paragraph position="2"> For an n word sentence, a parse tree has n leaf nodes, where the word feature value of the ith leaf node is the ith word in the sentence. The word feature value of the internal nodes is intended to contain the lexical head of the node's constituent. A deterministic lookup table based on the label of the internal node and the labels of the children is used to approximate this linguistic notion.</Paragraph> <Paragraph position="3"> The SPATTER representation of the sentence</Paragraph> <Paragraph position="5"> is shown in Figure 3. The nodes are constructed bottom-up from left-to-right, with the constraint that no constituent node is constructed until all of its children have been constructed. The order in which the nodes of the example sentence are constructed is indicated in the figure.</Paragraph> </Section> <Section position="2" start_page="279" end_page="279" type="sub_section"> <SectionTitle> 3.2 Training SPATTER's models </SectionTitle> <Paragraph position="0"> SPATTER consists of three main decision-tree models: a part-of-speech tagging model, a nodeextension model, and a node-labeling model.</Paragraph> <Paragraph position="1"> Each of these decision-tree models are grown using the following questions, where X is one of word, tag, label, or extension, and Y is either left and right: child from the Y? For each of the nodes listed above, the decision tree could also ask about the number of children and span of the node. For the tagging model, the values of the previous two words and their tags are also asked, since they might differ from the head words of the previous two constituents.</Paragraph> <Paragraph position="2"> The training algorithm proceeds as follows. The training corpus is divided into two sets, approximately 90% for tree growing and 10% for tree smoothing. For each parsed sentence in the tree growing corpus, the correct state sequence is traversed. Each state transition from si to 8i+1 is an event; the history is made up of the answers to all of the questions at state sl and the future is the value of the action taken from state si to state Si+l. Each event is used as a training example for the decision-tree growing process for the appropriate feature's tree (e.g. each tagging event is used for growing the tagging tree, etc.). After the decision trees are grown, they are smoothed using the tree smoothing corpus using a variation of the deleted interpolation algorithm described in (Magerman, 1994).</Paragraph> </Section> <Section position="3" start_page="279" end_page="280" type="sub_section"> <SectionTitle> 3.3 Parsing with SPATTER </SectionTitle> <Paragraph position="0"> The parsing procedure is a search for the highest probability parse tree. The probability of a parse is just the product of the probability of each of the actions made in constructing the parse, according to the decision-tree models.</Paragraph> <Paragraph position="1"> Because of the size of the search space, (roughly O(ITI&quot;INJ&quot;), where \[TJ is the number of part-of-speech tags, n is the number of words in the sentence, and \[NJ is the number of non-terminal labels), it is not possible to compute the probability of every parse. However, the specific search algorithm used is not very important, so long as there are no search errors. A search error occurs when the the highest probability parse found by the parser is not the highest probability parse in the space of all parses.</Paragraph> <Paragraph position="2"> SPATTER's search procedure uses a two phase approach to identify the highest probability parse of a sentence. First, the parser uses a stack decoding algorithm to quickly find a complete parse for the sentence. Once the stack decoder has found a complete parse of reasonable probability (> 10-5), it switches to a breadth-first mode to pursue all of the partial parses which have not been explored by the stack decoder. In this second mode, it can safely discard any partial parse which has a probability lower than the probability of the highest probability completed parse. Using these two search modes, SPATTER guarantees that it will find the highest probability parse. The only limitation of this search technique is that, for sentences which are modeled poorly, the search might exhaust the available memory before completing both phases. However, these search errors conveniently occur on sentences which SPATTER is likely to get wrong anyway, so there isn't much performance lossed due to the search errors. Experimentally, the search algorithm guarantees the highest probability parse is found for over 96% of the sentences parsed.</Paragraph> </Section> </Section> class="xml-element"></Paper>