XML Viewer - h89-2048

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-2048_intro.xml
Size: 8,702 bytes
Last Modified: 2025-10-06 14:04:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2048">
  <Title>Some Applications of Tree-based Modelling to Speech and Language</Title>
  <Section position="2" start_page="0" end_page="343" type="intro">
    <SectionTitle>
2. Classification and Regression Trees
</SectionTitle>
    <Paragraph position="0"> An excellent description of the theory and implementation of tree-based statistical models can be found in Classit~cation and Regression Trees \[L. Breiman, et al, 1984\]. A brief description of these ideas will be provided here.</Paragraph>
    <Paragraph position="1"> Figure 1 shows an example of a tree for classifying whether a stop (in the context C V) is voiced or voiceless based on factors such as voice onset time, closure duration, and phonetic context. Let us first see how to use such a tree for classification. Then we will see how the tree was generated.</Paragraph>
    <Paragraph position="2"> Suppose we have a stop with a VOT of 30 msec that is preceded by a nasal and followed by a high back vowel. Starting at the root node in Figure 1, the first decision is whether the VOT is greater or less than 35.4 msec. Since in our example, it is less, we take the left branch. The next split, labelled &amp;quot;l-cm&amp;quot;, refers to the consonantal manner of the the preceding (left) segment. Since in this case it is nasal, we take the right branch. The next split is on the vowel place of following (right) segment. Since it is high back in this case, we take the right branch, reaching a terminal node. The node is labelled &amp;quot;yes&amp;quot;, indicating that this example should be classified as voiced.</Paragraph>
    <Paragraph position="3"> In the training set, 739 of the 1189 examples that reached this node were correctly classified. This tree is a subtree of a better classifier to be described in the next section; this example was pruned for illustrative purposes.</Paragraph>
    <Paragraph position="4"> This is an example of a classification tree, since the decision is to choose one of several classes; in this case, there are two classes: {voiced, voiceless}. In other words, the predicted variable, y, is categorical. Trees can be created for continuous y also. In this case they are called regression trees with the terminal nodes labelled with a real number (or, more generally, a vector).</Paragraph>
    <Paragraph position="5"> Classifying with an existing tree is easy; the interesting question is how to generate the tree for a given problem. There are three basic questions that have to be answered when generating a tree: (1) what are the splitting rules, (2) what are the stopping rules, and (3) what prediction is made at each terminal node?  Let us begin answering these questions by introducing some notation. Consider that we have N samples of data, with each sample consisting of M features, xl, x2, x3,.., xm. In the voiced/voiceless stop example, xl might be VOT, x2 the phonetic class of the preceding segment, etc. Just as the y (dependent) variable can be continuous or categorical, so can the x (independent) variables. E.g., VOT is continuous, while phonetic class is categorical (can not be usefully ordered).</Paragraph>
    <Paragraph position="6"> The first question -- what stopping rule? -- refers to what split to take at a given node. It has two parts: (a) what candidates should be considered, and (b) which is the best choice among candidates for a given node? A simple choice is to consider splits based on one x variable at a time. If the independent variable being considered is continuous -oo &lt; x &lt; c~, consider splits of the form: z_&lt;k vs. x&gt;k, Vk.</Paragraph>
    <Paragraph position="7"> In other words, consider all binary cuts of that variable. If the independent variable is categorical x E {1, 2, ..., n} = X, consider splits of form: x~A vs. xEX-A, VAcX.</Paragraph>
    <Paragraph position="8"> In other words, consider all binary partitions of that variable. More sophisticated splitting rules would allow combinations of a such splits at a given node; e.g., linear combinations of continuous variables, or boolean combinations of categorical variables.</Paragraph>
    <Paragraph position="9"> A simple choice to decide which of these splits is the best at a given node is to select the one that minimizes the estimated classification or prediction error after that split based on the training set. Since this is done stepwise at each node, this is not guaranteed to be globally optimal even for the training set. In fact, there are cases where this is a bad choice. Consider Figure 2, where two different splits are illustrated for a classification problem having two classes (No. 1 and No. 2) and 800 samples in the training set (with 400 in each class). If we label each child node according to the greater class present there, we see that the two different splits illustrated both give 200 samples misclassified. Thus, minimizing the error gives no preference to either of these splits.</Paragraph>
    <Paragraph position="10"> The example on the right, however, is better because it creates at least one very pure node (no misclassification) which needs no more splitting. At the next split, the other node can be attacked. In other words, the stepwise optimization makes creating purer nodes at each step desirable. A simple way to do this is to minimize the entropy at each node for categorical y. Minimizing the mean square error is a common choice for continuous y.</Paragraph>
    <Paragraph position="11"> The second question -- what stopping rule? -- refers when to declare a node terminal. Too large trees may match the training data well, but they won't necessarily perform well on new test data, since they have overfit the data. Thus, a procedure is needed to find an &amp;quot;honest-sized&amp;quot; tree.</Paragraph>
    <Paragraph position="12"> Early attempts at this tried to find good stopping rules based on absolute purity, differential purity from the parent, and other such &amp;quot;local&amp;quot; evaluations. Unfortunately, good thresholds for these vary from problem to problem.</Paragraph>
    <Paragraph position="13"> A better choice is as follows: (a) grow an over-large tree with very conservative stopping rules, (b) form a sequence of subtrees, To,..., Tn, ranging from the full tree to just the root node, (c) estimate an &amp;quot;honest&amp;quot; error rate for each subtree, and then (d) choose the subtree with the minimum &amp;quot;honest&amp;quot; error rate.  Split 1 Split 2 Figure 2. Two different splits with the same misclassification rate.</Paragraph>
    <Paragraph position="14"> To form the sequence of subtrees in (b), vary c~ from 0 (for full tree) to oo (for just the root node) in: min \[R(T) + alTI\]. T where R(T) is the classification or prediction error for that subtree and I TI is the number of terminal nodes in the subtree. This is called the cost-complexity pruning sequence.</Paragraph>
    <Paragraph position="15"> To estimate an &amp;quot;honest&amp;quot; error rate in (c), test the subtrees on data different from the training data, e.g., grow the tree on 9/10 of the available data and test on 1/10 of the data repeating 10 times and averaging. This is often called cross-validation.</Paragraph>
    <Paragraph position="16"> Figure 3 shows misclassification rate vs. tree length for the voiced-voiceless stop classification problem. The bottom curve shows misclassification for the training data, which continues to improve with increasing tree length. The higher curve shows the cross-validated misclassification rate, which reaches a minimum with a tree size of about 30 and then rises again with increasing tree length. In fact a tree length of around 10 is very near optimal and would be a good choice for this problem.</Paragraph>
    <Paragraph position="17"> The last question -- what prediction is made at a terminal node? -- is easy to answer. If the predicted variable is categorical, choose the most frequent class among the training samples at that node (plurality vote). If it is continuous, choose the mean of the training samples at that node.</Paragraph>
    <Paragraph position="18"> The approach described here can be used on quite large problem. We have grown trees with hundreds of thousands of samples with a hundred different independent variables. The time complexity, in fact, grows only linearly with the number of input variables. The one expensive operation is forming the binary partitions for categorical x's. This increases exponentially with the number of distinct values the variable can assume. Let us now discuss some applications of these ideas to some problems in speech and language.</Paragraph>
    <Paragraph position="20"/>
  </Section>
class="xml-element"></Paper>
Download Original XML