XML Viewer - w01-0502

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0502_metho.xml
Size: 28,613 bytes
Last Modified: 2025-10-06 14:07:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0502">
  <Title>A Sequential Model for Multi-Class Classificationa0</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Multi-Class Classification
</SectionTitle>
    <Paragraph position="0"> Several works within the machine learning community have attempted to develop general approaches to multi-class classification. One of the most promising approaches is that of error correcting output codes (Dietterich and Bakiri, 1995); however, this approach has not been able to handle well a large number of classes (over 10 or 15, say) and its use for most large scale NLP applications is therefore questionable. Statistician have studied several schemes such as learning a single classifier for each of the class labels (one vs. all) or learning a discriminator for each pair of class labels, and discussed their relative merits(Hastie and Tibshirani, 1998).</Paragraph>
    <Paragraph position="1"> Although it has been argued that the latter should provide better results than others, experimental results have been mixed (Allwein et al., 2000) and in some cases, more involved schemes, e.g., learning a classifier for each set of three class labels (and deciding on the prediction in a tournament like fashion) were shown to perform better (Teow and Loe, 2000). Moreover, none of these methods seem to be computationally plausible for large scale problems, since the number of classifiers one needs to train is, at least, quadratic in the number of class labels.</Paragraph>
    <Paragraph position="2"> Within NLP, several learning works have already addressed the problem of multi-class classification.</Paragraph>
    <Paragraph position="3"> In (Kudoh and Matsumoto, 2000) the methods of &amp;quot;all pairs&amp;quot; was used to learn phrase annotations for shallow parsing. More than a4a6a5a7a5 different classifiers where used in this task, making it infeasible as a general solution. All other cases we know of, have taken into account some properties of the domain and, in fact, several of the works can be viewed as instantiations of the sequential model we formalize here, albeit done in an ad-hoc fashion.</Paragraph>
    <Paragraph position="4"> In speech recognition, a sequential model is used to process speech signal. Abstracting away some details, the first classifier used is a speech signal analyzer; it assigns a positive probability only to some of the words (using Levenshtein distance (Levenshtein, 1966) or somewhat more sophisticated techniques (Levinson et al., 1990)). These words are then assigned probabilities using a different contextual classifier e.g., a language model, and then, (as done in most current speech recognizers) an additional sentence level classifier uses the outcome of the word classifiers in a word lattice to choose the most likely sentence.</Paragraph>
    <Paragraph position="5"> Several word prediction tasks make decisions in a sequential way as well. In spell correction confusion sets are created using a classifier that takes as input the word transcription and outputs a positive probability for potential words. In conventional spellers, the output of this classifier is then given to the user who selects the intended word. In context sensitive spelling correction (Golding and Roth, 1999; Mangu and Brill, 1997) an additional classifier is then utilized to predict among words that are supported by the first classifier, using contextual and lexical information of the surrounding words. In all studies done so far, however, the first classifier - the confusion sets - were constructed manually by the researchers.</Paragraph>
    <Paragraph position="6"> Other word predictions tasks have also constructed manually the list of confusion sets (Lee and Pereira, 1999; Dagan et al., 1999; Lee, 1999) and justifications where given as to why this is a reasonable way to construct it. (Even-Zohar and Roth, 2000) present a similar task in which the confusion sets generation was automated. Their study also quantified experimentally the advantage in using early classifiers to restrict the size of the confusion set.</Paragraph>
    <Paragraph position="7"> Many other NLP tasks, such as pos tagging, name entity recognition and shallow parsing require multi-class classifiers. In several of these cases the number of classes could be very large (e.g., pos tagging in some languages, pos tagging when a finer proper noun tag is used). The sequential model suggested here is a natural solution.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Sequential Model
</SectionTitle>
    <Paragraph position="0"> We study the problem of learning a multi-class classifier, a8a10a9a12a11 a13 a14 where a11 a15a17a16a18a5a20a19a22a21a18a23a22a24 , a14 a25 a16a27a26a27a28a22a19a30a29a31a29a31a29a32a19a33a26a35a34a36a23 and a37 is typically large, on the order of a21a22a5a39a38a41a40a42a21a22a5a39a43 . We address this problem using the Sequential Model (SM) in which simpler classifiers are sequentially used to filter subsets of a14 out of consideration.</Paragraph>
    <Paragraph position="1"> The sequential model is formally defined as a a44 tuple: null</Paragraph>
    <Paragraph position="3"> a23 determines the order in which the classifiers are learned and evaluated. For convenience we denote a8a7a28a76a25a77a8a18a78a33a79a75a19a80a8</Paragraph>
    <Paragraph position="5"> is the set of classifiers used by the model, a8</Paragraph>
    <Paragraph position="7"> is a set of constant thresholds.</Paragraph>
    <Paragraph position="8"> Given a96a98a97a99a11 a50 and a set a14</Paragraph>
    <Paragraph position="10"> the a68 th classifier outputs a probability distribution1</Paragraph>
    <Paragraph position="12"> The set of remaining candidates after the a68 th classification stage is determined by a101</Paragraph>
    <Paragraph position="14"> The sequential process can be viewed as a multiplication of distributions. (Hinton, 2000) argues that a product of distributions (or, &amp;quot;experts&amp;quot;, PoE) 1The output of many classifiers can be viewed, after appropriate normalization, as a confidence measure that can be used as our a120a87a121 .</Paragraph>
    <Paragraph position="15"> is an efficient way to make decisions in cases where several different constrains play a role, and is advantageous over additive models. In fact, due to the thresholding step, our model can be viewed as a selective PoE. The thresholding ensures that the SM has the following monotonicity property:</Paragraph>
    <Paragraph position="17"> that is, as we evaluate the classifiers sequentially, smaller or equal (size) confusion sets are considered. A desirable design goal for the SM is that, w.h.p., the classifiers have one sided error (even at the price of rejecting fewer classes). That is, if a26a35a124 is the true target 2, then we would like to have</Paragraph>
    <Paragraph position="19"> . The rest of this paper presents a concrete instantiation of the SM, and then provides a theoretical analysis of some of its properties (Sec. 5). This work does not address the question of</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Example: POS Tagging
</SectionTitle>
    <Paragraph position="0"> This section describes a two part experiment of pos tagging in which we compare, under identical conditions, two classification models: A SM and a single classifier. Both are provided with the same input features and the only difference between them is the model structure.</Paragraph>
    <Paragraph position="1"> In the first part, the comparison is done in the context of assigning pos tags to unknown words those words which were not presented during training and therefore the learner has no baseline knowledge about possible POS they may take. This experiment emphasizes the advantage of using the SM during evaluation in terms of accuracy. The second part is done in the context of pos tagging of known words. It compares processing time as well as accuracy of assigning pos tags to known words (that is, the classifier utilizes knowledge about possible POS tags the target word may take). This part exhibits a large reduction in training time using the SM over the more common one-vs-all method while the accuracy of the two methods is almost identical.</Paragraph>
    <Paragraph position="2"> Two types of features - lexical features and contextual features may be used when learning how to tag words for pos. Contextual features capture the information in the surrounding context and the word lemma while the lexical features capture the morphology of the unknown word.3 Several is- null words.</Paragraph>
    <Paragraph position="3"> sues make the pos tagging problem a natural problem to study within the SM. (i) A relatively large number of classes (about 50). (ii) A natural decomposition of the feature space to contextual and lexical features. (iii) Lexical knowledge (for unknown words) and the word lemma (for known words) provide, w.h.p, one sided error (Mikheev, 1997).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The Tagger Classifiers
</SectionTitle>
      <Paragraph position="0"> The domain in our experiment is defined using the following set of features, all of which are computed relative to the target word a127  unknown word, the baseline is proper singular noun &amp;quot;NNP&amp;quot; for capitalized words and common singular noun &amp;quot;NN&amp;quot; otherwise. (This feature is introduced only in some of the experiments.)  Contextual and Lexical features in a Sequential Model. The input for capitalized classifier has 2 values and therefore 2 ways to create confusion sets. There are at most a137a149a148 a38a70a150 a129 a28a107a151 a129 a43a33a152 different inputs for the suffix classifier (26 character + 10 digits + 5 other symbols), therefore suffix may emit up to a137 a148 a38a82a150 a129 a28a107a151 a129 a43a82a152 confusion sets. illustrates the SM that was used in the experiments. All the classifiers in the sequential model, as well as the single classifier, use the SNoW learning architecture (Roth, 1998) with the Winnow update rule. SNoW (Sparse Network of Winnows) is a multi-class classifier that is specifically tailored for learning in domains in which the potential number of features taking part in decisions is very large, but in which decisions actually depend on a small number of those features. SNoW works by learning a sparse network of linear functions over a pre-defined or incrementally learned feature space. SNoW has already been used successfully on several tasks in natural language processing (Roth, 1998; Roth and Zelenko, 1998; Golding and Roth, 1999; Punyakanok and Roth, 2001).</Paragraph>
      <Paragraph position="1"> Specifically, for each class label SNoW learns a function a8a27a153a154a9a155a11a156a13 a91 a5a20a19a75a21a95a94 that maps a feature based representation a96 of the input instance to a number a157 a153a22a83a85a96a110a89a133a97a158a91a93a5a87a19a75a21a35a94 which can be interpreted as the probability of a26 being the class label corresponding to a96 . At prediction time, given a96a51a97a144a11 , SNoW outputs</Paragraph>
      <Paragraph position="3"> All functions - in our case, a44a39a5 target nodes are used, one for each pos tag - reside over the same feature space, but can be thought of as autonomous functions (networks). That is, a given example is treated autonomously by each target subnetwork; an example labeled a128 is considered as a positive example for the function learned for a128 and as a negative example for the rest of the functions (target nodes).</Paragraph>
      <Paragraph position="4"> The network is sparse in that a target node need not be connected to all nodes in the input layer. For example, it is not connected to input nodes (features) that were never active with it in the same sentence.</Paragraph>
      <Paragraph position="5"> Although SNoW is used with a44a39a5 different targets, the SM utilizes by determining the confusion set dynamically. That is, in evaluation (prediction), the maximum in Eq. 1 is taken only over the currently applicable confusion set. Moreover, in training, a given example is used to train only target networks that are in the currently applicable confusion set.</Paragraph>
      <Paragraph position="6"> That is, an example that is positive for target a128 , is viewed as positive for this target (if it is in the confusion set), and as negative for the other targets in the confusion set. All other targets do not see this example.</Paragraph>
      <Paragraph position="7"> The case of POS tagging of known words is handled in a similar way. In this case, all possible tags are known. In training, we record, for each word a127</Paragraph>
      <Paragraph position="9"> all pos tags with which it was tagged in the training corpus. During evaluation, whenever word a127</Paragraph>
      <Paragraph position="11"> curs, it is tagged with one of these pos tags. That is, in evaluation, the confusion set consists only of those tags observed with the target word in training, and the maximum in Eq. 1 is taken only over these. This is always the case when using a8a27a141 (or a8a84a142 a141 ), both in the SM and as a single classifier. In training, though, for the sake of this experiment, we treat a8 a141</Paragraph>
      <Paragraph position="13"> ) differently depending on whether it is trained for the SM or as a single classifier. When trained as a single classifier (e.g., (Roth and Zelenko, 1998)), a8a27a141 uses each a128 -tagged example as a positive example for a128 and a negative example for all other tags.</Paragraph>
      <Paragraph position="14"> On the other hand, the SM classifier is trained on a a128 -tagged example of word a127 , by using it as a positive example for a128 and a negative example only for the effective confusion set. That is, those pos tags which have been observed as tags of a127 in the training corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> The data for the experiments was extracted from the Penn Treebank WSJ and Brown corpora. The training corpus consists of a4a87a19a70a140a53a5a39a5a20a19a80a5a39a5a7a5 words. The test corpus consists of a4a6a162a7a5a87a19a80a5a7a5a39a5 words of which a44a149a19a82a140a84a21a75a4 are unknown words (that is, they do not occur in the training corpus. (Numbers (the pos &amp;quot;CD&amp;quot;), are not included among the unknown words).</Paragraph>
      <Paragraph position="1">  contextual features (accuracy in percent). a8a18a141 is a classifier that uses only contextual features, a8a27a141 + baseline is the same classifier with the addition of the baseline feature (&amp;quot;NNP&amp;quot; or &amp;quot;NN&amp;quot;).</Paragraph>
      <Paragraph position="2"> Table 1 summarizes the results of the experiments with a single classifier that uses only contextual features. Notice that adding the baseline POS significantly improves the results but not much is gained over the baseline. The reason is that the baseline feature is almost perfect (a147 a140a155a29a93a140a53a165 ) in the training data. For that reason, in the next experiments we do not use the baseline at all, since it could hide the phenomenon addressed. (In practice, one might want to use a more sophisticated baseline, as in</Paragraph>
      <Paragraph position="4"> contextual and lexical Features (accuracy in percent). a8a27a141 is based only on contextual features, a8a84a142  in the sequential model.</Paragraph>
      <Paragraph position="5"> Table 2 summarizes the results of the main experiment in this part. It exhibits the advantage of using the SM (columns 3,4) over a single classifier that makes use of the same features set (column 2). In both cases, all features are used. In a8a69a142</Paragraph>
      <Paragraph position="7"> is trained on input that consists of all these features and chooses a label from among all class labels. In</Paragraph>
      <Paragraph position="9"> but different classifiers are used sequentially - using only part of the feature space and restricting the set of possible outcomes available to the next classifier in the sequence - a8 a50 chooses only from among those left as candidates.</Paragraph>
      <Paragraph position="10"> It is interesting to note that further improvement can be achieved, as shown in the right most column. Given that the last stage in a45a47a46 a83a107a8a58a28a30a19a95a8</Paragraph>
      <Paragraph position="12"> tical to the single classifier a8a84a142 a141 , this shows the contribution of the filtering done in the first two stages using a8 a28 and a8  . In addition, this result shows that the input spaces of the classifiers need not be disjoint. null POS Tagging of Known Words Essentially everyone who is learning a POS tagger for known words makes use of a &amp;quot;sequential model&amp;quot; assumption during evaluation - by restricting the set of candidates, as discussed in Sec 4.1). The focus of this experiment is thus to investigate the advantage of the SM during training. In this case, a single (one-vs-all) classifier trains each tag against all other tags, while a SM classifier trains it only against the effective confusion set (Sec 4.1). Table 3 compares the performance of the a8 a141 classifier trained using in a one-vs-all method to the same classifier trained the SM way. The results are only for known words and the results of Brill's tagger (Brill, 1995) are presented for comparison.  textual features (accuracy in percent). one-vs-all denotes training where example a96 serves as positive example to the true tag and as negative example to all the other tags. SMa124a32a168a82a169 a50 a24 denotes training where example a96 serves as positive example to the true tag and as a negative example only to a restricted set of tags in based on a previous classifier - here, a simple baseline restriction.</Paragraph>
      <Paragraph position="13"> While, in principle, (see Sec 5) the SM should do better (an never worse) than the one-vs-all classifier, we believe that in this case SM does not have any performance advantages since the classifiers work in a very high dimensional feature space which allows the one-vs-all classifier to find a separating hyperplane that separates the positive examples many different kinds of negative examples (even irrelevant ones).</Paragraph>
      <Paragraph position="14"> However, the key advantage of the SM in this case is the significant decrease in computation time, both in training and evaluation. Table 4 shows that in the pos tagging task, training using the SM is 6 times faster than with a one-vs-all method and 3000 faster than Brill's learner. In addition, the evaluation time of our tagger was about twice faster than that of Brill's tagger.</Paragraph>
      <Paragraph position="15">  known words using contextual features (In CPU seconds). Train: training time over a21a22a5 a43 sentences. Brill's learner was interrupted after 12 days of training (default threshold was used). Test: average number of seconds to evaluate a single sentence. All runs were done on the same machine.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 The Sequential model: Theoretical
</SectionTitle>
    <Paragraph position="0"> Justification In this section, we discuss some of the theoretical aspects of the SM and explain some of its advantages. In particular, we discuss the following issues:  1. Domain Decomposition: When the input feature space can be decomposed, we show that it is advantageous to do it and learn several classifiers, each on a smaller domain.</Paragraph>
    <Paragraph position="1"> 2. Range Decomposition: Reducing confusion set size is advantageous both in training and testing the classifiers.</Paragraph>
    <Paragraph position="2"> (a) Test: Smaller confusion set is shown to yield a smaller expected error.</Paragraph>
    <Paragraph position="3"> (b) Training: Under the assumptions that a  small confusion set (determined dynamically by previous classifiers in the sequence) is used when a classifier is evaluated, it is shown that training the classifiers this way is advantageous.</Paragraph>
    <Paragraph position="4"> 3. Expressivity: SM can be viewed as a way to generate an expressive classifier by building on a number of simpler ones. We argue that the SM way of generating an expressive classifier has advantages over other ways of doing it, such as decision tree. (Sec 5.3).</Paragraph>
    <Paragraph position="5"> In addition, SM has several significant computational advantages both in training and in test, since it only needs to consider a subset of the set of candidate class labels. We will not discuss these issues in detail here.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Decomposing the Domain
</SectionTitle>
      <Paragraph position="0"> Decomposing the domain is not an essential part of the SM; it is possible that all the classifiers used actually use the same domain. As we shown below, though, when a decomposition is possible, it is advantageous to use it.</Paragraph>
      <Paragraph position="1"> It is shown in Eq. 2-7 that when it is possible to decompose the domain to subsets that are conditionally independent given the class label, the SM with classifiers defined on these subsets is as accurate as the optimal single classifier. (In fact, this is shown for a pure product of simpler classifiers; the SM uses a selective product.) In the following we assume that a11 a28 a19a22a29a30a29a22a29a30a19a82a11a115a65 provide a decomposition of the domain a11 (Sec. 3) and that a83a85a96 a28 a19a22a29a22a29a30a29a22a19a70a96 a65 a89a172a97a76a83a85a11 a28 a19a30a29a22a29a22a29a30a19a70a11 a65 a89 . By conditional independence we mean that  fore can be treated as a constant. Eq. 5 is derived by applying the independence assumption. Eq. 6 is derived by using the Bayes rule for each term a102a185a83a103a26a58a104 a96 a50 a89 separately.</Paragraph>
      <Paragraph position="2"> We note that although the conditional independence assumption is a strong one, it is a reasonable assumption in many NLP applications; in particular, when cross modality information is used, this assumption typically holds for decomposition that is done across modalities. For example, in POS tagging, lexical information is often conditionally independent of contextual information, given the true POS. (E.g., assume that word is a gerund; then the context is independent of the &amp;quot;ing&amp;quot; word ending.) In addition, decomposing the domain has significant advantages from the learning theory point of view (Roth, 1999). Learning over domains of lower dimensionality implies better generalization bounds or, equivalently, more accurate classifiers for a fixed size training set.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Decomposing the range
</SectionTitle>
      <Paragraph position="0"> The SM attempts to reduce the size of the candidates set. We justify this by considering two cases: (i) Test: we will argue that prediction among a smaller set of classes has advantages over predicting among a large set of classes; (ii) Training: we will argue that it is advantageous to ignore irrelevant examples.</Paragraph>
      <Paragraph position="1"> 5.2.1 Decomposing the range during Test The following discussion formalizes the intuition that a smaller confusion set in preferred. Let a8a161a9 a11a189a13a77a14 be the true target function and a102a185a83a103a26 a167 a104 a96a110a89 the probability assigned by the final classifier to class a26a82a167a190a97a10a14 given example a96a191a97a112a11 . Assuming that the prediction is done, naturally, by choosing the most likely class label, we see that the expected error when using a confusion set of size a178 is:  Claim 1 shows that reducing the size of the confusion set can only help; this holds under the assumption that the true class label is not eliminated from consideration by down stream classifiers, that is, under the one-sided error assumption. Moreover, it is easy to see that the proof of Claim 1 allows us to relax the one sided error assumption and assume instead that the previous classifiers err with a probability which is smaller than:  We will assume now, as suggested by the previous discussion, that in the evaluation stage the smallest possible set of candidates will be considered by each classifier. Based on this assumption, Claim 2 shows that training this way is advantageous. That is, that utilizing the SM in training yields a better classifier.</Paragraph>
      <Paragraph position="2"> Let a215 be a learning algorithm that is trained to minimize: a216</Paragraph>
      <Paragraph position="4"> where a96 is an example, a220a114a97a223a16a7a40a145a21a6a19a95a212a109a21a27a23 is the true class, a221 is the hypothesis, a218 is a loss function and a102a185a83a108a96a105a89 is the probability of seeing example a96 when a96a115a224 a101 (see (Allwein et al., 2000)). (Notice that in this section we are using general loss function a218 ; we could use, in particular, binary loss function used in Sec 5.2.) We phrase and prove the next claim, w.l.o.g, the case of a4 vs. a137 class labels.  Proof. Assume that the algorithm a215 , when trained on a sample a45 , produces a hypothesis that minimizes the empirical error over a45 .</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Expressivity
</SectionTitle>
      <Paragraph position="0"> The SM is a decision process that is conceptually similar to a decision tree processes (Rasoul and Landgrebe, 1991; Mitchell, 1997), especially if one allows more general classifiers in the decision tree nodes. In this section we show that (i) the SM can express any DT. (ii) the SM is more compact than a decision tree even when the DT makes used of more expressive internal nodes (Murthy et al., 1994).</Paragraph>
      <Paragraph position="1"> The next theorem shows that for a fixed set of functions (queries) over the input features, any binary decision tree can be represented as a SM. Extending the proof beyond binary decision trees is straight-forward.</Paragraph>
      <Paragraph position="2"> Theorem 3 Let a236 be a binary decision tree with a159 internal nodes. Then, there exist a sequential model a45 such that a45 and a236 have the same size, and they produce the same predictions.</Paragraph>
      <Paragraph position="3"> Proof (Sketch): Given a decision tree a236 on a159 nodes we show how to construct a SM that produces equivalent predictions.</Paragraph>
      <Paragraph position="4">  such that a classifier that is assigned to node a222 is processed before any classifier that was assigned to any of the children of a222 .</Paragraph>
      <Paragraph position="5"> 4. Define each classifier a8 a50 that was assigned to node a222a17a97 a236 to have an influence on the outcome iff node a222a240a97 a236 lies in the path  of a236 and a45 are identical.</Paragraph>
      <Paragraph position="6"> This completes that proof and shows that the resulting SM is of equivalent size to the original decision tree.</Paragraph>
      <Paragraph position="7"> We note that given a SM, it is also relatively easy (details omitted) to construct a decision tree that produces the same decisions as the final classifier of the SM. However, the simple construction results in a decision tree that is exponentially larger than the original SM. Theorem 4 shows that this difference in expressivity is inherent.</Paragraph>
      <Paragraph position="8"> Theorem 4 Let a159 be the number of classifiers in a sequential model a45 and the number of internal nodes a in decision tree a236 . Let a37 be the set of classes in the output of a45 and also the maximum degree of the internal nodes in a236 . Denote by  a89 the number of functions representable by a236a146a19 a45 respectively. Then, when a37a242a117a41a117 a159 , a241 a83 a45 a89 is exponentially larger than a241 a83a108a236a61a89 . Proof (Sketch): The proof follows by counting the number of functions that can be represented using a decision tree with a159 internal nodes(Wilf, 1994), and the number of functions that can be represented using a sequential model on a159 intermediate classifier. Given the exponential gap, it follows that one may need exponentially large decision trees to represent an equivalent predictor to an a159 size SM.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML