File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1023_metho.xml
Size: 28,027 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1023"> <Title>Weakly Supervised Natural Language Learning Without Redundant Views</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Weakly Supervised Algorithms </SectionTitle> <Paragraph position="0"> In this section, we give a high-level description of our implementation of the three weakly supervised algorithms that we use in our comparison, namely, co-training, selftraining, and EM.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Co-Training </SectionTitle> <Paragraph position="0"> Co-training (Blum and Mitchell, 1998) is a multi-view weakly supervised algorithm that trains two classifiers that can help augment each other's labeled data using two separate but redundant views of the data. Each classifier is trained using one view of the data and predicts the labels for all instances in the data pool, which consists of a randomly chosen subset of the unlabeled data. Each then selects its most confident predictions from the pool and adds the corresponding instances with their predicted labels to the labeled data while maintaining the class distribution in the labeled data.</Paragraph> <Paragraph position="1"> The number of instances to be added to the labeled data by each classifier at each iteration is limited by a pre-specified growth size to ensure that only the instances that have a high probability of being assigned the correct label are incorporated. The data pool is refilled with instances drawn from the unlabeled data and the process is repeated for several iterations. During testing, each classifier makes an independent decision for a test instance and the decision associated with the higher confidence is taken to be the final prediction for the instance.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Self-Training </SectionTitle> <Paragraph position="0"> Self-training is a single-view weakly supervised algorithm that has appeared in various forms in the literature.</Paragraph> <Paragraph position="1"> The version of the algorithm that we consider here is a variation of the one presented in Banko and Brill (2001).</Paragraph> <Paragraph position="2"> Initially, we use bagging (Breiman, 1996) to train a committee of classifiers using the labeled data. Specifically, each classifier is trained on a bootstrap sample created by randomly sampling instances with replacement from the labeled data until the size of the bootstrap sample is equal to that of the labeled data. Then each member of the committee (or bag) predicts the labels of all unlabeled data. The algorithm selects an unlabeled instance for adding to the labeled data if and only if all bags agree upon its label. This ensures that only the unlabeled instances that have a high probability of being assigned the correct label will be incorporated into the labeled set. The above steps are repeated until all unlabeled data is labeled or a fixed point is reached. Following Breiman (1996), we perform simple majority voting using the committee to predict the label of a test instance.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 EM </SectionTitle> <Paragraph position="0"> The use of EM as a single-view weakly supervised classification algorithm is introduced in Nigam et al. (2000).</Paragraph> <Paragraph position="1"> Like the classic unsupervised EM algorithm (Dempster et al., 1977), weakly supervised EM assumes a parametric model of data generation. The labels of the unlabeled data are treated as missing data. The goal is to find a model such that the posterior probability of its parameters is locally maximized given both the labeled data and the unlabeled data.</Paragraph> <Paragraph position="2"> Initially, the algorithm estimates the model parameters by training a probabilistic classifier on the labeled instances. Then, in the E-step, all unlabeled data is probabilistically labeled by the classifier. In the M-step, the parameters of the generative model are re-estimated using both the initially labeled data and the probabilistically labeled data to obtain a maximum a posteriori (MAP) hypothesis. The E-step and the M-step are repeated for several iterations. The resulting model is then used to make predictions for the test instances.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Machine Learning Framework for Coreference Resolution </SectionTitle> <Paragraph position="0"> Noun phrase coreference resolution refers to the problem of determining which noun phrases (NPs) refer to each real-world entity mentioned in a document. In this section, we give an overview of the coreference resolution system to which the weakly supervised algorithms described in the previous section are applied.</Paragraph> <Paragraph position="1"> The framework underlying the system is a standard combination of classification and clustering employed by supervised learning approaches (e.g. Ng and Cardie (2002); Soon et al. (2001)). Specifically, coreference resolution is recast as a classification task, in which a pair of NPs is classified as co-referring or not based on constraints that are learned from an annotated corpus. Training instances are generated by pairing each NP with each of its preceding NPs in the document. The classification associated with a training instance is one of COREFER- null NUMBER C if the NP pair agree in number; I if they disagree; NA if number information for one or both NPs cannot be determined.</Paragraph> <Paragraph position="2"> GENDER C if the NP pair agree in gender; I if they disagree; NA if gender information for one or both NPs cannot be determined.</Paragraph> <Paragraph position="3"> ANIMACY C if the NPs match in animacy; else I.</Paragraph> <Paragraph position="4"> APPOSITIVE C if the NPs are in an appositive relationship; else I. PREDNOM C if the NPs form a predicate nominal construction; else I. BINDING I if the NPs violate conditions B or C of the Binding Theory; else C. CONTRAINDICES I if the NPs cannot be co-indexed based on simple heuristics; else C. For instance, two non-pronominal NPs separated by a preposition cannot be co-indexed. SPAN I if one NP spans the other; else C.</Paragraph> <Paragraph position="5"> MAXIMALNP I if both NPs have the same maximal NP projection; else C. SYNTAX I if the NPs have incompatible values for the BINDING, CONTRAINDICES, SPAN or MAXIMALNP constraints; else C.</Paragraph> <Paragraph position="6"> INDEFINITE I if NPa5a6a3 is an indefinite and not appositive; else C. PRONOUN I if NPa2a7a3 is a pronoun and NPa5a6a3 is not; else C. EMBEDDED 1 Y if NPa2a7a3 is an embedded noun; else N.</Paragraph> <Paragraph position="7"> TITLE I if one or both of the NPs is a title; else C.</Paragraph> <Paragraph position="8"> Semantic WNCLASS C if the NPs have the same WordNet semantic class; I if they don't; NA if the semantic class information for one or both NPs cannot be determined.</Paragraph> <Paragraph position="9"> ALIAS C if one NP is an alias of the other; else I.</Paragraph> <Paragraph position="10"> Positional SENTNUM Distance between the NPs in terms of the number of sentences. Others PRO RESOLVE C if NPa5a6a3 is a pronoun and NPa2a7a3 is its antecedent according to a naive pronoun resolution algorithm; else I.</Paragraph> <Paragraph position="11"> generate an instance representing two NPs, NPa2a4a3 and NPa5a8a3 , in document a9 , where NPa2a7a3 precedes NPa5a8a3 . Non-relational features test some property P of one of the NPs under consideration and take on a value of YES or NO depending on whether P holds. Relational features test whether some property P holds for the NP pair under consideration and indicate whether the NPs are COMPATIBLE or INCOMPATIBLE w.r.t. P; a value of NOT APPLICABLE is used when property P does not apply. co-refer in the text. A separate clustering mechanism then coordinates the possibly contradictory pairwise classifications and constructs a partition on the set of NPs.</Paragraph> <Paragraph position="12"> We perform the experiments in this paper using our coreference resolution system (see Ng and Cardie (2002)). For the sake of completeness, we include the descriptions of the 25 features employed by the system in Table 1. Linguistically, the features can be divided into five groups: lexical, grammatical, semantic, positional, and others. However, we use naive Bayes rather than decision tree induction as the underlying learning algorithm to train a coreference classifier, simply because (1) it provides a generative model assumed by EM and hence facilitates comparison between different approaches and (2) it is more robust to the skewed class distributions inherent in coreference data sets than decision tree learners. When the coreference system is used within the weakly supervised setting, a weakly supervised algorithm bootstraps the coreference classifier from the given labeled and unlabeled data rather than from a much larger set of labeled instances.</Paragraph> <Paragraph position="13"> We conclude this section by noting that view factorization is a non-trivial task for coreference resolution.</Paragraph> <Paragraph position="14"> For many lexical tagging problems such as part-of-speech tagging, views can be drawn naturally from the left-hand and right-hand context. For other tasks such as named entity classification, views can be derived from features inside and outside the phrase under consideration (Collins and Singer, 1999). Unfortunately, neither of these options is possible for coreference resolution. We will explore several heuristic methods for view factorization in the next section.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> In this section, we empirically test our hypothesis that single-view weakly supervised algorithms can potentially outperform their multi-view counterparts for problems without a natural feature split.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> To ensure a fair comparison of the weakly supervised algorithms, the experiments are designed to determine the best parameter setting of each algorithm (in terms of its effectiveness to improve performance) for the data sets we investigate. Specifically, we keep the parameters common to all three weakly supervised algorithms (i.e. the labeled and unlabeled data) constant and vary the algorithm-specific parameters, as described below.</Paragraph> <Paragraph position="1"> Evaluation. We use the MUC-6 (1995) and MUC-7 (1998) coreference data sets for evaluation. The training set is composed of 30 &quot;dry run&quot; texts, 1 of which is selected to be the annotated text and the remaining 29 texts are used as unannotated data. For MUC-6, 3486 training instances are generated from 84 NPs in the annotated text.</Paragraph> <Paragraph position="2"> For MUC-7, 3741 training instances are generated from 87 NPs. The unlabeled data is composed of 488173 instances and 478384 instances for the MUC-6 and MUC-7 data sets, respectively. Testing is performed by applying the bootstrapped coreference classifier and the clustering algorithm described in section 3 on the 20-30 &quot;formal evaluation&quot; texts for each of the MUC-6 and MUC-7 data sets.</Paragraph> <Paragraph position="3"> Co-training parameters. The co-training parameters are set as follows.</Paragraph> <Paragraph position="4"> Views. We tested three pairs of views. Table 2 reproduces the 25 features of the coreference system and shows the views we employ. Specifically, the three view pairs are generated by the following methods.</Paragraph> <Paragraph position="5"> a10 Mueller et al.'s heuristic method. Starting from two empty views, the iterative algorithm selects for each view the feature whose addition maximizes the performance of the respective view on the labeled data at each iteration. 3 This method produces the view pair V1 and V2 in Table 2 for the MUC-6 data set.</Paragraph> <Paragraph position="6"> A different view pair is produced for MUC-7.</Paragraph> <Paragraph position="7"> a10 Random splitting of features into views. Starting from two empty views, an iterative algorithm that randomly chooses a feature for each view at each step is used to split the feature set. The resulting view pair V3 and V4 is used for both the MUC-6 and MUC-7 data sets.</Paragraph> <Paragraph position="8"> a10 Splitting of features according to the feature type. Specifically, one view comprises the lexico-syntactic features and the other the remaining ones. This approach produces the view pair V5 and V6, which is used for both data sets.</Paragraph> <Paragraph position="9"> Pool size. We tested pool sizes of 500, 1000, 5000.</Paragraph> <Paragraph position="10"> Growth size. We tested values of 10, 50, 100, 200, 250.</Paragraph> <Paragraph position="11"> 3Space limitation precludes a detailed description of this method. See Mueller et al. (2002) for details.</Paragraph> <Paragraph position="13"> ence system. Column 1 lists the 25 features shown in Table 1.</Paragraph> <Paragraph position="14"> Columns 2-7 show three different pairs of views that we have attempted for co-training coreference classifiers.</Paragraph> <Paragraph position="15"> Number of co-training iterations. We monitored performance on the test data at every 10 iterations of co-training and ran the algorithm until performance stabilized. null Self-training parameters. Given the labeled and unlabeled data, self-training requires only the specification of the number of bags. We tested all odd number of bags between 1 and 25.</Paragraph> <Paragraph position="16"> EM parameters. Given the labeled and unlabeled data, EM has only one parameter -- the number of iterations.</Paragraph> <Paragraph position="17"> We ran EM to convergence and kept track of its test set performance at every iteration.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Results and Discussion </SectionTitle> <Paragraph position="0"> Results are shown in Table 3, where performance is reported in terms of recall, precision, and F-measure using the model-theoretic MUC scoring program (Vilain et al., 1995). The baseline coreference system, which is trained only on the labeled document using naive Bayes, achieves an F-measure of 55.5 and 43.8 on the MUC-6 and MUC-7 data sets, respectively.</Paragraph> <Paragraph position="1"> The results shown in row 2 of Table 3 correspond to the best F-measure scores achieved by co-training for the two data sets based on co-training runs that comprise all of the parameter combinations described in the previous subsection. The parameter settings with which the best 5000, growth size = 50) for the MUC-6 data set.</Paragraph> <Paragraph position="2"> results are obtained are also shown in the table. To get a better picture of the behavior of co-training, we present the learning curve for the co-training run that gives rise to the best F-measure for the MUC-6 data set in Figure 1. The horizontal (dotted) line shows the performance of the baseline system, which achieves an F-measure of 55.5, as described above. As co-training progresses, F-measure peaks at iteration 220 and then gradually drops below that of the baseline after iteration 570.</Paragraph> <Paragraph position="3"> Although co-training produces substantial improvements over the baseline at its best parameter settings, a closer examination of our results reveals that they corroborate previous findings: the algorithm is sensitive not only to the number of iterations, but to other input parameters such as the pool size and the growth size as well (Nigam and Ghani, 2000; Pierce and Cardie, 2001). The lack of a principled method for determining these parameters in a weakly supervised setting where labeled data is scarce remains a serious disadvantage for co-training.</Paragraph> <Paragraph position="4"> Self-training results are shown in row 3 of Table 3: self-training performs substantially better than both the baseline and co-training for both data sets. In contrast to co-training, however, self-training is relatively insensi- null mance of self-training for the MUC-6 data set.</Paragraph> <Paragraph position="5"> tive to its input parameter. Figure 2 shows the fairly consistent performance of self-training with seven or more bags for the MUC-6 data set. We observe similar trends for the MUC-7 data set. These results are consistent with empirical studies of bagging across a variety of classification tasks where seven to 25 bags are deemed sufficient (Breiman, 1996).</Paragraph> <Paragraph position="6"> To gain a deeper insight into the behavior of selftraining, we plot the learning curve for self-training using 7 bags in Figure 3, again for the MUC-6 data set. At iteration 0 (i.e. before any unlabeled data is incorporated), the F-measure score achieved by self-training is higher than that of the baseline system (58.5 vs. 55.5). The observed difference is due to voting within the self-training algorithm. Voting has proved to be an effective technique for improving the accuracy of a classifier when training data is scarce by reducing the variance of a particular training corpus (Breiman, 1996). After the first iteration, there is a rapid increase in F-measure, which is accompanied by large gains in precision and smaller drops in recall.</Paragraph> <Paragraph position="7"> These results are consistent with our intuition regarding self-training: at each iteration the algorithm incorporates only instances whose label it is most confident about into for the MUC-6 data set.</Paragraph> <Paragraph position="8"> the labeled data, thereby ensuring that precision will increase. 4 As we can see from Table 3, the recall level achieved by co-training is much lower than that of self-training.</Paragraph> <Paragraph position="9"> This is an indication that each co-training view is insufficient to learn the concept: the feature split limits any interaction of features in different views that might produce better recall. Overall, these results provide evidence that self-training is a better alternative to co-training for weakly supervised learning for problems such as coreference resolution where no natural feature split exists. On the other hand, EM only gives rise to modest performance gains over the baseline system, as we can see from row 4 of Table 3. The performance of EM depends in part on the correctness of the underlying generative model (Nigam et al., 2000), which in our case is naive Bayes. In this model, an instance with a11 feature values</Paragraph> <Paragraph position="11"> a24 is created by first choosing the class with prior probability a25a27a26a4a24a29a28 and then generating each available feature a13a21a30 with probability a25a31a26a13a32a30a34a33a24a29a28 independently, under the assumption that the feature values are conditionally independent given the class. As a result, model correctness is adversely affected by redundant features, which clearly invalidate the conditional independence assumption. In fact, naive Bayes is known to be bad at handling redundant features (Langley and Sage, 1994).</Paragraph> <Paragraph position="12"> We hypothesize that the presence of redundant fea4When tackling the task of confusion set disambiguation, Banko and Brill (2001) observe only modest gains from self-training by bootstrapping from a seed corpus of one million words. We speculate that a labeled data set of this size can possibly enable them to train a reasonably good classifier with which self-training can only offer marginal benefits, but the relationship between the behavior of self-training and the size of the seed (labeled) corpus remains to be shown.</Paragraph> <Paragraph position="13"> tures causes the generative model and hence EM to perform poorly. Although self-training depends on the same model, it only makes use of the binary decisions returned by the model and is therefore more robust to the naive Bayes assumptions, as reflected in its fairly impressive empirical performance.5 In contrast, the fact that EM relies on the probability estimates of the model makes it more sensitive to the correctness of the model.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Meta-Bootstrapping with Feature </SectionTitle> <Paragraph position="0"> Selection If our hypothesis regarding the presence of redundant features were correct, then feature selection could result in an improved generative model, which could in turn improve the performance of weakly supervised EM. This section discusses a wrapper-based feature selection method for EM.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 A Two-Tiered Bootstrapping Algorithm </SectionTitle> <Paragraph position="0"> We now describe the FS-EM algorithm for boosting the performance of weakly supervised algorithms via feature selection. Although named after EM, the algorithm as described is potentially applicable to all single-view weakly supervised algorithms. FS-EM takes as input a supervised learner, a single-view weakly supervised learner, a labeled data set a35 , and an unlabeled data set a36 . In addition, it assumes knowledge of the positive class prior (i.e.</Paragraph> <Paragraph position="1"> the true percentage of positive instances in the data) like co-training and requires a deviation threshold that we will explain shortly.</Paragraph> <Paragraph position="2"> FS-EM, which has a two-level bootstrapping structure, is reminiscent of the meta-bootstrapping algorithm introduced in Riloff and Jones (1999). The outer-level bootstrapping task is feature selection, whereas the inner-level task is to learn a bootstrapped classifier from labeled and unlabeled data as described in section 4. At a high level, FS-EM uses a forward feature selection algorithm to impose a total ordering on the features based on the order in which the features are selected. Specifically, FS-EM performs the three steps below for each feature a37a39a38 that has not been selected. First, it uses the weakly supervised learner to train a classifier a40 from the labeled and unlabeled data (a35a31a41a27a36 ) using only the feature a37a39a38 as well as the features selected thus far. Second, the algorithm uses a40 to classify all of the instances in a35a42a41a43a36 . Finally, FS-EM trains a new model on just a36 , which is now labeled by a40 .</Paragraph> <Paragraph position="3"> At the end of the three steps, exactly one model is trained for each feature that has not been selected. The forward selection algorithm then selects the feature with which the corresponding model achieves the best performance 5It is possible for naive Bayes classifiers to return optimal classifications even if the conditional independence assumption is violated. See Domingos and Pazzani (1997) for an analysis.</Paragraph> <Paragraph position="4"> on a35 (w.r.t. the true labels of the instances in a35 ) for addition to a44a21a45a47a46a6a48 (the set of features selected thus far).6 The process is repeated until all features have been selected.</Paragraph> <Paragraph position="5"> Unfortunately, since a35 can be small, selecting a feature for incorporation into a44 a45a49a46a50a48 by measuring the performance of the corresponding model on a35 may not accurately reflect the actual model performance. To handle this problem, FS-EM has a preference for adding features whose inclusion results in a classification in which the positive class prior (i.e. the probability that an instance is labeled as positive),a51 a38 , does not deviate from the true positive class prior, a51 , by more than a pre-specified threshold value, a52 . A large deviation from the true prior is an indication that the resulting classification of the data does not correspond closely to the actual classification.</Paragraph> <Paragraph position="6"> This algorithmic bias is particularly useful for weakly supervised learners (such as EM) that optimize an objective function other than classification accuracy and can potentially produce a classification that is substantially different from the actual one. Specifically, FS-EM attempts to ensure that the classification produced by the weakly supervised learner weakly agrees with the actual classification, where the weak disagreement rate between two classifications is defined as the difference between their positive class priors. Note that weak agreement is a necessary but not sufficient condition for two classifications to be identical.7 Nevertheless, if the addition of any of the features to a44 a45a49a46a50a48 does not produce a classification that weakly agrees with the true one, FS-EM picks the feature whose inclusion results in a positive class prior that has the least deviation instead. This step can be viewed as introducing &quot;pseudo-random&quot; noise into the feature selection process. The hope is that the deviation of the high-scoring, &quot;highdeviation&quot; features can be lowered by first incorporating those with &quot;low deviation&quot;, thus continuing to strive for weak agreement while potentially achieving better performance on a35 .</Paragraph> <Paragraph position="7"> The final set of features, a44a21a53a19a38a4a45a49a54a34a55 , is composed of the firsta56 features chosen by the feature selection algorithm, wherea56 is the largest number of features that can achieve the best performance on a35 subject to the condition that the corresponding classification produced by the weakly supervised algorithm weakly disagrees with the true one by at most a52 . The output of FS-EM is a classifier that the weakly supervised learner learns from a35 and a36 using only the features in a44 a53a19a38a4a45a49a54a34a55 . The pseudo-code describing FS-EM is shown in Figure 4.</Paragraph> <Paragraph position="8"> ing classifications are identical.</Paragraph> <Paragraph position="9"> Input: a63 (a supervised learning algorithm) for a81 = 1, ...,a67 : foreach a82a39a2 ina65 : use a64 to learn a classifier a83 froma57 and a58 using only</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Results and Discussion </SectionTitle> <Paragraph position="0"> We instantiate FS-EM with naive Bayes as the supervised learner and EM as the weakly supervised learner, providing it with the same amount of labeled and unlabeled data as in previous experiments and setting a52 to 0.01. EM is run for 7 iterations whenever it is invoked.8 Results using FS-EM are shown in row 5 of Table 3. In comparison to EM, F-measure increases from 57.6 to 65.4 for MUC6, and from 46.4 to 60.5 for MUC-7, allowing FS-EM to even surpass the performance of self-training. These results are consistent with our hypothesis that the performance of EM can be boosted by improving the underlying generative model using feature selection.</Paragraph> <Paragraph position="1"> Finally, although FS-EM is only applicable to two-class problems, it can be generalized fairly easily to handle multi-class problems, where the true label distribution 8Seven is used because we follow the choice of previous work (Muslea et al., 2002; Nigam and Ghani, 2000). Additional experiments in which EM is run for 5 and 9 iterations give similar results.</Paragraph> <Paragraph position="2"> is assumed to be available and the weak agreement rate can be measured based on the similarity of two distributions. null</Paragraph> </Section> </Section> class="xml-element"></Paper>