File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0908_metho.xml
Size: 15,969 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0908"> <Title>Text Classification by Bootstrapping with Keywords, EM and Shrinkage</Title> <Section position="3" start_page="52" end_page="53" type="metho"> <SectionTitle> 2 Generating Preliminary Labels with Keywords </SectionTitle> <Paragraph position="0"> The first step in the bootstrapping process is to use the keywords to generate preliminary labels for as many of the unlabeled documents as possible. Each class is given just a few keywords. Figure 1 shows examples of the number and type of keywords given in our experimental domain--the human-provided keywords are shown in the nodes in non-italic font.</Paragraph> <Paragraph position="1"> In this paper, we generate preliminary labels from the keywords by term-matching in a rule-list fashion: for each document, we step through the keywords and place the document in the category of the first keyword that matches. Finding enough keywords to obtain broad coverage while simultaneously finding sufficiently specific keywords to obtain high accuracy is very difficult; it requires intimate knowledge of the data and a lot of trial and error.</Paragraph> <Paragraph position="2"> As a result, classification by keyword matching is both an inaccurate and incomplete. Keywords tend to provide high-precision and low-recall; this brittleness will leave many documents unlabeled. Some documents will match keywords from the wrong class. In general we expect the low recall of the key-words to be the dominating factor in overall error.</Paragraph> <Paragraph position="3"> In our experimental domain, for example, 59% of the unlabeled documents do not contain any keywords.</Paragraph> <Paragraph position="4"> Another method of priming bootstrapping with keywords would be to take each set of keywords as a labeled mini-document containing just a few words.</Paragraph> <Paragraph position="5"> This could then be used as input to any standard learning algorithm. Testing this, and other keyword labeling approaches, is an area of ongoing work.</Paragraph> </Section> <Section position="4" start_page="53" end_page="55" type="metho"> <SectionTitle> 3 The Bootstrapping Algorithm </SectionTitle> <Paragraph position="0"> The goal of the bootstrapping step is to generate a naive Bayes classifier from the inputs: the (inaccurate and incomplete) preliminary labels, the unlabeled data and the class hierarchy. One straight-forward method would be to simply take the unlabeled documents with preliminary labels, and treat this as labeled data in a standard supervised setting. This approach provides only minimal benefit for three reasons: (1) the labels are rather noisy, (2) the sample of preliminarily-labeled documents is skewed from the regular document distribution (i.e. it includes only documents containing keywords), and (3) data are sparse in comparison to the size of the feature space. Adding the remaining unlabeled data and running EM helps counter the first and second of these reasons. Adding hierarchical shrinkage to naive Bayes helps counter the first and third of these reasons. We begin a detailed description of our bootstrapping algorithm with a short overview of standard naive Bayes text classification, then proceed by adding EM to incorporate the unlabeled data, and conclude by explaining hierarchical shrinkage. An outline of the entire algorithm is presented in Table 1.</Paragraph> <Section position="1" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 3.1 The naive Bayes framework </SectionTitle> <Paragraph position="0"> We build on the framework of multinomial naive Bayes text classification (Lewis, 1998; McCallum and Nigam, 1998). It is useful to think of naive Bayes as estimating the parameters of a probabilistic generative model for text documents. In this model, first the class of the document is&quot; selected. The words of the document are then generated based on the parameters for the class-specific multinomial (i.e. unigram model). Thus, the classifier parameters consist of the class prior probabilities and the class-conditioned word probabilities. For formally, each class, cj, has a document frequency relative to all other classes, written P(cj). For every word wt in the vocabulary V, P(wtlcj) indicates the frequency that the classifier expects word wt to occur in documents in class cj.</Paragraph> <Paragraph position="1"> In the standard supervised setting, learning of the parameters is accomplished using a set of labeled training documents, 79. To estimate the word probability parameters, P (wt I cj), we count the frequency with which word wt occurs among all word occurrences for documents in class cj. We supplement * Inputs: A collection 79 of unlabeled documents, a class hierarchy, and a few keywords for each class.</Paragraph> <Paragraph position="2"> * Generate preliminary labels for as many of the unlabeled documents as possible by term-matching with the keywords in a rule-list fashion.</Paragraph> <Paragraph position="3"> * Initialize all the Aj's to be uniform along each path from a leaf class to the root of the class hierarchy.</Paragraph> <Paragraph position="4"> * Iterate the EM algorithm: * (M-step) Build the maximum likelihood multinomial at each node in the hierarchy given the class probability estimates for each document (Equations 1 and 2). Normalize all the Aj's along each path from a leaf class to the root of the class hierarchy so that they sum to 1.</Paragraph> <Paragraph position="5"> * (E-step) Calculate the expectation of the class labels of each document using the classifter created in the M-step (Equation 3). Increment the new )~j's by attributing each word of held-out data probabilistically to the ancestors of each class.</Paragraph> <Paragraph position="6"> *&quot; Output: A naive Bayes classifier that takes an unlabeled document and predicts a class label.</Paragraph> <Paragraph position="7"> scribed in Sections 2 and 3.</Paragraph> <Paragraph position="8"> this with Laplace smoothing that primes each estimate with a count of one to avoid probabilities of zero. Let N(wt,di) be the count of the number of times word we occurs in document di, and define P(cj\[di) E {0, 1}, as given by the document's class label. Then, the estimate of the probability of word wt in class cj is:</Paragraph> <Paragraph position="10"> The class prior probability parameters are set in the same way, where ICI indicates the number of classes:</Paragraph> <Paragraph position="12"> Given an unlabeled document and a classifier, we determine the probability that the document belongs in class cj using Bayes' rule and the naive Bayes assumption--that the words in a document occur independently of each other given the class. If we denote Wd~,k to be the kth word in document di, then classification becomes:</Paragraph> <Paragraph position="14"> Empirically, when given a large number of training documents, naive Bayes does a good job of classifying text documents (Lewis, 1998). More complete presentations of naive Bayes for text classification are provided by Mitchell (1997) and McCallum and Nigam (1998).</Paragraph> </Section> <Section position="2" start_page="54" end_page="54" type="sub_section"> <SectionTitle> 3.2 Adding unlabeled data with EM </SectionTitle> <Paragraph position="0"> In the standard supervised setting, each document comes with a label. In our bootstrapping scenario, the preliminary keyword labels are both incomplete and inaccurate--the keyword matching leaves many many documents unlabeled, and labels some incorrectly. In order to use the entire data set in a naive Bayes classifier, we use the Expectation-Maximization (EM) algorithm to generate probabilistically-weighted class labels for all the documents. This results in classifier parameters that are more likely given all the data.</Paragraph> <Paragraph position="1"> EM is a class of iterative algorithms for maximum likelihood or maximum a posteriori parameter estimation in problems with incomplete data (Dempster et al., 1977). Given a model of data generation, and data with some missing values, EM iteratively uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the parameters and give estimates for the missing values. In our scenario, the class labels of the unlabeled data are the missing values.</Paragraph> <Paragraph position="2"> In implementation, EM is an iterative two-step process. Initially, the parameter estimates are set in the standard naive Bayes way from just the preliminarily labeled documents. Then we iterate the E- and M-steps. The E-step calculates probabilistically-weighted class labels, P(cjldi), for every document using the classifier and Equation 3.</Paragraph> <Paragraph position="3"> The M-step estimates new classifier parameters using all the documents, by Equations 1 and 2, where P(cjldi) is now continuous, as given by the E-step.</Paragraph> <Paragraph position="4"> We iterate the E- and M-steps until the classifier converges. The initialization step from the preliminary labels identifies each mixture component with a class and seeds EM so that the local maxima that it finds correspond well to class definitions.</Paragraph> <Paragraph position="5"> In previous work (Nigam et al., 1999), we have shown this technique significantly increases text classification accuracy when given limited amounts of labeled data and large amounts of unlabeled data.</Paragraph> <Paragraph position="6"> The expectation here is that EM will both correct and complete the labels for the entire data set.</Paragraph> </Section> <Section position="3" start_page="54" end_page="55" type="sub_section"> <SectionTitle> 3.3 Improving sparse data estimates with </SectionTitle> <Paragraph position="0"> shrinkage Even when provided with a large pool of documents, naive Bayes parameter estimation during bootstrapping will suffer from sparse data because naive Bayes has so many parameters to estimate (\[V\[IC I + IC\[). Using the provided class hierarchy, we can integrate the statistical technique shrinkage into the bootstrapping algorithm to help alleviate the sparse data problem.</Paragraph> <Paragraph position="1"> Consider trying to estimate the probability of the word &quot;intelligence&quot; in the class NLP. This word should clearly have non-negligible probability there; however, with limited training data we may be unlucky, and the observed frequency of &quot;intelligence&quot; in NLP may be very far from its true expected value.</Paragraph> <Paragraph position="2"> One level up the hierarchy, however, the Artificial Intelligence class contains many more documents (the union of all the children). There, the probability of the word &quot;intelligence&quot; can be more reliably estimated. null Shrinkage calculates new word probability estimates for each leaf class by a weighted average of the estimates on the path from the leaf to the root.</Paragraph> <Paragraph position="3"> The technique balances a trade-off between specificity and reliability. Estimates in the leaf are most specific but unreliable; further up the hierarchy estimates are more reliable but unspecific. We can calculate mixture weights for the averaging that are guaranteed to maximize the likelihood of held-out data with the EM algorithm.</Paragraph> <Paragraph position="4"> One can think of hierarchical shrinkage as a generative model that is slightly augmented from the one described in Section 3.1. As before, a class (leaf) is selected first. Then, for each word position in the document, an ancestor of the class (including itself) is selected according to the shrinkage weights. Then, the word itself is chosen based on the multinomial word distribution of that ancestor. If each word in the training data were labeled with which ancestor was responsible for generating it, then estimating the mixture weights would be a simple matter of maximum likelihood estimation from the ancestor emission counts. But these ancestor labels are not provided in the training data, and hence we use EM to fill in these missing values. We use the term vertical EM to refer to this process that calculates ancestor mixture weights; we use the term horizontal EM to refer to the process of filling in the missing class (leaf) labels on the unlabeled documents. Both vertical and horizontal EM run concurrently, with interleaved E- and M-steps.</Paragraph> <Paragraph position="5"> More formally, let {pl(wt\[cj),...,pk(wtlcj) } be word probability estimates, where pl(wt\[cj) is the maximum likelihood estimate using training data just in the leaf, P2(wtlcj) is the maximum likelihood estimate in the parent using the training data from the union of the parent's children, pk-1 (w~lcj) is the estimate at the root using all the training data, and pk(wtlcj) is the uniform estimate (Pk(wtlcj) = 1/IVI). The interpolation weights among cj's &quot;ancestors&quot; (which we define to include cj itself) are written {A}, A~,..., A~}, where Ea:lk Aja = 1. The new word probability estimate based on shrinkage, denoted P(wt\[cj), is then</Paragraph> <Paragraph position="7"> The Aj vectors are calculated by the iterations of EM. In the E-step we calculate for each class cj and each word of unlabeled held out data, ~, the probability that the word was generated by the ith ancestor. In the M-step, we normalize the sum of these expectations to obtain new mixture weights ,kj. Without the use of held out data, all the mixture weight would concentrate in the leaves.</Paragraph> <Paragraph position="8"> Specifically, we begin by initializing the A mixture weights for each leaf to a uniform distribution. Let /3~ (di,k) denote the probability that the ath ancestor of cj was used to generate word occurrence di,k. The E-step consists of estimating the/Ts: ICj) (5) 5 (di,k) = A npm(wd', IcJ) &quot; In the M-step, we derive new and guaranteed improved weights, A, by summing and normalizing the</Paragraph> <Paragraph position="10"> The E- and M-steps iterate until the ~'s converge. These weights are then used to calculate new shrinkage-based word probability estimates, as in Equation 4. Classification of new test documents is performed just as before (Equation 3), where the Laplace estimates of the word probability estimates are replaced by shrinkage-based estimates.</Paragraph> <Paragraph position="11"> A more complete description of hierarchical shrinkage for text classification is presented by McCallum et al. (1998).</Paragraph> </Section> </Section> <Section position="5" start_page="55" end_page="55" type="metho"> <SectionTitle> 4 Related Work </SectionTitle> <Paragraph position="0"> Other research efforts in text learning have also used bootstrapping approaches. The co-training algorithm (Blum and Mitchell, 1998) for classification works in cases where the feature space is separable into naturally redundant and independent parts. For example, web pages can be thought of as the text on the web page, and the collection of text in hyperlink anchors to that page.</Paragraph> <Paragraph position="1"> A recent paper by Riloff and Jones (1999) bootstraps a dictionary of locations from just a small set of known locations. Here, their mutual bootstrap algorithm works by iteratively identifying syntactic constructs indicative of known locations, and identifying new locations using these indicative constructs. The preliminary labeling by keyword matching used in this paper is similar to the seed collocations used by Yarowsky (1995). There, in a word sense disambiguation task, a bootstrapping algorithm is seeded with some examples of common collocations with the particular sense of some word (e.g. the seed &quot;life&quot; for the biological sense of &quot;plant&quot;).</Paragraph> </Section> class="xml-element"></Paper>