XML Viewer - w99-0613

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0613_metho.xml
Size: 33,140 bytes
Last Modified: 2025-10-06 14:15:30
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0613">
  <Title>Unsupervised Models for Named Entity Classification</Title>
  <Section position="4" start_page="100" end_page="101" type="metho">
    <SectionTitle>
2 The Problem
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="100" end_page="101" type="sub_section">
      <SectionTitle>
2.1 The Data
</SectionTitle>
      <Paragraph position="0"> 971,746 sentences of New York Times text were parsed using the parser of (Collins 96).1 Word sequences that met the following criteria were then extracted as named entity examples: * The word sequence was a sequence of consecutive proper nouns (words tagged as NNP or NNPS) within a noun phrase, and whose last word was head of the noun phrase.</Paragraph>
      <Paragraph position="1">  * The NP containing the word sequence appeared in one of two contexts: 1. There was an appositive modifier to the NP,  whose head is a singular noun (tagged NN). For example, take .... says Maury Cooper, a vice president at S.&amp;R In this case, Maury Cooper is extracted. It is a sequence of proper nouns within an NP; its last word Cooper is the head of the NP; and the NP has an appositive modifier (a vice president at S.&amp;P ) whose head is a singular noun (president).</Paragraph>
      <Paragraph position="2"> 2. The NP is a complement to a preposition, which is the head of a PP. This PP modifies another NP, whose head is a singular noun. For example, ... fraud related to work on a federally funded sewage plant in Georgia In this case, Georgia is extracted: the NP containing it is a complement to the preposition in; the PP headed by in modifies the NP a federally funded sewage plant, whose head is the singular noun plant. In addition to the named-entity string (Maury Cooper or Georgia), a contextual predictor was also extracted. In the appositive case, the contextual IThanks to Ciprian Chelba for running the parser and providing the data.</Paragraph>
      <Paragraph position="3">  predictor was the head of the modifying appositive (president in the Maury Cooper example); in the second case, the contextual predictor was the preposition together with the noun it modifies (plant_in in the Georgia example). From here on we will refer to the named-entity string itself as the spelling of the entity, and the contextual predicate as the context.</Paragraph>
    </Section>
    <Section position="2" start_page="101" end_page="101" type="sub_section">
      <SectionTitle>
2.2 Feature Extraction
</SectionTitle>
      <Paragraph position="0"> Having found (spelling, context) pairs in the parsed data, a number of features are extracted. The features are used to represent each example for the learning algorithm. In principle a feature could be an arbitrary predicate of the (spelling, context) pair; for reasons that will become clear, features are limited to querying either the spelling or context alone.</Paragraph>
      <Paragraph position="1"> The following features were used: fuil-string=x The full string (e.g., for Maury Cooper, ful i- s tring=Maury_Cooper).</Paragraph>
      <Paragraph position="2"> contains(x) If the spelling contains more than one word, this feature applies for any words that the string contains (e.g., Maury Cooper contributes two such features, contains (Maury) and contains (Cooper).</Paragraph>
      <Paragraph position="3"> allcapl This feature appears if the spelling is a single word which is all capitals (e.g., IBM would contribute this feature).</Paragraph>
      <Paragraph position="4"> ailcap2 This feature appears if the spelling is a si ngle word which is all capitals or full periods, and contains at least one period. (e.g., N.Y.</Paragraph>
      <Paragraph position="5"> would contribute this feature, IBM would not).</Paragraph>
      <Paragraph position="6"> nonalpha=x Appears if the spelling contains any characters other than upper or lower case letters. In this case nonalpha is the string formed by removing all upper/lower case letters from the spelling (e.g., for Thomas E. Petry nonalpha=., for A.T.&amp;T.</Paragraph>
      <Paragraph position="7"> nonalpha=.. &amp;. ).</Paragraph>
      <Paragraph position="8"> context=x The context for the entity. The Maury Cooper and Georgia examples would contribute context=president and c ont ex t =p i ant_in respectively.</Paragraph>
      <Paragraph position="9"> context-type=x context-type=appos in the appositive case, context-type=prep in the PP case.</Paragraph>
      <Paragraph position="10"> Table 1 gives some examples of entities and their features.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="101" end_page="104" type="metho">
    <SectionTitle>
3 Unsupervised Algorithms based on
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="101" end_page="101" type="sub_section">
      <SectionTitle>
Decision Lists
3.1 Supervised Decision List Learning
</SectionTitle>
      <Paragraph position="0"> The first unsupervised algorithm we describe is based on the decision list method from (Yarowsky 95). Before describing the unsupervised case we first describe the supervised version of the algorithm: null Input to the learning algorithm: n labeled examples of the form (xi, Yi). Yi is the label of the ith example (given that there are k possible labels, Yi is a member of y = {1...k}). xiisasetofmi features {xil, xi2... Ximi} associated with the ith example. Each xij is a member of A', where X is a set of possible features.</Paragraph>
      <Paragraph position="1"> Output of the learning algorithm: a function h : &amp;' x y ~ \[0, 1\] where h(x,y) is an estimate of the conditional probability p(ylx) of seeing label y given that feature x is present. Alternatively, h can be thought of as defining a decision list of roles x ~ y ranked by their &amp;quot;strength&amp;quot; h(x, y).</Paragraph>
      <Paragraph position="2"> The label for a test example with features x is then defined as y(x) =arg max h(x,y) (1) xEx,yrY In this paper we define h(x, y) as the following function of counts seen in training data:</Paragraph>
      <Paragraph position="4"> seen with label y in training data, Count(x) = ~ueyCdegunt(x'Y)&amp;quot; a is a smoothing parameter, and k is the number of possible labels. In this paper k = 3 (the three labels are person, organization, location), and we set ~ = 0.1. Equation 2 is an estimate of the conditional probability of the label given the feature, P(ylx). z</Paragraph>
    </Section>
    <Section position="2" start_page="101" end_page="102" type="sub_section">
      <SectionTitle>
3.2 An Unsupervised Algorithm
</SectionTitle>
      <Paragraph position="0"> We now introduce a new algorithm for learning from unlabeled examples, which we will call DL-Co'IYain (DL stands for decision list, the term Cotrain is taken from (Blum and Mitchell 98)). The 2(Yarowsky 95) describes the use of more sophisticated smoothing methods. It's not clear how to apply these methods in the unsupervised case, as they required cross-validation techniques: for this reason we use the simpler smoothing method shown here.</Paragraph>
      <Paragraph position="1">  input to the unsupervised algorithm is an initial, &amp;quot;seed&amp;quot; set of rules. In the named entity domain these rules were</Paragraph>
      <Paragraph position="3"> Each of these rules was given a strength of 0.9999. The following algorithm was then used to induce new rules:  1. Set n = 5. (n is the maximum number of rules of each type induced at each iteration.) 2. Initialization: Set the spelling decision list equal to the set of seed rules.</Paragraph>
      <Paragraph position="4"> 3. Label the training set using the current set of spelling rules. Examples where no rule applies are left unlabeled.</Paragraph>
      <Paragraph position="5"> 4. Use the labeled examples to induce a decision list of contextual rules, using the method de- null scribed in section 3.1.</Paragraph>
      <Paragraph position="6"> Let Count'(x) be the number of times feature x is seen with some known label in the training data. For each label (Person, Organization and Location), take the n contextual rules with the highest value of Countt(x) whose unsmoothed 3 strength is above some threshold Pmin. (If fewer than n rules have precision greater than Pmin, we 3Note that taking tile top n most frequent rules already makes the method robust to low count events, hence we do not use smoothing, allowing low-count high-precision features to be chosen on later iterations.</Paragraph>
      <Paragraph position="8"> keep only those rules which exceed the precision threshold.) Pmin was fixed at 0.95 in all experiments in this paper.</Paragraph>
      <Paragraph position="9"> Thus at each iteration the method induces at most n x k rules, where k is the number of possible labels (k = 3 in the experiments in this paper).</Paragraph>
      <Paragraph position="10"> Label the training set using the current set of contextual rules. Examples where no rule applies are left unlabeled.</Paragraph>
      <Paragraph position="11"> On this new labeled set, select up to n x k spelling rules using the same method as in step 4. Set the spelling rules to be the seed set plus the rules selected.</Paragraph>
      <Paragraph position="12"> Ifn &lt; 2500 set n = n+ 5 and return to step 3. Otherwise, label the training data with the combined spelling/contextual decision list, then induce a final decision list from the labeled examples where all rules (regardless of strength) are added to the decision list.</Paragraph>
    </Section>
    <Section position="3" start_page="102" end_page="103" type="sub_section">
      <SectionTitle>
3.3 The Algorithm in (Yarowsky 95)
</SectionTitle>
      <Paragraph position="0"> We can now compare this algorithm to that of (Yarowsky 95). The core of Yarowsky's algorithm is as follows:</Paragraph>
      <Paragraph position="2"> Initialization: Set the decision list equal to the set of seed rules.</Paragraph>
      <Paragraph position="3"> Label the training set using the current set of rules.</Paragraph>
      <Paragraph position="4"> Use the labels to learn a decision list h(z, y) where h is defined by the formula in equation 2, with counts restricted to training data examples that have been labeled in step 2.  Set the decision list to include all rules whose (smoothed) strength is above some threshold Prain .</Paragraph>
      <Paragraph position="5"> 4. Return to step 2.</Paragraph>
      <Paragraph position="6">  There are two differences between this method and the DL-CoTrain algorithm: * The DL-CoTrain algorithm is rather more cautious, imposing a gradually increasing limit on the number of rules that can be added at each iteration. * The DL-CoTrain algorithm has separated the spelling and contextual features, alternating between labeling and learning with the two types of features. Thus an explicit assumption about the redundancy of the features -- that either the spelling or context alone should be sufficient to build a classifier -- has been built into the algorithm. To measure the contribution of each modification, a third, intermediate algorithm, Yarowsky-cautious was also tested. Yarowsky-cautious does not separate the spelling and contextual features, but does have a limit on the number of rules added at each stage. (Specifically, the limit n starts at 5 and increases by 5 at each iteration.) The first modification - cautiousness - is a relatively minor change. It was motivated by the observation that the (Yarowsky 95) algorithm added a very large number of rules in the first few iterations. Taking only the highest frequency rules is much &amp;quot;safer&amp;quot;, as they tend to be very accurate. This intuition is born out by the experimental results. The second modification is more important, and is discussed in the next section.</Paragraph>
    </Section>
    <Section position="4" start_page="103" end_page="104" type="sub_section">
      <SectionTitle>
3.4 Justification for the Separation of
Contextual and Spelling Features
</SectionTitle>
      <Paragraph position="0"> An important reason for separating the two types of features is that this opens up the possibility of theoretical analysis of the use of unlabeled examples.</Paragraph>
      <Paragraph position="1"> (Blum and Mitchell 98) describe learning in the fol- null lowing situation: * Each example is represented by a feature vector x drawn from a set of possible values (an instance space) X. The task is to learn a classification function f : X ~ Y where Y is a set of possible labels. * The features can be separated into two types:</Paragraph>
      <Paragraph position="3"> two different &amp;quot;views&amp;quot; of an example. In the named entity task, X1 might be the instance space for the spelling features, X2 might be the instance space for the contextual features. By this assumption, each element x E X can also be represented as</Paragraph>
      <Paragraph position="5"> sification. That is, there exist functions fl and f2 such that for any example x = (xl,x2), f(x) = fl(Xl) = f2(x2). We never see an example x = (xl, x2) in training or test data such that fl(xl) # f2(x2).</Paragraph>
      <Paragraph position="6"> Thus the method makes the fairly strong assumption that the features can be partitioned into two types such that each type alone is sufficient for classification. null * Xl and x2 are not correlated too tightly. (For example, there is not a deterministic function from x I to x2.) Now assume we have n pairs (xl,i, x2,i) drawn from X1 x X2, where the first m pairs have labels Yi, whereas for i = m+ 1...n the pairs are unlabeled. In a fully supervised setting, the task is to learn a function f such that for all i = 1...m, f(Xl,i, x2,i) ---- Yi. In the cotraining case, (Blum and Mitchell 98) argue that the task should be to induce functions fl and f2 such that</Paragraph>
      <Paragraph position="8"> So fl and f2 must (1) correctly classify the labeled examples, and (2) must agree with each other on the unlabeled examples. The key point is that the second constraint can be remarkably powerful in reducing the complexity of the learning problem.</Paragraph>
      <Paragraph position="9"> (Blum and Mitchell 98) give an example that illustrates just how powerful the second constraint can be. Consider the case where IXll = \]Xa\] = N and N is a &amp;quot;medium&amp;quot; sized number so that it is feasible to collect O(N) unlabeled examples. Assume that the two classifiers are &amp;quot;rote learners&amp;quot;: that is, fl and f2 are defined through look-up tables that list a label for each member of X1 or X2. The problem is a binary classification problem. The problem can be represented as a graph with 2N vertices corresponding to the members of X1 and X2. Each unlabeled pair (xl,i, x2,i) is represented as an edge between nodes corresponding to Xl,i and x2,i in the graph.</Paragraph>
      <Paragraph position="10"> An edge indicates that the two features must have the same label. Given a sufficient number of randomly drawn unlabeled examples (i.e., edges), we will induce two completely connected components that together span the entire graph. Each vertex within a connected component must have the same label -- in the binary classification case, we need a  single labeled example to identify which component should get which label.</Paragraph>
      <Paragraph position="11"> (Blum and Mitchell 98) go on to give PAC results for learning in the cotraining case. They also describe an application of cotraining to classifying web pages (the tw~o feature sets are the words on the page, and other pages pointing to the page).</Paragraph>
      <Paragraph position="12"> The method halves the error rate in comparison to a method using the' labeled examples alone.</Paragraph>
      <Paragraph position="13"> i Limitations of (B!um and Mitchell 98): While the assumptions of (Blum and Mitchell 98) are useful in developing both theoretical results and an intuition for the problem, the assumptions are quite limited. In particul~, it may not be possible to learn functions fl(xl,i)i = f2(x2,i) for i = m + 1...n: either because there is some noise in the data, or because it is just not realistic to expect to learn perfect classifiers given the features used for representation. It may be more realistic to replace the second criteria with a softer one, for example (Blum and Mitchell 98) suggest the alternative  1. fl(Xl,i) = f2(x2,i) = Yi fori = 1...m 2. The choice of fa and f2 must minimize the  number of examples for which fl(Xl,i) 7 ~ f2(z2,i).</Paragraph>
      <Paragraph position="14"> Alternatively, if fl and f2 are probabilistic learners, it might make sense to encode the second constraint as one of minimizing some measure of the distance between the distributions given by the two learners. The question of what soft function to pick, and how to design ' algorithms which optimize it, is an open question, but appears to be a promising way of looking at the problem.</Paragraph>
      <Paragraph position="15"> The DL-CoTrain algorithm can be motivated as being a greedy method of satisfying the above 2 constraints. At each iteration the algorithm increases the number of rules, while maintaining a high level of agregment between the spelling and contextual decision lists. Inspection of the data shows that at n = 2500, the two classifiers both give labels on 44,281 (4,9.2%) of the unlabeled examples, and give the same ~label on 99.25% of these cases.</Paragraph>
      <Paragraph position="16"> So the success of the algorithm may well be due to its success in max!mizing the number of unlabeled examples on which the two decision lists agree. In the next section we present an alternative approach that builds two classifiers while attempting to satisfy the above constraints as much as possible. The algorithm, called CoBoost, has the advantage of being more general than the decision-list learning al- l Input: (xl,Yl),..., (xm,Ym); xi E 2&amp;quot;V,yi = +-1</Paragraph>
      <Paragraph position="18"/>
    </Section>
  </Section>
  <Section position="6" start_page="104" end_page="107" type="metho">
    <SectionTitle>
4 A Boosting-based algorithm
</SectionTitle>
    <Paragraph position="0"> This section describes an algorithm based on boosting algorithms, which were previously developed for supervised machine learning problems. We first give a brief overview of boosting algorithms. We then discuss how we adapt and generalize a boosting algorithm, AdaBoost, to the problem of named entity classification. The new algorithm, which we call CoBoost, uses labeled and unlabeled data and builds two classifiers in parallel. (We would like to note though that unlike previous boosting algorithms, the CoBoost algorithm presented here is not a boosting algorithm under Valiant's (Valiant 84)</Paragraph>
    <Section position="1" start_page="104" end_page="105" type="sub_section">
      <SectionTitle>
Probably Approximately Correct (PAC) model.)
4.1 The AdaBoost algorithm
</SectionTitle>
      <Paragraph position="0"> This section describes AdaBoost, which is the basis for the CoBoost algorithm. AdaBoost was first introduced in (Freund and Schapire 97); (Schapire and Singer 98) gave a generalization of AdaBoost which we will use in this paper. For a description of the application of AdaBoost to various NLP problems see the paper by Abney, Schapire, and Singer in this volume.</Paragraph>
      <Paragraph position="1"> The input to AdaBoost is a set of training examples ((Xl, Yl),. * * , (Xrn, Ym))- Each xi E 2 x is the set of features constituting the ith example. For the moment we will assume that there are only two possible labels: each Yi is in {-1, +1}. AdaBoost is given access to a weak learning algorithm, which  accepts as input the training examples, along with a distribution over the instances. The distribution specifies the relative weight, or importance, of each example -- typically, the weak learner will attempt to minimize the weighted error on the training set, where the distribution specifies the weights.</Paragraph>
      <Paragraph position="2"> The weak learner for two-class problems computes a weak hypothesis h from the input space into the reals (h : 2 x --+ 11~), where the sign 4 of h(x) is interpreted as the predicted label and the magnitude Ih(x)l is the confidence in the prediction: large numbers for Ih(x) l indicate high confidence in the prediction, and numbers close to zero indicate low confidence. The weak hypothesis can abstain from predicting the label of an instance x by setting h(x) = 0. The final strong hypothesis, denoted f (x), is then the sign of a weighted sum of the weak hypotheses, f(x) = sign (~tT=l atht(x)), where the weights at are determined during the run of the algorithm, as we describe below.</Paragraph>
      <Paragraph position="3"> Pseudo-code describing the generalized boosting algorithm of Schapire and Singer is given in Figure 1. Note that Zt is a normalization constant that ensures the distribution Dt+l sums to 1; it is a function of the weak hypothesis ht and the weight for that hypothesis at chosen at the tth round. The normalization factor plays an important role in the AdaBoost algorithm. Schapire and Singer show that the training error is bounded above by</Paragraph>
      <Paragraph position="5"> Thus, in order to greedily minimize an upper bound on training error, on each iteration we should search for the weak hypothesis ht and the weight at that minimize Zt.</Paragraph>
      <Paragraph position="6"> In our implementation, we make perhaps the simplest choice of weak hypothesis. Each ht is a function that predicts a label (+1 or -1) on examples containing a particular feature xt, while abstaining on other examples:</Paragraph>
      <Paragraph position="8"> The prediction of the strong hypothesis can then be written as awe define sign(O) = O.</Paragraph>
      <Paragraph position="9"> We now briefly describe how to choose ht and o~t at each iteration. Our derivation is slightly different from the one presented in (Schapire and Singer 98) as we restrict o~t to be positive. Zt can be written as follows</Paragraph>
      <Paragraph position="11"> Following the derivation of Schapire and Singer, providing that W+ &gt; W_, Equ. (4) is minimized by setting</Paragraph>
      <Paragraph position="13"> Since a feature may be present in only a few examples, W_ can be in practice very small or even 0, leading to extreme confidence values. To prevent this we &amp;quot;smooth&amp;quot; the confidence by adding a small value, e, to both W+ and W_, giving st = Plugging the value of at from Equ. (5) and ht into</Paragraph>
      <Paragraph position="15"> In order to minimize Zt, at each iteration the final algorithm should choose the weak hypothesis (i.e., a feature xt) which has values for W+ and W_ that minimize Equ. (6), with W+ &gt; W_.</Paragraph>
    </Section>
    <Section position="2" start_page="105" end_page="107" type="sub_section">
      <SectionTitle>
4.2 The CoBoost algorithm
</SectionTitle>
      <Paragraph position="0"> We now describe the CoBoost algorithm for the named entity problem. Following the convention presented in earlier sections, we assume that each example is an instance pair of the from (Xl,i, X2,i) where Xj,i E 2&amp;quot;vJ,j E {1,2}. In the named-entity problem each example is a (spelling,context) pair. The first rn pairs have labels Yi, whereas for</Paragraph>
      <Paragraph position="2"> make the assumption that for each example, both</Paragraph>
      <Paragraph position="4"> xl,i and x2,i alone are sufficient to determine the label Yi. The learning task is to find two classifiers fx : 2 &amp; --+ {-1, +1} f2 : 2 x'2 --+ {-1, +1} such that fl(xl,i) = f2(x2,i) = Yi for examples i = 1,..., m, and fl(Xl,i) = f2(x2,i) as often as possible on examples i = m + 1,..., n. To achieve this goal we extend the auxiliary function that bounds the training error (see Equ. (3)) to be defined over unlabeled as well as labeled instances. Denote by gj(x) = ~t4h~:(x),j E {1,2} the unthresholded strong-hypothesis (i.e., fj (x) = sign(gj (x))). We define the following function:</Paragraph>
      <Paragraph position="6"> If Zco is small, then it follows that the two classifiers must have a'low error rate on the labeled examples, and that they also must give the same label on a large number of unlabeled instances. To see this, note that the first two terms in the above equation correspond to the function that AdaBoost attempts to minimize in the standard supervised setting (Equ. (3)), With one term for each classifier.</Paragraph>
      <Paragraph position="7"> The two new terms force the two classifiers to agree, as much as possible, on the unlabeled examples.</Paragraph>
      <Paragraph position="8"> Put another way, the minimum of Equ. (7) is at</Paragraph>
      <Paragraph position="10"> the sum of the classification error of the labeled examples and the number of disagreements between the two classifiers on the unlabeled examples. Formally, let el (e2) be the number of classification errors of the first (second) learner on the training data, and let eco be the number of unlabeled examples on which the two classifiers disagree. Then, it can be verified that q + e2 + 2eco _&lt; Zco * We can now derive the CoBoost algorithm as a means of minimizing Zco. The algorithm builds two classifiers in parallel from labeled and unlabeled data. As in boosting, the algorithm works in rounds. Each round is composed of two stages; each stage updates one of the classifiers while keeping the other classifier fixed. Denote the unthresholded classifiers after t - 1 rounds by 9} -x and assume that it is the turn for the first classifier to be updated while the second one is kept fixed. We first define &amp;quot;pseudo-labels&amp;quot;, yi, as follows:</Paragraph>
      <Paragraph position="12"> Thus the first m labels are simply copied from the labeled examples, while the remaining (n - m) examples are taken as the current output of the second classifier. We can now add a new weak hypothesis ht 1 based on a feature in P(1 with a confidence value oct 1 . ht 1 and tit 1 are chosen to minimize the function</Paragraph>
      <Paragraph position="14"> We now define, for 1 &lt; i &lt; n, the following virtual distribution,</Paragraph>
      <Paragraph position="16"> As before, Zt 1 is a normalization constant. Equ. (8) can now be rewritten 5 as n Z Dtl(i)exp(-,Yi~h~(xl, i))' i=l which is of the same form as the function Zt used in AdaBoost. Using the virtual distribution Dtl(i) and pseudo-labels ~)i, values for W0, W+ and W_ can be calculated for each possible weak hypothesis (i.e., for each feature x E ,121); the weak hypothesis with minimal value for W0 + 2~+W_ can be chosen as before; and the weight for this weak hypothesis c~t = 1/2 In \ w_ +~ ) can be calculated. This procedure is repeated for T rounds while alternating between the two classifiers. The pseudo-code describing the algorithm is given in Fig. 2.</Paragraph>
      <Paragraph position="17"> The CoBoost algorithm described above divides the function Zco into two parts: Zco = Zclo + Zc2o * * On each step CoBoost searches for a feature and a weight so as to minimize either Zclo or Zc2o . In  practice, this greedy approach almost always results in an overall decrease in the value of Zco. Note, however, that there might be situations in which Zco in fact increases.</Paragraph>
      <Paragraph position="18"> One implementation issue deserves some elaboration. Note that in our formalism a weak-hypothesis can abstain. In fact, during the first rounds many of the predictions of gl, 92 are zero. Thus corresponding pseudo-labels for instances on which 9j abstainare set to zero and these instances do not contribute to the objective function. Each learner is free to pick the labels for these instances. This allow the learners to &amp;quot;bootstrap&amp;quot; each other by filling the labels of the instances on which the other side has abstained so far.</Paragraph>
      <Paragraph position="19"> The CoBoost algorithm just described is for the case where there are two labels: for the named entity task there are three labels, and in general it will be useful to generalize the CoBoost algorithm to the multiclass case. Several extensions of AdaBoost for multiclass problems have been suggested (Freund and Schapire 97; Schapire and Singer 98). In this work we extended the AdaBoost.MH (Schapire and Singer 98) algorithm to the cotraining case. AdaBoost.MH maintains a distribution over instances and labels; in addition, each weak-hypothesis outputs a confidence vector with one confidence value for each possible label. We again adopt an approach where we alternate between two classifiers: one classifier is modified while the other remains fixed. Pseudo-labels are formed by taking seed labels on the labeled examples, and the output of the fixed classifier on the unlabeled examples. AdaBoost.MH can be applied to the problem using these pseudo-labels in place of supervised examples.</Paragraph>
      <Paragraph position="20"> For the experiments in this paper we made a couple of additional modifications to the CoBoost algorithm. The algorithm in Fig. (2) was extended to have an additional, innermost loop over the (3) possible labels. The weak hypothesis chosen was then restricted to be a predictor in favor of this label. Thus at each iteration the algorithm is forced to pick features for the location, person and organization in turn for the classifier being trained. This modification brings the method closer to the DL-CoTrain algorithm described earlier, and is motivated by the intuition that all three labels should be kept healthily populated in the unlabeled examples, preventing one label from dominating -this deserves more theoretical investigation.</Paragraph>
      <Paragraph position="21"> We also removed the context-type feature type when using the CoBoost approach. This &amp;quot;default&amp;quot; feature type has 100% coverage (it is seen on every example) but a low, baseline precision. When this feature type was included, CoBoost chose this default feature at an early iteration, thereby giving non-abstaining pseudo-labels for all examples, with eventual convergence to the two classifiers agreeing by assigning the same label to almost all examples.</Paragraph>
      <Paragraph position="22"> Again, this deserves further investigation.</Paragraph>
      <Paragraph position="23"> Finally, we would like to note that it is possible to devise similar algorithms based with other objective functions than the one given in Equ. (7), such as the likelihood function used in maximum-entropy problems and other generalized additive models (Lafferty 99). We are currently exploring such algorithms. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="107" end_page="108" type="metho">
    <SectionTitle>
5 An EM-based approach
</SectionTitle>
    <Paragraph position="0"> The Expectation Maximization (EM) algorithm (Dempster, Laird and Rubin 77) is a common approach for unsupervised training; in this section we describe its application to the named entity problem. A generative model was applied (similar to naive Bayes) with the three labels as hidden vari- null ! ables on unlabeled examples, and observed variables on (seed) labeled examples. The model was parameterized such that the joint probability of a (label, feature-sei) pair P(Yi, xi) is written as</Paragraph>
    <Paragraph position="2"> The model assumes that (y, x) pairs are generated by an underlying process where the label is first chosen with some prior probability P(Yi); the number of features mi is then chosen with some probability P(mi); finally th~ features are independently generated with probabilities P(xij \[Yi).</Paragraph>
    <Paragraph position="3"> We again assume a training set of n examples {xl ... Xn} where the first m examples have labels {Yl ... ym}, and the last (n - m) examples are unlabeled. For the purposes of EM, the &amp;quot;observed&amp;quot; data is {(xx,ya)i... (Xm, Ym),Xm+l...Xn}, and the hidden data is {ym+l ... Yn}. The likelihood of the observed data under the model is</Paragraph>
    <Paragraph position="5"> where P(Yi, xi) is defined as in (9). Training under this model involves estimation of parameter values for P(y), P(m) and P(xly). The maximum likelihood estimates (i.e., parameter values which maximize 10) can not be found analytically, but the EM algorithm can be used to hill-climb to a local maximum of the likelihood function from some initial parameter settings. In our experiments we set the parameter values randomly, and then ran EM to convergence. null Given parameter estimates, the label for a test example x is defined as</Paragraph>
    <Paragraph position="7"> We should note that the model in equation 9 is deficient, in that it assigns greater than zero probability to some feature combinations that are impossible. For example, the independence assumptions mean that the model fails to capture the dependence between specific and more general features (for example the fact that the feature full'-string=New_York is always seen with the features contains (New) and  The baseline method tags all entities as the most frequent class type (organization).</Paragraph>
    <Paragraph position="8"> contains (York) and is never seen with a feature such as contains (Group)). Unfortunately, modifying the model to account for these kind of dependencies is not at all straightforward.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML