File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3234_evalu.xml
Size: 17,395 bytes
Last Modified: 2025-10-06 13:59:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3234"> <Title>Trained Named Entity Recognition Using Distributional Clusters</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We experimented with the MUC 6 named entity data set, which consists of a training set of 318 documents, a validation set of 30 documents, and a test set of 30 documents.</Paragraph> <Paragraph position="1"> All documents are annotated to identify three types of name (PERSON, ORGANIZATION, MUC 6 fields produced by BWI using the baseline feature set. An initial and terminal detector is shown for each field.</Paragraph> <Paragraph position="2"> LOCATION), two types of temporal expression (DATE, TIME), and two types of numeric expression (MONEY, PERCENT). It is common to report performance in terms of precision, recall, and their harmonic mean (F1), a convention to which we adhere.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Baseline </SectionTitle> <Paragraph position="0"> Using the wildcards listed in Table 2, we trained BWI for 500 boosting iterations on each of the seven entity fields. The output out each of these training runs consists of a0a2a1a3a1a5a4a7a6 a26a9a8a2a1a3a1a3a1 boundary detectors. Look-ahead was set to 3.</Paragraph> <Paragraph position="1"> Table 3 shows a few of the boundary detectors induced by this procedure. These detectors were selected manually to illustrate the kinds of patterns generated. Note how some of the detectors amount to field-specific gazetteer entries. Others have more interesting (and typically intuitive) structure. We defer quantitative evaluation to the next section, where a comparison with the cluster-enhanced extractors will be made.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Adding Cluster Features </SectionTitle> <Paragraph position="0"> The MUC 6 dataset was produced using articles from the Wall Street Journal. In order to produce maximally relevant clusters, we used documents from the WSJ portion of the North American News corpus as input to co-clustering--some 119,000 documents in total. Note that there is a temporal disparity between the MUC 6 corpus and this clustering corpus, which has an undetermined impact on performance.</Paragraph> <Paragraph position="1"> enced by detectors in Table 4.</Paragraph> <Paragraph position="2"> We used this data to produce 200 clusters, as described in Section 3. Treating each of these clusters as an unlabeled gazetteer, we then defined corresponding wildcards. For example, the value of wildcard <C35> only matches a term belonging to Cluster 35. In order to reduce the training time of a given boundary learning problem, we tabulated the frequency of wildcard occurrence within three tokens of any occurrences of the target boundary and omitted from training wildcards testing true fewer than ten times.1 Table 4, which lists sample detectors from these runs, includes some that are clearly impossible to express using the baseline feature set. An example is the first row, which matches a third-person present-tense verb used in quote attribution, followed by a first name (see Table 5). At the same time, some of the new wildcards are employed trivially, such as the use of <C178> in the field-initial detector for the ORGANIZATION field.</Paragraph> <Paragraph position="3"> Table 6 shows performance of the two variants on the individual MUC 6 fields, tested over the &quot;dryrun&quot; and &quot;formal&quot; test sets combined. In this table, we scored each field individually using our own evaluation software. An entity instance was judged to be correctly extracted if a prediction precisely identified its boundaries (ignoring &quot;ALT&quot; at- null without (Base) and with (Clust) cluster-based features. Significantly better precision or recall scores, at the 95% confidence level, are in boldface.</Paragraph> <Paragraph position="4"> tributes). Non-matching predictions and missed entities were counted as false positives and false negatives, respectively. We assessed the statistical significance of precision and recall scores by computing beta confidence intervals at the 95% level. In the table, the higher precision or recall is in boldface if its separation from the lower score is significant.</Paragraph> <Paragraph position="5"> Except for TIME and LOCATION, all fields benefit from inclusion of the cluster features. TIME, which is scarce in the training and test sets, is insensitive to their inclusion. The effect on LOCATION is more interesting. It shares in the general tendency of cluster features to increase recall, but loses precision as a result.2 Although the increase in recall is approximately the same as the loss in precision, the F1 score, which is more heavily influenced by the lower of precision and recall, drops slightly.</Paragraph> <Paragraph position="6"> While the effect of the cluster features on precision is inconsistent, they typically benefit recall. This effect is most dramatic in the case of ORGA-NIZATION, where, at the expense of a small drop in precision, recall increases by more than 20 points.</Paragraph> <Paragraph position="7"> The somewhat counter-intuitive improvements in precision on some fields (particularly the significant improvement on PERSON) is attributable to our learning framework. Boosting for a sufficient number of iterations forces a learner to account for all boundary tokens through one or more detectors. To the exent that the baseline's features are unable to 2Note, however, that none of the differences observed for LOCATION are significant at the 95% level.</Paragraph> <Paragraph position="8"> account for as many of the boundary tokens, it is forced to learn a larger number of over-specialized detectors that rely on questionable patterns in the data. Depending on the task, these detectors can lead to a larger proportion of false positives.</Paragraph> <Paragraph position="9"> The relatively weak result for DATE comes as a surprise. Inspection of the data leads us to attribute this to two factors. On the one hand, there is considerable temporal drift between the training and test sets. Many of the dates are specific to contemporaneous events; patterns based on specific years, therefore, generalize in only a limited way.</Paragraph> <Paragraph position="10"> At the same time, the notion of date, as understood in the MUC 6 corpus, is reasonably subtle. Meaning roughly &quot;non-TIME temporal expression,&quot; it includes everything from shorthand date expressions to more interesting phrases, such as, &quot;the first six months of fiscal 1994.&quot; In passing we note a few potentially relevant idiosyncrasies in these experiments. Most significant is a representational choice we made in tokenizing the cluster corpus. In tallying frequencies we treated all numeric expressions as occurrences of a special term, &quot;*num*&quot;. Consequently, the tokens &quot;1989&quot; and &quot;10,000&quot; are treated as instances of the same term, and clustering has no opportunity to distinguish, say, years from monetary amounts.</Paragraph> <Paragraph position="11"> The (perhaps) disappointing performance on the relatively simple fields, TIME and PERCENT, somewhat under-reports the strength of the learner.</Paragraph> <Paragraph position="12"> As noted above, TIME occurs only very infrequently. Consequently, little training data is available for this field and mistakes (BWI missed one of the three instances in the test set) have a large effect on the TIME-specific scores. In the case of PER-CENT, we ignored MUC instructions not to attempt to recognize instances in tabular regions. One of the documents contains a significant number of unlabeled percentages in such a table. BWI duly recognized these--to the detriment of the reported precision. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 MUC Evaluation </SectionTitle> <Paragraph position="0"> For comparison with numbers reported in the literature, we used the learned extractors to produce mark-up and evaluated the result using the MUC 6 scorer. The MUC 6 evaluation framework differs from ours in two key ways. Most importantly, all entity types are to be processed simultaneously. We benefit from this framework, since spurious predictions for one entity type may be superseded by correct predictions for a related type. The opportunity is greatest for the three name types; in inspecting the false positives, we observed a number of confu- null by the MUC 6 scorer.</Paragraph> <Paragraph position="1"> sions among these fields.3 The MUC scorer is also more lenient than ours, awarding points for extraction of alternative strings and forgiving the inclusion of certain functional tokens in the extracted text.</Paragraph> <Paragraph position="2"> In moving to the multi-entity extraction setting, the obvious approach is to collect predictions from all extractors simultaneously. However, this requires a strategy for dealing with overlapping predictions (e.g., a single text fragment labeled as both a person and organization). We resolve such conflicts by preferring in each case the extraction with the highest confidence. In order to render confidence scores more comparable, we normalized the weights of detectors making up each boundary classifier so they sum to one.</Paragraph> <Paragraph position="3"> A comparison of Table 7 with Table 6 suggests the extent to which BWI benefits from the multifield mark-up setting. Note that, here, we used only the &quot;formal&quot; test set for evaluation, in contrast with the numbers in Table 6, which combine the two test sets. The lift we observe from cluster features is also in evidence here, and is most evident as an increase in recall, particularly of PERSON and ORGANI-ZATION. There is now also an increase in global precision, attributable in large part to the benefit of extracting multiple fields simultaneously.</Paragraph> <Paragraph position="4"> The F1 score produced by BWI is comparable to the best machine-learning-based results re3For example, companies are occasionally named after people (e.g., Liz Claiborne).</Paragraph> <Paragraph position="5"> ported elsewhere. Bikel, at al (1997), reports summary F1 of 0.93 on the same test set, but using a model trained on 450,000 words. We count approximately 130,000 words in the experiments reported here. The numbers reported by Bennett, et al (1997), for PERSON, ORGANIZATION, and LOCATION (F1 of 0.947, 0.815, and 0.925, respectively), are slightly better than the numbers BWI reaches on the same fields. Note, however, that the features provided to their learner include syntactic labels and carefully engineered semantic categories, whereas we eschew knowledge- and labor-intensive resources. This has important implications for the portability of the approaches to new domains and languages.</Paragraph> <Paragraph position="6"> By taking a few post-processing steps, it is possible to realize further improvements. For example, the learner occasionally identifies terms and phrases which some simple rules can reliably reject. By suppressing any prediction that consists entirely of a stopword, we increase the precision of both ORGANIZATION and LOCATION to 0.86 (from 0.84 and 0.80) and overall F1 to 0.88.</Paragraph> <Paragraph position="7"> We can also exploit what Cucerzan and Yarowsky (1999) call the one sense per discourse phenomenon, the tendency of terms to have a fixed meaning within a single document. By marking up unmarked strings that match extracted entity instances in the same document, we can improve the recall of some fields. We added this post-processing step for the PERSON and ORGANIZATION fields. This increased recall of PERSON from 0.95 to 0.98 and of ORGANIZATION from 0.74 to 0.79 with minimal changes to precision and a slight improvement in summary F1.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Analysis and Related Work </SectionTitle> <Paragraph position="0"> The promise of this general method--supervised learning on small training set using features derived from a larger unlabeled set--lies in the support it provides for rapid deployment in novel domains and languages. Without relying on any linguistic resources more advanced than a tokenizer and some orthographic features, we can produce a NER module using only a few annotated documents.</Paragraph> <Paragraph position="1"> How few depends ultimately on the difficulty of the domain. We might also expect the benefit of distributional features to decrease with increasing training set size. Figure 1 displays the F1 learningcurve performance of BWI, both with and without cluster features, on the two fields that benefit the greatest from these features, PERSON and ORGA-NIZATION. As expected, the difference appears to be greatest on the low end of the horizontal axis (al null number of documents.</Paragraph> <Paragraph position="2"> though overfitting complicates the comparison). At the same time, the improvement is fairly consistent at all training set sizes. Either the baseline feature set is ultimately too impoverished for this task, or, more likely, the complete MUC 6 training set (318 documents) is small for this class of learner.</Paragraph> <Paragraph position="3"> Techniques to lessen the need for annotation for NER have received a fair amount of attention recently. The prevailing approach to this problem is a bootstrapping technique, in which, starting with a few hand-labeled examples, the system iteratively adds automatic labels to a corpus, training itself, as it were. Examples of this are Cucerzan and Yarowsky (1999), Thelen and Riloff (2002), and Collins and Singer (1999).</Paragraph> <Paragraph position="4"> These techniques address the same problem as this paper, but are otherwise quite different from the work described here. The labeling method (seeding) is an indirect form of corpus annotation. The promise of all such approaches is that, by starting with a small number of seeds, reasonable results can be achieved at low expense. However, it is difficult to tell how much labeling corresponds to a given number of seeds, since this depends on the coverage of the seeds. Note, too, that any bootstrapping approach must confront the problem of instability; poor initial decisions by a bootstrapping algorithm can lead to large eventual performance degradations. We might expect a lightly supervised learner with access to features based on a full-corpus analysis to yield more consistently strong results.</Paragraph> <Paragraph position="5"> Of the three approaches mentioned above, only Cucerzan and Yarowsky do not presuppose a syntactic analysis of the corpus, so their work is perhaps most comparable to this one. Of course, comparisons must be strongly qualified, given the different labeling methods and data sets. Nevertheless, performance of cluster-enhanced BWI at the low end of the horizontal axis compares favorably with the English F1 performance of 0.543 they report using 190 seed words. And, arguably, annotating 10-20 documents is no more labor intensive than assembling a list of 190 seed words.</Paragraph> <Paragraph position="6"> Strong corroboration for the approach advocated in this paper is provided by Miller, et al (2004), in which cluster-based features are combined with a sequential maximum entropy model proposed in Collins (2002) to advance the state of the art. In addition, using active learning, the authors are able to reduce human labeling effort by an order of magnitude. null Miller, et al, use a proprietary data set for training and testing, so it is difficult to make a close comparison of outcomes. At roughly comparable training set sizes, they appear to achieve a score of about 0.89 (F1) with a &quot;conventional&quot; HMM, versus 0.93 using the discriminative learner trained with cluster features (compared with 0.86 reached by BWI). Both the HMM and Collins model are constrained to account for an entire sentence in tagging it, making determinations for all fields simultaneously, in contrast to the individual, local boundary detections made by BWI. This characteristic probably accounts for the accuracy advantage they appear to enjoy.</Paragraph> <Paragraph position="7"> An interesting distinguishing feature of Miller, et al, is their use of hierarchical clustering. While much is made of the ability of their approach to accomodate different levels of granularity automatically, no evidence is provided that the hierarchy provides real benefit. At the same time, our work shows that significant gains can be realized with a single, sufficiently granular partition of terms. It is known, moreover, that greedy agglomerative clustering leads to partitions that are sub-optimal in terms of a mutual information objective function (see, for example, Brown, et al (1992)). Ultimately, it is left to future research to determine how sensitive, if at all, the NER gains are to the details of the clustering.</Paragraph> </Section> </Section> class="xml-element"></Paper>