File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/h05-1013_evalu.xml
Size: 8,704 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1013"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 97-104, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model</Title> <Section position="6" start_page="101" end_page="103" type="evalu"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="101" end_page="102" type="sub_section"> <SectionTitle> 5.1 Data </SectionTitle> <Paragraph position="0"> We use the of cial 2004 ACE training and test set for evaluation purposes; however, we exclude from the training set the Fisher conversations data, since this is very different from the other data sets and there is no Fisher data in the 2004 test set. This amounts to a113a49a115a49a110 training documents, consisting of a114a51a94a116a112a73a77 sentences and a112a73a117a99a80a118a77 words. There are a total of a110a111a119a120a77 mentions in the data corresponding to a112a121a80a118a77 entities (note that the data is not annotated for cross-document coreference, so instances of Bill Clinton appearing in two different documents are counted as two different entities). Roughly half of the entities are people, a fth are organizations, a fth are GPEs and the remaining are mostly locations or facilities.</Paragraph> <Paragraph position="1"> The test data is a112a73a115a49a110 documents, a113a51a94a97a96a49a77 sentences and a117a111a119a120a77 words, with a112a121a80a118a77 mentions to a119a122a94a97a96a49a77 entities. In all cases, we use a beam of 16 for training and test, and ignore features that occur fewer than ve times in the training data.</Paragraph> </Section> <Section position="2" start_page="102" end_page="102" type="sub_section"> <SectionTitle> 5.2 Evaluation Metrics </SectionTitle> <Paragraph position="0"> There are many evaluation metrics possible for this data. We will use as our primary measure of quality the ACE metric. This is computed, roughly, by rst matching system mentions with reference mentions, then using those to match system entities with reference entities. There are costs, once this matching is complete, for type errors, false alarms and misses, which are combined together to give an ACE score, ranging from a80 to a112a121a80a49a80 , with a112a121a80a49a80 being perfect (we use v.10 of the ACE evaluation script).</Paragraph> </Section> <Section position="3" start_page="102" end_page="102" type="sub_section"> <SectionTitle> 5.3 Joint versus Pipelined </SectionTitle> <Paragraph position="0"> We compare the performance of the joint system with the pipelined system. For the pipelined system, to build the mention detection module, we use the same technique as for the full system, but simply don't include in the hypotheses the coreference chain information (essentially treating each mention as if it were in its own chain). For the stand-alone coreference system, we assume that the correct mentions and types are always given, and simply hypothesize the chain (though still in a left-to-right manner).1 Run as such, the joint model achieves an ACE score of a123a99a115a51a94a124a119 and the pipelined model achieves an ACE score of a123a99a114a51a94a116a112 , a reasonably substantial improvement for performing both task simultaneously.</Paragraph> <Paragraph position="1"> We have also computed the performance of these two systems, ignoring the coreference scores (this is done by considering each mention to be its own entity and recomputing the ACE score). In this case, the joint model, ignoring its coreference output, achieves an ACE score of a114a49a96a51a94a97a117 and the pipelined model achieves a score of a114a49a96a51a94a97a113 . The joint model 1One subtle dif culty with the joint model has to do with the online nature of the learning algorithm: at the beginning of training, the model is guessing randomly at what words are entities and what words are not entities. Because of the large number of initial errors made in this part of the task, the weights learned by the coreference model are initially very noisy. We experimented with two methods for compensating for this effect. The rst was to give the mention identi cation model as head start : it was run for one full pass through the training data, ignoring the coreference aspect and the following iterations were then trained jointly. The second method was to only update the coreference weights when the mention was identi ed correctly. On development data, the second was more ef cient and outperformed the rst by a125a66a126a127 ACE score, so we use this for the experiments reported in this section.</Paragraph> <Paragraph position="2"> feature classes are removed.</Paragraph> <Paragraph position="3"> does marginally better, but it is unlikely to be statistically signi cant. In the 2004 ACE evaluation, the best three performing systems achieved scores of a123a99a115a51a94a97a115 , a123a99a115a51a94a128a123 and a123a99a114a51a94a97a110 ; it is unlikely that our system is signi cantly worse than these.</Paragraph> </Section> <Section position="4" start_page="102" end_page="103" type="sub_section"> <SectionTitle> 5.4 Feature Comparison for Coreference </SectionTitle> <Paragraph position="0"> In this section, we analyze the effects of the different base feature types on coreference performance.</Paragraph> <Paragraph position="1"> We use a model with perfect mentions, entity types and mention types (with the exception of pronouns: we do not assume we know pronoun types, since this gives away too much information), and measure the performance of the coreference system. When run with the full feature set, the model achieves an ACE score of a114a49a115a51a94a116a112 and when run with no added features beyond simple biases, it achieves a117a49a96a51a94a124a119 . The best performing system in the 2004 ACE competition achieved a score of a115a95a112a99a94a97a96 on this task; the next best system scored a114a49a114a51a94a97a110 , which puts us squarely in the middle of these two (though, likely not statistically signi cantly different). Moreover, the best performing system took advantage of additional data that they labeled in house.</Paragraph> <Paragraph position="2"> To compute feature performance, we begin with all feature types and iteratively remove them one-by-one so that we get the best performance (we do not include the history features, since these are not relevant to the coreference task). The results are shown in Figure 2. Across the top line, we list the ten feature classes. The rst row of results shows the performance of the system after removing just one feature class. In this case, removing lexical features reduces performance to a114a49a114a51a94a97a115 , while removing string-match features reduces performance to a114a49a113a51a94a97a117 . The non-shaded box (in this case, syntactic features) shows the feature set that can be removed with the least penalty in performance. The second row repeats this, after removing syntactic features.</Paragraph> <Paragraph position="3"> As we can see from this gure, we can freely remove syntax, semantics and classes with little decrease in performance. From that point, patterns are dropped, followed by lists and inference, each with a performance drop of about a80a95a94a124a119 or a80a95a94a97a96 . Removing the knowledge based features results in a large drop from a114a118a123a32a94a97a117 down to a114a49a96a51a94a97a117 and removing count-based features drops the performance another a80a95a94a128a123 points. Based on this, we can easily conclude that the most important feature classes to the coreference problem are, in order, string matching features, lexical features, count features and knowledge-based features, the latter two of which are novel to this work.</Paragraph> </Section> <Section position="5" start_page="103" end_page="103" type="sub_section"> <SectionTitle> 5.5 Linkage Types </SectionTitle> <Paragraph position="0"> As stated in the previous section, the coreferenceonly task with intelligent link achieves an ACE score of a114a49a115a51a94a116a112 . The next best score is with min link (a114a49a114a51a94a128a123 ) followed by average link with a score of a114a49a114a51a94a116a112 . There is then a rather large drop with max link to a114a49a117a51a94a97a110 , followed by another drop for last link to a114a49a113a51a94a97a96 and rst link performs the poorest, scoring a114a95a112a99a94a97a96 .</Paragraph> </Section> </Section> class="xml-element"></Paper>