File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1060_evalu.xml
Size: 8,851 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1060"> <Title>Factorizing Complex Models: A Case Study in Mention Detection</Title> <Section position="5" start_page="477" end_page="479" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> All the experiments in this section are run on the ACE 2003 and 2004 data sets, in all the three languages covered: Arabic, Chinese, and English.</Paragraph> <Paragraph position="1"> Since the evaluation test set is not publicly available, we have split the publicly available data into a 80%/20% data split. To facilitate future comparisons with work presented here, and to simulate a realistic scenario, the splits are created based on article dates: the test data is selected as the last 20% of the data in chronological order. This way, the documents in the training and test data sets do not overlap in time, and the ones in the test data are posterior to the ones in the training data.</Paragraph> <Paragraph position="2"> Table 2 presents the number of documents in the training/test datasets for the three languages.</Paragraph> <Paragraph position="3"> 11For instance, the full label B-PER is consistent with the partial label B, but not with O or I.</Paragraph> <Paragraph position="4"> Each word in the training data is labeled with one of the following properties:12 * if it is not part of any entity, it's labeled as O * if it is part of an entity, it contains a tag specifying whether it starts a mention (B-) or is inside a mention (I-). It is also labeled with the entity type of the mention (seven possible types: person, organization, location, facility, geo-political entity, weapon, and vehicle), the mention type (named, nominal, pronominal, or premodifier13), and the entity subtype (depends on the main entity type).</Paragraph> <Paragraph position="5"> The underlying classifier used to run the experiments in this article is a maximum entropy model with a Gaussian prior (Chen and Rosenfeld, 1999), making use of a large range of features, including lexical (words and morphs in a 3-word window, prefixes and suffixes of length up to 4, Word-Net (Miller, 1995) for English), syntactic (POS tags, text chunks), gazetteers, and the output of other information extraction models. These features were described in (Florian et al., 2004), and are not discussed here. All three methods (AIO, joint, and cascade) instantiate classifiers based on the same feature types whenever possible. In terms of language-specific processing, the Arabic system uses as input morphological segments, while the Chinese system is a character-based model (the input elements x [?] X are characters), but it has access to word segments as features.</Paragraph> <Paragraph position="6"> Performance in the ACE task is officially evaluated using a special-purpose measure, the ACE value metric (NIST, 2003; NIST, 2004). This metric assigns a score based on the similarity between the system's output and the gold-standard at both mention and entity level, and assigns different weights to different entity types (e.g. the person entity weights considerably more than a facility entity, at least in the 2003 and 2004 evaluations). Since this article focuses on the mention detection task, we decided to use the more intuitive (unweighted) F-measure: the harmonic mean of precision and recall.</Paragraph> <Paragraph position="7"> 12The mention encoding is the IOB2 encoding presented in (Tjong Kim Sang and Veenstra, 1999) and introduced by (Ramshaw and Marcus, 1994) for the task of base noun phrase chunking.</Paragraph> <Paragraph position="8"> 13This is a special class, used for mentions that modify other labeled mentions; e.g. French in &quot;French wine&quot;. This tag is specific only to ACE'04.</Paragraph> <Paragraph position="9"> For the cascade model, the sub-task flow is presented in Figure 1. In the first step, we identify the mention boundaries together with their entity type (e.g. person, organization, etc). In preliminary experiments, we tried to &quot;cascade&quot; this task. The performance was similar on both strategies; the separated model would yield higher recall at the expense of precision, while the combined model would have higher precision, but lower recall. We decided to use in the system with higher precision.</Paragraph> <Paragraph position="10"> Once the mentions are identified and classified with the entity type property, the data is passed, in parallel, to the mention type detector and the subtype detector.</Paragraph> <Paragraph position="11"> For English and Arabic, we spent three person-weeks to annotate additional data labeled with only the entity type information: 550k words for English and 200k words for Arabic. As mentioned earlier, adding this data to the cascade model is a trivial task: the data just gets added to the training data, and the model is retrained. For the AIO model, we have build another mention classifier on the additional training data, and labeled the original ACE training data with it. It is important to note here that the ACE training data (called T in Section 3.4) is consistent with the additional training data T prime: the annotation guidelines for T prime are the same as for the original ACE data, but we only labeled entity type information. The resulting classifications are then used as features in the final AIO classifier. The joint model uses the additional partially-labeled data in the way described in Section 3.4; the probabilities ^q(x,y) are updated every 5 iterations.</Paragraph> <Paragraph position="12"> Table 3 presents the results: overall, the cascade model performs significantly better than the all-in-one model in four out the six tested cases - the numbers presented in bold reflect that the difference in performance to the AIO model is statistically significant.14 The joint model, while managing to recover some ground, falls in between the AIO and the cascade models.</Paragraph> <Paragraph position="13"> When additional partially-labeled data was available, the cascade and joint models receive a statistically significant boost in performance, while the all-in-one model's performance barely changes.</Paragraph> <Paragraph position="14"> This fact can be explained by the fact that the entity type-only model is in itself errorful; measuring the performance of the model on the training data yields a performance of 82 F-measure;15 therefore the AIO model will only access partially-correct 14To assert the statistical significance of the results, we ran a paired Wilcoxon test over the series obtained by computing F-measure on each document in the test set. The results are significant at a level of at least 0.009.</Paragraph> <Paragraph position="15"> 15Since the additional training data is consistent in the labeling of the entity type, such a comparison is indeed possible. The above mentioned score is on entity types only.</Paragraph> <Paragraph position="16"> data, and is unable to make effective use of it.</Paragraph> <Paragraph position="17"> In contrast, the training data for the entity type in the cascade model effectively triples, and this change is reflected positively in the 1.5 increase in F-measure.</Paragraph> <Paragraph position="18"> Not all properties are equally valuable: the entity type is arguably more interesting than the other properties. If we restrict ourselves to evaluating the entity type output only (by projecting the output label to the entity type only), the difference in performance between the all-in-one model and cascade is even more pronounced, as shown in the all-in-one and joint models in all cases except English'03, where the difference is not statistically significant.</Paragraph> <Paragraph position="19"> As far as run-time speed is concerned, the AIO and cascade models behave similarly: our implementation tags approximately 500 tokens per second (averaged over the three languages, on a Pentium 3, 1.2Ghz, 2Gb of memory). Since a MaxEnt implementation is mostly dependent on the number of features that fire on average on a example, and not on the total number of features, the joint model runs twice as slow: the average number of features firing on a particular example is considerably higher. On average, the joint system can tag approximately 240 words per second. The train time is also considerably longer; it takes 15 times as long to train the joint model as it takes to train the all-in-one model (60 mins/iteration compared to 4 mins/iteration); the cascade model trains faster than the AIO model.</Paragraph> <Paragraph position="20"> One last important fact that is worth mentioning is that a system based on the cascade model participated in the ACE'04 competition, yielding very competitive results in all three languages.</Paragraph> </Section> class="xml-element"></Paper>