File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0319_evalu.xml
Size: 11,156 bytes
Last Modified: 2025-10-06 14:00:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0319"> <Title>Probabilistic Coreference in Information</Title> <Section position="9" start_page="169" end_page="171" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="169" end_page="170" type="sub_section"> <SectionTitle> 5.1 Training the Maximum Entropy Models </SectionTitle> <Paragraph position="0"> For reasons described below, we trained separate pairwise probability models for each of the two approaches. We ran FASTUS over our development corpus, 72 texts of which produced coreference data.</Paragraph> <Paragraph position="1"> The texts gave rise to 132 coreference sets, and produced characteristics of context for 647 potential coreference relationships between pairs of templates.</Paragraph> <Paragraph position="2"> We created a key by analyzing the texts and entering the correct coreference relationships.</Paragraph> <Paragraph position="3"> We created three splits of training and test data.</Paragraph> <Paragraph position="4"> In the first split, the training set contained 60 messages, giving rise to 110 coreference sets, and the test set contained 12 messages, giving rise to 22 coreference sets. In the second split, the training set contained 57 messages, giving rise to 102 coreference sets, and the test set contained 15 messages, giving rise to 30 coreference sets. The third test set was created by combining the first and second test sets.</Paragraph> <Paragraph position="5"> The training set contained 47 messages, giving rise to 88 coreference sets, and the test set contained 25 messages (the first two test sets overlapped by two messages), which gave rise to 44 coreference sets.</Paragraph> <Paragraph position="6"> For training the maximum entropy model, only the sets of characteristics of context for pairwise coreference are relevant; the number of such sets differed between the two approaches as discussed below. The evaluations were performed on the test sets with respect to the final distribution generated for the coreference sets, with the result being measured in terms of the average cross-entropy between the model and the test data.</Paragraph> <Paragraph position="7"> Data for the Evidential Model The evidential model utilizes the pairwise probabilities between all pairs of templates in a coreference set. Therefore, we used all such pairs in each training set to train the maximum entropy model. In the first training set, the 110 coreference sets gave rise to characteristics of context for 578 pairs of templates; in the second, the 102 coreference sets gave rise to characteristics for 581 pairs of templates. In the third training set, the 88 coreference sets gave rise to characteristics for 525 pairs of templates.</Paragraph> <Paragraph position="8"> The maximum entropy algorithm selected similar sets of features to model in each case. 9 Among the systems of ,ki values learned, negative values were learned for the features in which template S properly subsumes template T and in which S and T are otherwise consistent. These two features model the cases in which template T contains information not contained in template S, reflecting the fact that expressions referring to the same entity usually do not become more specific as the discourse proceeds. A positive value was learned for the feature modeling cases in which templates S and T had at least two identical non-nil slot values, as well as for the feature modeling an exact match of complex name values.</Paragraph> <Paragraph position="9"> As one might expect, a negative value was learned for the case in which template T was created from an indefinite expression. A positive value was learned for the case in which template T was created from a definite expression and S was (perhaps transitively) the preferred referent according to the coreference module. Interestingly, no value was learned for template S being a possible but non-preferred referent, but a small positive value was learned for it not being on the list at all - presumably this covers cases in which the coreference module fails to identify an existing referent. All the distance features except for close and mid-distance received negative hi values, suggesting that coreference between close and mid-distance templates was more likely than coreference between templates that were very close, far away, and very far away.</Paragraph> <Paragraph position="10"> The cross-entropy of the learned model as applied to the training data in each case was about 0.80.</Paragraph> <Paragraph position="11"> Given that the cross-entropy of the uniform distribution and the data is 1 (as there are only two possible values for the random variable, i.e., S and T are coreferent or not), this relatively small reduction suggests that the problem has some amount of difficulty, which is consistent with the notable lack of clear signals of coreference characteristic of the texts in our domain.</Paragraph> <Paragraph position="12"> Data for the Merging Decision Model Unlike the evidential model, the merging decision model does not always utilize all of the palrwise probabilities between pairs in a coreference set. For instance, in determining the probability of a coreference configuration ((A B C)), it does not consider the probability assigned to the pair A and C except to check that they are compatible. Therefore, the training set for the maximum entropy algorithm was pared down to only contain those pairs that the merger would have considered in deriving the correct coreference configurations. The resulting data had the same coreference sets as the training data for the acteristic of context paired with the result of coreference. evidential approach, but consisted of characteristics of context for 415 template pairs in the first training set, 405 pairs in the second training set, and 370 pairs in the third training set. The features selected were similar to those in the training of the evidential model.</Paragraph> <Paragraph position="13"> The cross-entropies of the learned maximum entropy models and the training data were notably better than those for the evidential model, at about 0.70 in each case. This improvement is not particularly surprising. In the evidential case, the fact that all pairs of templates are considered results in a certain amount of &quot;washing out&quot; of the data, due to redundancy in coreference relationships. For instance, coreference between two templates that are far away might be unlikely if there are no coreferring expressions between them, but quite likely if there are. When just considering the pairwise feature sets, these two cases are not distinguished, so the resulting probability will be mixed. However, in the merging decision case, pairs that are far away will not be in the data set if there are coreferring expressions between them, and thus the probability for coreference at long distances will be diminished.</Paragraph> <Paragraph position="14"> The result is a &quot;cleaner&quot; set of data in which clearer distinctions may be found, as evidenced by the lower cross-entropy achieved.</Paragraph> </Section> <Section position="2" start_page="170" end_page="171" type="sub_section"> <SectionTitle> 5.2 Evaluation Results </SectionTitle> <Paragraph position="0"> The cross-entropies of the various approaches as applied to the three sets of test data are shown in Table 2. The number within parentheses indicates the number of times that the coreference set with the highest probability was the correct one. As hoped, both the evidential and merging decision approaches outperformed the uniform and greedy approaches with respect to cross-entropyJ deg Interestingly, and perhaps surprisingly, the evidential approach outperformed the merging decision model, even though in many respects the latter is more natural and elegant. While considering feature sets for all pairs may wash out the training data for the pairwise probability model somewhat, the evidence provided by all pairs appears to more than make up for the difference. Given that a goal of these experiments is to see how well the strategies would perform with a fairly crude, easily computable, and portable set of characteristics of con1degThe merging decision approach did not do any better than the greedy approach in terms of raw accuracy, and in fact did somewhat worse in the third test. Again, however, the reduction in cross-entropy is important, as the statistics produced by the system will be integrated with other probabilistic factors in the downstream system.</Paragraph> <Paragraph position="1"> text, we are encouraged by the results of these experiments, especially considering the limited amount of training data that was available.</Paragraph> <Paragraph position="2"> Nonetheless, additional data is necessary to confirm the results of these initial evaluations. Although the consistency of the results between the first two training/test divisions may suggest that the amount of training data is sufficient for the rather coarsely grained feature set used, the size of the test sets are potentially of concern, which motivated our inclusion of the third training/test division. Despite the reduction in training data and corresponding increase in test data, the results of this experiment appear to consistent with the first two.</Paragraph> <Paragraph position="3"> There are a variety of characteristics of context that one might add to improve the models. For instance, one could add a characteristic indicating when a template is created from a phrase in a sub-ject line or table, as many cases of coreference with subsequent indefinite phrases occur in this circumstance. Other types of information about text type, text structure, and more finely grained distinctions with respect to referential types (e.g., modeling pronouns differently than other definite NPs) would all likely further improve the model, although for some of these additional training data would be required and more domain and genre dependence may result.</Paragraph> <Paragraph position="4"> While this work was motivated by a need to pass probabilistic output to a downstream data fusion system, these methods can be applied system internally also, to supplant existing algorithms for merging in IE settings that do not allow for probabilistic output. In this scenario, the system simply performs the template merging dictated by the most probable coreference configuration for a given coreference set. However, as noted earlier, the texts in our application are relatively short, and therefore the coreference sets are usually of manageable size. Significantly larger coreference sets can lead to an enormous number of possible coreference configurations.</Paragraph> <Paragraph position="5"> Therefore, to address this task in applications with much longer texts, mechanisms beyond those that were necessary here will be required for intelligently pruning the search space and subsequently smoothing the distributions.</Paragraph> </Section> </Section> class="xml-element"></Paper>