XML Viewer - w97-0319

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0319_metho.xml
Size: 32,513 bytes
Last Modified: 2025-10-06 14:14:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0319">
  <Title>Probabilistic Coreference in Information</Title>
  <Section position="3" start_page="0" end_page="163" type="metho">
    <SectionTitle>
2 Overview of the Problem
</SectionTitle>
    <Paragraph position="0"> Let us consider an example text of the sort that we encounter in our application: 1 1The texts in our application are messages consisting of free text, possibly interspersed with formatted tables or charts which themselves may contain natural language fragments that require analysis. While this example is shorter than most texts in our corpus, the relevant free text portions of the messages are typically no longer than a few paragraphs. The style displayed in this example is fairly typical, although in some cases the sentence struc-</Paragraph>
  </Section>
  <Section position="4" start_page="163" end_page="163" type="metho">
    <SectionTitle>
NL Text
INFORMATION
SOURCE
I INFORMATION EXTRACTION
PATTERN \] ~\[ TEMPLATE\] RECOGNITION\[-'--&amp;quot;\] MERGING \[
INFORMATION ~
SOURCE
DOWNSTREAM
PROCESSING
</SectionTitle>
    <Paragraph position="0"> A rail depot was found 100 km southwest of the capitol of Raleigh, consisting of extensive admin and support areas (similar to the ammunition depot in Fairview), two material storage areas, extensive transshipment facilities (some of which are under construction immediately east of the depot), and several training areas.</Paragraph>
    <Paragraph position="1"> We focus on the four mentions of depots in the text, which are highlighted with italics. The pattern matching phases of FASTUS produce templates similar to those shown in Figure 2.</Paragraph>
  </Section>
  <Section position="5" start_page="163" end_page="164" type="metho">
    <SectionTitle>
FACILITY DEPOT
NUMBER 1
LOCATION KINSTON
TYPE RAIL
FACILITY DEPOT -
NUMBER 1
TYPE RAIL
Template A Template B
FACILITY DEPOT
NUMBER 1
LOCATION FAIRVIEW
TYPE AMMUNITION
FACILITY 1DEPOT \]
</SectionTitle>
    <Paragraph position="0"> We will refer to a set of templates that have potential coreference relationships among them as a ture is more telegraphic.</Paragraph>
    <Paragraph position="1"> coreference set, 2 and possible partitions of coreferential templates in the set as coreference configurations. In the coreference set containing templates A, B, C, and D, system knowledge external to the probabilistic model indicates that the type Ammunition in template C is not compatible with the type Rail in A and B; therefore these are taken a priori to be non-coreferential. Given these incompatibilities, seven possible coreference configurations remain. Template names grouped within parentheses are taken to be mutually coreferring; we will refer to such a grouping as a cell of the coreference configu- null ration.</Paragraph>
    <Paragraph position="2"> 1. (A B D) (C) 5. (B D) (A) (C) 2. (A B) (C D) 6. (C D) (A) (S) 3. (hB) (C) (D) 7. (A) (S) (C) (D) 4. (A D) (B) (C)  The first of these configurations expresses the correct coreference relationships for the example.</Paragraph>
    <Paragraph position="3"> Given a coreference set of templates, possibly coupled with a list of template pairs known a priori not to corefer, the task is to assign a probability distribution over the possible coreference configurations for that set.</Paragraph>
    <Paragraph position="4"> Relationship to Past Work While there have been previous investigations of empirical approaches to coreference, these have generally centered on the task of assigning correct referents for anaphor/c expressions (Connolly, Burger, and Day, 1994; Aone and Bennett, 1995; Lappin and Leass, 1994; Dagan 2Templates A, B, C, and D constitute the only coreference set in this example, since none of the other NPs (e.g., the various &amp;quot;areas&amp;quot; mentioned) are compatible with any of the others. In general, however, a text can give rise to any number of distinct coreference sets, each of which will be assigned its own probability distribution.  and Itai, 1990; Dagan et al., 1995; Kennedy and Boguraev, 1996a; Kennedy and Boguraev, 1996b).</Paragraph>
    <Paragraph position="5"> The current task deviates from that problem in several respects. First, in our task, all coreference relationships among templates are modeled regardless of the &amp;quot;referentiality&amp;quot; of the phrases that led to their creation. For instance, indefinites will sometimes corefer with a previously described entity; a typical case is illustrated by the coreference between the indefinite &amp;quot;a rail depot&amp;quot; and the depot introduced in the subject line in the example passage. Also, entities described with bare plurals are commonly found to be coreferential with other entities, in addition to cases in which they have their more standard generic meanings. On the other hand, definite noun phrases are often not referential to items evoked in the text (e.g., &amp;quot;the ammunition depot in Fairview&amp;quot;). Determining when such expressions are discourse-anaphoric is part of the task; this information is generally not known to the system a priori. Second, the results of this task will be evaluated by the probability assigned to the correct state of affairs with respect to an entire coreference set, and not by the number of correct antecedents assigned to anaphoric expressions. Modeling at the level of coreference sets ensures that the probabilities are consistent when considering the global state of affairs being described in the text. Furthermore, the role of probabilities for this application goes beyond selecting the correct coreference relationships - the probability assigned to an alternative will be central in determining how the downstream system will weigh it against information from other sources during data fusion. A system that assigns a probability of 0.9 to correct answers is more successful than one that assigns a probability of 0.6 to them.</Paragraph>
    <Paragraph position="6"> The Limitations of IE Systems The properties of typical IE systems such as FASTUS also make this task challenging. For one, successful modeling of coreference relationships is hampered by the crudeness of the representations used. The templates that are created are fairly shallow and may be incomplete. A reliance on detailed information about the context can prove detrimental if such information is often missed by the system. Also, FASTUS also does not build up complex representations for the syntax and semantics of sentences, placing limits on the extent to which such information can be utilized in determining coreference. Lastly, there are the inaccuracies that result from processing real text. The pattern matching phases of FASTUS may intermittently misanalyze phrases that serve as antecedents for subsequent referring expressions. Therefore, for example, with respect to an identified coreference set, it may be correct to place a referential pronoun in its own cell (implying that it does not corefer with anything), simply because system error caused its antecedent not to be included in the set.</Paragraph>
    <Paragraph position="7"> Outline of the Approach The number of coreference configurations over which a distribution is to be assigned depends on the number of templates in the coreference set, and the set of a priori constraints against coreference between some of its members. As there are many scenarios that will never be encountered in a corpus of training data of any reasonable size, it would be hopeless to attempt to estimate a conditional distribution for each possibility directly. To make matters worse, training data comes at a cost, as keys have to be coded by hand. One of the goals of this effort is to allow the ability to train up probabilities in new domains quickly, which requires an approach that is successful with a limited amount of training data.</Paragraph>
    <Paragraph position="8"> However, it would be reasonable to expect that we have enough data to estimate distributions for coreference sets with only two members. This suggests a two-step approach. First, we develop a general model of coreference between any two templates, and apply it to pairwise combinations of templates in a given coreference set without regard to the other templates in the set. We then utilize a method for combining the resulting probabilities to form a distribution over all the possible coreference configurations. We describe our method for modeling probabilities between pairs of templates in the next section, and describe two methods for deriving a distribution over the coreference configurations in Section 4. We report on an evaluation and comparison of the approaches in Section 5.</Paragraph>
  </Section>
  <Section position="6" start_page="164" end_page="165" type="metho">
    <SectionTitle>
3 Training A Model for Pairs of
Templates
</SectionTitle>
    <Paragraph position="0"> Our first task is to derive a model for determining the probability that two templates corefer, conditioned on various characteristics of the context. For this we employ an approach to maximum entropy modeling described by Berger et al. (1996).</Paragraph>
    <Paragraph position="1"> Maximum Entropy Modeling Suppose we wish to model some random process, such as that which determines coreference between two templates generated by an IE system, based on various characteristics of the context that influence this process, such as the content of the templates themselves, the form of the natural language expressions from which the templates were created, and the distance between  those expressions in the text. We refer to the collection of such characteristics for a given example as its context x, and the value denoting the output of the process as y. We can define a set of binary features that relate a possible value of a characteristic of x with a possible outcome y, i.e., whether the two templates corefer (y = 1) or not (y = 0). For example, a feature fl(x, y) pairing the characteristic of S and T having identical slot values with the outcome that they corefer would be defined as follows.</Paragraph>
    <Paragraph position="2"> Binary Feature fl(x,Y):</Paragraph>
    <Paragraph position="4"> slot values and S and T corefer</Paragraph>
  </Section>
  <Section position="7" start_page="165" end_page="166" type="metho">
    <SectionTitle>
0 otherwise
</SectionTitle>
    <Paragraph position="0"> From these features we can define constraints on the probabilistic model that is learned, in which we assume that the expected value of the feature with respect to the distribution of the training data (Pd) holds with respect to the general model (Pro).</Paragraph>
    <Paragraph position="1"> Constraints:</Paragraph>
    <Paragraph position="3"> Given that we have chosen a set of such constraints to impose on our model, we wish to identify that model which has the maximum entropy - this is the model that assumes the least information beyond those constraints. Berger et al. (1996) show that this model is a member of an exponential family with one parameter for each constraint, specifically a model of the form</Paragraph>
    <Paragraph position="5"> The parameters A1, ..., An are Lagrange multipliers that impose the constraints corresponding to the chosen features fl, ..-,fn- The term Z(x) normalizes the probabilities by summing over all possible outcomes y. Berger et al. (1996) demonstrate that the optimal values for the Ai's can be obtained by maximizing the likelihood of the training data with respect to the model, which can be performed using their improved iterative scaling algorithm.</Paragraph>
    <Paragraph position="6"> In practice, we will not want to incorporate constraints for all of the features that we might define, but only those that are most relevant and informative. Therefore, we use a procedure for selecting which of our pool of features should be made active.</Paragraph>
    <Paragraph position="7"> At each iteration, the algorithm approximates the gain in the model's predictiveness that would result from imposing the constraints corresponding to each of the existing inactive features, and selects the one with the highest anticipated payoff. Upon making this feature active, the Ai's for all active features are (re)trained so that the constraints are all met simultaneously. The feature selection process is iterated until the approximate gain for all the remaining inactive features is negligible.</Paragraph>
    <Paragraph position="8"> Characteristics of Context for Template Coreference We now need a set of possible characteristics of context on which the algorithm could choose to conditionalize in deriving the probabilistic model. For our initial experiments, we utilized a set of easily computable, but fairly crude, characteristics. 3 These characteristics fall into three categories. In what follows, we take S and T to be arbitrary templates where the natural language expression from which T was created appears later in the text than the expression from which S was created. null The first category relates to the contents of the templates themselves. We model the relationship between S and T as one of the following: S and T have identical slot values, S is properly subsumed by T, S properly subsumes T, or S and T are otherwise consistent. For instance, in our example in Section 2, template A is properly subsumed by template B, and A, B, and C are all properly subsumed by D, since in each case the latter template is more general than the former. We also have a binary characteristic for S and T having at least two (non-nil) slot values in common. Finally, we have a characteristic for modeling when the values of the NAME slot of a template are both multi-worded and identical; this is a crude heuristic for identifying matching unique identifiers.</Paragraph>
    <Paragraph position="9"> The second category of characteristics relates to the form of reference used in the expression from which T was created, specifically whether it was de3One could imagine a variety of more detailed and informative characteristics of context than those used here. However, in performing these experiments, we are interested in how far we can get with a fairly simple strategy that will port relatively easily to new domains, rather than relying heavily on information that is specific to our current domain. A fairly coarse-grained set of characteristics also allows us to restrict ourselves to a relatively small set of training data; likewise we will not want to encode a large set of data for each new domain.</Paragraph>
    <Section position="1" start_page="166" end_page="166" type="sub_section">
      <SectionTitle>
erence Set
</SectionTitle>
      <Paragraph position="0"> scribed with an indefinite phrase, a definite phrase (including pronouns), or neither of these (e.g., a bare, non-pronominal noun phrase). In the case of definite expressions, we also consider the recommendations of a distinct coreference module within FAS-TUS. We have a characteristic representing whether the potential antecedent is the preferred antecedent, 4 a non-preferred, but possible antecedent, or not on the list of possible antecedents. 5 The final category of characteristics relates to the distance in the text between the expressions from which S and T were created, which we categorize as being in one of five equivalence classes: very close, close, mid-distance, far away, and very far away. These distances are measured crudely (i.e., by character length) so as not to be dependent on the accuracy of methods for identifying more complex boundaries (e.g., clause, sentence, and discourse segment boundaries).</Paragraph>
      <Paragraph position="1"> The results of training the maximum entropy models are discussed in Section 5. To illustrate the approaches described in the next section, we will use the probabilities for the templates from the example passage in Section 2, shown in Table 1, which were produced from the parameters induced from one of the training sets.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="166" end_page="169" type="metho">
    <SectionTitle>
4 Inferring a Model for Coreference
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="166" end_page="166" type="sub_section">
      <SectionTitle>
Sets
</SectionTitle>
      <Paragraph position="0"> We now have a method for obtaining a model that assigns probabilities to the pairs of templates (henceforth, &amp;quot;pairwise probabilities&amp;quot;) in a coreference set that can possibly corefer. If there are only two templates in the coreference set, then we have the distri- null T if there is a chain of preferred referents linking them, e.g., if there is a template R that is the preferred referent of T and template S is the preferred referent of R.</Paragraph>
      <Paragraph position="1"> 5Although we do not model information about the surface positions of the expressions from which S and T were created within their respective sentences, the coreference module does take such information into account in determining likely antecedents of definite expressions. bution we seek. However, if there are more than two templates, we must utilize the pairwise probabilities to derive a distribution over the members of the set of coreference configurations. In the following sections, we describe two approaches to recovering such a distribution, followed by a description of two baseline metrics. An evaluation of these approaches is then given in Section 5.</Paragraph>
    </Section>
    <Section position="2" start_page="166" end_page="168" type="sub_section">
      <SectionTitle>
4.1 An Evidential Reasoning Approach
</SectionTitle>
      <Paragraph position="0"> The first approach we describe uses the pairwise probabilities as sources of evidence that inform the choice of model for the coreference sets. The list of coreference configurations for our example passage are repeated below; we will refer to these configura- null tions by their corresponding numbers.</Paragraph>
      <Paragraph position="1"> 1. (A B D) (C) 5. (B D) (A) (C) 2. (A B) (C D) 6. (C D) (A) (B) 3. (AB) (C) (D) 7. (A) (S) (C) (D) 4. (A D) (B) (C)  We recast a probability that two templates S and T corefer as a mass distribution over two members of the power set of coreference configurations, namely the set containing exactly those configurations in which S and T occupy the same cell, and the set containing those in which they do not. For instance, the probability that A and B corefer was determined to be 0.671; mapping this to corresponding sets of coreference configurations results in the mass distribution mAB in which mAB({Configs 1, 2, 3}) = 0.671 and mAB({Configs 4, 5, 6, 7}) = 0.329 This mass distribution can be seen as representing the beliefs of an observer who only has access to templates A and B, and who is therefore ignorant about their relationship to C and D. We can view the other pairwise probabilities for the coreference set in the same manner.</Paragraph>
      <Paragraph position="2"> In the best of all worlds, we might identify a model that is consistent with the mass distributions provided by all the pairwise probabilities. However, such a model may not, and often will not, exist. This is the case for the pairwise probabilities in our example, which can be seen most easily by considering only templates A, C, and D. The probability of A and D coreferring is 0.505 and of C and D coreferring is 0.504. Because we know that A and C cannot corefer, the coreference configurations in which A and D corefer and the configurations in which C  and D corefer are mutually exclusive. Therefore, there would have to be a distribution that assigns 0.505 of probability mass to a set of configurations that is mutually exclusive from a set that is assigned 0.504 of probability mass. Obviously, this cannot be done with a set of probabilities that add up to 1. This inconsistency arises from the manner in which the pairwise probabilities are estimated. The probability of coreference between templates situated similarly to A and D may be 0.505 with respect to all contexts in the training data, however it is almost certainly not this high with respect to the subset of cases in which a template similar to C is similarly situated. The same reasoning applies to the probability of C and D coreferring in light of the existence of A. Unfortunately, the existence of templates other than the pair being modeled is the type of conditional information for which we have little hope of accounting in a general and statistically significant manner.</Paragraph>
      <Paragraph position="3"> Therefore, we may be left with a series of mass distributions defined over sets of coreference configurations that are in inherent conflict. Instead of viewing these distributions as constraints on the underlying probabilistic model, we view them as sources of evidence. The question is then how to take these sources into account, given that they may be partially contradictory. Dempster's Rule of Combination (Dempster, 1968) provides a mechanism for doing this. Dempster's rule combines two mass distributions m 1 and m 2 to form a third distribution m 3 that represents the consensus of the original two distributions; the new mass distribution in effect leans toward the areas of agreement between the original distributions and away from points of conflict. Dempster's rule is defined as follows:</Paragraph>
      <Paragraph position="5"> The Al in our case are members of the power set of possible coreference configurations. In our example above, mAB assigns probability mass to two such Am, the set containing configurations 1, 2, and 3, and the set containing configurations 4, 5, 6, and 7.</Paragraph>
      <Paragraph position="6"> The value a is called the conflict between the mass distributions being combined; it provides a measure of the degree of disagreement between them. When = 0, the original distributions are compatible; when ,C/ = 1, they are in complete conflict and the result is undefined. When 0 &lt; ,~ &lt; 1, some conflict between the distributions exists; Dempster's rule has the effect of focusing on the agreement between the distributions by eliminating the conflicting portions and normalizing what remains.</Paragraph>
      <Paragraph position="7"> We can therefore use Dempster's Rule to resolve the conflict between the pairwise probability distributions to generate a distribution over the coreference configurations. Because we have pairwise probabilities for each possibly coreferring pair in the coreference set, it turns out that the Dempster solution is more easily stated and computed here than in the general case. The solution is identical to the one that results when the probabilities of all the relevant pairwise relations (indicating either coreference or not) are multiplied, normalized by the amount of probability mass assigned to coreference configurations that are impossible because coreference is transitive. For instance, the probability for the coreference configuration ((A B) (C)) is initially computed to be 6</Paragraph>
      <Paragraph position="9"> However, using this method, impossible combinations (e.g., A =c B, B =c C, AC/c C) will also receive positive probability mass. If we normalize the probabilities of possible combinations by distributing the sum of the probability assigned to all impossible combinations, the result is the same as that gotten by iteratively combining the pairwise distributions using Dempster's Rule.</Paragraph>
      <Paragraph position="10"> The resulting distribution for our example is:  1. (A B D) (C) = .383 2. (A B) (C D) = .184 3. (A B) (C) (D) = .123 4. (A D) (B) (C) --.062 5. (B D) (A) (C) = .125 6. (C D) (A) (B) = .061 7. (A) (B) (C) (D) = .061  In motivating our approach, we noted that we cannot expect to have the amount of training data necessary to directly estimate distributions for all the possible scenarios with which we may be confronted. Limiting ourselves to modeling probabilities between pairs of templates, however, leads to inconsistencies because of the failure to take into account the crucial information provided by the existence of other compatible templates. Dempster's Rule can be seen as a very coarse-grained approach to conditioning on context in this regard. The contributions of the pairwise models are conditioned not on the existence of other ~We use the notation =c to indicate coreference.  templates in context, but by virtue of the existence of conflicting models derived from those templates. For instance; the pairwise probability of coreference between C and D was originally 0.504, which might be reasonable if those were the only two templates generated from the text. 7 However, the probability that C and D corefer in the final distribution is only 0.245, the sum of the probabilities of the two partitions in which C and D occupy the same cell. This adjustment results from the existence of templates A and B: the fact that template D has a high probability of coreferring with each, combined with the fact that template C is incompatible with each, reduces the likelihood that C and D corefer. Therefore, the preferences for particular coreferential dependencies can change when considering the larger picture of possible coreference sets.</Paragraph>
      <Paragraph position="11"> In practice, coreference sets that are significantly larger than the one we have considered here can lead to an explosive number of possible coreference configurations. We have implemented simple methods for pruning very low probability configurations during processing and for smoothing the resulting distribution. The latter step is accomplished, when necessary, by eliminating certain low-probability configurations at the end of processing. The probability mass from these configurations is distributed uniformly over all the possible configurations that have been eliminated. While this is unlikely to be the best strategy for smoothing from the standpoint of probabilistic modeling, we are constrained by the number of alternatives we can report to the downstream system. Smoothing in this way allows us to report only the coreference configurations with non-negligible probability, along with a single probability that is assigned uniformly to the remainder of the possible configurations.</Paragraph>
    </Section>
    <Section position="3" start_page="168" end_page="169" type="sub_section">
      <SectionTitle>
4.2 A Model Based on Merging Decisions
</SectionTitle>
      <Paragraph position="0"> The second approach we consider models the likelihood of correctness of decisions that a template merger such as the one used in FASTUS would make in processing a text. To illustrate, consider the case in our example in which the probability of the coreference configuration ((A B D) (C)) is determined.</Paragraph>
      <Paragraph position="1"> The merger would make the following decisions in deriving such a configuration, in which the notation &amp;quot;B&amp;A&amp;quot; represents the template that results from 7Actually this number is lower than it would have been, because template B was identified as the preferred antecedent for template D instead of template C. If C and D were the only two templates generated, then C would have been identified as the preferred antecedent, thus raising the probability.</Paragraph>
      <Paragraph position="2">  templates A and B having previously been merged.</Paragraph>
      <Paragraph position="3"> 1. B =c A? ~ yes 2. C =c B&amp;A? ~ no 3. D=cC?~no 4. D =c B&amp;A? ~ yes  We therefore model the probability of this coreference configuration as the product of each of the corresponding pairwise probabilities. Since we cannot model coreference involving objects that have resulted from previous (hypothetical) merges - the appropriate feature values for distance and form of referring expression would become unclear - we make the following approximation:</Paragraph>
      <Paragraph position="5"> in which Yn is the most recently created template in Yt, ..., Yn.</Paragraph>
      <Paragraph position="6"> Using the probabilities from Table 1, s the probability assigned to ((A B D) (C)) would therefore be</Paragraph>
      <Paragraph position="8"> Note that unlike the evidential approach, the probability of the pair D and A coreferring does not come into play, given that coreference between D and B and between B and A has been factored in.</Paragraph>
      <Paragraph position="9"> This approach yields a probabilistic model as given, that is, the probabilities sum to 1 without normalization. However, in certain circumstances the approximation above will generate probability mass for an impossible case, specifically when it is known a priori that X is incompatible with one of the templates Y1,..., Y,~-i. For instance, if templates B and C in our example had been compatible (with A and C remaining incompatible), then the approximation above would assign positive probability mass to the coreference configuration ((A B C) (D)), because the zero probability of A coreferring with C would not come into play. Therefore we modify the above approximation to apply only if X and each of Y1, ..., Yn-1 are compatible; otherwise, the probability mass assigned is used for normalization. One can see that this can only improve the pure form of the model.</Paragraph>
      <Paragraph position="10"> Using the pairwise probabilities from Table 1, the results of the model as applied to the example are: SWe use these probabilities for ease of comparison.</Paragraph>
      <Paragraph position="11"> In reality, the pairwise probabilities for this model were trained with an adapted set of training data as explained below, and so these numbers axe in actuality a bit different.</Paragraph>
      <Paragraph position="12">  1. (A B D) (C) = .250 2. (A B) (C D) = .338 3. (A B) (C) (D)= .083 4. (h D) (S) (C) = .020 5. (B D) (A)(C)= .123 6. (C D) (A) (S) = .166 7. (A) (S) (C) (D) = .020</Paragraph>
    </Section>
    <Section position="4" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
4.3 Two Bases of Comparison
</SectionTitle>
      <Paragraph position="0"> We compared the two learned models with two base-line models. First, as an absolute baseline, we compared the model with the uniform distribution, that is, the distribution that assigns equal probability to each alternative. We then sought a more challenging, yet straightforward baseline. We defined a simple, &amp;quot;greedy&amp;quot; approach to merging similar to the one used in FASTUS, in which merging of newlycreated templates is attempted iteratively through the prior discourse, starting with the most recently produced object. Any unifications that succeed are performed. For instance, in the above example, the greedy method produces the configuration ((A B) (C D)), because A is compatible with B, C is not compatible with either, and D is compatible with C (with which merging would be attempted before the earlier-evoked templates B and A). Alternatively, in cases in which all of the templates in a coreference set are pairwise compatible, the greedy method will produce the configuration in which they are all coreferential. null We then calculated how often this approach yielded the correct results in each training set. We distinguished between three values: the percentage of correctness for coreference sets of cardinality 2 (call this P2), the percentage for coreference sets of cardinality 3 (call this P3), and the percentage for coreference sets of cardinality 4 or more (call this P&gt;3). The greedy model was defined such that the result of the greedy merging strategy is assigned the appropriate probability Pk, with the remainder of the probability mass 1 -p} distributed uniformly among the remaining possible alternatives. (No alternatives were included that were a priori known to be impossible due to incompatibilities.) For instance, in the first training set we describe below, p2--.571, p3=.652, and p&gt;3=.344 (the percentage for the whole training corpus was p=.555).</Paragraph>
      <Paragraph position="1"> If there are 4 templates, and 10 coreference configurations are possible, then the answer derived by the greedy strategy would receive probability .344, and the remaining 9 alternatives would receive probability 1-.3449 = .0729. In the second training set we describe below, p2--.646, p3=.600, and p&gt;3--.345 (the percentage for the whole training corpus was p=.549), and in the third training set, p2--.628, p3=.600, and p&gt;3=.280 (the percentage for the whole training corpus was p=.523).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML