XML Viewer - m95-1005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/m95-1005_metho.xml
Size: 12,800 bytes
Last Modified: 2025-10-06 14:14:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="M95-1005">
  <Title>A Model-Theoretic Coreference Scoring Schem e</Title>
  <Section position="2" start_page="0" end_page="45" type="metho">
    <SectionTitle>
* Key Links : &lt;A-B B-C B-D&gt;
* Response Links: &lt;A-B C-D&gt;
</SectionTitle>
    <Paragraph position="0"> Note that the key links generate an equivalence class, the set (A B C D I . Technically, the links are a spanning tree of the set's implicit equivalence graph, i .e., thefullyconnected graph whose nodes are the entities A, B, C, and D . The following figure shows the spanning tree in dark lines, and the rest of the graph in gray lines .</Paragraph>
    <Paragraph position="1"> This is just one such spanning tree for the overall equivalence class ; there are of course I Coreference task defmition ., version 2.0 and earlier.</Paragraph>
    <Paragraph position="2">  others, including the &amp;quot;non-problematic&amp;quot; case of &lt;A-B B-C C-D&gt; . Either way, a minimal spanning tree of the equivalence relation will always be of size 3, which aligns with th e intuitive notion that three links will always be necessary to make four entitie s coreferential under the criterion of strict identity .</Paragraph>
    <Paragraph position="3"> Returning to the task of scoring coreference for this problematic case, we note that a response of &lt;A-B C-D&gt; induces two equivalence classes, thus partitioning the set of ke y entities into subsets {A B } and { C DI . It is intuitive that the precision score for thi s response should be 2/2 = 1, since 2 out of 2 of the response links are &amp;quot;correct&amp;quot; . That is, both response links are arcs in the equivalence graph generated by the key . For recall, Sundheim et al. advance the desirable score of 2/3, which is not obtained by the syntacti c scoring measure. This score appeals to the intuitive notion that of the three links necessary to make the key entities fully coreferential, the response only provides two .</Paragraph>
    <Paragraph position="4"> Thinking model-theoretically, we note that the response corresponds to a subgraph of the fully-connected equivalence graph.</Paragraph>
    <Paragraph position="5"> The recall score of 2/3 aligns with the fact that one equivalence arc is required to &amp;quot;complete&amp;quot; the response graph, yielding one of the following four spanning trees .</Paragraph>
    <Paragraph position="7"> Note that the problem with the syntactic (link-wise) scorer is that there are combinatonally many such spanning trees for a given equivalence class, while keys only list one .</Paragraph>
  </Section>
  <Section position="3" start_page="45" end_page="46" type="metho">
    <SectionTitle>
COMPUTING MODEL-THEORETIC RECALL
</SectionTitle>
    <Paragraph position="0"> How then can we turn this notion of minimal missing links into a computationall y effective scoring procedure that works in the general case? Roughly stated, the scorin g mechanism for recall must form the equivalence sets generated by the key, and then determine for each such key set how many subsets the response partitions the key set into. The score then follows by simple arithmetic .</Paragraph>
    <Paragraph position="1"> Getting a bit more formal (but not much), let us define recall using these notions . First, let S be an equivalence set generated by the key, and let R1 . . .Rm be equivalent classes generated by the response. Then we define the following functions over S: * p(S) is a partition of S relative to the response . Each subset of S in the partition is formed by intersecting S and those response sets Ri that overlap S . Note that the equivalence classes defined by the response may include implicit singleton sets -these correspond to elements that are mentioned in the key but not in the response .</Paragraph>
    <Paragraph position="2">  For example, say the key generates the equivalence class S= { A B C D and the response is simply &lt;A-B&gt; . The relative partition p(S) is then {A B} {C} and {D}.</Paragraph>
    <Paragraph position="4"> noted above, this is the number of links necessary to fully reunite any components o f the p(S) partition. We note that this is simply one fewer than the number of element s in the partition, that is ,</Paragraph>
    <Paragraph position="6"> Looking in isolation at a single equivalence class in the key, the recall error for that clas s is just the number of missing links divided by the minimal number of correct links, i .e.,</Paragraph>
    <Paragraph position="8"> ISI- 1 To see how this works in practice, consider the second problematic example noted b y Sundheim et al.</Paragraph>
  </Section>
  <Section position="4" start_page="46" end_page="47" type="metho">
    <SectionTitle>
* Key Links: &lt;A-B B-C&gt;
* Response Links: &lt;A-C&gt;
</SectionTitle>
    <Paragraph position="0"> The key generates a single equivalence class S : (A B C}. The size of the class i s ISI=3, and the minimum number of links necessary to establish the class i s</Paragraph>
    <Paragraph position="2"> The response partitions this class into a partition p(S) of size 2, containing {A C} an d { B }, where the latter element is implicitly defined . Working through the arithmetic, we have:</Paragraph>
    <Paragraph position="4"> This score of 1/2 is the intuitively &amp;quot;correct&amp;quot; one that the syntactic measure fails t o calculate.</Paragraph>
    <Paragraph position="6"> Finally, we note that extending this measure from a single key equivalence class to a n entire test set T simply requires summing over the key equivalence classes . That is,</Paragraph>
    <Paragraph position="8"> The recall scoring procedure operates by merging the subsets of a key equivalence class that are defined by equivalence classes in the response . It is of course the case that the response classes may not be proper subsets of the key . When the response overlaps the key in such a way as to produce a non-trivial set difference, as in the following figure, the response contains precision errors.</Paragraph>
    <Paragraph position="9"> How may we use our model-theoretic notions to provide a scoring mechanism for precision? In the case of recall, we conceptually needed to add links to the response , building up the response's equivalence classes so as to end up with the key . In the case of precision, we need to do the converse : add links to equivalence classes in the key so as to yield equivalence classes in the response . We are switching the &amp;quot;figure&amp;quot; and th e &amp;quot;ground&amp;quot;; that is we are switching our notion of where the base sets come from (th e response rather than the key), and of what defines the partitions on those base sets (the key rather than the response) .</Paragraph>
    <Paragraph position="10"> More precisely, given an equivalence class S' defined by the response, we mus t determine the minimal number of links to be added to the key, so as to ensure that each of the members of the response set is in the same key set. Once again, we proceed by generating a relative partition, in this case the partition of the response equivalence class S' relative to key equivalence classes K1 . . .Kn. Elements of the response that are not found in the key once again generate implicit subsets, one per element. The number of missing elements is once again 1 less than the size of the partition .</Paragraph>
    <Paragraph position="11"> For the example above, we see that the response generates an equivalence class of size 3 , namely the set S' = {A B C) . The key partitions this class into subsets {B C) and {A) , where the latter is implicit. The partition is of size 2, and so the minimal number of links that need to be added to reunite the partition is just 1 .</Paragraph>
    <Paragraph position="12"> Switching the figure and ground in the recall formula, the scoring arithmetic for precisio n works itself out as follows, where S' is now an equivalence class from the response, an d p'(S') is the partition of S' vis-a-vis the key(s) . We then have:</Paragraph>
    <Paragraph position="14"> Thus, like the scheme proposed in Sundheim et al., we have an aesthetically pleasing inverse relationship between precision and recall . For the example above, these formul a yield a precision of 1/2, which is intuitively appropriate, since of the two minimum link s needed to generate the response class {A B C}, the key only provides one, B-C . To extend from a single response to a complete test set T, we once again sum over the tes t set, this time iterating over response equivalence classes.</Paragraph>
    <Paragraph position="15"> Table 1 shows the precision and recall scores, using the model-theoretic measures , for al l of the examples given in Sundheim et al. Note that these results agree with the original scoring proposal for the first three cases, but agree with intuition for the last two .</Paragraph>
  </Section>
  <Section position="5" start_page="47" end_page="47" type="metho">
    <SectionTitle>
EXAMPLES WITH MORE COMPLEXITY
</SectionTitle>
    <Paragraph position="0"> The examples so far have been purposefully simplified in that we have only considere d cases that defined one key class and one response class . Let us now consider some morecomplex examples where keys and responses don't so neatly overlap.</Paragraph>
    <Paragraph position="1"> To begin with, imagine that the key and response are as follows .</Paragraph>
  </Section>
  <Section position="6" start_page="47" end_page="50" type="metho">
    <SectionTitle>
* Key: &lt;B-C C-D D-E E-G G-H H-J&gt;
* Response: &lt;A-B B-C D-E E-F G-H-H-I &gt;
</SectionTitle>
    <Paragraph position="0"> The key establishes a single equivalence class, while the response defines three : {A B C}, {D E F}, {G H I} . In addition, the key contains the element J, which is missin g entirely from the response . These are shown in the following figure, where the thic k lines denote the key class, and thin ones denote response classes .</Paragraph>
    <Paragraph position="1">  Eyeballing the problem, we note that as there are seven elements in the key, six links must minimally be provided to achieve 100% recall. The response only provides thre e correct ones, so we would expect recall to come out at 50% . Precision should be 50% a s well, as half of the links indicated in the response are not in the key . Working through the math, we note that p(S), the partition of the key with respect to the response yield s four subsets, shown in the followin'figure with thin lines .</Paragraph>
    <Paragraph position="2"> A Evaluating for recall :</Paragraph>
    <Paragraph position="4"> For precision, we must consider the three equivalence classes defined in the response.</Paragraph>
    <Paragraph position="5"> Partitioning these three classes with respect to the key yields two subsets in each of the classes, for a total of six subsets. These are shown in thick lines in the following figure .</Paragraph>
    <Paragraph position="6"> To evaluate for precision, we must use the corpus-wide formula:</Paragraph>
    <Paragraph position="8"> Before closing, let us consider an even more complex example with multiple sets in bot h the key and the response .</Paragraph>
  </Section>
  <Section position="7" start_page="50" end_page="52" type="metho">
    <SectionTitle>
* Key: &lt;A-B B-C D-E E-F F-G &gt;
* Response: &lt;A-B C-D F-G G-H&gt;
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the equivalence classes corresponding to the key and response . Figure 2 shows how the response partitions the two key sets . Based on this, we can compute recall using the corpus-wide formula for recall. Since there are two key sets, there will be tw o terms each in the numerator and the denominator.</Paragraph>
    <Paragraph position="2"> This is consistent with the observation that the response only provides two out of the fiv e links that are minimally required to fully designate all the coreference relations . Turning to precision, Figure 3 shows the partitions induced on the three response sets by the key .</Paragraph>
    <Paragraph position="4"> (2--1)+(2--1)+(3--1 ) = 2/4, or 50% It is delightful that the formula yields exactly what intuition would dictate in this case . COMPUTATIONAL CONSIDERATIONS, BRIEFLY EXPLORE D Our scoring expressions are easy to compute . The key step is the formation ofequivalence classes, which can be accomplished by many algorithms . Tarjan's classic UNION-FIND, to cite the obvious example, operates in time effectively linear with th e number of entities under consideration. This is substantially less expensive than enumerating the transitive closure of keys and answers, as is required by the origina l syntactic scoring procedure . Not often does attention to model theory yield efficient algorithms, but in this particular case the effort was well worth the while.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML