File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1020_evalu.xml

Size: 2,565 bytes

Last Modified: 2025-10-06 13:59:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1020">
  <Title>Inferring Sentence-internal Temporal Relations</Title>
  <Section position="8" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5.2 Results
</SectionTitle>
    <Paragraph position="0"> Our results are summarised in Table 6. We measured how well subjects agree with the gold-standard (i.e., the corpus from which the experimental items were selected) and how well they agree with each other. We also show how  (inter-subject agreement is shown in boldface) well the ensembles from Section 4 agree with the humans and the gold-standard. We measured agreement using the Kappa coefficient (Siegel and Castellan, 1988) but also report percentage agreement to facilitate comparison with our model. In all cases we compute pairwise agreements and report the mean. In Table 6, H refers to the subjects, G to the gold-standard, and E to the ensemble.</Paragraph>
    <Paragraph position="1"> As shown in Table 6 there is less agreement among humans for the interpretation task than the sentence fusion task. This is expected given that some of the markers are semantically similar and in some cases more than one marker are compatible with the meaning of the two clauses. Also note that neither the model nor the subjects have access to the context surrounding the sentence whose marker must be inferred (we discuss this further in Section 6). Additional analysis of the interpretation data revealed that the majority of disagreements arose for as and once clauses. Once was also problematic for our model (see the Recall in Table 5). Only 33% of the subjects agreed with the gold-standard for as clauses; 35% of the subjects agreed with the gold-standard for once clauses. For the other markers, the subject agreement with the gold-standard was around 55%. The highest agreement was observed for since and until (63% and 65% respectively).</Paragraph>
    <Paragraph position="2"> The ensemble's agreement with the gold-standard approximates human performance on the interpretation task (.413 for E-G vs. .421 for H-G). The agreement of the ensemble with the subjects is also close to the upper bound, i.e., inter-subject agreement (see, E-H and H-H in Table 6). A similar pattern emerges for the fusion task: comparison between the ensemble and the gold-standard yields an agreement of .489 (see E-G) when subject and gold-standard agreement is .522 (see H-G); agreement of the ensemble with the subjects is .468 when the upper bound is .490 (see E-H and H-H, respectively).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML