File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3213_evalu.xml

Size: 10,391 bytes

Last Modified: 2025-10-06 13:59:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3213">
  <Title>Unsupervised Semantic Role Labelling</Title>
  <Section position="7" start_page="0" end_page="5" type="evalu">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Of over 960K slots we extracted from the corpus, 120K occurred with one of 54 target verbs. Of these, our validation data consisted of 278 slots, and our test data of 554 slots. We focus on the analysis of test data; the pattern on the validation data was nearly identical in all respects.</Paragraph>
    <Paragraph position="1"> The target slots fall into several categories, depending on the human judgements: argument slots, adjunct slots, and &amp;quot;bad&amp;quot; slots (chunking errors). We report detailed analysis over the slots identified as arguments. We also report overall accuracy if adjunct and &amp;quot;bad&amp;quot; slots are included in the slots to be labelled. This comparison is similar to that made by Gildea and Jurafsky (2002) and others, either using arguments as delimited in the FrameNet corpus, or having to automatically locate argument boundaries.4 Furthermore, we report results over individ-</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Evaluation Measures and Comparisons
</SectionTitle>
      <Paragraph position="0"> We report results after the &amp;quot;unambiguous&amp;quot; data is assigned, and at the end of the algorithm, when no more slots can be labelled. At either of these steps it is possible for some slots to have been assigned and some to remain unassigned. Rather than performing a simple precision/recall analysis, we report a finer grained elaboration that gives a more precise picture of the results. For the assigned slots, we report percent correct (of total, not of assigned) and percent incorrect. For the unassigned slots, we report percent &amp;quot;possible&amp;quot; (i.e., slots whose candidate list contains the correct role) and percent &amp;quot;impossible&amp;quot; (i.e., slots whose candidate list does not contain the correct role--and which may in fact be empty). All these percent figures are out of all argument slots (for the first set of results), and out of all slots (for the second set); see Table 3. Correctness is determined by the human judgements on the chunked slots, as reported above.</Paragraph>
      <Paragraph position="1"> Using our notion of slot class, we compare our results to a baseline that assigns all slots the role with the highest probability for that slot class, a39a41a40 a50 a42 a0a15a7a10a58 . When using general thematic roles, this is a more informed baseline than a39a41a40 a50 a42a57 a58 , as used in other work. We are using a very different verb lexicon, corpus, and human standard than in previous research.</Paragraph>
      <Paragraph position="2"> The closest work is that of Gildea and Jurafsky (2002) which maps FrameNet roles to a set of 18 thematic roles very similar to our roles, and also operates on a subset of the BNC (albeit manually rather than randomly selected). We mention the performance of their method where appropriate below.</Paragraph>
      <Paragraph position="3"> However, our results are compared to human annotation of chunked data, while theirs (and other supervised results) are compared to manually annotated full sentences. Our percentage correct values therefore do not take into account argument constituents that are simply missed by the chunker.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Results on Argument Slots
</SectionTitle>
      <Paragraph position="0"> Table 3 summarizes our results. In this section, we focus on argument slots as identified by our human judges (the first panel of results in the table). There are a number of things to note. First, our performance on these slots is very high, 90.1% correct at the end of the algorithm, with 7.0% incorrect, and delimited arguments, others train, as well as test, only on such arguments. In our approach, all previously annotated slots are used in the iterative training of the probability model. Thus, even when we report results on argument slots only, adjunct and &amp;quot;bad&amp;quot; slots may have induced errors in their labelling. only 2.9% left unassigned. (The latter have null candidate lists.) This is a 56% reduction in error rate over the baseline. Second, we see that even after the initial unambiguous role assignment step, the algorithm achieves close to the baseline percent correct. Furthermore, over 96% of the initially assigned roles are correct. This means that much of the work in narrowing down the candidate lists is actually being preformed during frame matching. It is noteworthy that such a simple method of choosing the initial candidates can be so useful, and it would seem that even supervised methods might benefit from employing such an explicit use of the lexicon to narrow down role candidates for a slot.</Paragraph>
      <Paragraph position="1"> After unambiguous role assignment, about 21% of the test data remains unassigned (116 slots). Of these 116 slots, 100 have a non-null candidate list.</Paragraph>
      <Paragraph position="2"> These 100 are assigned by our iterative probability model, so we are especially interested in the results on them. We find that 76 of these 100 are assigned correctly (accounting for the 13.7% increase to 90.1%), and 24 are assigned incorrectly, yielding a 76% accuracy for the probability model portion of our algorithm on identified argument slots.</Paragraph>
      <Paragraph position="3"> Moreover, we also find that all specificity levels of the probability model (see Figure 1) are employed in making these decisions--about a third of the decisions are made by each level. This indicates that while there is sufficient data in many cases to warrant using the exact probability formula a39a41a40 a50 a42a57 a53a1a0a47a53a3a2 a58 , the class-based generalizations we propose prove to be very useful to the algorithm.</Paragraph>
      <Paragraph position="4"> As a point of comparison, the supervised method of Gildea and Jurafsky (2002) achieved 82.1% accuracy on identified arguments using general thematic roles. However, they had a larger and more varied target set, consisting of 1462 predicates from 67 FrameNet frames (classes), which makes their task harder than ours. We are aware that our test set is small compared to supervised approaches, which have a large amount of labelled data available. However, our almost identical results across the validation and test sets indicates consistent behaviour that may generalize to a larger test set, at least on similar classes of verbs.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="5" type="sub_section">
      <SectionTitle>
6.3 Differences Among Slot Classes
</SectionTitle>
      <Paragraph position="0"> When using general thematic roles with a small set of verb classes, the probability used for the baseline, a39a41a40 a50 a42 a0a15a7a10a58 , works very well for subjects and objects (which are primarily Agents and Themes, respectively, for our verbs). Indeed, when we examine each of the slot classes individually, we find that, for subjects and objects, the percent correct  Section 6.4).</Paragraph>
      <Paragraph position="1"> achieved by the algorithm is indistinguishable from the baseline (both are around 93%, for both subjects and objects). For PP objects, on the other hand, the baseline is only around 11% correct, while we achieve 78.5% correct, a 76% reduction in error rate. Clearly, when more roles are available, even a39a41a40 a50 a42 a0a8a7 a58 becomes a weak predictor.  We could just assign the default role for subjects and objects when using general thematic roles, but we think this is too simplistic. First, when we broaden our range of verb classes, subjects and objects will have more possible roles. As we have seen with PPs, when more roles are available, the performance of a default role degrades. Second, although we achieve the same correctness as the baseline, our algorithm does not simply assign the dominant role in these cases. Some subjects are assigned Theme, while some objects are assigned Recipient or Source. These roles would never be possible in these slots if a default assignment were followed.</Paragraph>
    </Section>
    <Section position="4" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.4 Results Including All Target Slots
</SectionTitle>
      <Paragraph position="0"> We also consider our performance given frame matching and chunking errors, which can lead to adjuncts or even &amp;quot;bad&amp;quot; constituents being labelled. Only arguments should be labelled, while nonarguments should remain unlabelled. Of 98 slots judged to be adjuncts, 19 erroneously are given labels. Including the adjunct slots, our percent correct goes from 90.1% to 88.7%. Of the 20 &amp;quot;bad&amp;quot; slots, 12 were labelled. Including these, correctness is reduced slightly further, to 87.2%, as shown in the second panel of results in Table 3. The error rate reduction here of 65% is higher than on arguments only, because the baseline always labels (in error) adjuncts and &amp;quot;bad&amp;quot; slots. (Gildea and Jurafsky (2002) achieved 63.6% accuracy when having to identify arguments for thematic roles, though note again that this is on a much larger and more 5Due to the rarity of indirect object slots in the chunker output, the test data included no such slots. The validation set included one, which the algorithm correctly labelled.</Paragraph>
      <Paragraph position="1"> general test set. Also, although we take into account errors on identified chunks that are not arguments, we are are not counting chunker errors of missing arguments.) As others have shown (Gildea and Palmer, 2002), semantic role labelling is more accurate with better preprocessing of the data. However, we also think our algorithm may be extendable to deal with many of the adjunct cases we observed. Often, adjuncts express time or location; while not argument roles, these do express generalizable semantic relations.</Paragraph>
      <Paragraph position="2"> In future work, we plan to explore the notion of expanding our frame matching step to go beyond VerbNet by initializing potential adjuncts with appropriate roles.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML