XML Viewer - w04-3214

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3214_evalu.xml
Size: 7,725 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3214">
  <Title>The Influence of Argument Structure on Semantic Role Assignment</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Experiment 2: Explaining the Variance
</SectionTitle>
    <Paragraph position="0"> With Argument Structure With two measures for the uniformity of argument structure at hand, we now proceed to test our main hypothesis.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Data and Experimental Setup
</SectionTitle>
      <Paragraph position="0"> As argued in Section 3.4, our aim in this experiment is to control for the most plausible sources of performance variance and isolate the influence of argument structure.</Paragraph>
      <Paragraph position="1"> To meet this condition, we perform both the experiments and the uniformity measure calculation on a controlled subset of the data, with the condition that both the number of verbs and the number of sentences are the same for each frame.</Paragraph>
      <Paragraph position="2"> Following the methodology in Keller and Lapata (2003), we divide the verbs into four frequency bands, frequency being absolute number of annotated sentences: low (5), medium-low (12), medium-high (22), and high (38). We set the boundaries between the bands as the quartiles of all the verbs containing at least 5 annotated examples7. For each frame, 2 verbs in each frequency band are randomly chosen. This reduces our frame sample from 196 to 40. We furthermore randomly select a number of sentences for each verb which matches the boundaries between frequency bands, that is, all verbs in each frequency bands are artificially set to have the same number of annotated sentences. This method assures that all frames in the experiment have 8 verbs and 154 sentences, so that both the performance figures and the uniformity measures were acquired under equal conditions.</Paragraph>
      <Paragraph position="3"> The models for semantic role assignment were trained in the same way as for Experiment 1 (see Section 3.1), using the same features. We also performed 10-fold cross validation as before. The uniformity measures a34a19a35 and a36a37a34a19a35 were computed according to the definitions in Section 4.2.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the overall results and variance across frames for the new dataset. Table 4 contains detailed performance results (Columns 1 and 2) and uniformity figures (Columns 3 and 4) for five randomly drawn frames.</Paragraph>
      <Paragraph position="1">  across frames for Experiment 2.</Paragraph>
      <Paragraph position="2"> The overall results for the new, controlled dataset are 3 to 5 points F-score worse than in Experiment 1, which is a result of the artificial limitation of larger frames to fewer training examples. Otherwise, the same tendencies hold: The memory-based learner again performs better than the maximum entropy learner, and overlap evaluation returns higher scores than exact match. More relevantly, the data show the same amount of variance across frames as before (between 10 and 11%), even though the most plausible sources of variance are controlled for. The variation over cross validation runs is somewhat larger, but still small (2.0%/1.9% for Maxent and 0.9%/0.8% for MBL, respectively).</Paragraph>
      <Paragraph position="3"> We can now test our main hypothesis through an analysis of the correlation between performance and 7We consider 5 to be the (very) minimum number of instances necessary to construct a representative argument structure for a predicate.</Paragraph>
      <Paragraph position="4">  from Exp. 2. a34a39a35 = normalised uniformity, a36a37a34a19a35 = weighted uniformity (in percentages).</Paragraph>
      <Paragraph position="5"> uniformity figures. We log-transformed both variables to guarantee normal distribution and used the standard Pearson product-moment correlation coefficient, testing for positive correlation (higher uniformity - higher performance). The results in Table 5 show that all correlation tests are significant, and most are highly significant. This constitutes very good empirical support for our hypothesis.</Paragraph>
      <Paragraph position="6">  levels for correlating frame performance and frame uniformity for the dataset from Experiment 2.</Paragraph>
      <Paragraph position="7"> We find that a36a38a34a19a35 yields consistently higher correlation measures (and therefore more significant correlations) than a34a19a35 , which supports our hypothesis from Section 4 that a36a37a34a19a35 is a better measure for argument structure uniformity. Recall that the intuition behind the weighting is to let well-attested predicates (those with higher frequency) have a larger influence upon the measure. However, an independent experiment for the adequacy of the measures should be devised to verify this hypothesis. A comparison of the evaluation modes shows that frame uniformity correlates more strongly with the overlap evaluation measures than with exact match.</Paragraph>
      <Paragraph position="8"> We presume that this is due to the evaluation figures in exact match mode being somewhat noisier. All other things being equal, random errors introduced during the different processing stages (e.g. parsing errors) are more likely to influence the exact match outcome: A processing error which leads to a partially right argument assignment will influence the outcome of the exact match evaluation, but not of the overlap evaluation.</Paragraph>
      <Paragraph position="9"> As for the two statistical frameworks, uniformity is better correlated with the Maxent model than with the MBL model, even though MBL performs better on the evaluation. However, this does not mean that the correlation will become weaker for semantic role labelling systems performing at higher levels of accuracy. We compared our current models with an earlier version, which had an overall lower performance of about 5 points F-score. Using the same data, the correlation coefficients a1 a11 were on average 0.09 points lower, and the p-values were not significant for the Maxent model in exact match mode. This indicates that correlations tend to increase for better models.</Paragraph>
      <Paragraph position="10"> Therefore, we attribute the difference between the Maxent and the MBL model to their individual properties, or more specifically to differences in the distribution of the performance figures for the individual frames around the mean. While they are more evenly distributed in the MBL model, they present a higher peak with more outliers in the Max-ent model, which is also reflected in the slightly higher standard deviation of the Maxent model (cf.</Paragraph>
      <Paragraph position="11"> Tables 1 and 3). In short, the Maxent model appears to be more sensitive to differences in the data.</Paragraph>
      <Paragraph position="12"> Nevertheless, both models correlate strongly with each other in both evaluation modes (a1 a11 a1 a0 a2 a6a1a0 ,</Paragraph>
      <Paragraph position="14"> overlap). Thus, they agree to a large extent on which frames are easy or difficult to label.</Paragraph>
      <Paragraph position="15"> Our present results, thus, seem to indicate that the influence of argument structure cannot be solved by simply improving existing systems or choosing other statistical frameworks. Instead, there is a systematic relationship between the uniformity of the argument structures of the predicates in the frames and the performance of automatic role assignment.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML