File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3506_evalu.xml

Size: 5,366 bytes

Last Modified: 2025-10-06 13:59:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3506">
  <Title>Catching Metaphors</Title>
  <Section position="7" start_page="44" end_page="45" type="evalu">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
6.1 Classifier Choice
</SectionTitle>
      <Paragraph position="0"> Because of its ease of use and Java compatibility, we used an updated version of the Stanford conditional log linear (aka maxent) classifier written by Dan Klein (Stanford Classifier, 2003). Maxent classifiers are designed to maximize the conditional log likelihood of the training data where the conditional likelihood of a particular class c on training example</Paragraph>
      <Paragraph position="2"> Here Z is a normalizing factor, fi is the vector of features associated with example i and oc is the vector of weights associated with class c. Additionally, the Stanford classifier uses by default a Gaussian prior of 1 on the features, thus smoothing the feature weights and helping prevent overfitting.</Paragraph>
    </Section>
    <Section position="2" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
6.2 Baselines
</SectionTitle>
      <Paragraph position="0"> We use two different baselines to assess performance. They correspond to selecting the majority class of the training set overall or the majority class of verb specifically. The strong bias toward metaphor is reflected in the overall baseline of 93.80% for the validation set. The verb baseline is  higher,95.50%forthevalidationset,duetothepresence of words such as treat which are predominantly literal.</Paragraph>
    </Section>
    <Section position="3" start_page="44" end_page="45" type="sub_section">
      <SectionTitle>
6.3 Validation Set Results
</SectionTitle>
      <Paragraph position="0"> the feature sets described in the previous section.</Paragraph>
      <Paragraph position="1"> The overall and verb baselines are 605 and 616 out of 645 total examples in the validation set.</Paragraph>
      <Paragraph position="2"> The first feature set we experimented with was just the verb. We then added each argument in turn; trying ARG0 (Feature Set 2), ARG1 (Feature Set 3), ARG2 (Feature Set 4) and ARG3 (Feature Set 5).</Paragraph>
      <Paragraph position="3"> Adding ARG1 gave the best performance gain.</Paragraph>
      <Paragraph position="4"> ARG1 corresponds to the semantic role of mover in most of PropBank annotations for motion-related verbs. For example, stocks is labeled as ARG1 in both Stocks fell 10 points and Stocks were being thrown out of windows3. Intuitively, the mover role is highly informative in determining whether a motion verb is being used metaphorically, thus it makes sense that adding ARG1 added the single biggest  examples on the validation set for metaphor (M), literal (L) and total (Total) is shown. jump in performance compared to the other arguments. null Once we determined that ARG1 was the best argument to add, we also experimented with combining ARG1 with the other arguments. Validation results are shown for these other feature combinations (Feature Sets 6,7, 8 and 9) Using the best feature sets (Feature Sets 3,6), 621 targets are correctly labeled by the classifier. The accuracy is 96.98%, reducing error on the validation set by 40% and 17% over the baselines.</Paragraph>
    </Section>
    <Section position="4" start_page="45" end_page="45" type="sub_section">
      <SectionTitle>
6.4 Test Set Results
</SectionTitle>
      <Paragraph position="0"> We retrained the classifier using Feature Set 3 over the training and validation sets, then tested it on the test set. The overall and verb baselines are 800 and 817 out of 861 total examples, respectively. The classifier correctly labeled 819 targets in the test set.</Paragraph>
      <Paragraph position="1"> The results, broken down by frame, are shown in</Paragraph>
    </Section>
    <Section position="5" start_page="45" end_page="45" type="sub_section">
      <SectionTitle>
6.5 Discussion
</SectionTitle>
      <Paragraph position="0"> A comprehensive assessment of the classifier's performance requires a measure of interannotator agreement. Interannotator agreement represents a ceiling on the performance that can be expected on the classification task. Due to the very high baseline, even rare disagreements by human annotators affects the interpretation of the classifier's performance. Unfortunately, we did not have the resources available to redundantly annotate the corpus.</Paragraph>
      <Paragraph position="1"> We examined the 42 remaining errors and categorized them into four types:  The fixable errors are those that could be fixed givenmoreexperimentationwiththefeaturesetsand more data. Many of these errors are probably caused by the verbal bias, but a verbal bias that should not be insurmountable (for example, 2 or 3 metaphor to each 1 literal).</Paragraph>
      <Paragraph position="2"> The 27 errors caused by verbal biases are ones where the verb is so strongly biased to a particular metaphoric class that it is unsurprising that a test example of the opposite class was missed. Verbs like treat (0 metaphoric to 20 literal) and lead (345 metaphoric to 0 literal) are in this category.</Paragraph>
      <Paragraph position="3"> Thetworemainingerrorsarecaseswheretheverb was not present in the training data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML