XML Viewer - n06-1007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1007_evalu.xml
Size: 9,208 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1007">
  <Title>Acquisition of Verb Entailment from Text</Title>
  <Section position="7" start_page="52" end_page="54" type="evalu">
    <SectionTitle>
6 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> We now present the results of the evaluation of the method. In Section 6.1, we study its parameters and determine the best configuration. In Section 6.2, we compare its performance against that of human subjects as well as that of two state-of-the-art lexical resources: the verb entailment knowledge contained in WordNet2.0 and the inference rules from the DIRT database (Lin and Pantel, 2001).</Paragraph>
    <Section position="1" start_page="52" end_page="53" type="sub_section">
      <SectionTitle>
6.1 Model parameters
</SectionTitle>
      <Paragraph position="0"> We first examined the following parameters of the model: the window size, the use of paragraph boundaries, and the effect of the shared anchor on the quality of the model.</Paragraph>
      <Paragraph position="1"> 6.1.1 Window size and paragraph boundaries As was mentioned in Section 4.1, a free parameter in our model is a threshold on the distance between two clauses, that we take as an indicator that the clauses are discourse-related. To find an optimal threshold, we experimented with windows of 1, 2 ... 25 clauses around a given clause, taking clauses appearing within the window as potentially related to the given one. We also looked at the effect paragraph boundaries have on the identification of related clauses. Figure 2 shows two curves depicting the accuracy of the method as a function of the window size: the first one describes performance  when paragraph boundaries are taken into account (PAR) and the second one when they are ignored (NO PAR).</Paragraph>
      <Paragraph position="2">  of window size, with and without paragraph boundaries used for delineating coherent text.</Paragraph>
      <Paragraph position="3"> One can see that both curves rise fairly steeply up to window size of around 7, indicating that many entailment pairs are discovered when the two clauses appear close to each other. The rise is the steepest  between windows of 1 and 3, suggesting that entailment relations are most often explicated in clauses appearing very close to each other.</Paragraph>
      <Paragraph position="4"> PAR reaches its maximum at the window of 15, where it levels off. Considering that 88% of paragraphs in BNC contain 15 clauses or less, we take this as an indication that a segment of text where both a premise and its consequence are likely to be found indeed roughly corresponds to a paragraph. NO PAR's maximum is at 10, then the accuracy starts to decrease, suggesting that evidence found deeper inside other paragraphs is misleading to our model.</Paragraph>
      <Paragraph position="5"> NO PAR performs consistently better than PAR until it reaches its peak, i.e. when the window size is less than 10. This seems to suggest that several initial and final clauses of adjacent paragraphs are also likely to contain information useful to the model. We tested the difference between the maxima of PAR and NO PAR using the sign test, the non-parametric equivalent of the paired t-test. The test did not reveal any significance in the difference between their accuracies (6-, 7+, 116 ties: p = 1.000).  We further examined how the criterion of the common anchor influenced the quality of the model. We compared this model (ANCHOR) against the one that did not require that two clauses share an anchor (NO ANCHOR), i.e. considering only co-occurrence of verbs concatenated with specific syntactic role labels. Additionally, we included into the experiment a model that looked at plain verbs co-occurring inside a context window (PLAIN). Figure 3 compares the performance of these three models (paragraph boundaries were taken into account in all of them). Compared with ANCHOR, the other two models achieve considerably worse accuracy scores. The differences between the maximum of ANCHOR and those of the other models are significant according to the sign test (ANCHOR vs NO ANCHOR: 44+, 8-, 77 ties: p &lt; 0.001; ANCHOR vs PLAIN: 44+, 10-, 75 ties: p &lt; 0.001). Their maxima are also reached sooner (at the window of 7) and thereafter their performance quickly degrades. This indicates that the common anchor criterion is very useful, especially for locating related clauses at larger distances in the  accuracy of the method.</Paragraph>
      <Paragraph position="6"> The accuracy scores for NO ANCHOR and PLAIN are very similar across all the window size settings. It appears that the consistent co-occurrence of specific syntactic labels on two verbs gives no additional evidence about the verbs being related.</Paragraph>
    </Section>
    <Section position="2" start_page="53" end_page="54" type="sub_section">
      <SectionTitle>
6.2 Human evaluation
</SectionTitle>
      <Paragraph position="0"> Once the best parameter settings for the method were found, we compared its performance against human judges as well as the DIRT inference rules and the verb entailment encoded in the WordNet 2.0 database.</Paragraph>
      <Paragraph position="1"> Human judges. To elicit human judgments on the evaluation data, we automatically converted the templates into a natural language form using a number of simple rules to arrange words in the correct grammatical order. In cases where an obligatory syntactic position near a verb was missing, we supplied the pronouns someone or something in that position. In each template pair, the premise was turned into a statement, and the consequence into a question. Figure 4 illustrates the result of converting the test item from the previous example (Figure 1) into the natural language form.</Paragraph>
      <Paragraph position="2"> During the experiment, two judges were asked to mark those statement-question pairs in each test item, where, considering the statement, they could answer the question affirmatively. The judges' decisions coincided in 95 of 129 test items. The Kappa statistic is k=0.725, which provides some indication about the upper bound of performance on this task.</Paragraph>
      <Paragraph position="3">  rect consequence is marked by an asterisk.</Paragraph>
      <Paragraph position="4"> DIRT. We also experimented with the inference rules contained in the DIRT database (Lin and Pantel, 2001). According to (Lin and Pantel, 2001), an inference rule is a relation between two verbs which are more loosely related than typical paraphrases, but nonetheless can be useful for performing inferences over natural language texts. We were interested to see how these inference rules perform on the entailment recognition task.</Paragraph>
      <Paragraph position="5"> For each dependency tree path (a graph linking a verb with two slots for its arguments), DIRT contains a list of the most similar tree paths along with the similarity scores. To decide which is the most likely consequence in each test item, we looked up the DIRT database for the corresponding two dependency tree paths. The template pair with the greatest similarity was output as the correct answer.</Paragraph>
      <Paragraph position="6"> WordNet. WordNet 2.0 contains manually encoded entailment relations between verb synsets, which are labeled as &amp;quot;cause&amp;quot;, &amp;quot;troponymy&amp;quot;, or &amp;quot;entailment&amp;quot;. To identify the template pair satisfying entailment in a test item, we checked whether the two verbs in each pair are linked in WordNet in terms of one of these three labels. Because Word-Net does not encode the information as to the relative plausibility of relations, all template pairs where verbs were linked in WordNet, were output as correct answers.</Paragraph>
      <Paragraph position="7"> Figure 5 describes the accuracy scores achieved by our entailment acquisition algorithm, the two human judges, DIRT and WordNet. For comparison purposes, the random baseline is also shown.</Paragraph>
      <Paragraph position="8"> Our algorithm outperformed WordNet by 0.38 and DIRT by 0.15. The improvement is significant vs. WordNet (73+, 27-, 29 ties: p&lt;0.001) as well as vs. DIRT (37+, 20-, 72 ties: p=0.034).</Paragraph>
      <Paragraph position="9"> We examined whether the improvement on DIRT was due to the fact that DIRT had less extensive  proposed algorithm, WordNet, DIRT, two human judges, and a random baseline.</Paragraph>
      <Paragraph position="10"> coverage, encoding only verb pairs with similarity above a certain threshold. We re-computed the accuracy scores for the two methods, ignoring cases where DIRT did not make any decision, i.e. where the database contained none of the five verb pairs of the test item. On the resulting 102 items, our method was again at an advantage, 0.735 vs. 0.647, but the significance of the difference could not be established (21+, 12-, 69 ties: p=0.164).</Paragraph>
      <Paragraph position="11"> The difference in the performance between our algorithm and the human judges is quite large (0.103 vs. Judge 1 and 0.088 vs Judge 2), but significance to the 0.05 level could not be found (vs. Judge 1: 17-, 29+, 83 ties: p=0.105; vs. Judge 2: 15-, 27+, ties 87: p=0.09).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML