File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0810_metho.xml

Size: 8,938 bytes

Last Modified: 2025-10-06 14:09:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0810">
  <Title>A First Evaluation of Logic Form Identification Systems</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Test Data
</SectionTitle>
    <Paragraph position="0"> The test data was compiled so that the impact of external tools that different sytems might use in the LF identification process be minimal. For example, it is well-known that the accuracy of automatic syntactic parsing drops drastically for sentences larger than 40 words and thus we kept the size of the collected sentences below the 40 words threshold. The average sentence size in the test data is 9.89 words.</Paragraph>
    <Paragraph position="1"> Special attention was paid to covering linguistic phenomena such as: coordination, compound nouns, ditransitives, multiword expressions (give up, as well as, etc.), relative clauses and others.</Paragraph>
    <Paragraph position="2"> Different sources were used to look up such cases: Treebank, WordNet and the web.</Paragraph>
    <Paragraph position="3"> The size of the test set (4155 arguments, 2398 predicates, 300 sentences) allows a better evaluation of the vertical scalability (coverage of as many linguistics problems as possible) of sytems rather than their horizontal scalability (handling large data sets without significant deterioration of performance displayed on small sets).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Annotation Guidelines
</SectionTitle>
    <Paragraph position="0"> The annotation part is the most critical part of any evaluation exercise. For the Logic Form Identification task the following steps were applied to obtain the correct LF for the test data: 1. logic forms for the test data were automatically obtained using an extended version of the LF derivation engine developed in (Rus, 2002) for LFi of WordNet glosses. As part of this step, sentences were preprocessed: tokenized (separating punctuation from words) using the Penn Treebank guidelines, tagged with Brill's tagger (Brill, 1992) and then parsed with Collins' statistical parser (Collins, 1996).</Paragraph>
    <Paragraph position="1">  2. a first manual checking of the previously generated LF was done.</Paragraph>
    <Paragraph position="2"> 3. a second manual checking was done by another annotator.</Paragraph>
    <Paragraph position="3"> 4. quality assurance of the previous steps was performed by individual annotators by checking  specific cases (ditransitives, relative pronouns, etc.) with much emphasis on consistency.</Paragraph>
    <Paragraph position="4"> 5. annotators agreement was done with a human moderator solving conflicting cases.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Metrics
</SectionTitle>
    <Paragraph position="0"> Two performance measures to evaluate Logic Form Identification methods were developed by Rus in (Rus and Moldovan, 2002) for the particular task of LFi for WordNet glosses (the definitions of concepts are shorter than regular sentences in terms of number of words, etc.). Each measure has advantages in some context.</Paragraph>
    <Paragraph position="1"> Predicate level performance is defined as the number of predicates with correct arguments divided by the total number of predicates. This measure focuses on the derivation method, though at a coarse-grained level because it does not capture the capability of a method to successfully identify a specific argument, e.g. the subject of a verb.</Paragraph>
    <Paragraph position="2"> Gloss level performance is the number of entire glosses correctly transformed into logic forms divided by the total number of glosses attempted. This measure catches contextual capabilities of a method in that it gives an idea of how well a method performs at gloss level. It is a more appropriate measure when one tries to see the impact of using full glosses in logic forms to applications such as planning. This measure is specific to the particular task of LFi for concept definitions and thus is not suited for general open text tasks.</Paragraph>
    <Paragraph position="3"> Let us consider the following gloss from WordNet: null Abbey is a convent ruled by an abbess.</Paragraph>
    <Paragraph position="4"> and let us suppose that some system, say Sys is able to generate the following logic form (please note that the subject of rule event is missing):</Paragraph>
    <Paragraph position="6"> Since one of the arguments is missing the predicate level performance is 5/6 (there are 6 predicates and for five of them the system generated all the arguments correctly) and the gloss level performance is 0/1 (this measure awards cases where all the predicates in the statement have all their arguments correctly assigned).</Paragraph>
    <Paragraph position="7"> None of the two measures can distinguish between two systems, where one misses the subject of the rule event and the other misses both the subject and object (both systems will miss one predicate).</Paragraph>
    <Paragraph position="8"> We propose two new, finer metrics in the next section, that are more suitable for a less restrictive LFi task: precision and recall. Both precision and recall can be defined at argument and predicate level, respectively.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Argument Level
</SectionTitle>
      <Paragraph position="0"> We define Precision at argument level as the number of correctly identified arguments divided by the number of all identified arguments. Recall at argument level is the number of correctly identified arguments divided by the number of arguments that were supposed to be identified.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Predicate Level
</SectionTitle>
      <Paragraph position="0"> Precision at predicate level is the number of correctly and fully identified predicates (with ALL arguments correctly identified) divided by the number of all attempted predicates. Recall at predicate level is the number of correctly and fully identified predicates (with ALL arguments correctly identified) divided by the number of all predicates that were supposed to be identified.</Paragraph>
      <Paragraph position="1"> Let us suppose that some system outputs the following logic form for the above example:</Paragraph>
      <Paragraph position="3"> where x4 is incorrectly indentified as the direct object of eating event. In the correct output there are 11 slots to be filled and the predicate eat should have 4 arguments. The previously defined measures for the sample output are given in Table 1:  ment and predicate level.</Paragraph>
      <Paragraph position="4"> In addition, we report a more global measure called exact sentence which is defined as the number of sentences whose logic form was fully identified (all predicates and arguments correctly found) divided by the number of sentences attempted. This is similar to gloss level performance measure presented before. We proposed and computed several variants for it which are described below.</Paragraph>
      <Paragraph position="5"> Sentence-Argument (Sent-A): How many sentences have ALL arguments correctly detected out of all attempted sentences.</Paragraph>
      <Paragraph position="6"> Sentence-Predicate (Sent-P): How many sentences have ALL predicates correctly detected out of all attempted sentences.</Paragraph>
      <Paragraph position="7"> Sentence-Argument-Predicate Sent-AP: How many sentences have ALL arguments correctly detected out of sentences which have ALL predicates correctly detected</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sentence-Argument-Predicate-Sentences Sent-
</SectionTitle>
      <Paragraph position="0"> APSent: How many sentences have ALL arguments and ALL predicates correctly detected out of all attempted sentences.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Extra Resources
</SectionTitle>
    <Paragraph position="0"> A package of trial data was provided to interested participants. The trial package contains two data files: (1) English sentences and (2) their corresponding logic form. A software evaluator was available for download on the web page of the task. We compiled a dictionary of collocations from WordNet which was also freely available for download. It includes 62,611 collocations.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Submission Format
</SectionTitle>
    <Paragraph position="0"> Each team was supposed to submit a file containing on each line the answer to a input sentence using the following pattern:  The field Y000 was generated as is, for all lines. It will be used in future trials.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML