File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2803_evalu.xml
Size: 5,089 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2803"> <Title>A Little Goes a Long Way: Quick Authoring of Semantic Knowledge Sources for Interpretation</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> A preliminary evaluation was run for the physics domain.</Paragraph> <Paragraph position="1"> We used for our evaluation a corpus of essays written by students in response to 5 simple qualitative physics questions such as &quot;If a man is standing in an elevator holding his keys in front of his face, and if the cable holding the elevator snaps and the man then lets go of the keys, what will be the relationship between the position of the keys and that of the man as the elevator falls to the ground? Explain why.&quot; A predicate language definition was designed consisting of 40 predicates, 31 predicate types, 160 tokens, 37 token types, and 15 abstract types.</Paragraph> <Paragraph position="2"> The language was meant to be able to represent physical objects mentioned in our set of physics problems, body states (e.g., freefall, contact, non-contact), quantities that can be measured (e.g., force, velocity, acceleration, speed, etc.), features of these quantities (e.g., direction, magnitude, etc.), comparisons between quantities (equivalence, non-equivalence, relative size, relative time, relative location), physics laws, and dependency relations. An initial set of 250 example sentences was then annotated, including sentences from each of a set of 5 physics problems.</Paragraph> <Paragraph position="3"> Next a set of 202 novel test sentences, each between 4 and 64 words long, was extracted from the corpus. Since comparisons, such as between the accelerations of objects in freefall together, are important for the reasoning in all of the questions used for corpus collection, we focused the coverage evaluation specifically on sentences pertaining to comparisons, such as in Figures 1 and 2. The goal of the evaluation was to test the extent to which knowledge generated from annotated examples generalizes to novel examples.</Paragraph> <Paragraph position="4"> Since obtaining the correct predicate language representation requires obtaining a correct syntactic parse, we first evaluated CARMEL's syntactic coverage over the corpus of test sentences to obtain an upper bound for expected performance. We assigned the syntactic interpretation of each sentence a score of None, Bad, Partial, or Acceptable. A grade of None indicates that no interpretation was built by the grammar. Bad indicates that parses were generated, but they contained errorfull functional relationships between constituents. Partial indicates that no parse was generated that covered the entire sentence, ut the portions that were completely correct for at least one interpretation of the sentence. Acceptable indicates that a complete parse was built that contained no incorrect functional relationships. If any word of the sentence was not covered, it was one that would not change the meaning of the sentence. For example, &quot;he had the same velocity as you had&quot; is the same as &quot;he had the same velocity as you&quot;, so if &quot;did&quot; was not part of the final parse but other than that, the parse was fine, it was counted as Acceptable. Overall the coverage of the grammar was very good. 166 sentences were graded Acceptable, which is about 83% of the corpus. 8 received a grade of Partial, 26 Bad, and 1 None.</Paragraph> <Paragraph position="5"> We then applied the same set of grades to the quality of the predicate language output. Note that that the grade assigned to an analysis represents the correctness and completeness of the predicate representation the system obtained for that sentence. In this case, a grade of Acceptable meant that all aspects of intended meaning were accounted for, and no misleading information was encoded.</Paragraph> <Paragraph position="6"> Partial indicated that some non-trivial part of the intended meaning was communicated. Any interpretation containing any misleading information was counted as Bad. If no predicate language representation was returned, the sentence was graded as None. As expected, grades for semantic interpretation were not as high as for syntactic analysis. In particular, 107 were assigned a grade of Acceptable, 45 were assigned a grade of Partial, 36 were assigned a grade of Bad, and 14 received a nil interpretation. Our evaluation demonstrates that knowledge generated from annotated examples can be used to interpret novel sentences, however, there are still gaps in the coverage of the automatically generated knowledge sources that need to be filled in with new annotated examples.</Paragraph> <Paragraph position="7"> Furthermore, the small but noticeable percentage of bad interpretations indicates that some previously annotated examples need to be modified in order to prevent these bad interpretations from being generated.</Paragraph> </Section> class="xml-element"></Paper>