File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1005_evalu.xml
Size: 3,544 bytes
Last Modified: 2025-10-06 13:59:39
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1005"> <Title>Effectively Using Syntax for Recognizing False Entailment</Title> <Section position="5" start_page="38" end_page="39" type="evalu"> <SectionTitle> 6 Results and Discussion </SectionTitle> <Paragraph position="0"> In this section we present the final results of our system on the PASCAL RTE-1 test set, and examine our features in an ablation study. The PASCAL RTE-1 development and test sets consist of 567 and 800 examples, respectively, with the test set split equally between true and false examples.</Paragraph> <Section position="1" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 6.1 Results and Performance Comparison on </SectionTitle> <Paragraph position="0"> the PASCAL RTE-1 Test Set Table 2 displays the accuracy and confidence-weighted score6 (CWS) of our final system on each of the tasks for both the development and test sets. Our overall test set accuracy of 62.50% represents a 2.1% absolute improvement over the task-independent system described in (Tatu and Moldovan, 2005), and a 20.2% relative improvement in accuracy over their system with respect to an uninformed baseline accuracy of 50%.</Paragraph> <Paragraph position="1"> To compute confidence scores for our judgments, any entailment determined to be false by any heuristic was assigned maximum confidence; no attempts were made to distinguish between entailments rejected by different heuristics. The confidence of all other predictions was calculated as the absolute value in the difference between the output score(H,T) of the lexical similarity model and the threshold t = 0.1285 as tuned for highest accuracy on our development set. We would expect a higher CWS to result from learning a more appropriate confidence function; nonetheless our overall curacy loss obtained by removal of single feature test set CWS of 0.6534 is higher than previouslyreported task-independent systems (however, the task-dependent system reported in (Raina et al., 2005) achieves a CWS of 0.686).</Paragraph> </Section> <Section position="2" start_page="38" end_page="39" type="sub_section"> <SectionTitle> 6.2 Feature analysis </SectionTitle> <Paragraph position="0"> Table 3 displays the results of our feature ablation study, analyzing the individual effect of each feature.</Paragraph> <Paragraph position="1"> Of the seven heuristics used in our final system for node alignment (including lexical similarity and paraphrase detection), our ablation study showed that five were helpful in varying degrees on our test set, but that removal of either MindNet similarity scores or paraphrase detection resulted in no accuracy loss on the test set.</Paragraph> <Paragraph position="2"> Of the six false entailment heuristics used in the final system, five resulted in an accuracy improvement on the test set (the most effective by far was the &quot;Argument Movement&quot;, resulting in a net gain of 20 correctly-classified false examples); inclusion of the &quot;Superlative Mismatch&quot; feature resulted in a small net loss of two examples.</Paragraph> <Paragraph position="3"> We note that our heuristics for false entailment, where applicable, were indeed significantly more accurate than our final system as a whole; on the set of examples predicted false by our heuristics we had 71.3% accuracy on the training set (112 correct out of 157 predicted), and 72.9% accuracy on the test set (164 correct out of 225 predicted).</Paragraph> </Section> </Section> class="xml-element"></Paper>