File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/h05-1076_evalu.xml
Size: 4,807 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1076"> <Title>Hong Kong</Title> <Section position="11" start_page="608" end_page="609" type="evalu"> <SectionTitle> 5. Experiments </SectionTitle> <Paragraph position="0"> RC experiments are run on the Remedia corpus as well as the ChungHwa corpus. The Remedia training set has 55 stories, each with about five questions. The Remedia test set has 60 stories and 5 questions per story. The ChungHwa corpus is derived from the book, &quot;English Reading Comprehension in 100 days,&quot; published by Chung Hwa Book Co., (H.K.) Ltd. The ChungHwa training set includes 100 English stories and each has four questions on average.</Paragraph> <Paragraph position="1"> The ChungHwa testing set includes 50 stories and their questions. We use HumSent as the prime evaluation metric for reading comprehension.</Paragraph> <Paragraph position="2"> The three kinds of knowledge sources are used incrementally in our experimental setup and results are labeled as follows: Abbrievations are: bag-of-words (BOW), metadata (MD), Web-derived answer patterns (AP), contextual knowledge (Context).</Paragraph> <Section position="1" start_page="608" end_page="608" type="sub_section"> <SectionTitle> 5.1 Results on Remedia </SectionTitle> <Paragraph position="0"> Table 8 shows the RC results for various question types in the Remedia test set.</Paragraph> <Paragraph position="1"> We observe that the HumSent accuracies vary substantially across different interrogatives. The system performs best for when questions and worst for why questions. The use of Web-derived answer patterns brought improvements to all the different interrogatives. The other knowledge sources, namely, meta data and context, bring improvements for some question types but degraded others.</Paragraph> <Paragraph position="2"> Figure 1 shows the overall RC results of our system. The relative incremental gains due to the use of metadata, Web-derived answer patterns and context are 20.7%, 10.9% and 8.2% respectively. We also ran pairwise t-tests to test the statistical significance of these improvements and results are shown in Table 9. The improvements due to metadata matching and Web-derived answer patterns are statistically significant (p<0.05) but the improvement due to context is not.</Paragraph> <Paragraph position="3"> incremental improvements over BOW among the use of metadata, Web-derived answer patterns and context. We also compared our results across various interrogatives with those previously reported in (Riloff and Thelen, 2000). Their system is based on handcrafted rules with deterministic algorithms. The comparison (see Table 10) shows that our approach which is based on data-driven patterns and statistics can achieve comparable performance.</Paragraph> </Section> <Section position="2" start_page="608" end_page="609" type="sub_section"> <SectionTitle> 5.2 Results on ChungHwa </SectionTitle> <Paragraph position="0"> Experimental results for the ChungHwa corpus are presented in Figure 2. The HumSent accuracies obtained are generally higher than those with Remedia. We observe similar trends as before, i.e. our approach in the use of metadata, Web-derived answer patterns and context bring incremental gains to RC performance. However, the actual gain levels are much reduced.</Paragraph> <Paragraph position="1"> In order to understand the underlying reason for reduced performance gains as we migrated from Remedia to Chunghwa, we analyzed the question lengths as well as the degree of word match between questions and answers among the two corpora. Figure 3 shows that the average length of questions in Chunghwa are longer than Remedia. Longer questions contain more information which is beneficial to the BOW approach in finding the correct answer.</Paragraph> <Paragraph position="2"> proportion of questions that have a match- size (i.e. number of matching words between a question and its answer) larger than 2. This presents an advantage for the BOW approach in RC. It is also observed that approximately 10% of the Remedia questions have no correct answers (i.e. matchsize=-1) and about 25% have no matching words with the correct answer sentence. This explains the overall discrepancies in HumSent accuracies of matching words between questions and their answers) in the two corpora.</Paragraph> <Paragraph position="3"> While our approach has leveraged a variety of knowledge sources in RC, we still observe that our system is unable to correctly answer 58% of the questions in Remedia. An example of such elusive questions is: Question: When do the French celebrate their freedom? Answer Sentence: To the French, July 14 has the same meaning as July 4th does to the United States.</Paragraph> </Section> </Section> class="xml-element"></Paper>