File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1063_intro.xml
Size: 2,410 bytes
Last Modified: 2025-10-06 14:03:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1063"> <Title>QuestionBank: Creating a Corpus of Parse-Annotated Questions</Title> <Section position="4" start_page="497" end_page="497" type="intro"> <SectionTitle> 2 Background and Motivation </SectionTitle> <Paragraph position="0"> High quality probabilistic, treebank-based parsing resources can be rapidly induced from appropriate treebank material. However, treebank- and machine learning-based grammatical resources reect the characteristics of the training data. They generally underperform on test data substantially different from the training data.</Paragraph> <Paragraph position="1"> Previous work on parser performance and domain variation by Gildea (2001) showed that by training a parser on the Penn-II Treebank and testing on the Brown corpus, parser accuracy drops by 5.7% compared to parsing the Wall Street Journal (WSJ) based Penn-II Treebank Section 23. This shows a negative effect on parser performance even when the test data is not radically different from the training data (both the Penn II and Brown corpora consist primarily of written texts of American English, the main difference is the considerably more varied nature of the text in the Brown corpus). Gildea also shows how to resolve this problem by adding appropriate data to the training corpus, but notes that a large amount of additional data has little impact if it is not matched to the test material.</Paragraph> <Paragraph position="2"> Work on more radical domain variance and on adapting treebank-induced LFG resources to analyse ATIS (Hemphill et al., 1990) question material is described in Judge et al. (2005). The research established that even a small amount of additional training data can give a substantial improvement in question analysis in terms of both CFG parse accuracy and LFG grammatical functional analysis, with no signi cant negative effects on non-question analysis. Judge et al. (2005) suggest, however, that further improvements are possible given a larger question training corpus.</Paragraph> <Paragraph position="3"> Clark et al. (2004) worked speci cally with question parsing to generate dependencies for QA with Penn-II treebank-based Combinatory Categorial Grammars (CCG's). They use what questions taken from the TREC QA datasets as the basis for a What-Question corpus with CCG annotation. null</Paragraph> </Section> class="xml-element"></Paper>