File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2006_intro.xml
Size: 5,918 bytes
Last Modified: 2025-10-06 14:03:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2006"> <Title>Evaluating the Accuracy of an Unlexicalized Statistical Parser on the PARC DepBank</Title> <Section position="3" start_page="0" end_page="41" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Considerable progress has been made in accurate statistical parsing of realistic texts, yielding rooted, hierarchical and/or relational representations of full sentences. However, much of this progress has been made with systems based on large lexicalized probabilistic context-free like (PCFG-like) models trained on the Wall Street Journal (WSJ) subset of the Penn Tree-Bank (PTB). Evaluation of these systems has been mostly in terms of the PARSEVAL scheme using treesimilaritymeasuresof(labelled)precisionand recall and crossing bracket rate applied to section 23 of the WSJ PTB. (See e.g. Collins (1999) for detailed exposition of one such very fruitful line of research.) We evaluate the comparative accuracy of an unlexicalized statistical parser trained on a smaller treebank and tested on a subset of section 23 of the WSJ using a relational evaluation scheme. We demonstrate that a parser which is competitive in accuracy (without sacrificing processing speed) canbequicklydevelopedwithoutrelianceonlarge in-domain manually-constructed treebanks. This makes it more practical to use statistical parsers in diverse applications needing access to aspects of predicate-argument structure.</Paragraph> <Paragraph position="1"> We define a lexicalized statistical parser as one which utilizes probabilistic parameters concerning lexical subcategorization and/or bilexical relations over tree configurations. Current lexicalized statistical parsers developed, trained and tested on PTB achieve a labelled F1-score - the harmonic mean of labelled precision and recall - of around 90%. Klein and Manning (2003) argue that such results represent about 4% absolute improvement over a carefully constructed unlexicalized PCFG-like model trained and tested in the same manner.1 Gildea(2001)showsthatWSJ-derivedbilexical parameters in Collins' (1999) Model 1 parser contribute less than 1% to parse selection accuracy when test data is in the same domain, and yield no improvement for test data selected from the Brown Corpus. Bikel (2004) shows that, in Collins'(1999)Model2,bilexicalparameterscontribute less than 0.5% to accuracy on in-domain data while lexical subcategorization-like parameters contribute just over 1%.</Paragraph> <Paragraph position="2"> Several alternative relational evaluation schemes have been developed (e.g. Carroll et al., 1998; Lin, 1998). However, until recently, no WSJ data has been carefully annotated to support relational evaluation. King et al. (2003) describe the PARC 700 Dependency Bank (hereinafter DepBank), which consists of 700 WSJ sentences randomly drawn from section 23. These sentences have been annotated with syntactic features and with bilexical head-dependent relations derived from the F-structure representation of Lexical comparison of PCFG-like statistical parsers developed from the PTB with other parsers whose output is not designed to yield PTB-style trees, using an evaluation which is closer to the protypical parsing task of recovering predicate-argument structure.</Paragraph> <Paragraph position="3"> Kaplan et al. (2004) compare the accuracy and speed of the PARC XLE Parser to Collins' Model 3 parser. They develop transformation rules for both, designed to map native output to a subset of the features and relations in DepBank. They compare performance of a grammatically cut-down and complete version of the XLE parser to the publically available version of Collins' parser.</Paragraph> <Paragraph position="4"> One fifth of DepBank is held out to optimize the speed and accuracy of the three systems. They concludefromtheresultsoftheseexperimentsthat the cut-down XLE parser is two-thirds the speed of Collins' Model 3 but 12% more accurate, while the complete XLE system is 20% more accurate but five times slower. F1-score percentages range from the mid- to high-70s, suggesting that the relational evaluation is harder than PARSEVAL.</Paragraph> <Paragraph position="5"> Both Collins' Model 3 and the XLE Parser use lexicalized models for parse selection trained on the rest of the WSJ PTB. Therefore, although Kaplan et al. demonstrate an improvement in accuracy at some cost to speed, there remain questions concerning viability for applications, at some remove from the financial news domain, for which substantial treebanks are not available. The parser we deploy, like the XLE one, is based on a manually-defined feature-based unification grammar. However, the approach is somewhat different, making maximal use of more generic structural rather than lexical information, both within the grammar and the probabilistic parse selection model. Here we compare the accuracy of our parser with Kaplan et al.'s results, by repeating their experiment with our parser. This comparison is not straightforward, given both the system-specific nature of some of the annotation in DepBank and the scoring reported. We, therefore, extend DepBank with a set of grammatical relations derived from our own system output and highlight howissuesofrepresentationandscoringcanaffect results and their interpretation.</Paragraph> <Paragraph position="6"> In SS2, we describe our development methodology and the resulting system in greater detail. SS3 describes the extended Depbank that we have developed and motivates our additions. SS2.4 discusses how we trained and tuned our current system and describes our limited use of information derived from WSJ text. SS4 details the various experiments undertaken with the extended DepBank and gives detailed results. SS5 discusses these results and proposes further lines of research.</Paragraph> </Section> class="xml-element"></Paper>