File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/96/w96-0209_concl.xml
Size: 3,555 bytes
Last Modified: 2025-10-06 13:57:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0209"> <Title>APPORTIONING DEVELOPMENT EFFORT IN A PROBABILISTIC LR PARSING SYSTEM THROUGH EVALUATION</Title> <Section position="10" start_page="97" end_page="98" type="concl"> <SectionTitle> 6. CONCLUSIONS </SectionTitle> <Paragraph position="0"> In this paper we have outlined an approach to robust domain-independent parsing, in which subcategorisation constraints play no part, resulting in coverage that greatly improves upon more conventional grammar-based approaches to NL text analysis. We described an implemented system, and evaluated its performance along several different dimensions. We assessed its coverage and that of previous versions on a development corpus and an unseen corpus, and demonstrated that the grammar refinement we have carried out has led to substantial improvements in coverage and reductions in spurious ambiguity. We also evaluated the accuracy of parse selection with respect to treebank analyses, and, by varying the amount of training material, we showed that it requires comparatively little data to achieve a good level of accuracy.</Paragraph> <Paragraph position="1"> We have made good progress in increasing grammar coverage, though we have now reached a point of diminishing returns. Further significant improvements in this area would require corpus-specific additions and tuning whose benefit would not necessarily carry over to other corpora. In the application we are currently using the system for-automatic extraction of subcategorisation frames, and more generally argument structure, from large amounts of text (Briscoe ~ Carroll, 1996)--we do not need full coverage; 70-80% appears to be sufficient. However, further improvements in coverage will require some automated approach to rule induction driven by parse failure. Since our evaluations indicate that our system achieves a good level of accuracy with little treebank data, and that 67-75% coverage was achieved for English quite early in the grammar refinement effort, porting the current system to other languages should be possible with small-to-medium-sized treebanks (around 20K words) and feasible manual effort (of the order of 12 person-months for grammar-writing and treebanking). This may yield a system accurate enough for some types of application, given that the system is not restricted to returning the single highest ranked analysis but can return the n-highest ranked for further application-specific selection.</Paragraph> <Paragraph position="2"> Although we report promising results, parse selection that is sufficiently accurate for many practical applications will require a more lexicalised system. Magerman's (1995) parser is an extension of the history-based parsing approach developed at IBM (Black et al., 1993) in which rules are conditioned on lexical and other (essentially arbitrary) information available in the parse history. In future work, we intend to explore a more restricted and semantically-driven version of this approach in which, firstly, probabilities are associated with different subcategorisation possibilities, and secondly, alternative predicate-argument structures derived from the grammar are ranked probabilistically. However, the massively increased coverage obtained here by relaxing subcategorisation constraints underlines the need to acquire accurate and complete subcategorisation frames in a corpus-driven fashion, before such constraints can be exploited robustly and effectively with free text.</Paragraph> </Section> class="xml-element"></Paper>