File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/p04-1047_concl.xml
Size: 3,656 bytes
Last Modified: 2025-10-06 13:54:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1047"> <Title>Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank</Title> <Section position="7" start_page="0" end_page="0" type="concl"> <SectionTitle> 6 Conclusions </SectionTitle> <Paragraph position="0"> We have presented an algorithm and its implementation for the extraction of semantic forms or subcategorisation frames from the Penn-II Treebank, automatically annotated with LFG f-structures. We have substantially extended an earlier approach by (van Genabith et al., 1999). The original approach was small-scale and 'proof of concept'. We have scaled our approach to the entire WSJ Sections of Penn-II (50,000 trees). Our approach does not predefine the subcategorisation frames we extract as many other approaches do. We extract abstract syntactic function-based subcategorisation frames (LFG semantic forms), traditional CFG category-based frames as well as mixed function-category based frames. Unlike many other approaches to subcategorisation frame extraction, our system properly reflects the effects of long distance dependencies and distinguishes between active and passive frames.</Paragraph> <Paragraph position="1"> Finally our system associates conditional probabilities with the frames we extract. We carried out an extensive evaluation of the complete induced lexicon (not just a sample) against the full COMLEX resource. To our knowledge, this is the most extensive qualitative evaluation of subcategorisation extraction in English. The only evaluation of a similar scale is that carried out by (Schulte im Walde, 2002) for German. Our results compare well with hers.</Paragraph> <Paragraph position="2"> We believe our semantic forms are fine-grained and by choosing to evaluate against COMLEX we set our sights high: COMLEX is considerably more detailed than the OALD or LDOCE used for other evaluations.</Paragraph> <Paragraph position="3"> Currently work is under way to extend the coverage of our acquired lexicons by applying our methodology to the Penn-III treebank, a more balanced corpus resource with a number of text genres (in addition to the WSJ sections). It is important to realise that the induction of lexical resources is part of a larger project on the acquisition of wide-coverage, robust, probabilistic, deep unification grammar resources from treebanks. We are already using the extracted semantic forms in parsing new text with robust, wide-coverage PCFG-based LFG grammar approximations automatically acquired from the f-structure annotated Penn-II tree-bank (Cahill et al., 2004a). We hope to be able to apply our lexical acquisition methodology beyond existing parse-annotated corpora (Penn-II and Penn-III): new text is parsed by our PCFG-based LFG approximations into f-structures from which we can then extract further semantic forms. The work reported here is part of the core component for bootstrapping this approach.</Paragraph> <Paragraph position="4"> As the extraction algorithm we presented derives semantic forms at f-structure level, it is easily applied to other, even typologically different, languages. We have successfully ported our automatic annotation algorithm to the TIGER Treebank, despite German being a less configurational language than English, and extracted wide-coverage, probabilistic LFG grammar approximations and lexical resources for German (Cahill et al., 2003). Currently, we are migrating the technique to Spanish, which has freer word order than English and less morphological marking than German. Preliminary results have been very encouraging.</Paragraph> </Section> class="xml-element"></Paper>