File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1013_intro.xml
Size: 3,925 bytes
Last Modified: 2025-10-06 14:01:18
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1013"> <Title>High Precision Extraction of Grammatical Relations</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 The Analysis System </SectionTitle> <Paragraph position="0"> In this investigation we extend a statistical shallow parsing system for English developed originally by Carroll, Minnen and Briscoe (1998). Briefly, the system works as follows: input text is labelled with part-of-speech (PoS) tags by a tagger, and these are parsed using a wide-coverage unification-based 'phrasal grammar of English PoS tags and punctuation. For disambiguation, the parser uses a probabilistic LR model derived from parse tree structures in a treebank, augmented with a set of lexical entries for verbs, acquired automatically from a 10 million word sample of the British National Corpus (Leech, 1992), each entry containing subcategorisation frame information and an associated probability. The parser is therefore 'semi-lexicalised in that verbal argument structure is disambiguated lexically, but the rest of the disambiguation is purely structural.</Paragraph> <Paragraph position="1"> The coverage of the grammar--the proportion of sentences for which at least one complete spanning analysis is found--is around 80% when applied to the SUSANNE corpus (Sampson, 1995). In addition, the system is able to perform parse failure recovery, finding the highest scoring sequence of phrasal fragments (following the approach of Kiefer et al., 1999), and the system has produced at least partial analyses for over 98% of the sentences in the written part of the British National Corpus.</Paragraph> <Paragraph position="2"> The parsing system reads off grammatical relation tuples (GRs) from the constituent structure tree that is returned from the disambiguation phase. Information is used about which grammar rules introduce subjects, complements, and modifiers, and which daughter(s) is/are the head(s), and which the dependents. In Carroll et al. s evaluation the system achieves GR accuracy that is comparable to published results for other systems: extraction of non-clausal subject relations with 83% precision, compared with Grefenstette s (1998) figure of 80%; and overall F-score2 of unlabelled head-dependent pairs of 80%, as opposed to Lin s (1998) 83%3 and Srinivas s (2000) 84% (this with respect only to binary relations, and omitting the analysis of control relationships). Blaheta and Charniak (2000) report an F-score of 87% for assigning grammatical function tags to constituents, but the task, and therefore the scoring method, is rather different.</Paragraph> <Paragraph position="3"> For the work reported in this paper we have extended Carroll et al. s basic system, implementing a version of Schmid and Rooth s expected governor technique (see section 1 above) but adapted for unification-based grammar and GR-based analyses.</Paragraph> <Paragraph position="4"> Each sentence is analysed as a set of weighted GRs where the weight associated with each grammatical relation is computed as the sum of the probabilities of the parses that relation was derived from, divided by the sum of the probabilities of all parses.</Paragraph> <Paragraph position="5"> So, if we assume that Schmid and Rooth s example sentence Peter reads every paper on markup has 2 parses, one where on markup attaches to the preceding noun having overall probability a0a2a1a3a0a4a0a6a5 and the other where it has verbal attachment with probability a0a2a1a3a0a4a0a8a7 , then some of the weighted GRs would be</Paragraph> <Paragraph position="7"> Figure 1 contains a more extended example of a weighted GR analysis for a short sentence from the SUSANNE corpus, and also gives a flavour of the relation types that the system returns. The GR scheme is decribed in detail by Carroll, Briscoe and Sanfilippo (1998).</Paragraph> </Section> class="xml-element"></Paper>