File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1114_metho.xml

Size: 19,836 bytes

Last Modified: 2025-10-06 14:15:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1114">
  <Title>Can Subcategorisation Probabilities Help a Statistical Parser?</Title>
  <Section position="5" start_page="119" end_page="124" type="metho">
    <SectionTitle>
3 The Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="119" end_page="119" type="sub_section">
      <SectionTitle>
3.1 The Baseline Parser
</SectionTitle>
      <Paragraph position="0"> The baseline parsing system comprises: * an HMM part-of-speech tagger (Elworthy, 1994), which produces either the single highest-ranked tag for each word, or multiple tags with associated forward-backward probabilities (which are used with a threshold to prune lexical ambiguity); * a robust finite-state lemmatiser for English, an extended and enhanced version of the University of Sheffield GATE system morphological analyser (Cunningham et al., 1995); * a wide-coverage unification-based 'phrasal' grammar of English PoS tags and punctuation; null quire selection preferences automatically from (partially) parsed data.</Paragraph>
      <Paragraph position="1"> * a fast generalised Lit parser using this grammar, taking the results of the tagger as input, and performing disambiguation using a probabilistic model similar to that of Briscoe &amp; Carroll (1993); and * training and test treebanks (of 4600 and 500 sentences respectively) derived semi-automatically from the SUSANNE corpus (Sampson, 1995); The grammar consists of 455 phrase structure rule schemata in the format accepted by the parser (a syntactic variant of a Definite Clause Grammar with iterative (Kleene) operators). It is 'shallow' in that no attempt is made to fully analyse unbounded dependencies. However, the distinction between arguments and adjuncts is expressed, following X-bar theory, by Chomsky-adjunction to maximal projections of adjuncts (XP ~ XP Adjunct) as opposed to 'government' of arguments (i.e. arguments are sisters within X1 projections;</Paragraph>
      <Paragraph position="3"> analyses are rooted (in S) so the grammar assigns global, shallow and often 'spurious' analyses to many sentences. Currently, the coverage of this grammar--the proportion of sentences for which at least one analysis is found--is 79% when applied to the SUSANNE corpus, a 138K word treebanked and balanced subset of the Brown corpus.</Paragraph>
      <Paragraph position="4"> Inui et al. (1997) have recently proposed a novel model for probabilistic LR parsing which they justify as theoretically more consistent and principled than the Briscoe &amp; Carroll (1993) model. We use this new model since we have found that it indeed also improves disambiguation accuracy.</Paragraph>
      <Paragraph position="5"> The 500-sentence test corpus consists only of in-coverage sentences, and contains a mix of written genres: news reportage (general and sports), belles lettres, biography, memoirs, and scientific writing. The mean sentence length is 19.3 words (including punctuation tokens).</Paragraph>
    </Section>
    <Section position="2" start_page="119" end_page="121" type="sub_section">
      <SectionTitle>
3.2 Incorporating Acquired
Subcategorisation Information
</SectionTitle>
      <Paragraph position="0"> The test corpus contains a total of 485 distinct verb lemmas. We ran the Briscoe &amp; Carroll (1997) subcategorisation acquisition system on the first 10 million words of the BNC, for each of these verbs saving the first 1000 cases in which a possible instance of a subcategorisation frame</Paragraph>
      <Paragraph position="2"> Table h VSUBCAT values in the grammar.</Paragraph>
      <Paragraph position="3"> was identified. For each verb the acquisition system hypothesised a set of lexical entries corresponding to frames for which it found enough evidence. Over the complete set of verbs we ended up with a total of 5228 entries, each with an associated frequency normalised with respect to the total number of frames for all hypothesised entries for the particular verb.</Paragraph>
      <Paragraph position="4"> In the experiment each acquired lexical entry was assigned a probability based on its normalised frequency, with smoothing--to allow for unseen events--using the (comparatively crude) add-1 technique. We did not use the lexical entries themselves during parsing, since missing entries would have compromised coverage. Instead, we factored in their probabilities during parse ranking at the end of the parsing process.</Paragraph>
      <Paragraph position="5"> We ranked complete derivations based on the product of (1) the (purely structural) derivation probability according to the probabilistic LR model, and (2) for each verb instance in the derivation the probability of the verbal lexical entry that would be used in the particular analysis context. The entry was located via the VSUBCATvalue assigned to the verb in the analysis by the immediately dominating verbal phrase structure rule in the grammar: VSUBCATvalues are also present in the lexical entries since they were acquired using the same grammar. Table 1 lists the VSUBCAT values. The values are mostly self-explanatory; however, examples of some of the less obvious ones are given in (1).</Paragraph>
      <Paragraph position="6"> (1) They made (NP_WHPP) a great fuss about what to do.</Paragraph>
      <Paragraph position="7"> They admitted (PP~COMP) to the authorities that they had entered illegally.</Paragraph>
      <Paragraph position="8"> It dawned (PP_WHS) on him what he should do.</Paragraph>
      <Paragraph position="9">  Some VSUBCATvalues correspond to several of the 160 subcategorisation classes distinguished by the acquisition system. In these cases the sum of the probabilities of the corresponding entries was used. The finer distinctions stem from the use by the acquisition system of additional information about classes of specific prepositions, particles and other function words appearing within verbal frames. In this experiment we ignored these distinctions.</Paragraph>
      <Paragraph position="10"> In taking the product of the derivation and subcategorisation probabilities we have lost some of the properties of a statistical language model. The product is no longer strictly a probability, although we do not attempt to use it as such: we use it merely to rank competing analyses. Better integration of these two sets of probabilities is an area which requires further investigation.</Paragraph>
    </Section>
    <Section position="3" start_page="121" end_page="123" type="sub_section">
      <SectionTitle>
3.3 Quantitative Evaluation
3.3.1 Bracketing
</SectionTitle>
      <Paragraph position="0"> We evaluated parser accuracy on the unseen test corpus with respect to the phrasal bracketing annotation standard described by Carroll et al. (1997) rather than the original SUSANNE bracketings, since the analyses assigned by the grammar and by the corpus differ for many constructions 3. However, with the exception of SUSANNE 'verb groups' our annotation standard is bracket-consistent with the treebank analyses (i.e. no 'crossing brackets'). Table 2 shows the baseline accuracy of the parser with respect to (unlabelled) bracketings, and also with this model when augmented with the extracted subcategorisation information. Briefly, the evaluation metrics compare unlabelled bracketings derived from the test treebank with those derived from parses, computing recall, the ratio of matched brackets over all brackets in the treebank; precision, the ratio of matched brackets over all brackets found by the parser; mean crossings, the number of times a bracketed sequence output by the parser overlaps with one from the treebank but neither is properly contained in the other, averaged over all sentences; SOur previous attempts to produce SUSANNE annotation scheme analyses were not entirely successful, since SUSANNE does not have an underlying grammar, or even a formal description of the possible bracketing configurations. Our evaluation results were often more sensitive to the exact mapping we used than to changes we made to the parsing system itself.</Paragraph>
      <Paragraph position="1">  and zero crossings, the percentage of sentences for which the analysis returned has zero crossings (see Grishman, Macleod &amp; Sterling, 1992).</Paragraph>
      <Paragraph position="2"> Since the test corpus contains only in-coverage sentences our results are relative to the 80deg70 or so of sentences that can be parsed. In experiments measuring the coverage of our system (Carroll &amp; Briscoe, 1996), we found that the mean length of failing sentences was little different to that of successfully parsed ones. We would therefore argue that the remaining 20% of sentences are not significantly more complex, and therefore our results are not skewed due to parse failures. Indeed, in these experiments a fair proportion of unsuccessfully parsed sentences were elliptical noun or prepositional phrases, fragments from dialogue and so forth, which we do not attempt to cover.</Paragraph>
      <Paragraph position="3"> On these measures, there is no significant difference between the baseline and lexicalised versions of the parser. In particular, the mean crossing rates per sentence are almost identical.</Paragraph>
      <Paragraph position="4"> This is in spite of the fact that the two versions return different highest-ranked analyses for 30% of the sentences in the test corpus. The reason for the similarity in scores appears to be that the annotation scheme and evaluation measures are relatively insensitive to argument/adjunct and attachment distinctions. For example, in the sentence (2) from the test corpus (2) Salem (AP) - the statewide meeting of war mothers Tuesday in Salem will hear a greeting from Gov. Mark HaYfield.</Paragraph>
      <Paragraph position="5"> the phrasal analyses returned by the baseline and lexicalised parsers are, respectively (3a) and (3b).</Paragraph>
      <Paragraph position="6">  (3) a ... (VP will hear (NP a greeting) (PP from (NP Gov. Mark garfield))) ...</Paragraph>
      <Paragraph position="7"> b ... (VP will hear (NP a greeting (PP from (YP Gov. Mark Hatfield)))) ...</Paragraph>
      <Paragraph position="8">  The latter is correct, but the former, incorrectly taking the PP to be an argument of the verb, is penalised only lightly by the evaluation measures: it has zero crossings, and 75% recall and precision. This type of annotation and evaluation scheme may be appropriate for a phrasal parser, such as the baseline version of the parser, which does not have the knowledge to resolve such ambiguities. Unfortunately, it masks differences between such a phrasal parser and one which can use lexical information to make informed decisions between complementation and modification possibilities 4.  We therefore also evaluated the baseline and lexicalised parser against the 500 test sentences marked up in accordance with a second, grammatical relation-based (GR) annotation scheme (described in detail by Carroll, Briscoe ~ Sanfil!ppo, 1998).</Paragraph>
      <Paragraph position="9"> In general, grammatical relations (GRs) are viewed as specifying the syntactic dependency which holds between a head and a dependent.</Paragraph>
      <Paragraph position="10"> The set of GRs form a hierarchy; the ones we are concerned with are shown in figure 1. Subj(ect) GRs divide into clausal (zsubj/csubj), and non-clausal (ncsubj) relations. Comp(lement) GRs divide into clausal, and into non-clausal direct object (dobj), second (non-clausal) complement in ditransitive constructions (obj2), and indirect object complement introduced by a preposition (iobj). In general the parser returns the most specific (leaf) relations in the GR hierarchy, except when it is unable to determine whether clausal subjects/objects are controlled from within or without (i.e. csubj vs. zsubj, and ccomp vs. zcomp respectively), in which case it 4Shortcomings of this combination of annotation and evaluation scheme have been noted previously by Lin (1996), Carpenter &amp; Manning (1997) and others. Carroll, Briscoe &amp; Sanfilippo (1998) summarise the various criticisms that have been made.</Paragraph>
      <Paragraph position="11">  lation is parameterised with a head (lemma) and a dependent (lemma)--also optionally a type and/or specification of grammatical function. For example, the sentence (4a) would be marked up as in (4b).</Paragraph>
      <Paragraph position="12"> (4) a Paul intends to leave IBM.</Paragraph>
      <Paragraph position="14"> Carroll, Briscoe &amp; Sanfilippo (1998) justify this new evaluation annotation scheme and compare it with others (constituent- and dependencybased) that have been proposed in the literature. null The relatively large size of the test corpus has meant that to date we have in some cases not distinguished between c/zsubj and between c/zcomp, and we have not marked up modification relations; we thus report evaluation with respect to argument relations only (but including the relation arg_mod--a semantic argument which is syntactically realised as a modifier, such as the passive 'by-phrase'). The mean number of GRs per sentence in the test corpus is 4.15.</Paragraph>
      <Paragraph position="15"> When computing matches between the GRs produced by the parser and those in the corpus annotation, we allow a single level of subsumption: a relation from the parser may be one level higher in the GR hierarchy than the actual correct relation. For example, if the parser returns clausal, this is taken to match both the more specific zcomp and ccomp. Also, an unspecified filler (_) for the type slot in the iobj and clausal relations successfully matches any actual specified filler. The head slot fillers are in all cases the base forms of single head words, so for example, 'multi-component' heads, such as the names of people, places or organisations are reduced to one word; thus the slot filler corresponding to Mr. Bill Clinton would be Clinton. For real-world applications this might not be the desired behaviour---one might instead want the token Mr._BiILClinton. This could be achieved by invoking a processing phase similar to the conventional 'named entity' identification task in information extraction.</Paragraph>
      <Paragraph position="16"> Considering the previous example (2), but this time with respect to GRs, the sets returned by the baseline and lexicalised parsers are (5a) and (Sb), respectively.</Paragraph>
      <Paragraph position="17"> (5) a ncsubj (hear, meeting,_) dobj (hear, greeting,_) io bj (.from, hear, Hatfield) b ncsubj (hear, meeting,_) dobj (hear, greeting,_) The latter is correct, but the former, incorrectly taking the PP to be an argument of the verb, hear, is penalised more heavily than in the bracketing annotation and evaluation schemes: it gets only 67% recall. There is also no misleadingly low crossing score since there is no analogue to this in the GR scheme.</Paragraph>
      <Paragraph position="18"> Table 3 gives the result of evaluating the base-line and lexicalised versions of the parser on the GR annotation. The measures compare the set of GRs in the annotated test corpus with those returned by the parser, in terms of recall, the percentage of GRs correctly found by the parser out of all those in the treebank; and precision;  by the lexicalised parser.</Paragraph>
      <Paragraph position="19"> the percentage of GRs returned by the parser that are actually correct. In the evaluation, GR recall of the lexicalised parser drops by 0.5% compared with the baseline, while precision increases by 9.0%. The drop in recall is not statistically significant at the 95% level (paired t-test, 1.46, 499 dr, p &gt; 0.1), whereas the increase in precision is significant even at the 99.95% level (paired t-test, 5.14, 499 dr, p &lt; 0.001).</Paragraph>
      <Paragraph position="20"> Table 4 gives the number of each type of GR returned by the two models, compared with the correct numbers in the test corpus. The base-line parser returns a mean of 4.65 relations per sentence, whereas the lexicalised parser returns only 4.15, the same as the test corpus. This is further, indirect evidence that the lexicalised probabilistic system models the data more accurately. null</Paragraph>
    </Section>
    <Section position="4" start_page="123" end_page="124" type="sub_section">
      <SectionTitle>
3.4 Discussion
</SectionTitle>
      <Paragraph position="0"> In addition to the quantitative analysis of parser accuracy reported above, we have also performed a qualitative analysis of the errors made.</Paragraph>
      <Paragraph position="1"> We looked at each of the errors made by the lexicalised version of the parser on the 500-sentence test corpus, and categorised them into errors concerning: complementation, modification, coordination, structural attachment of textual adjuncts, and phrase-internal misbracketing. Of course, multiple errors within a given sentence may interact, in the sense that one error may so disrupt the structure of an analysis that it necessarily leads to one or more other errors being made. In all cases, though, we considered all of the errors and did not attempt to determine whether or not one of them was the 'root cause'.</Paragraph>
      <Paragraph position="2"> Table 5 summarises the number of errors of each type over the test corpus.</Paragraph>
      <Paragraph position="3"> Typical examples of the five error types identified are: complementation ... decried the high rate of unemployment in the state misanalysed as decry followed by an NP and a PP complement; null modification in ... surveillance of the pricing practices of the concessionaires for the purpose of keeping the prices reasonable, the PP modifier for the purpose of... attached 'low' to concessionaires rather than 'high' to surveillance; co-ordination the NP priests, soldiers, and other members of the party misanalysed as just two conjuncts, with the first conjunct containing the first two words in apposition; null textual in But you want a job guaranteed when you return, I continued my attack, the (textual) adjunct I ... attack attached to the VP guaranteed ... return rather than the S But ... return; and misbracketing Nowhere in Isfahan is this rich aesthetic life of the Persians ... has of misanalysed as a particle, with the Persians becoming a separate NP.</Paragraph>
      <Paragraph position="4"> There are no obvious trends within each type of error, although some particularly numerous sub-types can be identified. In 8 of the 30 cases of textual misanalysis, a sentential textual adjunct preceded by a comma was attached too low. The most common type of modification error was--in 20 of the 134 cases--misattachment of a PP modifier of N to a higher VP. The majority of the complementation errors were verbal, accounting for 115 of the total of 124. In 15 cases of incorrect verbal complementation a passive construction was incorrectly analysed as active, often with a following 'by' prepositional phrase erroneously taken to be a complement.</Paragraph>
      <Paragraph position="5"> Other shortcomings of the system were evident in the treatment of co-ordinated verbal  heads, and of phrasal verbs. The grammatical relation extraction module is currently unable to return GRs in which the verbal head alone appears in the sentence as a conjunct--as in the VP ... to challenge and counter-challenge the authentication. This can be remedied fairly easily. Phrasal verbs, such as to consist of are identified as such by the subcategorisation acquisition system. The grammar used by the shallow parser analyses phrasal verbs in two stages: firstly the verb itself and the following particle are combined to form a sub--constituent, and then phrasal complements are attached. The simple mapping from VSUBCAT values to subcategorisation classes cannot cope with the second level of embedding of phrasal verbs, so these verbs do not pick up any lexical information at parse time.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML