XML Viewer - p06-2006

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2006_metho.xml
Size: 16,272 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2006">
  <Title>Evaluating the Accuracy of an Unlexicalized Statistical Parser on the PARC DepBank</Title>
  <Section position="4" start_page="41" end_page="43" type="metho">
    <SectionTitle>
2 Unlexicalized Statistical Parsing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
2.1 System Architecture
</SectionTitle>
      <Paragraph position="0"> Both the XLE system and Collins' Model 3 pre-process textual input before parsing. Similarly, our baseline system consists of a pipeline of modules. First, text is tokenized using a deterministic finite-state transducer. Second, tokens are part-of-speech and punctuation (PoS) tagged using a 1storder Hidden Markov Model (HMM) utilizing a lexicon of just over 50K words and an unknown word handling module. Third, deterministic morphological analysis is performed on each tokentag pair with a finite-state transducer. Fourth, the lattice of lemma-affix-tags is parsed using a grammar over such tags. Finally, the n-best parses are computed from the parse forest using a probabilistic parse selection model conditioned on the structuralparsecontext. Theoutputofthe parsercanbe displayed as syntactic trees, and/or factored into a sequence of bilexical grammatical relations (GRs) between lexical heads and their dependents.</Paragraph>
      <Paragraph position="1"> The full system can be extended in a variety of ways - for example, by pruning PoS tags but allowing multiple tag possibilities per word as input to the parser, by incorporating lexical subcategorization into parse selection, by computing GR weights based on the proportion and probability of the n-best analyses yielding them, and so forth - broadly trading accuracy and greater domaindependence against speed and reduced sensitivity to domain-specific lexical behaviour (Briscoe and Carroll, 2002; Carroll and Briscoe, 2002; Watson et al., 2005; Watson, 2006). However, in this paperwefocusexclusivelyonthebaselineunlexical- null ized system.</Paragraph>
    </Section>
    <Section position="2" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
2.2 Grammar Development
</SectionTitle>
      <Paragraph position="0"> The grammar is expressed in a feature-based, unification formalism. There are currently 676 phrase structure rule schemata, 15 feature propagation rules, 30 default feature value rules, 22 category expansion rules and 41 feature types which together define 1124 compiled phrase structure rules in which categories are represented as sets of fea- null tures, that is, attribute-value pairs, possibly with variable values, possibly bound between mother and one or more daughter categories. 142 of the phrase structure schemata are manually identified as peripheral rather than core rules of English grammar. Categories are matched using fixed-arity term unification at parse time.</Paragraph>
      <Paragraph position="1"> The lexical categories of the grammar consist of feature-based descriptions of the 149 PoS tags and 13 punctuation tags (a subset of the CLAWS tagset, see e.g. Sampson, 1995) which constitute the preterminals of the grammar. The number of distinct lexical categories associated with each preterminal varies from 1 for some function words throughtoaround35as, forinstance, tagsformain verbs areassociated witha VSUBCAT attribute taking 33 possible values. The grammar is designed to enumerate possible valencies for predicates by including separate rules for each pattern of possible complementation in English. The distinction between arguments and adjuncts is expressed by adjunction of adjuncts to maximal projections</Paragraph>
      <Paragraph position="3"> arguments (i.e. arguments are sisters within X1 projections; X1 - X0 Arg1. . . ArgN).</Paragraph>
      <Paragraph position="4"> Each phrase structure schema is associated with one or more GR specifications which can be conditioned on feature values instantiated at parse time and which yield a rule-to-rule mapping from local trees to GRs. The set of GRs associated with a given derivation define a connected, directed graph with individual nodes representing lemma-affix-tags and arcs representing named grammatical relations. The encoding of this mapping within the grammar is similar to that of F-structure mapping in LFG. However, the connected graph is not constructed and completeness and coherence constraints are not used to filter the phrase structure derivation space.</Paragraph>
      <Paragraph position="5"> The grammar finds at least one parse rooted in thestartcategoryfor85%oftheSusannetreebank, a 140K word balanced subset of the Brown Corpus, which we have used for development (Sampson, 1995). Much of the remaining data consists of phrasal fragments marked as independent text sentences, for example in dialogue. Grammatical coverage includes the majority of construction types of English, however the handling of some unboundeddependencyconstructions, particularly comparatives and equatives, is limited because of the lack of fine-grained subcategorization information in the PoS tags and by the need to balance depth of analysis against the size of the derivation space. On the Susanne corpus, the geometric mean of the number of analyses for a sentence of length n is 1.31n. The microaveraged F1-score for GR extraction on held-out data from Susanne is 76.5% (see section 4.2 for details of the evaluation scheme).</Paragraph>
      <Paragraph position="6"> The system has been used to analyse about 150 million words of English text drawn primarily from the PTB, TREC, BNC, and Reuters RCV1 datasets in connection with a variety of projects.</Paragraph>
      <Paragraph position="7"> The grammar and PoS tagger lexicon have been incrementally improved by manually examining cases of parse failure on these datasets. However, the effort invested amounts to a few days' effort for each new dataset as opposed to the main grammar development effort, centred on Susanne, which has extended over some years and now amountstoabout2years'effort(seeBriscoe,2006 for further details).</Paragraph>
    </Section>
    <Section position="3" start_page="42" end_page="43" type="sub_section">
      <SectionTitle>
2.3 Parser
</SectionTitle>
      <Paragraph position="0"> To build the parsing module, the unification grammar is automatically converted into an atomiccategoried context free 'backbone', and a non-deterministic LALR(1) table is constructed from this, which is used to drive the parser. The residue of features not incorporated into the backbone are unified on each rule application (reduce action). In practice, the parser takes average time roughly quadratic in the length of the input to create a packed parse forest represented as a graph-structured stack. The statistical disambiguation phase is trained on Susanne treebank bracketings, producing a probabilistic generalized LALR(1) parser (e.g. Inui et al., 1997) which associates probabilities with alternative actions in the LR table. null The parser is passed as input the sequence of most probable lemma-affix-tags found by the tagger. During parsing, probabilities are assigned to subanalyses based on the the LR table actions that derived them. The n-best (i.e. most probable) parses are extracted by a dynamic programming procedure over subanalyses (represented by nodes in the parse forest). The search is efficient since probabilities are associated with single nodes in the parse forest and no weight function over ancestor or sibling nodes is needed. Probabilities capture structural context, since nodes in  the parse forest partially encode a configuration of the graph-structured stack and lookahead symbol, so that, unlike a standard PCFG, the model discriminates between derivations which only differ in the order of application of the same rules and also conditions rule application on the PoS tag of the lookahead token.</Paragraph>
      <Paragraph position="1"> When there is no parse rooted in the start category, the parser returns a connected sequence of partial parses which covers the input based on subanalysis probability and a preference for longer and non-lexical subanalysis combinations (e.g. Kiefer et al., 1999). In these cases, the GR graph will not be fully connected.</Paragraph>
    </Section>
    <Section position="4" start_page="43" end_page="43" type="sub_section">
      <SectionTitle>
2.4 Tuning and Training Method
</SectionTitle>
      <Paragraph position="0"> The HMM tagger has been trained on 3M words of balanced text drawn from the LOB, BNC and Susanne corpora, which are available with hand-corrected CLAWS tags. The parser has been trained from 1.9K trees for sentences from Susanne that were interactively parsed to manually obtain the correct derivation, and also from 2.1K further sentences with unlabelled bracketings derived from the Susanne treebank. These bracketings guide the parser to one or possibly several closely-matching derivations and these are used to derive probabilities for the LR table using (weighted) Laplace estimation. Actions in the table involving rules marked as peripheral are assigned a uniform low prior probability to ensure that derivations involving such rules are consistently lower ranked than those involving only core rules.</Paragraph>
      <Paragraph position="1">  ToimproveperformanceonWSJtext,weexamined some parse failures from sections other than section 23 to identify patterns of consistent failure. We then manually modified and extended the grammar with a further 6 rules, mostly to handle cases of indirect and direct quotation that are very common in this dataset. This involved 3 days' work. Once completed, the parser was retrained on the original data. A subsequent limited inspection of top-ranked parses led us to disable 6 existing rules which applied too freely to the WSJ text; these were designed to analyse auxiliary ellipsis which appears to be rare in this genre. We alsocataloguedincorrectPoStagsfromWSJparse failures and manually modified the tagger lexicon where appropriate. These modifications mostly consisted of adjusting lexical probabilities of extant entries with highly-skewed distributions. We also added some tags to extant entries for infrequent words. These modifications took a further day. The tag transition probabilities were not reestimated. Thus, we have made no use of the PTB itself and only limited use of WSJ text.</Paragraph>
      <Paragraph position="2"> This method of grammar and lexicon development incrementally improves the overall performance of the system averaged across all the datasets that it has been applied to. It is very likely that retraining the PoS tagger on the WSJ and retraining the parser using PTB would yield a system which would perform more effectively on DepBank. However, one of our goals is to demonstrate that an unlexicalized parser trained on a modest amount of annotated text from other sources, coupled to a tagger also trained on generic, balanced data, can perform competitively with systems which have been (almost) entirely developed and trained using PTB, whether or not these systems deploy hand-crafted grammars or ones derived automatically from treebanks.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="43" end_page="44" type="metho">
    <SectionTitle>
3 Extending and Validating DepBank
</SectionTitle>
    <Paragraph position="0"> DepBank was constructed by parsing the selected section 23 WSJ sentences with the XLE system and outputting syntactic features and bilexical relations from the F-structure found by the parser.</Paragraph>
    <Paragraph position="1"> These features and relations were subsequently checked, correctedandextendedinteractivelywith the aid of software tools (King et al., 2003).</Paragraph>
    <Paragraph position="2"> The choice of relations and features is based quite closely on LFG and, in fact, overlaps substantially with the GR output of our parser. Figure 1 illustrates some DepBank annotations used in the experiment reported by Kaplan et al. and our hand-corrected GR output for the example Ten of the nation's governors meanwhile called on the justices to reject efforts to limit abortions. We have kept the GR representation simpler and more readable by suppressing lemmatization, token numbering and PoS tags, but have left the DepBank annotations unmodified.</Paragraph>
    <Paragraph position="3"> The example illustrates some differences between the schemes. For instance, the subj and ncsubj relations overlap as both annotations contain such a relation between call(ed) and Ten), but the GR annotation also includes this relation between limit and effort(s) and reject and justice(s), while DepBank links these two verbs to a variable pro. This reflects a difference of philosophy about</Paragraph>
    <Paragraph position="5"> resolution of such 'understood' relations in different constructions. Viewed as output appropriate to specific applications, either approach is justifiable.</Paragraph>
    <Paragraph position="6"> However, for evaluation, these DepBank relations add little or no information not already specified by the xcomp relations in which these verbs also appear as dependents. On the other hand, DepBank includes an adjunct relation between meanwhile and call(ed), while the GR annotation treats meanwhile as a text adjunct (ta) of governors, delimitedbybalancedcommas, followingNunberg's (1990) text grammar but conveying less information here.</Paragraph>
    <Paragraph position="7"> There are also issues of incompatible tokenization and lemmatization between the systems and of differing syntactic annotation of similar information, which lead to problems mapping between our GR output and the current DepBank. Finally, differences in the linguistic intuitions of the annotators and errors of commission or omission on both sides can only be uncovered by manual comparison of output (e.g. xmod vs. xcomp for limit efforts above). Thus we reannotated the DepBank sentences with GRs using our current system, and then corrected and extended this annotation utilizing a software tool to highlight differences between the extant annotations and our own.2 This exercise, though time-consuming, uncovered problems in both annotations, and yields a doubly-annotated and potentially more valuable resource in which annotation disagreements over complexattachmentdecisions,forinstance, canbe inspected.</Paragraph>
    <Paragraph position="8"> The GR scheme includes one feature in DepBank (passive), several splits of relations in Dep-Bank, such as adjunct, adds some of DepBank's featural information, such as subord form, as a subtype slot of a relation (ccomp), merges DepBank's oblique with iobj, and so forth. But it does not explicitly include all the features of DepBank or even of the reduced set of semantically-relevant features used in the experiments and evaluation reported in Kaplan et al.. Most of these features can be computed from the full GR representation of bilexical relations between numbered lemma-affix-tags output by the parser. For instance, num features, such as the plurality of justices in the example, can be computed from the full det GR (det justice+s NN2:4 the AT:3) based on the CLAWS tag (NN2 indicating 'plural')selectedforoutput. Thefewfeaturesthatcannot be computed from GRs and CLAWS tags directly, such as stmt type, could be computed from the derivation tree.</Paragraph>
  </Section>
  <Section position="6" start_page="44" end_page="44" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
4.1 Experimental Design
</SectionTitle>
      <Paragraph position="0"> We selected the same 560 sentences as test data as Kaplan et al., and all modifications that we made to our system (see SS2.4) were made on the basis of (very limited) information from other sections of WSJ text.3 We have made no use of the further 140 held out sentences in DepBank. The results we report below are derived by choosing the most probable tag for each word returned by the PoS tagger and by choosing the unweighted GR set returned for the most probable parse with no lexical information guiding parse ranking.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML