XML Viewer - p99-1035

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1035_metho.xml
Size: 12,977 bytes
Last Modified: 2025-10-06 14:15:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1035">
  <Title>Inside-Outside Estimation of a Lexicalized PCFG for German</Title>
  <Section position="4" start_page="0" end_page="270" type="metho">
    <SectionTitle>
3 Grammar
</SectionTitle>
    <Paragraph position="0"> The grammar is a manually developed headed context-free phrase structure grammar for German subordinate clauses with 5508 rules and  analyze&gt; Deutsche i. deutsch'ADJ.Pos+NN.Fem.Akk.Sg 2. deutsch^ADJ.Pos+NN.Fem.Nom.Sg 3. deutsch^ADJ.Pos+NN.Masc.Nom. Sg. Sw 4. deutsch^ADJ.Pos+NN.Neut.Akk.Sg. Sw 5. deutsch^ADJ.Pos+NN.Neut.Nom. Sg.Sw 6. deutsch-ADJ.Pos+NN.NoGend.Akk.Pi.St 7. deutsch^ADJ.Pos+NN.NoGend.Nom.Pl.St 8. *deutsch+ADJ.Pos.Fem.Akk.Sg 9. *deutsch+ADJ.Pos.Fem.Nom.Sg i0. *deutsch+ADJ.Pos.Masc.Nom.Sg.Sw ii. *deutsch+ADJ.Pos.Neut.Akk.Sg.Sw 12. *deutsch+ADJ.Pos.Neut.Nom.Sg. Sw 13. *deutsch+ADJ.Pos.NoGend.Akk.Pi.St 14. *deutsch+ADJ.Pos.NoGend.Nom.Pl.St ==&gt; Deutsche { ADJ.E, NNADJ.E }</Paragraph>
    <Paragraph position="2"> 562 categories, 209 of which are terminal categories. The formalism is that of Carroll and Rooth (1998), henceforth C+R: mother -&gt; non-heads head' non-heads (freq) The rules are head marked with a prime. The non-head sequences may be empty, freq is a rule frequency, which is initialized randomly and subsequently estimated by the inside outsidealgorithm. To handle systematic patterns related to features, rules were generated by Lisp functions, rather than being written directly in the above form. With very few exceptions (rules for coordination, S-rule), the rules do not have more than two daughters.</Paragraph>
    <Paragraph position="3"> Grammar development is facilitated by a chart browser that permits a quick and efficient discovery of grammar bugs (Carroll, 1997a). Fig. 3 shows that the ambiguity in the chart is quite considerable even though grammar and corpus are restricted. For the entire corpus, we computed an average 9202 trees per clause. In the chart browser, the categories filling the cells indicate the most probable category for that span with their estimated frequencies. The pop-up window under IP presents the ranked list of all possible categories for the covered span. Rules (chart edges) with frequencies can be viewed with a further menu. In the chart browser, colors are used to display frequencies (between 0 and 1) estimated by the inside-outside algorithm. This allows properties shared across tree analyses to be checked at a glance; often grammar and estimation bugs can be detected without mouse operations.</Paragraph>
    <Paragraph position="4"> The grammar covers 88.5~o of the clauses and 87.9% of the tokens contained in the corpus.</Paragraph>
    <Paragraph position="5"> Parsing failures are mainly due to UNTAGGED words contained in 6.6% of the failed clauses, the pollution of the corpus by infinitival constructions (~1.3%), and a number of coordinations not covered by the grammar (~1.6%).</Paragraph>
    <Section position="1" start_page="269" end_page="270" type="sub_section">
      <SectionTitle>
3.1 Case features and agreement
</SectionTitle>
      <Paragraph position="0"> On nominal categories, in addition to the four cases Nom, Gen, Dat, and Akk, case features with a disjunctive interpretation (such as Dir for Nom or Akk) are used. The grammar is written in such a way that non-disjunctive features are introduced high up in the tree. This results in some reduction in the size of the parse forest, and some parameter pooling. Essentially the full range of agreement inside the noun phrase is enforced. Agreement between the nominative NP and the tensed verb (e.g. in number) is not enforced by the grammar, in order to control the number of parameters and rules.</Paragraph>
      <Paragraph position="1"> For noun phrases we employ Abney's chunk grammar organization (Abney, 1996). The noun chunk (NC) is an approximately non-recursive projection that excludes post-head complements and (adverbial) adjuncts introduced higher than pre-head modifiers and determiners but includes participial pre-modifiers with their complements. Since we perform complete context free parsing, parse forest construction, and inside-outside estimation, chunks are not motivated by deterministic parsing. Rather, they facilitate evaluation and graphical debugging, by tending to increase the span of constituents with high estimated frequency.</Paragraph>
      <Paragraph position="2">  class # frame types VPA.na.na VPA.na.na VPA 15 n, na, nad, nai, nap, nar, nd, ndi, ~ ndp, ndr, ni, nir, np, npr, nr / \ / \ NP.Nom VPA.na.a NP.Akk VPA.na.n VPP 13 d, di, dp, dr, i, ir, n, nd, ni, np, p, pr, r ~ /~ VPI 10 a, ad, ap, ar, d, dp, dr, p, pr, r NP.Akk VPA.na NP.Nom VPA.na VPK 2 i, n</Paragraph>
    </Section>
    <Section position="2" start_page="270" end_page="270" type="sub_section">
      <SectionTitle>
3.2 Subcategorisation frames of verbs
</SectionTitle>
      <Paragraph position="0"> The grammar distinguishes four subcategorisation frame classes: active (VPA), passive (VPP), infinitival (VPI) frames, and copula constructions (VPK). A frame may have maximally three arguments. Possible arguments in the frames are nominative (n), dative (d) and accusative (a) NPs, reflexive pronouns (r), PPs (p), and infinitival VPs (i). The grammar does not distinguish plain infinitival VPs from zu-infinitival VPs. The grammar is designed to partially distinguish different PP frames relative to the prepositional head of the PP. A distinct category for the specific preposition becomes visible only when a subcategorized preposition is cancelled from the subcat list. This means that specific prepositions do not figure in the evaluation discussed below.</Paragraph>
      <Paragraph position="1"> The number and the types of frames in the different frame classes are given in figure 4.</Paragraph>
      <Paragraph position="2"> German, being a language with comparatively free phrase order, allows for scrambling of arguments. Scrambling is reflected in the particular sequence in which the arguments of the verb frame are saturated. Compare figure 5 for an example of a canonical subject-object order in an active transitive frame and its scrambled objectsubject order. The possibility of scrambling verb arguments yields a substantial increase in the number of rules in the grammar (e.g. 102 combinatorically possible argument rules for all in VPA frames). Adverbs and non-subcategorized PPs are introduced as adjuncts to VP categories which do not saturate positions in the subcat frame.</Paragraph>
      <Paragraph position="3"> In earlier experiments, we employed a flat clausal structure, with rules for all permutations of complements. As the number of frames increased, this produced prohibitively many rules, particularly with the inclusion of adjuncts.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="270" end_page="271" type="metho">
    <SectionTitle>
4 Parameters
</SectionTitle>
    <Paragraph position="0"> The parameterization is as in C+R, with one significant modification. Parameters consist of (i) rule parameters, corresponding to right hand  sides conditioned by parent category and parent head; (ii) lexical choice parameters for non-head children, corresponding to child lemma conditioned by child category, parent category, and parent head lemma. See C+R or Charniak (1995) for an explanation of how such parame~ ters define a probabilistic weighting of trees. The change relative to C+R is that lexicalization is by uninflected lemma rather than word form.</Paragraph>
    <Paragraph position="1"> This reduces the number of lexical parameters, giving more acceptable model sizes and eliminating splitting of estimated frequencies among inflectional forms. Inflected forms are generated at the leaves of the tree, conditioned on terminal category and lemma. This results in a third family of parameters, though usually the choice of inflected form is deterministic.</Paragraph>
    <Paragraph position="2"> A parameter pooling feature is used for argument filling where all parent categories of the form VP.x.y are mapped to a category VP.x in defining lexical choice parameters. The consequence is e.g. that an accusative daughter of a nominative-accusative verb uses the same lexical choice parameter, whether a default or scrambled word order is used. (This feature was used by C/R for their phrase trigram grammar, not in the linguistic part of their grammar.) Not all desirable parameter pooling can be expressed in this way, though; for instance rule parameters are not pooled, and so get split when the parent category bears an inflectional feature.</Paragraph>
  </Section>
  <Section position="6" start_page="271" end_page="272" type="metho">
    <SectionTitle>
5 Estimation
</SectionTitle>
    <Paragraph position="0"> The training of our probabilistic CFG proceeds in three steps: (i) unlexicalized training with the supar parser, (ii) bootstrapping a lexicalized model from the trained unlexicalized one with the ultra parser, and finally (iii) lexicalized training with the hypar parser (Carroll, 1997b). Each of the three parsers uses the inside-outside algorithm, supar and ultra use an unlexicalized weighting of trees, while hypar uses a lexicalized weighting of trees, ultra and hypar both collect frequencies for lexicalized rule and lexical choice events, while supar collects only unlexicalized rule frequencies.</Paragraph>
    <Paragraph position="1"> Our experiments have shown that training an unlexicalized model first is worth the effort. Despite our use of a manually developed grammar that does not have to be pruned of superfluous rules like an automatically generated grammar,  on heldout data) (iteration: cross-entropy the lexicalized model is notably better when preceded by unlexicalized training (see also Ersan and Charniak (1995) for related observations). A comparison of immediate lexicalized training (without prior training of an unlexicalized model) and our standard training regime that involves preliminary unlexicalized training speaks in favor of our strategy (cf. the different 'lex 0' and 'lex 2' curves in figures 8 and 9). However, the amount of unlexicalized training has to be controlled in some way.</Paragraph>
    <Paragraph position="2"> A standard criterion to measure overtraining is to compare log-likelihood values on held-out data of subsequent iterations. While the log-likelihood value of the training data is theoretically guaranteed to converge through subsequent iterations, a decreasing log-likelihood value of the held-out data indicates overtraining. Instead of log-likelihood, we use the inversely proportional cross-entropy measure.</Paragraph>
    <Paragraph position="3"> Fig. 6 shows comparisons of different sizes of training and heldout data (training/heldout): (A) 50k/50k, (B) 500k/500k, (C) 4.1M/500k.</Paragraph>
    <Paragraph position="4"> The overtraining effect is indicated by the increase in cross-entropy from the penultimate to the ultimate iteration in the tables. Overtraining results for lexicalized models are not yet available. null However, a comparison of precision/recall measures on categories of different complexity through iterative unlexicalized training shows that the mathematical criterion for overtraining may lead to bad results from a linguistic point of view. While we observed more or less converging precision/recall measures for lower level structures such as noun chunks, iterative unlexicalized training up to the overtraining threshold turned out to be disastrous for the evaluation of complex categories that depend on almost the</Paragraph>
    <Paragraph position="6"/>
    <Paragraph position="8"> entire span of the clause. The recognition of sub-categorization frames through 60 iterations of unlexicalized training shows a massive decrease in precision/recall from the best to the last iteration, even dropping below the results with the randomly initialized grammar (see Fig. 9).</Paragraph>
    <Section position="1" start_page="272" end_page="272" type="sub_section">
      <SectionTitle>
5.1 Training regime
</SectionTitle>
      <Paragraph position="0"> We compared lexicalized training with respect to different starting points: a random unlexicalized model, the trained unlexicalized model with the best precision/recall results, and an unlexicalized model that comes close to the cross-entropy overtraining threshold. The details of the training steps are as follows:  (1) 0, 2 and 60 iterations of unlexicalized parsing with supar; (2) lexicalization with ultra using the entire corpus; (3) 23 iterations of lexicalized parsing with hypar.</Paragraph>
      <Paragraph position="1">  The training was done on four machines (two 167 MHz UltraSPARC and two 296 MHz SUNW UltraSPARC-II). Using the grammar described here, one iteration of supar on the entire corpus takes about 2.5 hours, lexicalization and generating an initial lexicalized model takes more than six hours, and an iteration of lexicalized parsing can be done in 5.5 hours.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML