File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1206_metho.xml
Size: 23,402 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1206"> <Title>The effect of alternative tree representations on tree bank grammars</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Different tree structure </SectionTitle> <Paragraph position="0"> representations of adjunction There is considerable variation in the tree structures used in the linguistic literature to represent various linguistic constructions. In this paper we focus on variations in the representation of adjunction constructions, particularly PP adjunction, but similiar variation occurs in other constructions as well.</Paragraph> <Paragraph position="1"> Early analyses in transformational grammar typically adopted a 'flat' representation of adjunction structures in which adjuncts are represented as siblings of the phrasal head, as shown in Figure 1. This representation does not systematically distinguish between adjuncts and arguments, as both are attached as children of a single maximal projection.</Paragraph> <Paragraph position="2"> The Penn II tree bank represents PP adjunction to VP in this manner, presumably because it permits the annotators to avoid having to determine whether the PP in question is an adjunct or an argument.</Paragraph> <Paragraph position="3"> Because this representation attaches all of the adjuncts modifying the same phrase to the same node, distinct CFG productions are required for each possible number of adjuncts. Thus the set of all possible trees following this representation scheme can only be generated by a CFG if one imposes an upper bound on the number of PPs that can be adjoined to any one single phrase, but according to standard linguistic wisdom there is no natural bound on the number of PPs that may be adjoined to a single phrase.</Paragraph> <Paragraph position="4"> Johnson 40 The effect of alternative tree representations</Paragraph> <Paragraph position="6"> of a lexical head (in this case, the verb ate). The Penn II tree bank represents VP adjunction in this manner.</Paragraph> <Paragraph position="7"> Later transformational analyses adopted the more complex 'Chomsky adjunction' representation of adjunction structures for theory-internal reasons (e.g., it was a corollary of Emmonds' &quot;Structure Preserving Hypothesis&quot;).</Paragraph> <Paragraph position="8"> This representation provides an additional level of recursive phrasal structure for each adjunct, as depicted in Figure 2.</Paragraph> <Paragraph position="9"> Modern transformational grammar, following Chomsky's X I theory of phrase structure, represents adjunction with similiar recursive structures; the major difference being that the non-maximal phrasal nodes are given a new, distinct category label.</Paragraph> <Paragraph position="10"> Because the Chomsky adjunction structure and the X I theory based on it use a single rule to recursively adjoin an arbitrary number of adjuncts, the set of all tree structures required by this representation scheme can be generated by a CFG.</Paragraph> <Paragraph position="11"> The Penn II tree bank uses a mixed kind of representation for NP adjunction, involving two levels of phrasal structure irrespective of the number of adjuncts, as shown in Figure 3. This representation permits adjuncts to be systematically distinguished from arguments, although this does not seem to have been done systematically in the Penn II corpus. 1 Just as with the corpus are described in detail in (Bies et al., 1995). The mixed representation arises from the fact that &quot;postmodifiers are Chomsky-adjoined to the phrase they modify&quot; with the proviso that &quot;consecutive unrelated adjuncts are non-recursively attached to the NP the modify&quot;. However, because constructions such as appositives, emphatic reflexives and phrasal titles are associated with their own level of NP structure, it is possible for NPs with more than two levels of structure to appear.</Paragraph> <Paragraph position="12"> PCFG that generates the trees in Figure 3, yet it does not fit the general representational scheme for adjunction structures used in the Penn II tree bank.</Paragraph> <Paragraph position="13"> generated by a CFG unless the number of PPs adjoined to a single phrase is bounded.</Paragraph> <Paragraph position="14"> Perhaps more seriously for PCFG modelling of such tree structures, a PCFG which can generate a nontrivial subset of such 'two level' NP tree structures will also generate tree structures which are not instances of this representational scheme. For example, the NP production needed to produce the leftmost tree in Figure 3 can apply recursively, generating an alternative tree structure for the yield of the rightmost tree of Figure 3, as shown in Figure 4. It is not clear what interpretation to give tree structures such as these, as they do not fit the chosen representational scheme for adjunction structures.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 PCFG models of PP adjunction </SectionTitle> <Paragraph position="0"> This section presents a theoretical investigation into the effect of different tree representations on the performance of PCFG models of PP adjunction. The analysis of four different models is presented here.</Paragraph> <Paragraph position="1"> Clearly actual tree bank data is far more corn-Johnson 41 The effect of alternative tree representations</Paragraph> <Paragraph position="3"> the unique sibling of a phrasal node (in this case, VP). Chomsky's X I theory, used by modern transformational grammar, analyses adjunction in a structurally similiar way, except that the non-maximal (in these examples, non-root) phrasal nodes are given a new category label (in this case attached as siblings of a single NP node.</Paragraph> <Paragraph position="4"> plicated than the simple models investigated in this section, and the next section investigates the effects of different tree representations empirically by applying tree transformations to the Penn II tree bank representations. However, the theoretical models discussed in this section show clearly that the choice of tree representation can in principle affect the generalizations made by</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 The Penn II tree bank </SectionTitle> <Paragraph position="0"> representations Suppose we train a PCFG on a corpus ~1 consisting only of two different tree structures: the NP attachment structure labelled (A1) and the VP attachment tree labelled (B1).</Paragraph> <Paragraph position="1"> In the Penn II tree bank, structure (A1) occurs 7,033 times in the F2-21 subcorpora and 279 times in the F22 subcorpus, and structure (B1) occurs 7,717 times in the F2-21 sub-corpora and 299 times in the F22 subcorpus. Thus f ~ 0.48 in both the F2-21 subcorpora and the F22 corpus.</Paragraph> <Paragraph position="2"> Returning to the theoretical analysis, the relative frequency counts 6'i and the non-unit production probability estimates P1 for the PCFG induced from this two-tree corpus are as follows:</Paragraph> <Paragraph position="4"> ree bank the counts of all these productions would also include their occurences in other constructions, so the theoretical analysis presented here is a crude idealization. null Thus the estimated likelihoods using P1 of the tree structures (A1) and (B1) are:</Paragraph> <Paragraph position="6"> except at f = 0 and f = 1,so in general the estimated frequencies using P1 differ from the frequencies of A1 and B1 in the training corpus.</Paragraph> <Paragraph position="7"> This is not too surprising, as the PCFG P1 assigns non-zero probability to trees not in the training corpus. For example, P1 assigns non-zero probability to the tree in Figure 4. We discuss the ramifications of this in section 6.</Paragraph> <Paragraph position="8"> In any case, in the parsing applications mentioned earlier the absolute magnitude of the probability of a tree is not of direct interest; rather we are concerned with its probability relative to the probabilities of other, alternative tree structures. Thus it is arguably more reasonable to ignore the &quot;spurious&quot; tree structures generated by P1 but not present in the training corpus, and compare the estimated relative frequencies of (A1) and (B1) under P1 to their frequencies in the training data.</Paragraph> <Paragraph position="9"> Ideally the estimated relative frequency \]1 of</Paragraph> <Paragraph position="11"> will be close to its actual frequency f in the training corpus. The relationship between f \] of NP attachment using the PCFG models discussed in the text as a function of the relative frequency f of NP attachment in the training data.</Paragraph> <Paragraph position="12"> and ./1 is plotted in Figure 5. The value of \]1 can diverge substantially from f. For example,</Paragraph> <Paragraph position="14"> As in the previous subsection P2(A2) < f and P2(B2) < (1 -f) because the PCFG assigns non-zero probability to trees not in the training corpus. Again, we calculate the estimated relative frequencies of (A2) and (B2) under P2.</Paragraph> <Paragraph position="16"> The relationship between f and f2 is plotted in Figure 5. The value of f2 can diverge from f, although not as widely as fl. For example, at f = 0148 f2 = 0.36. Thus the precise tree structure representations used to train a PCFG can have a marked effect on its performance.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Penn II representations with </SectionTitle> <Paragraph position="0"> parent annotation One of the weaknesses of a PCFG is that it is insensitive to non-local relationships between nodes. If these relationships are significant then a PCFG will be a poor language model. Indeed, the sense in which the set of trees generated by a CFG is &quot;context free&quot; is precisely that the label on a node completely characterizes the relationships between the subtree dominated by the node and the set of nodes that properly dominate this subtree.</Paragraph> <Paragraph position="1"> Thus one way of relaxing the independence assumptions implicit in a PCFG model is to systematically encode more information in node labels about their context. This subsection explores a particularly simple kind of contextual encoding: the label of the parent of each non-root nonpreterminal node is appended to that node's label. The labels of the root node and the terminal and preterminal nodes are left unchanged. null For example, assuming that the Penn II format trees (A1) and (B1) of subsection 4.1 are immediately dominated by a node labelled S, this relabelling applied to those trees produces the trees (A3) and (B3) below.</Paragraph> <Paragraph position="2"> We can perform the same theoretical analysis on this two tree corpus that we applied to the previous corpora to investigate the effect of this relabelling on the PCFG modelling of PP attachment structures.</Paragraph> <Paragraph position="3"> The counts C3 and the non-unit production probability estimates P3 for the PCFG induced from this two-tree corpus are as follows:</Paragraph> <Paragraph position="5"> As in the previous subsection P3(A3) < f and P3(B3) < (1 - f). Again, we calculate the estimated relative frequencies of (A2) and (B2) The relationship between f and \]3 is plotted in Figure 5. The value of f3 can diverge from f, just like the other estimates. For example, at f --- 0.48 \]3 = 0.46. Thus as expected, increased context information in the form of an enriched node labelling scheme can markedly change PCFG modelling performance.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="493" type="metho"> <SectionTitle> 5 Tree transformations </SectionTitle> <Paragraph position="0"> The last section presented simplified theoretical analyses of the effect of variation in tree representation and node labelling on PCFG modelling of PP attachment preferences. This section reports the results of an empirical investigation into the effect of changes in tree representation. These experiments were conducted by:</Paragraph> <Paragraph position="2"> systematically transforming the trees in the training corpus F2-21 by applying a tree transform X, inducing a PCFG Gx from the transformed F2-21 trees, finding the maximum likelihood parses Y(~-)x of the yield of each sentence in the F22 corpus with respect to the PCFG Gx, . applying the inverse transform X -1 to these maximum likelihood parse trees Y(~)x to yield a sequence of 'detransformed' trees X-I(Y(~')x) using (approximately) the same representational system as the tree bank itself, and . evaluating the detransformed trees X-I(Y(?)x) with the standard labelled precision and recall measures.</Paragraph> <Paragraph position="3"> Statistics were also collected on the properties of the grammar Gx and its detransformed maximum likelihood parses X-I(Y(~)x); the full results are presented in Table 1.</Paragraph> <Paragraph position="4"> The columns of that table correspond to different sequences of trees as follows.</Paragraph> <Paragraph position="5"> F22: the trees from the F22 subcorpus of the Penn II tree bank, F22 Id: the maximum likelihood parses of the yields of the F22 subcorpus using the PCFG estimated from the F22 subcorpus itself, Id: the maximum likelihood parses of the yields of the F22 subcorpus using the PCFG estimated from the F2-21 subcorpus (i.e., this corresponds to applying an identity transform), Parent: as above, except that the parent annotation transform described in subsection 4.3 was used in training and evaluation, null VP: as in Id, except that the flat VP structures used in the Penn II tree bank were transformed into recursive Chomsky adjunction structures as described below, NP: as above, except that the one-level NP structures used in the Penn II tree bank were transformed into recursive Chomsky adjunction structures, and VP-NP: as above, except that both NP and VP structures were transformed into recursive Chomsky adjunction structures.</Paragraph> <Paragraph position="6"> The F22 tree sequence column provides information on the distribution of subtrees in the test tree sequence itself. The F22 Id PCFG gives data on the case where the PCFG is trained on the same data that it is evaluated on, namely the F22 subcorpus. This column is included because it is often assumed that the performance of such a model is a reasonable upper bound on what can be expected from models induced from training data distinct from the test data. The remaining columns describe PCFGs induced from versions of the F2-21 subcorpora obtained by applying tree transformations in the manner described above.</Paragraph> <Paragraph position="7"> The VP transform is the result of exhaustively applying the tree transforms below. The first transform transforms VP expansions with column corresponds to the sequence of trees, either consisting of the F22 subcorpus or transforms of the maximum likelihood parses of the yields of the F22 subcorpus with respect to different PCFGs, as explained in the text. The first row reports the number of productions in these PCFGs, and the next two rows give the labelled precision and recall of these sequences of trees. The last four rows report the number of times particular kinds of subtrees appear in these sequences of trees, as explained in the text.</Paragraph> <Paragraph position="8"> final PPs into Chomsky adjunction structures, and the second transform adjoins final PPs with a following comma punctuation into Chomsky adjunction structures. In both cases it is required that the 'lowered' sequence of subtrees a be of length 2 or greater. This ensures that the transforms will only apply a finite number of times. These two rules have the effect of converting VP final PPs into Chomsky adjunction structures.</Paragraph> </Section> <Section position="7" start_page="493" end_page="493" type="metho"> <SectionTitle> VP VP VP PP a PP a /~ VP VP VP PP a PP AA </SectionTitle> <Paragraph position="0"> The NP transform is similiar to the VP transform. It too is the result of exhaustively applying two tree transformation rules. These have the effect of converting NP final PPs into Choresky adjunction structures. In this case, we require that a be of length 1 or greater.</Paragraph> <Paragraph position="1"> The NP-VP transform is the result of applying all four of the above tree transforms. The rows of Table 1 provide descriptions of these tree sequences (after 'untransformation', as described above) and, if appropriate, the PCFGs that generated them.</Paragraph> <Paragraph position="2"> The labelled precision and recall figures are obtained by regarding a sequence of trees ? as a multiset or bag E(~) of edges, i.e., triples (N, l, r> where N is a nonterminal label and l and r are left and right string positions in yield of the entire corpus. (Root nodes and preterminal nodes are ignored in these edge sets, as they are given as input to the parser). Relative to a 'test sequence' of trees ?' (here the F22 subcorpus) the labelled precision and recall of a sequence of trees ~ with the same yield as ?' (The 'fY operation above refers to multiset intersection). Precision is the fraction of edges in the tree sequence to be evaluated which also appear in the test sequence, and recall is the fraction of edges in the test sequence which also appear in sequence to be evaluated.</Paragraph> <Paragraph position="3"> The rows labelled NP attachments and VP attachments provide the number of times the following tree schema, which represent a single PP attachment, match the tree sequence. 2 In these schema, V can be instantiated by any of the verbal preterminal tags used in the Penn II corpus.</Paragraph> <Paragraph position="4"> The rows labelled NP* attachments and VP* attachments provide the number of times that the following more relaxed schema match the tree sequence. Here a can be instantiated by any sequence of trees, and V can be instantiated by the same range of preterminal tags as above. As expected, the PCFG based on the Parent transformation, which copies the label of each parent node onto those of its children, outperforms all other PCFGs in terms of labelled precision and recall.</Paragraph> <Paragraph position="5"> attachment' notation for indicating ambiguous attachment. However, this is only used relatively infrequently--the pseudo-attachment markup, only appears 27 times in the entire Penn II tree bank--and was ignored here. Pseudo-attachment structures count as VP attachment structures here.</Paragraph> <Paragraph position="6"> The various adjunction transformations only had minimal effect on labelled precision and recall. Perhaps this is because PP attachment ambiguities, despite their important role in linguistic and parsing theory, are just one source of ambiguity among many in real language, and the effect of the alternative representations has only minor effect.</Paragraph> <Paragraph position="7"> Indeed, in some cases moving to the purportedly linguistically more realistic tree Chomsky adjunction representations actually decreased performance on these measures. On reflection, perhaps this should not be surprising. The Chomsky adjunction representations are motivated within the theoretical framework of Transformational Grammar, which explicitly argues for nonlocal, indeed, non context free, dependencies. Thus its poor performance when used as input to a statistical model which is insensitive to such dependencies is to be expected. Indeed, it might be the case that the additional adjunction nodes inserted in the tree transformations above have the effect of converting a local dependency (which can be described by a PCFG) into a nonlocal dependency (which cannot). null Another initially surprising property of the tree sequences produced by the PCFGs is that they do not reflect at all well the frequency of the different kinds of PP attachment found in the Penn II corpus. This is in fact to be expected, since the sequences consist of maximum likelihood parses. To see this, consider any of the examples analysed in section 4. In all of these cases, the corpora contained two tree structures, and the induced PCFG associates each with an estimated likelihood. If these likelihoods differ, then a maximum likelihood parser will always return the same maximum likelihood tree structure each time it is presented with its yield, and will never return the tree structure with lower likelihood, even though the PCFG assigns it a nonzero likelihood.</Paragraph> <Paragraph position="8"> Thus the surprising fact is that these PCFG parsers ever produce a nonzero number of NP attachments and VP attachments in the same tree sequence. This is possible because the node label V in the attachment schema above abbre-Johnson 47 The effect of alternative tree representations viates several different preterminal labels (i.e., the set of all verbal tags). Further investigation shows that once the V label in NP and VP attachment schemas is instantiated with a particular verbal tag, only either the relevant NP attachment schema or the VP attachment schema appears in the tree sequence. For instance, in the Id tree sequence (i.e., produced by the standard tree bank grammar) the 67 NP attachments all occured with the V label instantiated to the verbal tag AUX. 3</Paragraph> </Section> <Section position="8" start_page="493" end_page="493" type="metho"> <SectionTitle> 6 Subsumed rules in tree bank </SectionTitle> <Paragraph position="0"> grammars It was mentioned in subsection 4.1 that it is possible for the PCFG induced from a tree bank to generate trees that are not meaningful representations with respect to the original tree bank representational scheme. The PCFG induced from the F2-21 subcorpus contains the following two productions:</Paragraph> <Paragraph position="2"> These productions generate the Penn II representations of one and two PP adjunctions to NP, as explained above. However, the second of these productions will never be used in a maximum likelihood parse, as the parse of the sequence NP PP PP involving two applications of the first rule has a higher estimated likelihood.</Paragraph> <Paragraph position="3"> In fact, all of the productions of the form NP --+ NP ppn where n > 1 in the PCFG induced from the F2-21 subcorpus are subsumed by the NP --+ NP PP production in this way.</Paragraph> <Paragraph position="4"> Thus PP adjunction to NP in the maximum likelihood parses using this PCFG always appear as Chomsky adjunctions, even though the original tree bank did not use this representational scheme for adjunction! In fact, a large number of productions in the PCFG induced from the F2-21 subcorpus are subsumed in this way. Of the 14,962 productions in the PCFG, 1,327, or just under 9%, are subsumed by combinations of two or more 3This tag was introduced by (Charniak, 1996) to distinguish auxiliary verbs from main verbs.</Paragraph> <Paragraph position="5"> productions. Since these productions are never used to construct a maximum likelihood parse, ':they can be ignored if only maximum likelihood parses are required.</Paragraph> </Section> class="xml-element"></Paper>