File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/j98-4004_metho.xml

Size: 42,138 bytes

Last Modified: 2025-10-06 14:14:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="J98-4004">
  <Title>PCFG Models of Linguistic Tree Representations</Title>
  <Section position="3" start_page="614" end_page="614" type="metho">
    <SectionTitle>
2. PCFG Models of Tree Structures
</SectionTitle>
    <Paragraph position="0"> The theory of PCFGs is described elsewhere (e.g., Charniak \[1993\]), so it is only summarized here. A PCFG is a CFG in which each production A ~ o~ in the grammar's set of productions P is associated with an emission probability P(A ~ o~) that satisfies a normalization constraint</Paragraph>
    <Paragraph position="2"> and a consistency or tightness constraint not discussed here, that PCFGs estimated from tree banks using the relative frequency estimator always satisfy (Chi and Geman 1998).</Paragraph>
    <Paragraph position="3"> A PCFG defines a probability distribution over the (finite) parse trees generated by the grammar, where the probability of a tree T is given by</Paragraph>
    <Paragraph position="5"> where Cr(A ~ o~) is the number of times the production A ~ oL is used in the derivation T.</Paragraph>
    <Paragraph position="6"> The PCFG that assigns maximum likelihood to the sequence ~ of trees in a treebank corpus is given by the relative frequency estimator.</Paragraph>
    <Paragraph position="8"> Here C~ (A ~ o~) is the number of times the production A --~ oz is used in derivations of the trees in ~.</Paragraph>
    <Paragraph position="9"> This estimation procedure can be used in a broad-coverage parsing procedure as follows: A PCFG G is estimated from a treebank corpus ~ of training data. In the work presented here the actual lexical items (words) are ignored, and the terminals of the trees are taken to be the part-of-speech (POS) tags assigned to the lexical items. Given a sequence of POS tags to be analyzed, a dynamic programming method based on the CKY algorithm (Aho and Ullman 1972) is used to search for a maximum-likelihood parse using this PCFG.</Paragraph>
  </Section>
  <Section position="4" start_page="614" end_page="616" type="metho">
    <SectionTitle>
3. Tree Representations of Linguistic Constructions
</SectionTitle>
    <Paragraph position="0"> For something so apparently fundamental to syntactic research, there is considerable disagreement among linguists as to just what the right tree structure analysis of various linguistic constructions ought to be. Figure 1 shows some of the variation in PP modification structures postulated in generative syntactic approaches over the past 30 years.</Paragraph>
    <Paragraph position="1"> The flat attachment structure was popular in the early days of transformational grammar, and is used to represent VPs in the WSJ corpus. In this representation both  arguments and adjuncts are sisters to the lexical head, and so are not directly distinguished in the tree structure.</Paragraph>
    <Paragraph position="2"> The adjunction representation was introduced by Chomsky (it is often called &amp;quot;Chomsky adjunction'); in that representation arguments are sisters to the lexical head, while adjuncts are adjoined as sisters to a phrasal node: either a maximal projection (as shown in Figure 1) or a &amp;quot;l-bar&amp;quot; projection in the &amp;quot;X-bar&amp;quot; theory of grammar and its descendants.</Paragraph>
    <Paragraph position="3"> The third representation depicted in Figure 1 is a mixed representation in which phrases with adjuncts have exactly two levels of phrasal projection. The lower level contains the lexical head, and all adjuncts are attached as sisters to a maximal projection at the higher level. To a first approximation, this is the representation used for NPs with PP modifiers or complements in the WSJ corpus used in this study. 1 If the standard linguistic intuition that the number of PP modifiers permitted in natural language is unbounded is correct, then only the Chomsky adjunction representation trees can be generated by a CFG, as the other two representations depicted in Figure 1 require a different production for each possible number of PP modifiers. For example, the rule schema VP ~ V NP PP*, which generates the flat attachment structure, abbreviates an infinite number of CF productions.</Paragraph>
    <Paragraph position="4"> In addition, if a treebank using the two-level representation contains at least one node with a single PP modifier, then the PCFG induced from it will generate Chomsky adjunction representations of multiple PP modification, in addition to the two-level representations used in the treebank. (Note that this is not a criticism of the use of this representation in a treebank, but of modeling such a representation with a PCFG). This raises the question: how should a parse tree be interpreted that does not fit the representational scheme used to construct the treebank training data? 1 The Penn treebank annotation conventions are described in detail in Bies et al. (1995). The two-level representation arises from the conventions that &amp;quot;postmodifiers are Chomsky-adjoined to the phrase they modify&amp;quot; (11.2.1.1) and that &amp;quot;consecutive unrelated adjuncts are non-recursively attached to the NP they modify&amp;quot; (11.2.1.3.a) (parenthetical material identifies relevant subsections in Bies et al. \[1995\]). Arguments are not systematically distinguished from adjunct PPs, and &amp;quot;only clausal complements of NP are placed inside \[the innermost\] NP&amp;quot; as a sister of the head noun. However, because certain constructions are encoded recursively, such as appositives, emphatic reflexives, phrasal titles, etc., it is possible for NPs with more than two levels of structure to appear.</Paragraph>
  </Section>
  <Section position="5" start_page="616" end_page="623" type="metho">
    <SectionTitle>
Mark Johnson PCFG Models
</SectionTitle>
    <Paragraph position="0"> As noted above, the WSJ corpus represents PP modification to NPs using the two-level representation. The PCFG estimated from sections 2-21 of this corpus contains the following two productions:</Paragraph>
    <Paragraph position="2"> These productions generate the two-level representations of one and two PP adjunctions to NP, as explained above. However, the second of these productions will never be used in a maximum-likelihood parse, as the parse of sequence NP PP PP involving two applications of the first rule has a higher estimated likelihood.</Paragraph>
    <Paragraph position="3"> In fact, all of the productions of the form NP ~ NP ppn where n &gt; 1 in the PCFG induced from sections 2-21 of the WSJ corpus are subsumed by the NP ~ NP PP production in this way. Thus PP adjunctions to NP in the maximum-likelihood parses using this PCFG always appear as Chomsky adjunctions, even though the original treebank uses a two-level representation! A large number of productions in the PCFG induced from sections 2-21 of the WSJ corpus are subsumed by higher-likelihood combinations of shorter, higher-probability productions. Of the 14,962 productions in the PCFG, 1,327 productions, or just under 9%, are subsumed by combinations of two or more productions. 2 Since the subsumed productions are never used to construct a maximum-likelihood parse, they can be ignored if only maximum-likelihood parses are required. Moreover, since these subsumed productions tend to be longer than the productions that subsume them, removing them from the grammar reduces the average parse time of the exhaustive PCFG parser used here by more than 9%.</Paragraph>
    <Paragraph position="4"> Finally, note that the overgeneration of the PCFG model of the two-level adjunction structures is due to an independence assumption implicit in the PCFG model; specifically, that the upper and lower NPs in the two-level structure have the same expansions, and that these expansions have the same distributions. This assumption is clearly incorrect for the two-level tree representations. If we systematically relabel one of these NPs with a fresh label, then a PCFG induced from the resulting transformed treebank no longer has this property. The &amp;quot;parent annotation&amp;quot; transform discussed below, which appends the category of a parent node onto the label of all of its nonterminal children as sketched in Figure 2, has just this effect. Charniak and Carroll (1994) describe this transformation as adding &amp;quot;pseudo context-sensitivity&amp;quot; to the language model because the distribution of expansions of a node depends on nonlocal context, viz,, the category of its parent. 3 This nonlocal information is sufficient to distinguish the upper and lower NPs in the structures considered here.</Paragraph>
    <Paragraph position="5"> Indeed, even though the PCFG estimated from the trees obtained by applying the &amp;quot;parent annotation&amp;quot; transformation to sections 2-21 of the WSJ corpus contains 22,773 productions (i.e., 7,811 more than the PCFG estimated from the untransformed corpus), only 965 of them, or just over 4%, are subsumed by two or more other productions.</Paragraph>
    <Paragraph position="6"> 2 These were found by parsing the right-hand side fl of each production A ~ fl with the treebank grammar: if a higher-likelihood derivation A --~+ fl can be found then the production is subsumed. As a CL reviewer points out, Krotov et al. (1997) investigate rule redundancy in CFGs estimated from treebanks. They discussed, but did not irwestigate, rule subsumption in treebank PCFGs.</Paragraph>
    <Paragraph position="7">  can generate Chomsky adjunction structures because it contains the production NP--~NP PP, the PCFG induced from tree (b) can only generate two-level NPs.</Paragraph>
    <Paragraph position="8"> 4. A Theoretical Investigation of Alternative Tree Structures  We can gain some theoretical insight into the effect that different tree representations have on PCFG language models by considering several artifical corpora whose estimated PCFGs are simple enough to study analytically. PP attachment was chosen for investigation here because the alternative structures are simple and clear, but presumably the same points could be made for any construction that has several alternative tree representations. Correctly resolving PP attachment ambiguities requires information, such as lexical information (Hindle and Rooth 1993), that is simply not available to the PCFG models considered here. Still, one might hope that a PCFG model might be able to accurately reflect general statistical trends concerning attachment preferences in the training data, even if it lacks the information to correctly resolve individual cases. But as the analysis in this section makes clear, even this is not always obtained. For example, suppose our corpora only contain two trees, both of which have yields V Det N P Det N, are always analyzed as a VP with a direct object NP and a PP, and differ only as to whether the PP modifies the NP or the VP. The corpora differ as to how these modifications are represented as trees. The dependencies in these corpora (specifically, the fact that the PP is either attached to the NP or to the VP) violate the independence assumptions implicit in a PCFG model, so one should not expect a PCFG model to exactly reproduce any of these corpora. As a CL reviewer points out, the results presented here depend on the assumption that there is exactly one PP. Nevertheless, the analysis of these corpora highlights two important points: * the choice of tree representation can have a noticable effect on the performance of a PCFG language model, and * the accuracy of a PCFG model can depend not just on the trees being modeled, but on their frequency.</Paragraph>
    <Section position="1" start_page="617" end_page="619" type="sub_section">
      <SectionTitle>
4.1 The Penn II Representations
</SectionTitle>
      <Paragraph position="0"> Suppose we train a PCFG on a corpus ~1 consisting only of two different tree structures: the NP attachment structure labeled (A1) and the VP attachment tree labeled (B0 depicted in Figure 3. These trees are called the &amp;quot;Penn II&amp;quot; tree representations here because these are the representations used to encode PP modification in version II of the WSJ corpus constructed at the University of Pennsylvania. Suppose that (A0  The training corpus ~1. This corpus, which uses Penn II tree representations, consists of the trees (A1) with relative frequencyf and the trees (B1) with relative frequency 1 -f. The PCFG f~l is estimated from this corpus.</Paragraph>
      <Paragraph position="1"> occurs in the corpus with relative frequencyf and (B1) occurs with relative frequency 1-f.</Paragraph>
      <Paragraph position="2"> In fact, in the WSJ corpus, structure (A1) occurs 7,033 times in sections 2-21 and 279 times in section 22, while structure (B1) occurs 7,717 times in sections 2-21 and 299 times in section 22. Thus f ~ 0.48 in both the F2-21 subcorpora and the F22 corpus.</Paragraph>
      <Paragraph position="3"> Returning to the theoretical analysis, the relative frequency counts C1 and the nonunit production probability estimates P1 for the PCFG induced from this two-tree corpus are as follows:</Paragraph>
      <Paragraph position="5"> Of course, in a real treebank the counts of all these productions would also include their occurrences in other constructions, so the theoretical analysis presented here is but a crude idealization. Empirical studies using actual corpus data are presented in Section 5.</Paragraph>
      <Paragraph position="6"> Thus the estimated likelihoods using P1 of the tree structures (A1) and (B1) are:</Paragraph>
      <Paragraph position="8"> Clearly Pl(al) &lt;f and \]51(B1) &lt; (1 -f) except at f= 0 andf = 1, so in general the estimated frequencies using \]~1 differ from the frequencies of (A~) and (B1) in the training corpus. This is not too surprising, as the PCFG Pl assigns nonzero probability to trees not in the training corpus (e.g., to trees with more than one PP).</Paragraph>
      <Paragraph position="9"> In any case, in the parsing applications mentioned earlier the absolute magnitude of the probability of a tree is not of direct interest; rather we are concerned with its probability relative to the probabilities of other, alternative tree structures for the same yield. Thus it is arguably more reasonable to ignore the &amp;quot;spurious&amp;quot; tree structures</Paragraph>
      <Paragraph position="11"> The estimated relative frequency f of NP attachment. This graph shows f as a function of the relative frequency f of NP attachment in the training data for various models discussed in the text.</Paragraph>
      <Paragraph position="12"> generated by Pl but not present in the training corpus, and compare the estimated relative frequencies of (A1) and (Ba) under Pl to their frequencies in the training data. Ideally the estimated relative frequency fl of (A1)</Paragraph>
      <Paragraph position="14"> will be close to its actual frequencyf in the training corpus. The relationship between f and fl is plotted in Figure 4. As inspection of Figure 4 makes clear, the value of fl can diverge substantially fromf. For example, atf = 0.48 (the estimate obtained from the WSJ corpus presented above)fl = 0.15. Thus a PCFG language model induced from the simple two-tree corpus above can underestimate the relative frequency of NP attachment by a factor of more than 3.</Paragraph>
    </Section>
    <Section position="2" start_page="619" end_page="620" type="sub_section">
      <SectionTitle>
4.2 Chomsky Adjunction Representations
</SectionTitle>
      <Paragraph position="0"> Now suppose that the corpus contains the following two trees (A2) and (B2) of Figure 5, which are the Chomsky adjunction representations of NP attached and VP attached PP's, respectively, with relative frequencies f and 1 -f as before. Note that unlike the Penn II representations, the Chomsky adjunction representation represents NP and VP modification by PPs symmetrically.</Paragraph>
      <Paragraph position="1">  The training corpus ~2. This two-tree corpus, which uses Chomsky adjuncfion tree representations, consists of the trees (A2) with relative frequencyf and the trees (B2) with relative frequency 1 -f. The PCFG P2 is estimated from this corpus.</Paragraph>
      <Paragraph position="2"> The counts C2 and the nonunit production probability estimates P2 for the PCFG induced from this two-tree corpus are as follows:</Paragraph>
      <Paragraph position="4"> As in the previous subsection, P2(A2) Kf and P2(B2) &lt; (1 -f) because the PCFG assigns nonzero probability to trees not in the training corpus. Again, we calculate the estimated relative frequencies of (A2) and (B2) under P2.</Paragraph>
      <Paragraph position="6"> The relationship between/and f2 is also plotted in Figure 4. The value off2 can diverge from f, although not as widely as fl. For example, atf = 0.48f2 = 0.36. Thus the precise tree structure representations used to train a PCFG can have a marked effect on the probability distribution that it generates.</Paragraph>
    </Section>
    <Section position="3" start_page="620" end_page="622" type="sub_section">
      <SectionTitle>
4.3 Flattened Tree Representations
</SectionTitle>
      <Paragraph position="0"> The previous subsection showed that inserting additional nodes into the tree structure can result in a PCFG language model that better models the distribution of trees in the training corpus. This subsection investigates the effect of removing the lower NP node in the WSJ NP modification structure, again resulting in a pair of more symmetric tree  The training corpus ~3. The NP modification tree representation used in the Penn II WSJ corpus is &amp;quot;flattened&amp;quot; to make it similar to the VP modification representation. The PCFG P3 is estimated from this corpus.</Paragraph>
      <Paragraph position="1"> structures, as shown in Figure 6. As explained in Section 1, flattening the tree structures in general corresponds to weakening the independence assumptions in the induced PCFG models, so one might expect this to improve the induced language model.</Paragraph>
      <Paragraph position="2"> The counts C3 and the nonunit production probability estimates P3 for the PCFG induced from this two-tree corpus are as follows:  The relationship between f and f3 is also plotted in Figure 4. The value of f3 diverges from f, as before: atf = 0.48f3 = 0.23. As Figure 4 shows, the estimated relative frequency f3 using the flattened tree representations is always closer to f than the estimated relative frequency fl using the Penn II representations, but is only closer to f than the estimated relative frequency f2 using the Chomsky adjunction representations for f greater than approximately 0.7.</Paragraph>
      <Paragraph position="3">  The training corpus ~4. This corpus, which uses Penn II treebank tree representations in which each preterminal node's parent's category is appended onto its own label, consists of the trees (A4) with relative frequencyf and the trees (B4) with relative frequency 1 -f. The PCFG P4 is estimated from this corpus.</Paragraph>
    </Section>
    <Section position="4" start_page="622" end_page="623" type="sub_section">
      <SectionTitle>
4.4 Penn II Representations with Parent Annotation
</SectionTitle>
      <Paragraph position="0"> As mentioned in Section 1, another way of relaxing the independence assumptions implicit in a PCFG model is to systematically encode more information in node labels about their context. This subsection explores a particularly simple kind of contextual encoding: the label of the parent of each nonroot nonpreterminal node is appended to that node's label. The labels of the root node and the terminal and preterminal nodes are left unchanged.</Paragraph>
      <Paragraph position="1"> For example, assuming that the Penn II format trees (A1) and (B1) of Section 4.1 are immediately dominated by a node labeled S, this relabeling applied to those trees produces the trees (A4) and (B4) depicted in Figure 7.</Paragraph>
      <Paragraph position="2"> We can perform the same theoretical analysis on this two-tree corpus that we applied to the previous corpora to investigate the effect of this relabeling on the PCFG modeling of PP attachment structures.</Paragraph>
      <Paragraph position="3"> The counts C4 and the nonunit production probability estimates P4 for the PCFG induced from this two-tree corpus are as follows:</Paragraph>
      <Paragraph position="5"> As in the previous subsection P4(a4) &lt; f and P4(B4) &lt; (1 --f). Again, we calculate the estimated relative frequencies of (a4) and (B4) under P4.</Paragraph>
      <Paragraph position="7"> Computational Linguistics Volume 24, Number 4 The relationship between f and f4 is plotted in Figure 4. The value of f4 can diverge from f, just like the other estimates. For example, atf = 0.48 f4 = 0.46, which is closer to f than any of the other relative frequency estimates presented earlier. (However, forf less than approximatel^y 0.38, the relative fre^quency estimate using the Chomsky adjunction representations f2 is closer to f than f4). Thus as expected, increasing the context information in the form of an enriched node-labeling scheme can improve the performance of a PCFG language model.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="623" end_page="628" type="metho">
    <SectionTitle>
5. Empirical Investigation of Different Tree Representations
</SectionTitle>
    <Paragraph position="0"> The previous section presented theoretical evidence that varying the tree representations used to estimate a PCFG language model can have a noticeable impact on that model's performance. However, as anyone working with statistical language models knows, the actual performance of a language model on real language data can often differ dramatically from one's expectations, even when it has an apparently impeccable theoretical basis. For example, on the basis of the theoretical models presented in the last section (and, undoubtedly, a background in theoretical linguistics) I expected that PCFG models induced from Chomksy adjunction tree representations would perform better than models induced from the Penn II representations. However, as shown in this section, this is not the case, but some of the other tree representations investigated here induce PCFGs that do perform noticeably better than the Penn II representations.</Paragraph>
    <Paragraph position="1"> It is fairly straightforward to mechanically transform the Penn II tree representations in the WSJ corpus into something close to the alternative tree representations described above, although the diversity of local trees in the WSJ corpus makes this task more difficult. For example, what is the Chomsky adjunction representation of a VP with no apparent verbal head? In addition, the Chomsky adjunction representation requires argument PPs to be attached as sisters of the lexical head, while adjunct PPs are attached as sisters of a nonlexical projection. Argument PPs are not systematically distinguished from adjunct PPs in the Penn II tree representations, and reliably determining whether a particular PP is an argument or an adjunct is extremely difficult, even for trained linguists. Nevertheless, the tree transformations investigated below should give at least an initial idea as to the influence of different kinds of tree representation on the induced PCFG language models.</Paragraph>
    <Section position="1" start_page="623" end_page="625" type="sub_section">
      <SectionTitle>
5.1 The Tree Transformations
</SectionTitle>
      <Paragraph position="0"> The tree transformations investigated in this section are listed below. Each is given a short name, which is used to identify it in the rest of the paper. Designing the tree transformations is complicated by the fact that there are in general many different tree transformations that correctly transform the simple cases discussed in Section 4, but behave differently on more complex constructions that appear in the WSJ corpus.</Paragraph>
      <Paragraph position="1"> The actual transformations investigated here have the advantage of simplicity, but many other different transformations would correctly transform the trees discussed in Sections 3 and 4 and be just as linguistically plausible as the transforms below, yet would presumably induce PCFGs with very different properties.</Paragraph>
      <Paragraph position="2"> Id is an identity transformation, i.e., it does not modify the trees at all. This condition studies the behavior of the Penn II tree representation used in the WSJ corpus.</Paragraph>
      <Paragraph position="3"> NP-VP produces trees that represent PP modification of both NPs and VPs using Chomsky adjunction. The NP-VP transform is the result of exhaustively  are exhaustively reapplied to produce the Chomsky adjunction tree representations from Penn II tree representations in the NP-VP transformation. In the N'-V' transformation the boxed NP and VP nodes are relabeled with N' and V' respectively. In these schema a is a sequence of trees of length 1 or greater and fl is a sequence of trees of length 2 or greater.</Paragraph>
      <Paragraph position="5"> applying all four of the tree transforms depicted in Figure 8. The first and fourth transforms turn NP and VP nodes whose rightmost child is a PP into Chomsky adjunction structures, and the second and third transforms adjoin final PPs with a following comma punctuation into Chomsky adjunction structures. The constraints that a &gt; 1 and fl &gt; 2 ensures that these transforms will only apply a finite number of time to any given subtree.</Paragraph>
      <Paragraph position="6"> produces trees that represent PP modification of NPs and VPs with a Chomsky adjunction representation that uses an intermediate level of X t structure. This is the result of repeatedly applying the four transformations depicted in Figure 8 as in the NP-VP transform, with the modification that the new nonmaximal nodes are labeled N t or V' as appropriate (rather than NP or VP).</Paragraph>
      <Paragraph position="7"> produces trees in which NPs have a flatter structure than the two-level representation of NPs used in the Penn II treebank. Only subtrees consisting of a parent node labeled NP whose first child is also labeled NP are affected by this transformation. The effect of this transformation is to excise all the children nodes labeled NP from the tree, and to attach their children as direct descendants of the parent node, as depicted in the schema below.</Paragraph>
      <Paragraph position="8">  Computational Linguistics Volume 24, Number 4 Parent appends to each nonroot nonterminal node's label its parent's category. The effect of this transformation is to produce trees of the kind discussed in Section 4.4.</Paragraph>
    </Section>
    <Section position="2" start_page="625" end_page="627" type="sub_section">
      <SectionTitle>
5.2 Evaluation of Parse Trees
</SectionTitle>
      <Paragraph position="0"> It is straightforward to estimate PCFGs using the relative frequency estimator from the sequences of trees produced by applying these transforms to the WSJ corpus. We turn now to the question of evaluating the different PCFGs so obtained.</Paragraph>
      <Paragraph position="1"> None of the PCFGs induced from the various tree representations discussed here reliably identifies the correct tree representations on sentences from held-out data. It is standard to evaluate broad-coverage parsers using less-stringent criteria that measure how similiar the trees produced by the parser are to the &amp;quot;correct&amp;quot; analysis trees in a portion of the treebank held out for testing purposes. This study uses the 1,578 sentences in section 22 of the WSJ corpus of length 40 or less for this purpose.</Paragraph>
      <Paragraph position="2"> The labeled precision and recall figures are obtained by regarding the sequence of trees e produced by a parser as a multiset or bag E(e) of edges, i.e., triples IN, 1, r / where N is a nonterminal label and 1 and r are left and right string positions in yield of the entire corpus. (Root nodes and preterminal nodes are not included in these edge sets, as they are given as input to the parser). Relative to a test sequence of trees e r (here section 22 of the WSJ corpus) the labeled precision and recall of a sequence of trees e with the same yield as e t are calculated as follows, where the n operation denotes multiset intersection.</Paragraph>
      <Paragraph position="4"> Thus, precision is the fraction of edges in the tree sequence to be evaluated that also appear in the test tree sequence, and recall is the fraction of edges in the test tree sequence that also appear in tree sequence to be evaluated.</Paragraph>
      <Paragraph position="5"> It is straightforward to use the PCFG estimation techniques described in Section 2 to estimate PCFGs from the result of applying these transformations to sections 2-21 of the Penn II WSJ corpus. The resulting PCFGs can be used with a parser to obtain maximum-likelihood parse trees for the POS tag yields of the trees of the held-out test corpus (section 22 of the WSJ corpus). While the resulting parse trees can be compared to the trees in the test corpus using the precision and recall measures described above, the results would not be meaningful as the parse trees reflect a different tree representation to that used in the test corpus, and thus are not directly comparable with the test corpus trees. For example, the node labels used in the PCFG induced from trees produced by applying the parent transform are pairs of categories from the original Penn II WSJ tree bank, and so the labeled precision and recall measures obtained by comparing the parse trees obtained using this PCFG with the trees from the tree bank would be close to zero.</Paragraph>
      <Paragraph position="6"> One might try to overcome this by applying the same transformation to the test trees as was used to obtain the training trees for the PCFG, but then the resulting precision and recall measures would not be comparable across transformations. For example, as two different Penn II format trees may map to the same flattened tree, the flatten transformation is in general not invertible. Thus a parsing system that produces perfect flat tree representations provides less information than one that produces perfect Penn II tree representations, and one might expect that all else being equal, a  The tree transformation/detransformation process.</Paragraph>
      <Paragraph position="7"> Parsing system using flat representations will score higher (or at least differently) in terms of precision and recall than an equivalent one producing Penn II representations. The approach developed here overcomes this problem by applying an additional tree transformation step that converts the parse trees produced using the PCFG back to the Penn II tree representations, and compares these trees to the held-out test trees using the labeled precision and recall trees. This transformation/detransformation process is depicted in Figure 9. It has the virtue that all precision and recall measures involve trees using the Penn II tree representations, but it does involve an additional detransformation step.</Paragraph>
      <Paragraph position="8"> It is straightforward to define detransformers for all of the tree transformations described in this section except for the flattening transform. The difficulty in this case is that several different Penn II format trees may map onto the same flattened tree, as mentioned above. The detransformer for the flattening transform was obtained by recording for each distinct local tree in the flattened tree representation of the training corpus the various tree fragments in the Penn II format training corpus it could have been derived from. The detransformation of a flattened tree is effected by replacing each local tree in the parse tree with its most frequently occuring Penn II format fragment.</Paragraph>
      <Paragraph position="9"> This detransformation step is in principle an additional source of error, in that a parser could produce flawless parse trees in its particular tree representation, but the transformation to the corresponding Penn II tree representations might itself introduce errors. For example, it might be that several different Penn II tree representations can correspond to a single parse tree, as is the case with a parser producing flattened tree representations. To determine if detransformation can be done reliably, for each tree transformation, labeled precision and recall measures were calculated comparing the result of applying the transformation and the corresponding detransformation to the  Computational Linguistics Volume 24, Number 4 Table 1 The results of an empirical study of the effect of tree structure on PCFG models. Each column corresponds to a sequence of trees, either consisting of section 22 of the WSJ corpus or transforms of the maximum-likelihood parses of the yields of the section 22 subcorpus with respect to different PCFGs, as explained in the text. The first row reports the number of productions in these PCFGs, and the next two rows give the labeled precision and recall of these sequences of trees. The last four rows report the number of times particular kinds of subtrees appear in these sequences of trees, as explained in the text.</Paragraph>
      <Paragraph position="10">  test corpus trees with the original trees of the test corpus. In all cases except for the flattening transform these precision and recall measures were always greater than 99.5%, indicating that the transformation/detransformation process is quite reliable. For the flattening transform the measures were greater than 97.5%, suggesting that while the error introduced by this process is noticable, the transformation/detransformation process does not introduce a very large error on its own.</Paragraph>
    </Section>
    <Section position="3" start_page="627" end_page="628" type="sub_section">
      <SectionTitle>
5.3 Results
</SectionTitle>
      <Paragraph position="0"> Table 1 presents an analysis of the sequences of trees produced via this detransformation process applied to the maximum-likelihood-parse trees. The columns of this table correspond to sequences of parse trees for section 22 of the WSJ corpus. The column labeled &amp;quot;22&amp;quot; describes the trees given in section 22 of the WSJ corpus, and the column labeled &amp;quot;22 Id&amp;quot; describes the maximum-likelihood-parse trees of section 22 of the WSJ corpus using the PCFG induced from those very trees. This is thus an example of training on the test data, and is often assumed to provide an upper bound on the performance of a learning algorithm. The remaining columns describe the sequences of trees produced using the transformation/detransformation process described above.</Paragraph>
      <Paragraph position="1"> The first three rows of the table show the number of productions in each PCFG (which is the number of distinct local trees in the corresponding transformed training corpus), and the labeled precision and recall measures for the detransformed parse trees.</Paragraph>
      <Paragraph position="2"> Randomization tests for paired sample data were performed to assess the significance of the difference between the labeled precision and recall scores for the output of the Id PCFG and the other PCFGs (Cohen 1995). The labeled precision and recall scores for the Flatten and Parent transforms differed significantly from each other and also from the Id transform at the 0.01 level, while neither the NP-VP nor the N'-V ~ transform differed significantly from each other or the Id transform at the 0.1 level.</Paragraph>
      <Paragraph position="3"> The remaining rows of Table 1 show the number of times certain tree schema appear in these (detransformed) tree sequences. The rows labeled NP attachments and VP attachments provide the number of times the following tree schema, which  The rows labeled NP* attachments and VP* attachments provide the number of times that the following more relaxed schema match the tree sequence. Here oL can be instantiated by any sequence of trees, and V can be instantiated by the same range of preterminal tags as above.</Paragraph>
    </Section>
    <Section position="4" start_page="628" end_page="628" type="sub_section">
      <SectionTitle>
5.4 Discussion
</SectionTitle>
      <Paragraph position="0"> As expected, the PCFGs induced from the output of the Flatten transform and Parent transform significantly improve precision and recall over the original treebank PCFG (i.e., the PCFG induced from the output of the Id transform). The PCFG induced from the output of the Parent transform performed significantly better than any other PCFG investigated here. As discussed above, both the Parent and the Flatten transforms induce PCFGs that are sensitive to what would be non-CF dependencies in the original treebank trees, which perhaps accounts for their superior performance. Both the Flatten and Parent transforms induced PCFGs that have substantially more productions than the original treebank grammar, perhaps reflecting the fact that they encode more contextual information than the original treebank grammar, albeit in different ways. Their superior performance suggests that the reduction in bias obtained by the weakening of independence assumptions that these transformations induce more than outweighs any associated increase in variance.</Paragraph>
      <Paragraph position="1"> The various adjunction transformations only had minimal effect on labeled precision and recall. Perhaps this is because PP attachment ambiguities, despite their important role in linguistic and parsing theory, are just one source of ambiguity among many in real language, and the effect of the alternative representations is only minor.</Paragraph>
      <Paragraph position="2"> Indeed, moving to the purportedly linguistically more realistic Chomsky adjunction representations did not improve performance on these measures. On reflection, perhaps this should not be surprising. The Chomsky adjunction representations are motivated within the theoretical framework of Transformational Grammar, which explicitly argues for nonlocal, indeed, non-context-free, dependencies. Thus its poor per-</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="628" end_page="629" type="metho">
    <SectionTitle>
4 The Penn II markup scheme permits a pseudo-attachment notation for indicating ambiguous
</SectionTitle>
    <Paragraph position="0"> attachment. However, this is only used relatively ilffrequently--the pseudo-attachment markup only appears 27 times in the entire Penn II treebank--and was ignored here. Pseudo-attachment structures count as VP attachment structures here.</Paragraph>
    <Paragraph position="1">  Computational Linguistics Volume 24, Number 4 formance when used as input to a statistical model that is insensitive to such dependencies is perhaps to be expected. Indeed, it might be the case that inserting the additional adjunction nodes inserted by the NP-VP and Nt-V ~ transformations above have the effect of converting a local dependency (which can be described by a PCFG) into a nonlocal dependency (which cannot).</Paragraph>
    <Paragraph position="2"> Another initially surprising property of the tree sequences produced by the PCFGs is that they do not reflect at all well the frequency of the different kinds of PP attachment found in the Penn II corpus. This is in fact to be expected, since the sequences consist of maximum-likelihood parses. To see this, consider any of the examples analyzed in Section 4. In all of these cases, the corpora contained two tree structures, and the induced PCFG associates each with an estimated likelihood. If these likelihoods differ, then a maximum-likelihood parser will always return the same maximum-likelihood tree structure each time it is presented with its yield, and will never return the tree structure with lower likelihood, even though the PCFG assigns it a nonzero likelihood.</Paragraph>
    <Paragraph position="3"> Thus the surprising fact is that these PCFG parsers ever produce a nonzero number of NP attachments and VP attachments in the same tree sequence. This is possible because the node label V in the attachment schema above abbreviates several different preterminal labels (i.e., the set of all verbal tags). Further investigation shows that once the V label in NP attachment and VP attachment schemas is instantiated with a particular verbal tag, only either the relevant NP attachment schema or the VP attachment schema appears in the tree sequence. For instance, in the Id tree sequence (i.e., produced by the standard tree bank grammar) the 67 NP attachments all occurred with the V label instantiated to the verbal tag AUX. 5 It is worth noting that the 8% improvement in average precision and recall obtained by the parent annotation transform is approximately half of the performance difference between a parser using a PCFG induced directly from the tree bank (i.e., using the Id transform above) and the best currently available broad-coverage parsing systems, which exploit lexical as well as purely syntactic information (Charniak 1997).</Paragraph>
    <Paragraph position="4"> In order to better understand just why the parent annotation transform performs so much better than the other transforms, transformation/detransformation experiments were performed in which the parent annotation transform was performed selectively either on all nodes with a given category label, or all nodes with a given category label and parent category label. Figure 10 depicts the effect of selective application of the parent annotation transform on the change of the average of precision and recall with respect to the Id transform. It is clear that distinguishing the context of NP and S nodes is responsible for an important part of the improvement in performance. Merely distinguishing root from nonroot S nodes--a distinction made in early transformational grammar but ignored in more recent work--improves average precision and recall by approximately 3%. Thus it is possible that the performance gains achieved by the parent annotation transform have little to do with PP attachment.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML