File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1044_metho.xml
Size: 27,907 bytes
Last Modified: 2025-10-06 14:07:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1044"> <Title>Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Observed Performance </SectionTitle> <Paragraph position="0"> In this section, we outline the observed performance of the parser for various settings. We frequently speak in terms of the following: a11 span: a range of words in the chart, e.g., [1,3]4 a11 edge: a category over a span, e.g., NP:[1,3] a11 traversal: a way of making an edge from an active and a passive edge, e.g., NP:[1,3] a14</Paragraph> <Paragraph position="2"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Time </SectionTitle> <Paragraph position="0"> The parser has an a2a4a3a17a16a19a18a20a5 a7 a9 theoretical time bound, where a5 is the number of words in the sentence to be parsed, a18 is the number of nonterminal categories in the grammar and a16 is the number of (active) states in the FSA encoding of the grammar. The time bound is derived from counting the number of traversals processed by the parser, each taking a2a4a3a22a21a23a9 time.</Paragraph> <Paragraph position="1"> In figure 3, we see the average time5 taken per sentence length forseveral settings, with the empirical exponent (and correlation a24 -value) from the best-fit simple power law model to the right. Notice that most settings show time growth greater than a2a4a3a25a5 a7 a9 .</Paragraph> <Paragraph position="2"> Although, a2a4a3a25a5 a7 a9 is simply an asymptotic bound, there are good explanations for the observed behavior. There are two primary causes for the super-cubic time values. The first is theoretically uninteresting.</Paragraph> <Paragraph position="3"> Theparserisimplemented in Java, whichusesgarbage collection for memory management. Even when there is plenty of memory for a parse's primary data structures, &quot;garbage collection thrashing&quot; can occur when logical possibility would be trie encodings which compact the grammar states by common suffix rather than common prefix, as in (Leermakers, 1992). The savings are less than for prefix compaction.</Paragraph> <Paragraph position="4"> 4Notethatthenumberofwords(or size)ofaspanisequal to the difference between the endpoints.</Paragraph> <Paragraph position="5"> 5The hardware was a 700 MHz Intel Pentium III, and we used up to 2GB of RAM for very long sentences or very poor parameters. With good parameter settings, the system can parse 100+ word treebank sentences.</Paragraph> <Paragraph position="6"> the number produced with a bottom-up strategy (shown for TRIE-NOTRANSFORM, others are similar). parsing longer sentences as temporary objects cause increasingly frequent reclamation. To see past this effect, which inflatesthe empirical exponents, weturn to the actual traversal counts, which better illuminate the issues at hand. Figures 4 (a) and (b) show the traversal curves corresponding to the times in figure 3.</Paragraph> <Paragraph position="7"> The interesting cause of the varying exponents comes from the &quot;constant&quot; terms in the theoretical bound. The second half of this paper shows how modeling growth in these terms can accurately predict parsing performance (see figures 9 to 13).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Memory </SectionTitle> <Paragraph position="0"> The memory bound for the parser is a2a4a3a17a16a26a5a8a27a10a9 . Since the parser is running in a garbage-collected environment, it is hard to distinguish required memory from utilized memory. However, unlike time and traversals which in practice can diverge, memory requirements match the number of edges in the chart almost exactly, since the large data structures are all proportional in size to the number of edges a28a30a29a31a2a4a3a17a16a26a5a8a27a10a9 .6 Almost all edges stored are activeedges (a32a34a33a36a35a38a37 for sentenceslonger than30 words), of whichtherecan be a2a4a3a39a16a26a5a8a27a10a9 : one for every grammar state and span. Passive edges, of which there can be a2a4a3a39a18a20a5a8a27a40a9 , one for every category and span, are a shrinking minority. This isbecause, while a18 is boundedaboveby27in thetreebank7 (forspans a32 2), a16 numbersinthethousands(see figure 12). Thus, required memory will be implicitly modeled when we model active edges in section 4.3.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Tree Transforms </SectionTitle> <Paragraph position="0"> Figure 4 (a) shows the effect of the tree transforms on traversal counts. The NOUNARIES settings are much more efficient than the others, however this efficiency comes at a price in terms of the utility of the final parse. For example, regardless of which NOUNARIES 6A standard chart parser might conceivably require storing more than a41a43a42a6a44a46a45 traversals on its agenda, but ours provably never does.</Paragraph> <Paragraph position="1"> 7This count is the number of phrasal categories with the introduction of a TOP label for the unlabeled top treebank nodes.</Paragraph> <Paragraph position="2"> transform is chosen, there will be NP nodes missing from the parses, making the parses less useful for any task requiring NP identification. For the remainder of the paper, we will focus on the settings NOTRANSFORM and NOEMPTIES.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Grammar Encodings </SectionTitle> <Paragraph position="0"> Figure 4 (b) shows the effect of each tree transform on traversal counts. The more compacted the grammar representation, the more time-efficient the parser is.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3.5 Top-Down vs. Bottom-Up </SectionTitle> <Paragraph position="0"> Figure 4 (c) shows the effect on total edges and traversals of using top-down and bottom-up strategies.</Paragraph> <Paragraph position="1"> There are some extremely minimal savings in traversals duetotop-downfilteringeffects, butthere isacorresponding penalty in edges as rules whose left-corner cannot be built are introduced. Given the highly unrestrictive nature of the treebank grammar, it is not very surprising that top-down filtering provides such little benefit. However, this is a useful observation about real world parsing performance. The advantages of top-down chart parsing in providing grammar-driven prediction are often advanced (e.g., Allen 1995:66), butin practicewefindalmostnovaluein thisforbroad coverage CFGs. While some part of this is perhaps due to errors in the treebank, a large part just reflects the true nature of broad coveragegrammars: e.g., once you allow adverbial phrases almost anywhere and allow PPs, (participial) VPs, and (temporal) NPs to be adverbial phrases, along with phrases headed by adverbs, then there is very little useful top-down control left. With such a permissive grammar, the only real constraints are in the POS tags which anchor the local trees (see section 4.3). Therefore, for the remainder of the paper, we consider only bottom-up settings.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Models </SectionTitle> <Paragraph position="0"> In the remainder of the paper we provide simple models that nevertheless accurately capture the varying magnitudes and exponents seen for different grammar encodings and tree transformations. Since the a5 a7 term of a2a4a3a17a16a19a18a20a5 a7 a9 comes directly from the number of start, split, and end points for traversals, it is certainly not responsible for the varying growth rates. An initially plausible possibility is that the quantity bounded by the a18 term is non-constant in a5 in practice, because longer spans are more ambiguous in terms of the number of categories they can form. This turns out to be generally false, as discussed in section 4.2. Alternately, the effective a16 term could be growing with a5 , which turns out to be true, as discussed in section 4.3.</Paragraph> <Paragraph position="1"> The number of (possibly zero-size) spans for a sentence of length a5 is fixed: a3a25a5a48a47a49a21a10a9a50a3a6a5a51a47a53a52a54a9a56a55a54a52 . Thus, to be able to evaluate and model the total edge counts, we look to the number of edges over a given span.</Paragraph> <Paragraph position="2"> Definition 1 The passive (or active) saturation of a given span is the number of passive (or active) edges over that span.</Paragraph> <Paragraph position="3"> In the total time and traversal bound a2a4a3a17a16a19a18a20a5 a7 a9 , the effective value of a16 is determined by the active saturation, while the effective value of a18 is determined by the passive saturation. An interesting fact is that the saturation of a span is, for the treebank grammar and sentences, essentially independent of what size sentence the span is from and where in the sentence the span begins. Thus, for a given span size, we report the average over all spans of that size occurring anywhere in any sentence parsed.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Treebank Grammar Structure </SectionTitle> <Paragraph position="0"> The reason that effective growth is not found in the a18 component is that passive saturation stays almost constant as span size increases. However, the more interesting result is not that saturation is relatively constant (for spans beyond a small, grammar-dependent size), but that the saturation values are extremely large compared to a18 (see section 4.2). For the NOTRANSFORM and NOEMPTIES grammars, most categories are reachable from most other categories using rules which can be applied over a single span. Once you get one of these categories over a span, you will get the rest as well. We now formalize this.</Paragraph> <Paragraph position="1"> Definition 2 A category a57 is empty-reachable in a grammar a58 if a57 can be built using only empty terminals. null The empty-reachable set for the NOTRANSFORM grammar is shown in figure 5.8 These 23 categories plus the tag -NONE- create a passive saturation of 24 for zero-spans for NOTRANSFORM (see figure 9).</Paragraph> <Paragraph position="2"> Definition 3 A category a59 is same-span-reachable from a category a57 in a grammar a58 if a59 can be built from a57 using a parse tree in which, aside from at most 8The set of phrasal categories used in the Penn Tree-bank is documented in Manning and Sch&quot;utze (1999, 413); Marcus et al. (1993, 281) has an early version.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> ADJP ADVP FRAG INTJ NAC NP NX PP PRN QP RRC S SBAR SBARQ SINV SQ TOP UCP VP WHADVP WHNP WHPP X </SectionTitle> <Paragraph position="0"> one instance of a57 , every node not dominating that instance is an instance of an empty-reachable category.</Paragraph> <Paragraph position="1"> Thesame-span-reachabilityrelationinducesagraph over the 27 non-terminal categories. The stronglyconnected component (SCC) reduction of that graph is shown in figures 6 and 7.9 Unsurprisingly, the largest SCC, which contains most &quot;common&quot; categories (S, NP, VP, PP, etc.) is slightly larger for the NOTRANSFORM grammar, since the empty-reachable set is nonempty. However, note that even for NOTRANSFORM, the largest SCC is smaller than the empty-reachable set, since empties providedirect entry into some of the lower SCCs, in particular because of WH-gaps.</Paragraph> <Paragraph position="2"> Interestingly, this same high-reachability effect occurs even for the NOUNARIES grammars, as shown in the next section.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Passive Edges </SectionTitle> <Paragraph position="0"> The total growth and saturation of passive edges is relatively easy to describe. Figure 8 shows the total num- null ber of passive edges by sentence length, and figure 9 shows the saturation as a function of span size.10 The grammar representation does not affect which passive edges will occur for a given span.</Paragraph> <Paragraph position="1"> The large SCCs cause the relative independence of passive saturation from span size for the NOTRANSFORM and NOEMPTIES settings. Onceanycategoryin the SCC is found, all will be found, as well as all categories reachable from that SCC. For these settings, the passive saturation can be summarized by three saturation numbers: zero-spans (empties) a60a62a61a23a63a36a64a66a65 , one-spans</Paragraph> <Paragraph position="3"> Taking averages directly from the data, we have our first model, shown on the right in figure 9.</Paragraph> <Paragraph position="4"> For the NOUNARIES settings, there will be no same-span reachability and hence no SCCs. To reach a new category always requires the use of at least one overtword. However,for spans of size 6 or so, enough words exist that the same high saturation effect will still be observed. This can be modeled quite simply by assuming each terminal unlocks a fixed fraction of the nonterminals, as seen in the right graph of figure 9, but we omit the details here.</Paragraph> <Paragraph position="5"> Using these passive saturation models, we can directly estimate the total passive edge counts by summation: null</Paragraph> <Paragraph position="7"> 10The maximum possible passive saturation for any span greater than one is equal to the number of phrasal categories in the treebank grammar: 27. However, empty and size-one spans can additionally be covered by POS tag edges.</Paragraph> <Paragraph position="8"> The predictions are shown in figure 8. For the NO-</Paragraph> <Paragraph position="10"> We correctly predict that the passive edge total exponents will be slightly less than 2.0 when unaries are present, and greater than 2.0 when they are not. With unaries, the linear terms in the reduced equation are significant over these sentence lengths and drag down the exponent. The linear terms are larger for NOTRANSFORM and therefore drag the exponent down more.11 Without unaries, the more gradual saturation growth increases the total exponent, more so for NOUNARIESLOW than NOUNARIESHIGH. However, notethat forspans around 8 andonward, the saturation curves are essentially constant for all settings.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Active Edges </SectionTitle> <Paragraph position="0"> Active edges are the vast majority of edges and essentially determine(non-transient)memory requirements.</Paragraph> <Paragraph position="1"> While passive counts depend only on the grammar transform, active counts depend primarily on the encodingforgeneralmagnitudebutalsoonthetransform null for the details (and exponent effects). Figure 10 shows the total active edges by sentence size for three settings chosen to illustrate the main effects. Total active growth is sub-quadratic for LIST, but has an exponent of up to about 2.4 for the TRIE settings.</Paragraph> <Paragraph position="2"> 11Note that, over these values of a95 , even a basic quadratic function like the simple sum a72 a95a4a96a94a42a83a95a98a97a100a99a101a45a17a95a91a102a90a103 has a best-fit simple power curve exponent of only a104a105a99a90a106a107a10a108 for the same reason. Moreover, note that a95a84a109a101a102a23a103 has a higher best-fit exponent, yet will never actually outgrow it.</Paragraph> <Paragraph position="3"> as predicted by our models (right).</Paragraph> <Paragraph position="4"> To model the active totals, we again begin by modeling the active saturation curves, shown in figure 11. Theactivesaturationfor anyspanis boundedaboveby a16 , the number of active grammar states (states in the grammar FSAs whichcorrespondto activeedges). For list grammars, this number is the sum of the lengths of all rules in the grammar. For trie grammars, it is the number of unique rule prefixes (including the LHS) in the grammar. For minimized grammars, it is the number of states with outgoing transitions (non-black states in figure 2). The value of a16 is shown for each setting in figure 12. Note that the maximum number of active states is dramatically larger for lists since common rule prefixesare duplicated many times. For minimized FSAs, the state reduction is even greater. Since states which are earlier in a rule are much more likely to match a span, the fact that tries (and min FSAs) compress early states is particularly advantageous.</Paragraph> <Paragraph position="5"> Unlike passive saturation, which was relatively close to its bound a18 , active saturation is much farther below a16 . Furthermore, while passive saturation was relatively constant in span size, at least after a point, active saturation quite clearly grows with span size, even for spans well beyond those shown in figure 11.</Paragraph> <Paragraph position="6"> We now model these active saturation curves.</Paragraph> <Paragraph position="7"> What does it take for a given active state to match a given span? For TRIE and LIST, an active state corresponds to a prefix of a rule and is a mix of POS tags and phrasal categories, each of which must be matched, in order, over that span for that state to be reached. Given the large SCCs seen in section 4.1, phrasal categories, to a first approximation, might as well be wildcards, able to match any span, especially if empties are present. However, the tags are, in comparison, very restricted. Tags must actually match a word in the span.</Paragraph> <Paragraph position="8"> More precisely, consider an active state a63 in the grammar and a span a61 . In the TRIE and LIST encodings, there is some, possibly empty, list a110 of labels that must be matched over a61 before an activeedge with this state can be constructed over that span.12 Assume that the phrasal categories in a110 can match any span (or any non-zero span in NOEMPTIES).13 Therefore, phrasal categories in a110 do not constrain whether a63 can match a61 . The real issue is whether the tags in a110 will matchwordsin a61 . Assumethatarandomtagmatchesa random word with a fixed probabilitya60 , independently of where the tag is in the rule and where the word is in the sentence.14 Assume further that, although tags occur more often than categories in rules (63.9% of rule items are tags in the NOTRANSFORM case15), given a 12The essence of the MIN model, which is omitted here, is that states are represented by the &quot;easiest&quot; label sequence which leads to that state.</Paragraph> <Paragraph position="9"> 13The model for the NOUNARIES cases is slightly more complex, but similar.</Paragraph> <Paragraph position="10"> 14This is of course false; in particular, tags at the end of rules disproportionately tend to be punctuation tags.</Paragraph> <Paragraph position="11"> 15Although the present model does not directly apply to the NOUNARIES cases, NOUNARIESLOW is significantly fixed number of tags and categories, all permutations are equally likely to appear as rules.16 Under these assumptions, the probability that an active state a63 is in the treebank grammar will depend only on the number a64 of tags and a111 of categories in a110 . Call this pair a112a92a3a25a63a113a9a114a29a115a3a6a64a101a116a56a111a90a9 the signature of a63 . For a given signature a112 , let a111a50a69a10a117a91a5a118a64a50a3a25a112a8a9 be the number of active states in the grammar which have that signature.</Paragraph> <Paragraph position="12"> Now, take a state a63 of signature a3a6a64a101a116a119a111a120a9 and a span a61 . If we align the tags in a63 with words in a61 and align the categories in a63 with spans of words in a61 , then provided the categories align with a non-empty span (for NOEMPTIES)oranyspanatall(for NOTRANSFORM), then the question of whether this alignment of a63 with a61 matches is determined entirely by the a64 tags. However, with our assumptions, the probability that a randomly chosen set of a64 tags matches a randomly chosen set of a64 words is simplya60a68a121 .</Paragraph> <Paragraph position="13"> Wethenhaveanexpressionforthechanceofmatching a specific alignment of an active state to a specific span. Clearly, there can be many alignments which differonlyinthespansofthecategories,butlineupthe sametags withthesamewords. However,therewillbe a certain number of unique ways in which the words and tags can be lined up between a63 and a61 . If we know this number, we can calculate the total probability that there is some alignment which matches. For example, consider the state NP a15 NP CC NP . PP (which has signature (1,2) - the PP has no effect) over a span of length a5 , with empties available. The NPs can match any span, so there are a5 alignments which are distinct from the standpoint of the CC tag - it can be in any position. The chance that some alignment will match is therefore a21a8a79a100a3a70a21a80a79a122a60a91a9 a73 , which, for smalla60 is roughly linear in a5 . It should be clear that for an active state like this, the longer the span, the more likely it is that this state will be found over that span.</Paragraph> <Paragraph position="14"> It is unfortunately not the case that all states with the same signature will match a span length with the same probability. For example, the state NP a15 NP NP CC . NP has the same signature, but must align the CC with the final element of the span. A state like this will not become more likely (in our model) as span size increases. However, with some straightforward but space-consuming recurrences, we can calculate the expected chance that a random rule of a given signature will match a given span length. Since we know how many states have a given signature, we can calculate the total active saturation a63a113a61a90a63a38a64a50a3a25a5a71a9 as</Paragraph> <Paragraph position="16"> more efficient than NOUNARIESHIGH despite having more active states, largely because using the bottoms of chains increases the frequency of tags relative to categories.</Paragraph> <Paragraph position="17"> 16This is also false; tags occur slightly more often at the beginnings of rules and less often at the ends.</Paragraph> <Paragraph position="18"> This model has two parameters. First, there isa60 which weestimateddirectlybylooking attheexpectedmatch between the distribution of tags in rules and the distribution of tags in the Treebank text (which is around 1/17.7). No factor for POS tag ambiguity was used, another simplification.17 Second, there is the map a111a50a69a10a117a91a5a118a64 from signatures to a number of active states, which was read directly from the compiled grammars.</Paragraph> <Paragraph position="19"> This model predicts the active saturation curves shown to the right in figure 11. Note that the model, though not perfect, exhibits the qualitative differences between the settings, both in magnitudes and exponents.18 In particular: a11 The transform primarily changes the saturation over shortspans,whiletheencodingdeterminestheoverall magnitudes. For example, in TRIE-NOEMPTIES the low-span saturation is lower than in TRIE-NOTRANSFORM since short spans in the former case can match only signatures which have both a64 and a111 small, while in the latter only a64 needs to be small. Therefore, the several hundred states which are reachable only via categories all match every span starting from size 0 for NOTRANSFORM, but are accessed only gradually for NOEMPTIES. However, for larger spans, the behavior converges to counts characteristic for TRIE encodings.</Paragraph> <Paragraph position="20"> a11 For LIST encodings, the early saturations are huge, due to the fact that most of the states which are available early for trie grammars are precisely the ones duplicated up to thousands of times in the list grammars. However, the additive gain over the initialstates is roughlythe same forboth, asafter afew items are specified, the tries become sparse.</Paragraph> <Paragraph position="21"> a11 The actual magnitudes and exponents19 of the saturations are surprisingly well predicted, suggesting that this model captures the essential behavior.</Paragraph> <Paragraph position="22"> Theseactivesaturation curvesproducethe activetotalcurvesin figure10, whicharealsoqualitativelycorrect in both magnitudes and exponents.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Traversals </SectionTitle> <Paragraph position="0"> Nowthat wehavemodelsfor activeand passiveedges, we can combine them to model traversal counts as well. We assume that the chance for a passive edge and an active edge to combine into a traversal is a single probability representing howlikely an arbitrary active state is to have a continuation with a label matching an arbitrarypassivestate. List rule states haveonly one continuation, while trie rule states in the branch17In general, the a135 we used was lower for not having modeled tagging ambiguity, but higher for not having modeled the fact that the SCCs are not of size 27.</Paragraph> <Paragraph position="1"> 18And does so without any &quot;tweakable&quot; parameters. 19Note that the list curves do not compellingly suggest a power law model.</Paragraph> <Paragraph position="2"> predicted by the models presented in the latter part of the paper (right). ing portion of the trie average about 3.7 (min FSAs 4.2).20 Making another uniformity assumption, we assume that this combination probability is the continuation degree divided by the total number of passive labels, categorical or tag (73).</Paragraph> <Paragraph position="3"> In figure 13, we give graphs and exponents of the traversal counts, both observed and predicted, for various settings. Ourmodel correctlypredictsthe approximate values and qualitative facts, including: a11 For LIST, the observed exponent is lower than for TRIEs, though the total number of traversals is dramatically higher. This is because the active saturation is growing much faster for TRIEs; note that in cases like this the lower-exponent curve will never actually outgrow the higher-exponent curve.</Paragraph> <Paragraph position="4"> a11 Of the settings shown, only TRIE-NOEMPTIES exhibits super-cubic traversal totals. Despite their similar active and passive exponents, TRIE-NOEMPTIES and TRIE-NOTRANSFORM vary in traversal growth due to the &quot;early burst&quot; of active edges which gives TRIE-NOTRANSFORM significantly more edges over short spans than its power law would predict. This excess leads to a sizeable quadratic addend in the number of transitions, causing the average best-fit exponent to drop without greatly affecting the overall magnitudes.</Paragraph> <Paragraph position="5"> Overall, growth of saturation values in span size increases best-fit traversal exponents, while early spikes in saturation reduce them. The traversal exponents therefore range from LIST-NOTRANSFORM at 2.6 to TRIE-NOUNARIESLOW at over 3.8. However, the finalperformance ismore dependentonthe magnitudes, which range from LIST-NOTRANSFORM as the worst, despiteitsexponent,to MIN-NOUNARIESHIGH asthe best. The single biggest factor in the time and traversal performance turned out to be the encoding, which is fortunate because the choice of grammar transform will depend greatly on the application.</Paragraph> <Paragraph position="6"> 20Thisisa simplificationaswell, sincetheshorter prefixes that tend to have higher continuation degrees are on average also a larger fraction of the active edges.</Paragraph> </Section> </Section> class="xml-element"></Paper>