File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1022_metho.xml
Size: 23,594 bytes
Last Modified: 2025-10-06 14:09:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1022"> <Title>Coarse-to-fine n-best parsing and MaxEnt discriminative reranking</Title> <Section position="3" start_page="173" end_page="176" type="metho"> <SectionTitle> 2 Recovering the n-best parses using </SectionTitle> <Paragraph position="0"> coarse-to-fine parsing The major difficulty in n-best parsing, compared to 1-best parsing, is dynamic programming. For example, n-best parsing is straight-forward in best-first search or beam search approaches that do not use dynamic programming: to generate more than one parse, one simply allows the search mechanism to create successive versions to one's heart's content. A good example of this is the Roark parser (Roark, 2001) which works left-to right through the sentence, and abjures dynamic programming in favor of a beam search, keeping some large number of possibilities to extend by adding the next word, and then re-pruning. At the end one has a beam-width's number of best parses (Roark, 2001).</Paragraph> <Paragraph position="1"> The Collins parser (Collins, 1997) does use dynamic programming in its search. That is, whenever a constituent with the same history is generated a second time, it is discarded if its probability is lower than the original version. If the opposite is true, then the original is discarded. This is fine if one only wants the first-best, but obviously it does not directly enumerate the n-best parses.</Paragraph> <Paragraph position="2"> However, Collins (Collins, 2000; Collins and Koo, in submission) has created an n-best version of his parser by turning off dynamic programming (see the user's guide to Bikel's re-implementation of Collins' parser, http://www.cis.upenn.edu/ dbikel/software.html#statparser). As with Roark's parser, it is necessary to add a beam-width constraint to make the search tractable. With a beam width of 1000 the parser returns something like a 50-best list (Collins, personal communication), but the actual number of parses returned for each sentences varies. However, turning off dynamic programming results in a loss in efficiency. Indeed, Collins's n-best list of parses for section 24 of the Penn tree-bank has some sentences with only a single parse, because the n-best parser could not find any parses.</Paragraph> <Paragraph position="3"> Now there are two known ways to produce n-best parses while retaining the use of dynamic programming: the obvious way and the clever way.</Paragraph> <Paragraph position="4"> The clever way is based upon an algorithm developed by Schwartz and Chow (1990). Recall the key insight in the Viterbi algorithm: in the optimal parse the parsing decisions at each of the choice points that determine a parse must be optimal, since otherwise one could find a better parse. This insight extends to n-best parsing as follows. Consider the second-best parse: if it is to differ from the best parse, then at least one of its parsing decisions must be suboptimal. In fact, all but one of the parsing decisions in second-best parse must be optimal, and the one suboptimal decision must be the second-best choice at that choice point. Further, the nth-best parse can only involve at most n suboptimal parsing decisions, and all but one of these must be involved in one of the second through the n[?]1th-best parses. Thus the basic idea behind this approach to n-best parsing is to first find the best parse, then find the second-best parse, then the third-best, and so on. The algorithm was originally described for hidden Markov models.</Paragraph> <Paragraph position="5"> Since this first draft of this paper we have become aware of two PCFG implementations of this algorithm (Jimenez and Marzal, 2000; Huang and Chang, 2005). The first was tried on relatively small grammars, while the second was implemented on top of the Bikel re-implementation of the Collins parser (Bikel, 2004) and achieved oracle results for 50-best parses similar to those we report below.</Paragraph> <Paragraph position="6"> Here, however, we describe how to find n-best parses in a more straight-forward fashion. Rather than storing a single best parse of each edge, one stores n of them. That is, when using dynamic programming, rather than throwing away a candidate if it scores less than the best, one keeps it if it is one of the top n analyses for this edge discovered so far. This is really very straight-forward. The problem is space. Dynamic programming parsing algorithms for PCFGs require O(m2) dynamic programming states, where m is the length of the sentence, so an n-best parsing algorithm requires O(nm2). However things get much worse when the grammar is bilexicalized. As shown by Eisner (Eisner and Satta, 1999) the dynamic programming algorithms for bilexicalized PCFGs require O(m3) states, so a n-best parser would require O(nm3) states. Things become worse still in a parser like the one described in Charniak (2000) because it conditions on (and hence splits the dynamic programming states according to) features of the grandparent node in addition to the parent, thus multiplying the number of possible dynamic programming states even more. Thus nobody has implemented this version.</Paragraph> <Paragraph position="7"> There is, however, one particular feature of the Charniak parser that mitigates the space problem: it is a &quot;coarse-to-fine&quot; parser. By &quot;coarse-to-fine&quot; we mean that it first produces a crude version of the parse using coarse-grained dynamic programming states, and then builds fine-grained analyses by splitting the most promising of coarse-grained states.</Paragraph> <Paragraph position="8"> A prime example of this idea is from Goodman (1997), who describes a method for producing a simple but crude approximate grammar of a standard context-free grammar. He parses a sentence using the approximate grammar, and the results are used to constrain the search for a parse with the full CFG.</Paragraph> <Paragraph position="9"> He finds that total parsing time is greatly reduced.</Paragraph> <Paragraph position="10"> A somewhat different take on this paradigm is seen in the parser we use in this paper. Here the parser first creates a parse forest based upon a much less complex version of the complete grammar. In particular, it only looks at standard CFG features, the parent and neighbor labels. Because this grammar encodes relatively little state information, its dynamic programming states are relatively coarse and hence there are comparatively few of them, so it can be efficiently parsed using a standard dynamic programming bottom-up CFG parser. However, precisely because this first stage uses a grammar that ignores many important contextual features, the best parse it finds will not, in general, be the best parse according to the finer-grained second-stage grammar, so clearly we do not want to perform best-first parsing with this grammar. Instead, the output of the first stage is a polynomial-sized packed parse forest which records the left and right string positions for each local tree in the parses generated by this grammar. The edges in the packed parse forest are then pruned, to focus attention on the coarse-grained states that are likely to correspond to high-probability fine-grained states. The edges are then pruned according to their marginal probability conditioned on the string s being parsed as follows:</Paragraph> <Paragraph position="12"> Here nij,k is a constituent of type i spanning the words from j to k, a(nij,k) is the outside probability of this constituent, and b(nij,k) is its inside probability. From parse forest both a and b can be computed in time proportional to the size of the compact forest. The parser then removes all constituents nij,k whose probability falls below some preset threshold.</Paragraph> <Paragraph position="13"> In the version of this parser available on the web, this threshold is on the order of 10[?]4.</Paragraph> <Paragraph position="14"> The unpruned edges are then exhaustively evaluated according to the fine-grained probabilistic model; in effect, each coarse-grained dynamic programming state is split into one or more fine-grained dynamic programming states. As noted above, the fine-grained model conditions on information that is not available in the coarse-grained model. This includes the lexical head of one's parents, the part of speech of this head, the parent's and grandparent's category labels, etc. The fine-grained states investigated by the parser are constrained to be refinements of the coarse-grained states, which drastically reduces the number of fine-grained states that need to be investigated.</Paragraph> <Paragraph position="15"> It is certainly possible to do dynamic programming parsing directly with the fine-grained grammar, but precisely because the fine-grained grammar conditions on a wide variety of non-local contextual information there would be a very large number of different dynamic programming states, so direct dynamic programming parsing with the fine-grained grammar would be very expensive in terms of time and memory.</Paragraph> <Paragraph position="16"> As the second stage parse evaluates all the remaining constituents in all of the contexts in which they appear (e.g., what are the possible grand-parent labels) it keeps track of the most probable expansion of the constituent in that context, and at the end is able to start at the root and piece together the overall best parse.</Paragraph> <Paragraph position="17"> Now comes the easy part. To create a 50-best parser we simply change the fine-grained version of 1-best algorithm in accordance with the &quot;obvious&quot; scheme outlined earlier in this section. The first, coarse-grained, pass is not changed, but the second, fine-grained, pass keeps the n-best possibilities at each dynamic programming state, rather than keeping just first best. When combining two constituents to form a larger constituent, we keep the best 50 of the 2500 possibilities they offer. Naturally, if we</Paragraph> <Paragraph position="19"> The experimental question is whether, in practice, the coarse-to-fine architecture keeps the number of dynamic programming states sufficiently low that space considerations do not defeat us.</Paragraph> <Paragraph position="20"> The answer seems to be yes. We ran the algorithm on section 24 of the Penn WSJ tree-bank using the default pruning settings mentioned above. Table 1 shows how the number of fine-grained dynamic programming states increases as a function of sentence length for the sentences in section 24 of the Treebank. There are no sentences of length greater than 69 in this section. Columns two to four show the number of sentences in each bucket, their average length, and the average number of fine-grained dynamic programming structures per sentence. The final column gives the value of the function 100[?]L1.5 where L is the average length of sentences in the bucket. Except for bucket 6, which is abnormally low, it seems that this add-hoc function tracks the number of structures quite well. Thus the number of dynamic programming states does not grow as L2, much less as L3.</Paragraph> <Paragraph position="21"> To put the number of these structures per sen- null of n-best parses tence in perspective, consider the size of such structures. Each one must contain a probability, the non-terminal label of the structure, and a vector of pointers to it's children (an average parent has slightly more than two children). If one were concerned about every byte this could be made quite small. In our implementation probably the biggest factor is the STL overhead on vectors. If we figure we are using, say, 25 bytes per structure, the total space required is only 1.25Mb even for 50,000 dynamic programming states, so it is clearly not worth worrying about the memory required.</Paragraph> <Paragraph position="22"> The resulting n-bests are quite good, as shown in Table 2. (The results are for all sentences of section 23 of the WSJ tree-bank of length [?] 100.) From the 1-best result we see that the base accuracy of the parser is 89.7%.1 2-best and 10-best show dramatic oracle-rate improvements. After that things start to slow down, and we achieve an oracle rate of 0.968 at 50-best. To put this in perspective, Roark (Roark, 2001) reports oracle results of 0.941 (with the same experimental setup) using his parser to return a variable number of parses. For the case cited his parser returns, on average, 70 parses per sentence.</Paragraph> <Paragraph position="23"> Finally, we note that 50-best parsing is only a fac1Charniak in (Charniak, 2000) cites an accuracy of 89.5%. Fixing a few very small bugs discovered by users of the parser accounts for the difference.</Paragraph> <Paragraph position="24"> tor of two or three slower than 1-best.</Paragraph> </Section> <Section position="4" start_page="176" end_page="177" type="metho"> <SectionTitle> 3 Features for reranking parses </SectionTitle> <Paragraph position="0"> This section describes how each parse y is mapped to a feature vector f(y) = (f1(y),...,fm(y)). Each feature fj is a function that maps a parse to a real number. The first feature f1(y) = logp(y) is the logarithm of the parse probability p according to the n-best parser model. The other features are integer valued; informally, each feature is associated with a particular configuration, and the feature's value fj(y) is the number of times that the configuration that fj indicates. For example, the feature feat pizza(y) counts the number of times that a phrase in y headed by eat has a complement phrase headed by pizza.</Paragraph> <Paragraph position="1"> Features belong to feature schema, which are abstract schema from which specific features are instantiated. For example, the feature feat pizza is an instance of the &quot;Heads&quot; schema. Feature schema are often parameterized in various ways. For example, the &quot;Heads&quot; schema is parameterized by the type of heads that the feature schema identifies. Following Grimshaw (1997), we associate each phrase with a lexical head and a function head. For example, the lexical head of an NP is a noun while the functional head of an NP is a determiner, and the lexical head of a VP is a main verb while the functional head of VP is an auxiliary verb.</Paragraph> <Paragraph position="2"> We experimented with various kinds of feature selection, and found that a simple count threshold performs as well as any of the methods we tried.</Paragraph> <Paragraph position="3"> Specifically, we ignored all features that did not vary on the parses of at least t sentences, where t is the count threshold. In the experiments described below t = 5, though we also experimented with t = 2.</Paragraph> <Paragraph position="4"> The rest of this section outlines the feature schemata used in the experiments below. These feature schemata used here were developed using the n-best parses provided to us by Michael Collins approximately a year before the n-best parser described here was developed. We used the division into preliminary training and preliminary development data sets described in Collins (2000) while experimenting with feature schemata; i.e., the first 36,000 sentences of sections 2-20 were used as preliminary training data, and the remaining sentences of sections 20 and 21 were used as preliminary development data. It is worth noting that developing feature schemata is much more of an art than a science, as adding or deleting a single schema usually does not have a significant effect on performance, yet the overall impact of many well-chosen schemata can be dramatic.</Paragraph> <Paragraph position="5"> Using the 50-best parser output described here, there are 1,148,697 features that meet the count threshold of at least 5 on the main training data (i.e., Penn treebank sections 2-21). We list each feature schema's name, followed by the number of features in that schema with a count of at least 5, together with a brief description of the instances of the schema and the schema's parameters.</Paragraph> <Paragraph position="6"> CoPar (10) The instances of this schema indicate conjunct parallelism at various different depths.</Paragraph> <Paragraph position="7"> For example, conjuncts which have the same label are parallel at depth 0, conjuncts with the same label and whose children have the same label are parallel at depth 1, etc.</Paragraph> <Paragraph position="8"> CoLenPar (22) The instances of this schema indicate the binned difference in length (in terms of number of preterminals dominated) in adjacent conjuncts in the same coordinated structures, conjoined with a boolean flag that indicates whether the pair is final in the coordinated phrase.</Paragraph> <Paragraph position="9"> RightBranch (2) This schema enables the reranker to prefer right-branching trees. One instance of this schema returns the number of nonterminal nodes that lie on the path from the root node to the right-most non-punctuation preterminal node, and the other instance of this schema counts the number of the other nonterminal nodes in the parse tree.</Paragraph> <Paragraph position="10"> Heavy (1049) This schema classifies nodes by their category, their binned length (i.e., the number of preterminals they dominate), whether they are at the end of the sentence and whether they are followed by punctuation.</Paragraph> <Paragraph position="11"> Neighbours (38,245) This schema classifies nodes by their category, their binned length, and the part of speech categories of the lscript1 preterminals to the node's left and the lscript2 preterminals to the node's right. lscript1 and lscript2 are parameters of this schema; here lscript1 = 1 or lscript1 = 2 and lscript2 = 1. Rule (271,655) The instances of this schema are local trees, annotated with varying amounts of contextual information controlled by the schema's parameters. This schema was inspired by a similar schema in Collins and Koo (in submission). The parameters to this schema control whether nodes are annotated with their preterminal heads, their terminal heads and their ancestors' categories. An additional parameter controls whether the feature is specialized to embedded or non-embedded clauses, which roughly corresponds to Emonds' &quot;nonroot&quot; and &quot;root&quot; contexts (Emonds, 1976). NGram (54,567) The instances of this schema are lscript-tuples of adjacent children nodes of the same parent. This schema was inspired by a similar schema in Collins and Koo (in submission).</Paragraph> <Paragraph position="12"> This schema has the same parameters as the Rule schema, plus the length lscript of the tuples of children (lscript = 2 here).</Paragraph> <Paragraph position="13"> Heads (208,599) The instances of this schema are tuples of head-to-head dependencies, as mentioned above. The category of the node that is the least common ancestor of the head and the dependent is included in the instance (this provides a crude distinction between different classes of arguments). The parameters of this schema are whether the heads involved are lexical or functional heads, the number of heads in an instance, and whether the lexical item or just the head's part of speech are included in the instance.</Paragraph> <Paragraph position="14"> LexFunHeads (2,299) The instances of this feature are the pairs of parts of speech of the lexical head and the functional head of nodes in parse trees.</Paragraph> <Paragraph position="15"> WProj (158,771) The instances of this schema are preterminals together with the categories of lscript of their closest maximal projection ancestors. The parameters of this schema control the number lscript of maximal projections, and whether the preterminals and the ancestors are lexicalized.</Paragraph> <Paragraph position="16"> Word (49,097) The instances of this schema are lexical items together with the categories of lscript of their immediate ancestor nodes, where lscript is a schema parameter (lscript = 2 or lscript = 3 here). This feature was inspired by a similar feature in Klein and Manning (2003).</Paragraph> <Paragraph position="17"> HeadTree (72,171) The instances of this schema are tree fragments consisting of the local trees consisting of the projections of a preterminal node and the siblings of such projections. This schema is parameterized by the head type (lexical or functional) used to determine the projections of a preterminal, and whether the head preterminal is lexicalized.</Paragraph> <Paragraph position="18"> NGramTree (291,909) The instances of this schema are subtrees rooted in the least common ancestor of lscript contiguous preterminal nodes. This schema is parameterized by the number lscript of contiguous preterminals (lscript = 2 or lscript = 3 here) and whether these preterminals are lexicalized.</Paragraph> </Section> <Section position="5" start_page="177" end_page="178" type="metho"> <SectionTitle> 4 Estimating feature weights </SectionTitle> <Paragraph position="0"> This section explains how we estimate the feature weights th = (th1,...,thm) for the feature functions f = (f1,...,fm). We use a MaxEnt estimator to find the feature weights ^th, where L is the loss function and R is a regularization penalty term:</Paragraph> <Paragraph position="2"> The training data D = (s1,...,snprime) is a sequence of sentences and their correct parses ystar(s1),...,ystar(sn). We used the 20-fold cross-validation technique described in Collins (2000) to compute the n-best parses Y(s) for each sentence s in D. In general the correct parse ystar(s) is not a member of Y(s), so instead we train the reranker to identify one of the best parses Y+(s) = argmaxy[?]Y(s) Fystar(s)(y) in the n-best parser's output, where Fystar(y) is the Parseval f-score of y evaluated with respect to ystar.</Paragraph> <Paragraph position="3"> Because there may not be a unique best parse for each sentence (i.e., |Y+(s) |> 1 for some sentences s) we used the variant of MaxEnt described in Riezler et al. (2002) for partially labelled training data. Recall the standard MaxEnt conditional probability model for a parse y [?] Y:</Paragraph> <Paragraph position="5"> thjfj(y).</Paragraph> <Paragraph position="6"> The loss function LD proposed in Riezler et al. (2002) is just the negative log conditional likelihood of the best parses Y+(s) relative to the n-best parser</Paragraph> <Paragraph position="8"> The partial derivatives of this loss function, which are required by the numerical estimation procedure, are:</Paragraph> <Paragraph position="10"> In the experiments reported here, we used a Gaussian or quadratic regularizer R(w) = csummationtextmj=1 w2j , where c is an adjustable parameter that controls the amount of regularization, chosen to optimize the reranker's f-score on the development set (section 24 of the treebank).</Paragraph> <Paragraph position="11"> We used the Limited Memory Variable Metric optimization algorithm from the PETSc/TAO optimization toolkit (Benson et al., 2004) to find the optimal feature weights ^th because this method seems substantially faster than comparable methods (Malouf, 2002). The PETSc/TAO toolkit provides a variety of other optimization algorithms and flags for controlling convergence, but preliminary experiments on the Collins' trees with different algorithms and early stopping did not show any performance improvements, so we used the default PETSc/TAO setting for our experiments here.</Paragraph> </Section> class="xml-element"></Paper>