File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1022_metho.xml

Size: 9,872 bytes

Last Modified: 2025-10-06 14:10:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1022">
  <Title>Multilevel Coarse-to-fine PCFG Parsing</Title>
  <Section position="3" start_page="168" end_page="169" type="metho">
    <SectionTitle>
2 Previous Research
</SectionTitle>
    <Paragraph position="0"> Coarse-to-fine search is an idea that has appeared several times in the literature of computational linguistics and related areas. The first appearance of this idea we are aware of is in Maxwell and Kaplan (1993), where a covering CFG is automatically extracted from a more detailed unification grammar and used to identify the possible locations of constituents in the more detailed parses of the sentence. Maxwell and Kaplan use their covering CFG to prune the search of their unification grammar parser in essentially the same manner as we do here, and demonstrate significant performance improvements by using their coarse-to-fine approach.</Paragraph>
    <Paragraph position="1"> The basic theory of coarse-to-fine approximations and dynamic programming in a stochastic framework is laid out in Geman and Kochanek (2001). This paper describes the multilevel dynamic programming algorithm needed for coarse-to-fine analysis (which they apply to decoding rather than parsing), and show how to perform exact coarse-to-fine computation, rather than the heuristic search we perform here.</Paragraph>
    <Paragraph position="2"> A paper closely related to ours is Goodman (1997). In our terminology, Goodman's parser is a two-stage ctf parser. The second stage is a standard tree-bank parser while the first stage is a regular-expression approximation of the grammar. Again, the second stage is constrained by the parses found in the first stage. Neither stage is smoothed. The parser of Charniak (2000) is also a two-stage ctf model, where the first stage is a smoothed Markov grammar (it uses up to three previous constituents as context), and the second stage is a lexicalized Markov grammar with extra annotations about parents and grandparents. The second stage explores all of the constituents not pruned out after the first stage. Related approaches are used in Hall (2004) and Charniak and Johnson (2005).</Paragraph>
    <Paragraph position="3"> A quite different approach to parsing efficiencyistakeninCaraballoandCharniak(1998) null (and refined in Charniak et al. (1998)). Here efficiency is gained by using a standard chart-parsing algorithm and pulling constituents off the agenda according to (an estimate of) their probability given the sentence. This probability is computed by estimating Equation 1:</Paragraph>
    <Paragraph position="5"> It must be estimated because during the bottom-up chart-parsing algorithm, the true outside probability cannot be computed. The results cited in Caraballo and Charniak (1998) cannot be compared directly to ours, but are roughly in the same equivalence class. Those presented in Charniak et al. (1998) are superior, but in Section 5 below we suggest that a combination of the techniques could yield better results still.</Paragraph>
    <Paragraph position="6"> Klein and Manning (2003a) describe efficient A[?] for the most likely parse, where pruning is accomplished by using Equation 1 and a true upper bound on the outside probability. While their maximum is a looser estimate of the outside probability, it is an admissible heuristic and together with an A[?] search is guaranteed to find the best parse first. One question is if the guarantee is worth the extra search required by the looser estimate of the true outside probability.</Paragraph>
    <Paragraph position="7"> Tsuruoka and Tsujii (2004) explore the framework developed in Klein and Manning (2003a), and seek ways to minimize the time required by the heap manipulations necessary in this scheme. They describe an iterative deepening algorithm that does not require a heap. They also speed computation by precomputing more accurate upper bounds on the outside probabilities of various kinds of constituents. They are able to reduce by half the number of constituents required to find the best parse (compared to CKY).</Paragraph>
    <Paragraph position="8"> Most recently, McDonald et al. (2005) have implemented a dependency parser with good accuracy (it is almost as good at dependency parsing as Charniak (2000)) and very impressive speed (it is about ten times faster than Collins (1997) and four times faster than Charniak (2000)). It achieves its speed in part because it uses the Eisner and Satta (1999) algorithm for n3 bilexical parsing, but also because dependency parsing has a much lower grammar constant than does standard PCFG parsing -after all, there are no phrasal constituents to consider. The current paper can be thought of as a way to take the sting out of the grammar constant for PCFGs by parsing first with very few phrasal constituents and adding them only Level: 0 1 2 3  after most constituents have been pruned away.</Paragraph>
  </Section>
  <Section position="4" start_page="169" end_page="170" type="metho">
    <SectionTitle>
3 Multilevel Course-to-fine Parsing
</SectionTitle>
    <Paragraph position="0"> We use as the underlying parsing algorithm a reasonably standard CKY parser, modified to allow unary branching rules.</Paragraph>
    <Paragraph position="1"> The complete nonterminal clustering is given in Figure 1. We do not cluster preterminals.</Paragraph>
    <Paragraph position="2"> These remain fixed at all levels to the standard Penn-tree-bank set Marcus et al. (1993).</Paragraph>
    <Paragraph position="3"> Level-0 makes two distinctions, the root node and everybody else. At level 1 we make one further distinction, between phrases that tend to be heads of constituents (NPs, VPs, and Ss) and those that tend to be modifiers (ADJPs, PPs, etc.). Level-2 has a total of five categories: root, things that are typically headed by nouns, those headed by verbs, things headed by prepositions, and things headed by classical modifiers (adjectives, adverbs, etc.). Finally, level 3 is the  classical tree-bank set. As an example, Figure 2 shows the parse for the sentence &amp;quot;He ate at the mall.&amp;quot; at levels 0 to 3.</Paragraph>
    <Paragraph position="4"> During training we create four grammars, one for each level of granularity. So, for example, at level 1 the tree-bank rule</Paragraph>
    <Paragraph position="6"> would be translated into the rule</Paragraph>
    <Paragraph position="8"> That is, each constituent type found in &amp;quot;S -NP VP.&amp;quot;ismappedintoitsgeneralizationatlevel1.</Paragraph>
    <Paragraph position="9"> The probabilities of all rules are computed using maximum likelihood for constituents at that level.</Paragraph>
    <Paragraph position="10"> The grammar used by the parser can best be described as being influenced by four compo- null nents: 1. the nonterminals defined at that level of parsing, 2. the binarization scheme, 3. the generalizations defined over the binarization, and 4. extra annotation to improve parsing accu null racy.</Paragraph>
    <Paragraph position="11"> The first of these has already been covered. We discuss the other three in turn.</Paragraph>
    <Paragraph position="12"> In anticipation of eventually lexicalizing the grammar we binarize from the head out. For example, consider the rule A -a b c d e where c is the head constituent. We binarize this as follows:  too specific, as the binarization introduce a very large number of very specialized phrasal categories (the Ai). Following common practice Johnson (1998; Klein and Manning (2003b) we Markovize by replacing these nonterminals with ones that remember less of the immediate rule context. In our version we keep track of only the parent, the head constituent and the constituent immediately to the right or left, depending on which side of the constituent we are processing. With this scheme the above rules now look like  So, for example, the rule &amp;quot;A -Ad,c e&amp;quot; would have a high probability if constituents of type A, with c as their head, often have d followed by e at their end.</Paragraph>
    <Paragraph position="13"> Lastly, we add parent annotation to phrasal categories to improve parsing accuracy. If we assume that in this case we are expanding a rule for an A used as a child of Q, and a,b,c,d, and e are all phrasal categories, then the above rules  pruned as a function of pruning thresholds for the first 100 sentences of the development corpus Once we have parsed at a level, we evaluate the probability of a constituent p(nki,j  |s) according to the standard inside-outside formula of Equation 1. In this equation nki,j is a constituent of type k spanning the words i to j, and a(*) and b(*) are the outside and inside probabilities of the constituent, respectively. Because we prune at the end each granularity level, we can evaluate the equation exactly; no approximations are needed (as in, e.g., Charniak et al. (1998)).</Paragraph>
    <Paragraph position="14"> During parsing, instead of building each constituent allowed by the grammar, we first test if the probability of the corresponding coarser constituent (which we have from Equation 1 in the previous round of parsing) is greater than a threshold. (The threshold is set empirically based upon the development data.) If it is below the threshold, we do not put the constituent in the chart. For example, before we can use a NP and a VP to create a S (using the rule above), we would first need to check that the probability in the coarser grammar of using the same span HP and HP to create a HP is above the threshold. We use the standard inside-outside formula to calculate this probability (Equation 1). The empirical results below justify our conjecture that there are thresholds that allow significant pruning while leaving the gold constituents  as a function of pruning thresholds for the first 100 sentences of the development corpus</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML