File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1022_intro.xml

Size: 4,327 bytes

Last Modified: 2025-10-06 14:03:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1022">
  <Title>Multilevel Coarse-to-fine PCFG Parsing</Title>
  <Section position="2" start_page="0" end_page="168" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Reasonably accurate constituent-based parsing is fairly quick these days, if fairly quick means about a second per sentence. Unfortunately, this is still too slow for many applications. In some cases researchers need large quantities of parsed data and do not have the hundreds of machines necessary to parse gigaword corpora in a week or two. More pressingly, in real-time applications such as speech recognition, a parser would be only a part of a much larger system, and the system builders are not keen on giving the parser one of the ten seconds available to process, say, a thirty-word sentence. Even worse, some applications require the parsing of multiple candidate strings per sentence (Johnson and Charniak, 2004) or parsing from a lattice (Hall and Johnson, 2004), and in these applications parsing efficiency is even more important.</Paragraph>
    <Paragraph position="1"> We present here a multilevel coarse-to-fine (mlctf) PCFG parsing algorithm that reduces the complexity of the search involved in finding the best parse. It defines a sequence of increasingly more complex PCFGs, and uses the parse forest produced by one PCFG to prune the search of the next more complex PCFG.</Paragraph>
    <Paragraph position="2"> We currently use four levels of grammars in our mlctf algorithm. The simplest PCFG, which we call the level-0 grammar, contains only one non-trivial nonterminal and is so simple that minimal time is needed to parse a sentence using it. Nonetheless, we demonstrate that it identifies the locations of correct constituents of the parse tree (the &amp;quot;gold constituents&amp;quot;) with high recall. Our level-1 grammar distinguishes only argument from modifier phrases (i.e., it has two nontrivial nonterminals), while our level-2 grammar distinguishes the four major phrasal categories (verbal, nominal, adjectival and prepositional phrases), and level 3 distinguishes all of the standard categories of the Penn treebank.</Paragraph>
    <Paragraph position="3">  Thenonterminalcategoriesinthesegrammars can be regarded as clusters or equivalence classes of the original Penn treebank nonterminal categories. (In fact, we obtain these grammars by relabeling the node labels in the treebank and extracting a PCFG from this relabelled treebank in the standard fashion, but we discuss other approaches below.) We require that the partition of the nonterminals defined by the equivalence classes at level l +1 be a refinement of the partition defined at level l. This means that each nonterminal category at level l+1 is mapped to a unique nonterminal category at level l (although in general the mapping is many to one, i.e., each nonterminal category at level l corresponds to several nonterminal categories at level l +1).</Paragraph>
    <Paragraph position="4"> We use the correspondence between categories at different levels to prune possible constituents.</Paragraph>
    <Paragraph position="5"> A constituent is considered at level l + 1 only if the corresponding constituent at level l has a probability exceeding some threshold.. Thus parsing a sentence proceeds as follows. We first parse the sentence with the level-0 grammar to produce a parse forest using the CKY parsing algorithm. Then for each level l + 1 we reparse the sentence with the level l + 1 grammar using the level l parse forest to prune as described above. As we demonstrate, this leads to considerable efficiency improvements.</Paragraph>
    <Paragraph position="6"> The paper proceeds as follows. We next discuss previous work (Section 2). Section 3 outlines the algorithm in more detail. Section 4 presents some experiments showing that the work load (as measured by the total number of constituents processed) is decreased by a factor of ten over standard CKY parsing at the final level. We also discuss some fine points of the results therein. Finally in section 5 we suggest that because the search space of mlctf algorithms is, at this point, almost totally unexplored, future work should be able to improve significantly on these results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML