XML Viewer - w04-0305

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0305_metho.xml
Size: 28,339 bytes
Last Modified: 2025-10-06 14:09:03
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0305">
  <Title>Lookahead in Deterministic Left-Corner Parsing</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Approximating Optimal
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Deterministic Parsing
</SectionTitle>
      <Paragraph position="0"> The general principles of deterministic parsing, as proposed by Marcus (1980), are that parsing proceeds incrementally from left to right, and that once a parsing decision has been made, it cannot be revoked or overridden by an alternative analysis. We translate the rst principle into the design of a statistical parser by using an incremental generative probability model. Such a model provides us with probabilities for partial parses which generate pre xes of the sentence and which do not depend on the words not in this pre x. We can then translate the second principle into constraints on how a statistical parser chooses which partial parses to pursue further as it searches for the most probable complete parse.</Paragraph>
      <Paragraph position="1"> The principle that decisions cannot be revoked or overridden means that, given a sequence of parser actions a1;:::; ai 1 which we have already chosen, we need to choose a single parser action ai before considering any subsequent parser action ai+1. However, this constraint does not prevent considering the e ects of multiple alternative parser actions for ai before choosing between them. This leaves a great deal of exibility for the design of a deterministic parser, because the set of actions de ned by a deterministic parser does not have to be the same as the basic decisions de ned by our probability model. We can combine any nite sequence of decisions dj;:::; dj+l from our probability model into a single parser action ai.</Paragraph>
      <Paragraph position="2"> This combination allows a deterministic parser to consider the e ects of the entire sequence of decisions dj;:::; dj+l before deciding whether to choose it. Di erent deterministic parser designs will combine the basic decisions into parser actions in di erent ways, thereby imposing di erent constraints on how long a sequence of future decisions dj;:::; dj+l can be considered before choosing a parser action.</Paragraph>
      <Paragraph position="3"> Once we have made a distinction between the basic decisions of the probability model dj and the parser actions ai = dj;:::; dj+l, it is convenient to express the choice of the parse a1;:::; an as a search for the most probable d1;:::; dm, where a1;:::; an = d1;:::; dm. The search incrementally constructs partial parses and prunes this search down to a single partial parse after each complete parser action. In other words, given that the search has so far chosen the partial parse a1;:::; ai 1 = d1;:::; dj 1, the search rst considers all the possible partial parses d1;:::; dj 1; dj;:::; dj+l where there exists an ai = dj;:::; dj+l. The search is then pruned down to only the best d1;:::; dj 1; dj;:::; dj+l from this set, and the search continues with all partial parses containing this pre x. Thus the search is allowed to delay pruning for as many basic decisions as are combined into a single parser action. Rather than considering one single deterministic parser design, in this paper we consider a family of deterministic parser designs. We then determine tight upper bounds on the performance of any deterministic parser in this family. We de ne a family of deterministic parsers by starting with a particular incremental generative probability model, and consider a range of ways to de ne parser action ai as nite sequences dj;:::; dj+l of these basic decisions.</Paragraph>
      <Paragraph position="4"> We de ne the family of parser designs as allowing the combination of any sequence of decisions which occur between the parsing of two words. After a word has been incorporated into the parse, this constraint allows the search to consider all the possible decision sequences leading up to the incorporation of the next word, but not beyond. When the next word is reached, the search must again be pruned down to a single analysis. This is a natural point to prune, because it is the position where new information about the sentence is available. Given this de nition of the family of deterministic parsers and the fact that we are only concerned with an upper bound on a deterministic parser's performance, there is no need to consider parser designs which require more pruning than this, since they will never perform as well as a parser which requires less pruning.</Paragraph>
      <Paragraph position="5"> Unfortunately, allowing the combination of any sequence of decisions which occur between the parsing of two words does not exactly correspond to the constraints on deterministic parsing. This is because we cannot put a nite upper bound on the number of actions which occur between two words. Thus this class of parsers includes non-deterministic parsers, and therefore our performance results represent only an upper bound on the performance which could be achieved by a deterministic parser in the class.</Paragraph>
      <Paragraph position="6"> However, there is good reason to believe this is a tight upper bound. Lexicalized theories of syntax all assume that the amount of information about the syntactic structure contributed by each word is nite, and that all the information in the syntactic structure is contributed by some word. Thus it should possible to distribute all the information about the structure across the parse in such a way that a nite amount falls in between each word. The parsing order we use (a form of left-corner parsing) seems to achieve this fairly well, except for the fact that it uses a stack. Parsing right-branching structures, such as are found in English, results in the stack growing arbitrarily large, and then the whole stack needs to be popped at the end of the sentence. With the exception of these sequences of popping actions, the number of actions which occur between any two words could be bounded. In our training set, the bound on the number of non-popping actions between any two words could be set at just 4.</Paragraph>
      <Paragraph position="7"> In addition to designing parser actions to make deterministic parsing easier, another mechanism which is commonly used in deterministic parser designs is lookahead. With lookahead, information about words which have not yet been incorporated into the parse can be used to decide what action to choose next. We consider models where the lookahead consists of some small xed-length pre x of the un-parsed portion of the sentence, which we call k-word lookahead. This mechanisms is constrained by the requirement that the parser be incremental, since a deterministic parser with k-word lookahead can only provide an interpretation for the portion of the sentence which is k words behind what has been input so far. Thus it is not possible to include the entire unboundedly-long sentence in the lookahead. The family of deterministic parsers with k-word lookahead would include parsers which sometimes choose parser actions without waiting to see all k words (and thus on average allow interpretation sooner), but because here we are only concerned with the optimal performance achievable with a given lookahead, we do not have to consider these alternatives. null The optimal deterministic parser with lookahead will choose the partial parse which is the most likely to lead to the correct complete parse given the previous partial parse plus the k words of lookahead. In other words, we are trying to maximize P(at+1ja1;:::; at; wt+1;:::; wt+k), which is the same as maximizing</Paragraph>
      <Paragraph position="9"> wt+1;:::; wt+k. (Note that any partial parse a1;:::; at generates the words w1;:::; wt, because the optimal deterministic parser designs we are considering all have parser actions which combine the entire portion of a parse between one word and another.) We can compute this probability by summing over all parses which include the partial parse a1;:::; at+1 and which generate the lookahead string wt+1;:::; wt+k.</Paragraph>
      <Paragraph position="11"> where at+1;:::; at+k generates wt+1;:::; wt+k : Because the parser actions are de ned in terms of basic decisions in the probability model, we can compute this sum directly using the probability model. A real deterministic parser cannot actually perform this computation explicitly, because it involves pursuing multiple analyses which are then discarded. But ideally a deterministic parser should compute an estimate which approximates this sum. Thus we can compute the performance of a deterministic parser which makes the ideal use of lookahead by explicitly computing this sum. Again, this will be an upper bound on the performance of a real deterministic parser, but we can reasonably expect that a real deterministic parser can reach performance quite close to this ideal for a small amount of lookahead.</Paragraph>
      <Paragraph position="12"> This approach to lookahead can also be expressed in terms of pruning the search for the best parse. After pruning to a single partial parse a1;:::; at which ends by generating wt, the search is allowed to pursue multiple parses in parallel until they generate the word wt+k. The probabilities for these new partial parses are then summed to get estimates of</Paragraph>
      <Paragraph position="14"> at+1, and these sums are used to choose a single at+1. The search is then pruned by removing all partial parses which do not start with a1;:::; at+1.</Paragraph>
      <Paragraph position="15"> The remaining partial parses are then continued until they generate the word wt+k+1, and their probabilities are summed to decide how to prune to a single choice of at+2.</Paragraph>
      <Paragraph position="16"> By expressing the family of deterministic parsers with lookahead in terms of a pruning strategy on a basic parsing model, we are able to easily investigate the e ects of di erent lookahead lengths on the maximum performance of a deterministic parser in this family. To complete the speci cation of the family of deterministic parsers, we simple have to specify the basic parsing model, as done in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A Generative Left-Corner
</SectionTitle>
    <Paragraph position="0"> As with several previous statistical parsers (Collins, 1999; Charniak, 2000), we use a generative history-based probability model of parsing. Designing a history-based model of parsing involves two steps, rst choosing a mapping from the set of phrase structure trees to the set of parses, and then choosing a probability model in which the probability of each parser decision is conditioned on the history of previous decisions in the parse. For the model to be generative, these decisions must include predicting the words of the sentence. To support incremental parsing, we want to map phrase structure trees to parses which predict the words of the sentence in their left-to-right order. To support deterministic parsing, we want our parses to specify information about the phrase structure tree at appropriate points in the sentence. For these reasons, we choose a form of left-corner parsing (Rosenkrantz and Lewis, 1970).</Paragraph>
    <Paragraph position="1"> In a left-corner parse, each node is introduced after the subtree rooted at the node's rst child has been fully parsed. Then the subtrees for the node's remaining children are parsed in their left-to-right order. In the form of left-corner parsing we use, parsing a constituent starts by pushing the leftmost word w of the constituent onto the stack with a shift(w) action. Parsing a constituent ends by either introducing the constituent's parent nonterminal (labeled Y ) with a project(Y) action, or attaching to the parent with a attach action.</Paragraph>
    <Paragraph position="2"> More precisely, this parsing strategy is a version of left-corner parsing which rst applies right-binarization to the grammar, as is done in (Manning and Carpenter, 1997) except that we binarize down to nullary rules rather than to binary rules. This means that choosing the children for a node is done one child at a time, and that ending the sequence of children is a separate choice. We also extended the parsing strategy slightly to handle Chomsky adjunction structures (i.e. structures of the form [X [X : : :]</Paragraph>
    <Paragraph position="4"> junction is removed and replaced with a special \modi er&amp;quot; link in the tree (becoming [X : : : [mod Y : : :]]). This means that the parser's set of basic actions includes modify, as well as attach, shift(w), and project(Y). We also compiled some frequent chains of non-branching nodes (such as [S [VP : : :]]) into a single node with a new label (becoming [S-VP : : :]). All these grammar transforms are undone before any evaluation of the output trees is performed.</Paragraph>
    <Paragraph position="5"> Because this mapping from phrase structure trees to sequences of parser decisions is one-toone, nding the most probable phrase structure tree is equivalent to nding the parse d1;:::; dm which maximizes P(d1;:::; dm), as is done in generative models. Because this probability includes the probabilities of the shift(wi) decisions, this is the joint probability of the phrase structure tree and the sentence. The probability model is then de ned by using the chain rule for conditional probabilities to derive the probability of a parse as the multiplication of the probabilities of each decision di conditioned on that decision's prior parse history d1;:::; di 1.</Paragraph>
    <Paragraph position="7"> The parameters of this probability model are the P(dijd1;:::; di 1). Generative models are the standard way to transform a parsing strategy into a probability model, but note that we are not assuming any bound on the amount of information from the parse history which might be relevant to each parameter.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Estimating the Parameters with a
Neural Network
</SectionTitle>
    <Paragraph position="0"> The most challenging problem in estimating P(dijd1;:::; di 1) is that the conditional includes an unbounded amount of information. The parse history d1;:::; di 1 grows with the length of the sentence. In order to apply standard probability estimation methods, we use neural networks to induce nite representations of this sequence, which we will denote h(d1;:::; di 1). The neural network training methods we use try to nd representations which preserve all the information about the sequences which are relevant to estimating the desired probabilities.</Paragraph>
    <Paragraph position="2"> Of the previous work on using neural networks for parsing natural language, by far the most empirically successful has been the work using Simple Synchrony Networks (Henderson, 2003). Like other recurrent network architectures, SSNs compute a representation of an unbounded sequence by incrementally computing a representation of each pre x of the sequence.</Paragraph>
    <Paragraph position="3"> At each position i, representations from earlier in the sequence are combined with features of the new position i to produce a vector of real valued features which represent the pre x ending at i. This representation is called a hidden representation. It is analogous to the hidden state of a Hidden Markov Model. As long as the hidden representation for position i 1 is always used to compute the hidden representation for position i, any information about the entire sequence could be passed from hidden representation to hidden representation and be included in the hidden representation of that sequence.</Paragraph>
    <Paragraph position="4"> When these representations are then used to estimate probabilities, this property means that we are not making any a priori hard independence assumptions.</Paragraph>
    <Paragraph position="5"> The di erence between SSNs and most other recurrent neural network architectures is that SSNs are speci cally designed for processing structures. When computing the history representation h(d1;:::; di 1), the SSN uses not only the previous history representation h(d1;:::; di 2), but also uses history representations for earlier positions which are particularly relevant to choosing the next parser decision di.</Paragraph>
    <Paragraph position="6"> This relevance is determined by rst assigning each position to a node in the parse tree, namely the node which is on the top of the parser's stack when that decision is made. Then the relevant earlier positions are chosen based on the structural locality of the current decision's node to the earlier decisions' nodes. In this way, the number of representations which information needs to pass through in order to ow from history representation i to history representation j is determined by the structural distance between i's node and j's node, and not just the distance between i and j in the parse sequence.</Paragraph>
    <Paragraph position="7"> This provides the neural network with a linguistically appropriate inductive bias when it learns the history representations, as explained in more detail in (Henderson, 2003). The fact that this bias is both structurally de ned and linguistically appropriate is the reason that this parser performs so much better than previous attempts at using neural networks for parsing, such as (Costa et al., 2001).</Paragraph>
    <Paragraph position="8"> Once it has computed h(d1;:::; di 1), the SSN uses standard methods (Bishop, 1995) to estimate a probability distribution over the set of possible next decisions di given these representations. This involves further decomposing the distribution over all possible next parser actions into a small hierarchy of conditional probabilities, and then using log-linear models to estimate each of these conditional probability distributions. The input features for these log-linear models are the real-valued vectors computed by h(d1;:::; di 1), as explained in more detail in (Henderson, 2003).</Paragraph>
    <Paragraph position="9"> As with many other machine learning methods, training a Simple Synchrony Network involves rst de ning an appropriate learning criteria and then performing some form of gradient descent learning to search for the optimum values of the network's parameters according to this criteria. We use the on-line version of Backpropagation to perform the gradient descent. This learning simultaneously tries to optimize the parameters of the output computation and the parameters of the mapping h(d1;:::; di 1). With multi-layered networks such as SSNs, this training is not guaranteed to converge to a global optimum, but in practice a network whose criteria value is close to the optimum can be found.</Paragraph>
    <Paragraph position="10"> 5 Searching for the most probable parse As discussed in section 2, we investigate deterministic parsing by translating the principles of deterministic parsing into properties of the pruning strategy used to search the space of possible parses. The complete parsing system alternates between using the search strategy to decide what partial parse d1;:::; di 1 to pursue further and using the SSN to estimate a probability distribution P(dijd1;:::; di 1) over possible next decisions di. The probabilities P(d1;:::; di) for the new partial parses are then just P(d1;:::; di 1) P(dijd1;:::; di 1). When no pruning applies, the partial parse with the highest probability is chosen as the next one to be extended.</Paragraph>
    <Paragraph position="11"> Even in the non-deterministic version of the parser, we need to prune the search space. This is because the number of possible parses is exponential in the length of the sentence, and we cannot use dynamic programming to compute the best parse e ciently because we do not make any independence assumptions. However, we have found that the search can be drastically pruned without loss in accuracy, using a similar approach to that used here to model deterministic parsing. After the prediction of each word, we prune all partial parses except a xed beam of the most probable partial parses. Due to the use of the above left-corner parsing order, we have found that the beam can be as little as 100 parses without having any measurable e ect on accuracy. Below we will refer to this beam width as the post-word search beam width.</Paragraph>
    <Paragraph position="12"> In addition to pruning after the prediction of each word, we also prune the search space in between two words by limiting its branching factor to at most 5. This, in e ect, just limits the number of labels considered for each new nonterminal. We found that increasing the branching factor had no e ect on accuracy and little e ect on speed.</Paragraph>
    <Paragraph position="13"> For the simulations of deterministic parsers, we always applied both the above pruning strategies, in addition to the deterministic pruning. This non-deterministic pruning reduces the number of partial parses a1;:::; at+1;:::; at+k whose probabilities are included in the sum used to choose at+1 for the deterministic pruning.</Paragraph>
    <Paragraph position="14"> This approximation is not likely to have any signi cant e ect on the choice of at+1, because the probabilities of the partial parses which are pruned by the non-deterministic pruning tend to be very small compared to the most probable alternatives. The non-deterministic pruning also reduces the set of partial parses which are chosen between during the subsequent deterministic pruning. But this undoubtedly has no signi cant e ect, since experimental results have shown that the level of non-deterministic pruning discussed above does not e ect performance even without deterministic pruning.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="80" type="metho">
    <SectionTitle>
6 The Experiments
</SectionTitle>
    <Paragraph position="0"> To investigate the e ects of lookahead on our family of deterministic parsers, we ran empirical experiments on the standard the Penn Treebank (Marcus et al., 1993) datasets. The input to the network is a sequence of tag-word pairs.1 We report results for a vocabulary size of 508 tag-word pairs (a frequency threshold of 200).</Paragraph>
    <Paragraph position="1"> We rst trained a network to estimate the parameters of the basic probability model. We determined appropriate training parameters and network size based on intermediate validation 1We used a publicly available tagger (Ratnaparkhi, 1996) to provide the tags. This tagger is run before the parser, so there may be some information about future words which is available in the disambiguated tag which is not available in the word itself. We don't think this has had a signi cant impact on the results reported here, but currently we are working on doing the tagging internally to the parser to avoid this problem.</Paragraph>
    <Paragraph position="2">  results and our previous experience.2 We trained several networks and chose the best ones based on their validation performance. The best post-word search beam width for the non-deterministic parser was determined on the validation set, which was 100.</Paragraph>
    <Paragraph position="3"> To avoid repeated testing on the standard testing set, we measured the performance of the di erent models on section 0 of the Penn Treebank (which is not included in either the training or validation sets). Standard measures of accuracy for di erent lookahead lengths are plotted in gure 1.3 First we should note that the non-deterministic parser has state-of-the-art accuracy (89.0% F-measure), considering its vocabulary size. A moderately larger vocabulary version (4215 tag-word pairs) of this parser achieves 89.8% F-measure on section 0, where the best current result on the testing set is 90.7% (Bod, 2003).</Paragraph>
    <Paragraph position="4"> As expected, the deterministic parsers do worse than the non-deterministic one, and this di erence becomes less as the lookahead is lengthened. What is surprising about the curves in gure 1 is that there is a very large increase in performance from zero words of lookahead  following the standard criteria in (Collins, 1999). We used the standard training (sections 2{22, 39,832 sentences, 910,196 words) and validation (section 24, 1346 sentence, 31507 words) sets (Collins, 1999). Results of the nondeterministic parser average 0.2% worse on the standard testing set, and average 0.8% better when a larger vocabulary (4215 tag-word pairs) is used.</Paragraph>
    <Paragraph position="5"> (i.e. pruning the search to 1 alternative directly after every word) to one word of lookahead. After one word of lookahead the curves show relatively moderate improvements with each additional word of lookahead, converging to the non-deterministic level, as would be expected.4 But between zero words of lookahead and one word of lookahead there is a 5.6% absolute improvement in F-measure (versus a 0.9% absolute improvement between one and two words of lookahead). In other words, adding the rst word of lookahead results in a 2/3 reduction in the di erence between the deterministic and non-deterministic parser's F-measure, while adding subsequent words results in at most a 1/3 reduction per word.</Paragraph>
  </Section>
  <Section position="7" start_page="80" end_page="80" type="metho">
    <SectionTitle>
7 Discussion
</SectionTitle>
    <Paragraph position="0"> The large improvement in performance which results from adding the rst word of lookahead, as compared to adding the subsequent words, indicates that the rst word of lookahead has a qualitatively di erent e ect on deterministic parsing. We believe that one word of lookahead is both necessary and su cient for a model of deterministic parsing.</Paragraph>
    <Paragraph position="1"> The large gain provided by the rst word of lookahead indicates that this lookahead is necessary for deterministic parsing. Given the fact the with one word of lookahead the F-measure of the deterministic parser is only 2.7% below the maximum possible, it is unlikely that the family of deterministic parsers assumed here is so sub-optimal that the entire 5.6% improvement gained with one word lookahead is simply the result of compensating for limitations in the choice of this family.</Paragraph>
    <Paragraph position="2"> The performance curves in gure 1 also suggest that one word of lookahead is su cient. We believe the gain provided by more than one word of lookahead is the result of compensating for limitations in the family of deterministic parsers assumed here. Any limitations in this family will result in the deterministic search making choices before the necessary disambiguating information is available, thereby leading to additional errors. As the lookahead increases, some previously mistaken choices will become disambiguated by the additional lookahead information, thereby improving performance. In the limit as lookahead increases, the performance of 4Note that when the lookahead length is longer than the longest sentence, the deterministic and non-deterministic parsers become equivalent.</Paragraph>
    <Paragraph position="3"> the deterministic and non-deterministic parsers will become the same, no matter what family of deterministic parsers has been speci ed. The smooth curve of increasing performance as the lookahead is increased above one word is the type of results we would expect if the lookahead were simply correcting mistakes in this way.</Paragraph>
    <Paragraph position="4"> Examples of possible limitations to the family of deterministic parsers assumed here include the choice of the left-corner ordering of parser decisions. The left-corner ordering completely determines when each decision about the phrase structure tree must be made. If the family of deterministic parsers had more exibility in this ordering, then the optimal deterministic parser could use an ordering which was tailored to the statistics of the data, thereby avoiding being forced to make decisions before su cient information is available.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML