File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2089_intro.xml

Size: 11,613 bytes

Last Modified: 2025-10-06 14:03:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2089">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Best-First Probabilistic Shift-Reduce Parser</Title>
  <Section position="4" start_page="691" end_page="693" type="intro">
    <SectionTitle>
2 Parser Description
</SectionTitle>
    <Paragraph position="0"> Our parser uses an extended version of the basic bottom-up shift-reduce algorithm for constituent structures used in Sagae and Lavie's (2005) deterministic parser. For clarity, we will first describe the deterministic version of the algorithm, and then show how it can be extended into a probabilistic algorithm that performs a best-first search.</Paragraph>
    <Section position="1" start_page="691" end_page="691" type="sub_section">
      <SectionTitle>
2.1 A Shift-Reduce Algorithm for
Deterministic Constituent Parsing
</SectionTitle>
      <Paragraph position="0"> In its deterministic form, our parsing algorithm is the same single-pass shift-reduce algorithm as the one used in the classifer-based parser of Sagae and Lavie (2005). That algorithm, in turn, is similar to the dependency parsing algorithm of Nivre and Scholz (2004), but it builds a constituent tree and a dependency tree simultaneously. The algorithm considers only trees with unary and binary productions. Training the parser with arbitrary branching trees is accomplished by a simple procedure to transform those trees into trees with at most binary productions. This is done by converting each production with n children, where n &gt; 2, into n [?] 1 binary productions.</Paragraph>
      <Paragraph position="1"> This binarization process is similar to the one described in (Charniak et al., 1998). Additional non-terminal nodes introduced in this conversion must be clearly marked. Transforming the parser's output into arbitrary branching trees is accomplished using the reverse process.</Paragraph>
      <Paragraph position="2"> The deterministic parsing algorithm involves two main data structures: a stack S, and a queue W. Items in S may be terminal nodes (part-ofspeech-tagged words), or (lexicalized) subtrees of the final parse tree for the input string. Items in W are terminals (words tagged with parts-of-speech) corresponding to the input string. When parsing begins, S is empty and W is initialized by inserting every word from the input string in order, so that the first word is in front of the queue.</Paragraph>
      <Paragraph position="3"> The algorithm defines two types of parser actions, shift and reduce, explained below:  * Shift: A shift action consists only of removing (shifting) the first item (part-of-speechtagged word) from W (at which point the next word becomes the new first item), and placing it on top of S.</Paragraph>
      <Paragraph position="4"> * Reduce: Reduce actions are subdivided into unary and binary cases. In a unary reduction,  the item on top of S is popped, and a new item is pushed onto S. The new item consists of a tree formed by a non-terminal node with the popped item as its single child. The lexical head of the new item is the same as the lexical head of the popped item. In a binary reduction, two items are popped from S in sequence, and a new item is pushed onto S.</Paragraph>
      <Paragraph position="5"> The new item consists of a tree formed by a non-terminal node with two children: the first item popped from S is the right child, and the second item is the left child. The lexical head of the new item may be the lexical head of its left child, or the lexical head of its right child. If S is empty, only a shift action is allowed. If W is empty, only a reduce action is allowed. If both S and W are non-empty, either shift or reduce actions are possible, and the parser must decide whether to shift or reduce. If it decides to reduce, it must also choose between a unary-reduce or a binary-reduce, what non-terminal should be at the root of the newly created subtree to be pushed onto the stack S, and whether the lexical head of the newly created subtree will be taken from the right child or the left child of its root node. Following the work of Sagae and Lavie, we consider the complete set of decisions associated with a reduce action to be part of that reduce action. Parsing terminates when W is empty and S contains only one item, and the single item in S is the parse tree for the input string.</Paragraph>
    </Section>
    <Section position="2" start_page="691" end_page="692" type="sub_section">
      <SectionTitle>
2.2 Shift-Reduce Best-First Parsing
</SectionTitle>
      <Paragraph position="0"> A deterministic shift-reduce parser based on the algorithm described in section 2.1 does not handle ambiguity. By choosing a single parser action at each opportunity, the input string is parsed deterministically, and a single constituent structure is built during the parsing process from beginning to end (no other structures are even considered).</Paragraph>
      <Paragraph position="1"> A simple extension to this idea is to eliminate determinism by allowing the parser to choose several actions at each opportunity, creating different paths that lead to different parse trees. This is essentially the difference between deterministic LR parsing (Knuth, 1965) and Generalized-LR parsing (Tomita, 1987; Tomita, 1990). Furthermore, if a probability is assigned to every parser action, the probability of a parse tree can be computed  simply as the product of the probabilities of each action in the path that resulted in that parse tree (the derivation of the tree). This produces a probabilistic shift-reduce parser that resembles a generalized probabilistic LR parser (Briscoe and Carroll, 1993), where probabilities are associated with an LR parsing table. In our case, although there is no LR table, the action probabilities are associated with several aspects of the current state of the parser, which to some extent parallel the information contained in an LR table. Instead of having an explicit LR table and pushing LR states onto the stack, the state of the parser is implicitly defined by the configurations of the stack and queue.</Paragraph>
      <Paragraph position="2"> In a way, there is a parallel between how modern PCFG-like parsers use markov grammars as a distribution that is used to determine the probability of any possible grammar rules, and the way a statistical model is used in our parser to assign a probability to any transition of parser states (instead of a symbolic LR table).</Paragraph>
      <Paragraph position="3"> Pursuing every possible sequence of parser actions creates a very large space of actions for even moderately sized sentences. To find the most likely parse tree efficiently according to the probabilistic shift-reduce parsing scheme described so far, we use a best-first strategy. This involves an extension of the deterministic shift-reduce algorithm into a best-first shift-reduce algorithm. To describe this extension, we first introduce a new data structure Ti that represents a parser state, which includes a stack Si and a queue Wi. In the deterministic algorithm, we would have a single parser state T that contains S and W. The best-first algorithm, on the other hand, has a heap H containing multiple parser states T1 ... Tn.</Paragraph>
      <Paragraph position="4"> These states are ordered in the heap according to their probabilities, so that the state with the highest probability is at the top. State probabilities are determined by multiplying the probabilities of each of the actions that resulted in that state. Parser actions are determined from and applied to a parser state Ti popped from the top of H. The parser actions are the same as in the deterministic version of the algorithm. When the item popped from the top of the heap H contains a stack Si with a single item and an empty queue (in other words, meets the acceptance criteria for the deterministic version of the algorithm), the item on top of Si is the tree with the highest probability. At that point, parsing terminates if we are searching for the most probable parse. To obtain a list of n-best parses, we simply continue parsing once the first parse tree is found, until either n trees are found, or H is empty.</Paragraph>
      <Paragraph position="5"> We note that this approach does not use dynamic programming, and relies only on the best-first search strategy to arrive at the most probable parse efficiently. Without any pruning of the search space, the distribution of probability mass among different possible actions for a parse state has a large impact on the behavior of the search. We do not use any normalization to account for the size (in number of actions) of different derivations when calculating their probabilities, so it may seem that shorter derivations usually have higher probabilities than longer ones, causing the best-first search to approximate a breadth-first search in practice. However, this is not the case if for a given parser state only a few actions (or, ideally, only one action) have high probability, and all other actions have very small probabilities. In this case, only likely derivations would reach the top of the heap, resulting in the desired search behavior.</Paragraph>
      <Paragraph position="6"> The accuracy of deterministic parsers suggest that this may in fact be the types of probabilities a classifier would produce given features that describe the parser state, and thus the context of the parser action, specifically enough. The experiments described in section 4 support this assumption.</Paragraph>
    </Section>
    <Section position="3" start_page="692" end_page="693" type="sub_section">
      <SectionTitle>
2.3 Classifier-Based Best-First Parsing
</SectionTitle>
      <Paragraph position="0"> To build a parser based on the deterministic algorithm described in section 2.1, a classifier is used to determine parser actions. Sagae and Lavie (2005) built two deterministic parsers this way, one using support vector machines, and one using k-nearest neighbors. In each case, the set of features and classes used with each classifier was the same. Items 1 - 13 in figure 1 shows the features used by Sagae and Lavie. The classes produced by the classifier encode every aspect of a parser action. Classes have one of the following forms:</Paragraph>
      <Paragraph position="2"> duce action, where the root of the new sub-tree pushed onto S is of type XX (where XX is a non-terminal symbol, typically NP, VP,</Paragraph>
      <Paragraph position="4"> duce action, where the root of the new sub- null tree pushed onto S is of non-terminal type XX. Additionally, the head of the new subtree is the same as the head of the left child of the</Paragraph>
      <Paragraph position="6"> duce action, where the root of the new sub-tree pushed onto S is of non-terminal type XX. Additionally, the head of the new sub-tree is the same as the head of the right child of the root node.</Paragraph>
      <Paragraph position="7"> To implement a parser based on the best-first algorithm, instead of just using a classifier to determine one parser action given a stack and a queue, we need a classification approach that provides us with probabilities for different parser actions associated with a given parser state. One such approach is maximum entropy classification (Berger et al., 1996), which we use in the form of a library implemented by Tsuruoka1 and used in his classifier-based parser (Tsuruoka and Tsujii, 2005). We used the same classes and the same features as Sagae and Lavie, and an additional feature that represents the previous parser action applied the current parser state (figure 1).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML