XML Viewer - p06-2089

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2089_metho.xml
Size: 9,858 bytes
Last Modified: 2025-10-06 14:10:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2089">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Best-First Probabilistic Shift-Reduce Parser</Title>
  <Section position="6" start_page="693" end_page="696" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We evaluated our classifier-based best-first parser on the Wall Street Journal corpus of the Penn Tree-bank (Marcus et al., 1993) using the standard split: sections 2-21 were used for training, section 22 was used for development and tuning of parameters and features, and section 23 was used for testing. Every experiment reported here was performed on a Pentium4 3.2GHz with 2GB of RAM.</Paragraph>
    <Paragraph position="1"> Each tree in the training set had empty-node and function tag information removed, and the trees were lexicalized using the same head-table rules as in the Collins (1999) parser (these rules were taken from Bikel's (2002) implementation of the Collins parser). The trees were then converted into trees containing only unary and binary productions, us- null ing the binarization transform described in section 2. Classifier training instances of features paired  with classes (parser actions) were extracted from the trees in the training set, and the total number of training instances was about 1.9 million. It is interesting to note that the procedure of training the best-first parser is identical to the training of a deterministic version of the parser: the deterministic  Let: S(n) denote the nth item from the top of the stack S, and W(n) denote the nth item from the front of the queue W. Features:  S(0) and S(1) 9. The number of lexical items (words) that have been found (so far) to be dependents of the head-words of: S(0), and S(1) 10. The most recently found lexical dependent of the head-word of S(0) that is to the left of S(0)'s head 11. The most recently found lexical dependent of the head-word of S(0) that is to the right of S(0)'s head 12. The most recently found lexical dependent of the head-word of S(1) that is to the left of S(1)'s head 13. The most recently found lexical dependent of the head-word of S(1) that is to the right of S(1)'s head 14. The previous parser action applied to the current parser state  features described in items 1 - 7 are more directly related to the lexicalized constituent trees that are built during parsing, while the features described in items 8 - 13 are more directly related to the dependency structures that are built simultaneously to the constituent structures.  algorithm is simply run over all sentences in the training set, and since the correct trees are known in advance, we can simply record the features and correct parser actions that lead to the construction of the correct tree.</Paragraph>
    <Paragraph position="2"> Training the maximum entropy classifier with such a large number (1.9 million) of training instances and features required more memory than was available (the maximum training set size we were able to train with 2GB of RAM was about 200,000 instances), so we employed the training set splitting idea used by Yamada and Matsumoto (2003) and Sagae and Lavie (2005). In our case, we split the training data according to the part-of-speech (POS) tag of the head-word of the item on top of the stack, and trained each split of the training data separately. At run-time, every trained classifier is loaded, and the choice of classifier to use is made by looking at the head-word of the item on top of the stack in the current parser state. The total training time (a single machine was used and each classifier was trained in series) was slightly under nine hours. For comparison, Sagae and Lavie (2005) report that training support vector machines for one-against-all multi-class classification on the same set of features for their deterministic parser took 62 hours, and training a k-nearest neighbors classifier took 11 minutes.</Paragraph>
    <Paragraph position="3"> When given perfectly tagged text (gold part-of-speech tags extracted from the Penn Treebank), our parser has labeled constituent precision and recall of 89.40% and 88.79% respectively over all sentences in the test set, and 90.01% and 89.32% over sentences with length of at most 40 words.</Paragraph>
    <Paragraph position="4"> These results are at the same level of accuracy as those obtained with other state-of-the-art statistical parsers, although still well below the best published results for this test set (Bod, 2003; Charniak and Johnson, 2005). Although the parser is quite accurate, parsing the test set took 41 minutes. By implementing a very simple pruning strategy, the parser can be made much faster. Pruning the search space is done by only adding a new parser state to the heap if its probability is greater than 1/b of the probability of the most likely state in the heap that has had the same number of parser actions. By setting b to 50, the parser's accuracy is only affected minimally, and we obtain 89.3% precision and 88.7% recall, while parsing the test set in slightly under 17 minutes and taking less than 60 megabytes of RAM. Under the same conditions, but using automatically assigned part-of-speech tags (at 97.1% accuracy) using the SVM-Tool tagger (Gimenez and Marquez, 2004), we obtain 88.1% precision and 87.8% recall. It is likely that the deterioration in accuracy is aggravated by the training set splitting scheme based on POS tags.</Paragraph>
    <Paragraph position="5"> A deterministic version of our parser, obtained by simply taking the most likely parser action as the only action at each step (in other words, by setting b to 1), has precision and recall of 85.4% and 84.8%, respectively (86.5% and 86.0% using gold-standard POS tags). More interestingly, it parses all 2,416 sentences (more than 50,000 words) in only 46 seconds, 10 times faster than the deterministic SVM parser of Sagae and Lavie (2005).</Paragraph>
    <Paragraph position="6"> The parser of Tsuruoka and Tsujii (Tsuruoka and Tsujii, 2005) has comparable speed, but we obtain more accurate results. In addition to being fast, our deterministic parser is also lean, requiring only about 25 megabytes of RAM.</Paragraph>
    <Paragraph position="7"> A summary of these results is shown in table 1, along with the results obtained with other parsers for comparison purposes. The figures shown in table 1 only include experiments using automatically assigned POS tags. Results obtained with gold-standard POS tags are not shown, since they serve little purpose in a comparison with existing parsers. Although the time figures reflect the performance of each parser at the stated level of accuracy, all of the search-based parsers can trade accuracy for increased speed. For example, the Charniak parser can be made twice as fast at the cost of a 0.5% decrease in precision/recall, or ten times as fast at the cost of a 4% decrease in precision/recall (Roark and Charniak, 2002).</Paragraph>
    <Section position="1" start_page="695" end_page="696" type="sub_section">
      <SectionTitle>
4.1 Reranking with the Probabililstic
Shift-Reduce Model
</SectionTitle>
      <Paragraph position="0"> One interesting aspect of having an accurate parsing model that is significantly different from other well-known generative models is that the combination of two accurate parsers may produce even more accurate results. A probabilistic shift-reduce LR-like model, such as the one used in our parser, is different in many ways from a lexicalized PCFG-like model (using markov a grammar), such as those used in the Collins (1999) and Charniak (2000) parsers. In the probabilistic LR model, probabilities are assigned to tree  the test set. We first show results for the parsers described here, then for four of the most accurate or most widely known parsers, for the Ratnaparkhi maximum entropy parser, and finally for three recent classifier-based parsers. For the purposes of direct comparisons, only results obtained with automatically assigned part-of-speech tags are shown (tags are assigned by the parser itself or by a separate part-of-speech tagger). * Times reported by authors running on different hardware. derivations (not the constituents themselves) based on the sequence of parser shift/reduce actions.</Paragraph>
      <Paragraph position="1"> PCFG-like models, on the other hand, assign probabilities to the trees directly. With models that differ in such fundamental ways, it is possible that the probabilities assigned to different trees are independent enough that even a very simple combination of the two models may result in increased accuracy.</Paragraph>
      <Paragraph position="2"> We tested this hypothesis by using the Charniak (2000) parser in n-best mode, producing the top 10 trees with corresponding probabilities. We then rescored the trees produced by the Charniak parser using our probabilistic LR model, and simply multiplied the probabilities assigned by the Charniak model and our LR model to get a combined score for each tree2. On development data this resulted in a 1.3% absolute improvement in f-score over the 1-best trees produced by the Charniak parser. On the test set (WSJ Penn Treebank section 23), this reranking scheme produces precision of 90.9% and recall of 90.7%, for an f-score of 90.8%.</Paragraph>
      <Paragraph position="3"> 2The trees produced by the Charniak parser may include the part-of-speech tags AUX and AUXG, which are not part of the original Penn Treebank tagset. See (Charniak, 2000) for details. These are converted deterministically into the appropriate Penn Treebank verb tags, possibly introducing a small number of minor POS tagging errors. Gold-standard tags or the output of a separate part-of-speech tagger are not used at any point in rescoring the trees.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML