XML Viewer - w05-1506

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1506_evalu.xml
Size: 8,976 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1506">
  <Title>Better k-best Parsing</Title>
  <Section position="9" start_page="58" end_page="59" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We report results from two sets of experiments. For probabilistic parsing, we implemented Algorithms 0, 1, and 3 on top of a widely-used parser (Bikel, 2004) and conducted experiments on parsing efficiency and the quality of the k-best-lists. We also implemented Algorithms 2 and 3 in a parsing-based MT decoder (Chiang, 2005) and report results on decoding speed.</Paragraph>
    <Section position="1" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
5.1 Experiment 1: Bikel Parser
</SectionTitle>
      <Paragraph position="0"> Bikel's parser (2004) is a state-of-the-art multilingual parser based on lexicalized context-free models (Collins, 2003; Eisner, 2000). It does support k-best parsing, but, following Collins' parse-reranking work (Collins, 2000) (see also Section 5.1.2), it accomplishes this by simply abandoning dynamic programming, i.e., no items are considered equivalent (Charniak and Johnson, 2005).</Paragraph>
      <Paragraph position="1"> Theoretically, the time complexity is exponential in n (the input sentence length) and constant in k, since, without merging of equivalent items, there is no limit on the number of items in the chart. In practice, beam search is used to reduce the observed time.5 But with the standard beam width of 10[?]4, this method becomes prohibitively expensive for n , 25 on Bikel's parser. Collins (2000) used a narrower 10[?]3 beam and further applied a cell limit of 100,6 but, as we will show below, this has a detrimental effect on the quality of the output. We therefore omit this method from our speed comparisons, and use our implementation of Algorithm 0 (nacurrency1 ve) as the baseline.</Paragraph>
      <Paragraph position="2"> We implemented our k-best Algorithms 0, 1, and 3 on top of Bikel's parser and conducted experiments on a 2.4 GHz 64-bit AMD Opteron with 32 GB memory. The program is written in Java 1.5 running on the Sun JVM in server mode with a maximum heap size of 5 GB. For this experiment, we used sections 02 21 of the Penn Tree-bank (PTB) (Marcus et al., 1993) as the training data and section 23 (2416 sentences) for evaluation, as is now standard. We ran Bikel's parser using its settings to emulate Model 2 of (Collins, 2003).</Paragraph>
      <Paragraph position="3">  We tested our algorithms under various conditions. We rst did a comparison of the average parsing time per sentence of Algorithms 0, 1, and 3 on section 23, with k * 10000 for the standard beam of width 10[?]4. Figure 8(a) shows that the parsing speed of Algorithm 3 improved dramatically against the other algorithms and is nearly constant in k, which exactly matches the complexity analysis. Algorithm 1 (k log k) also signi cantly out-performs the baseline nacurrency1 ve algorithm (k2 log k). We also did a comparison between our Algorithm 3 and the Jim*enez and Marzal algorithm in terms of average  heap size. Figure 8(b) shows that for larger k, the two algorithms have the same average heap size, but for smaller k, our Algorithm 3 has a considerably smaller average heap size. This difference is useful in applications where only short k-best lists are needed. For example, McDonald et al. (2005) nd that k = 5 gives optimal parsing accuracy.</Paragraph>
      <Paragraph position="4">  Our efficient k-best algorithms enable us to search over a larger portion of the whole search space (e.g. by less aggressive pruning), thus producing k-best lists with better quality than previous methods. We demonstrate this by comparing our k-best lists to those in (Ratnaparkhi, 1997), (Collins, 2000) and the parallel work by Charniak and Johnson (2005) in several ways, including oracle reranking and average number of found parses.</Paragraph>
      <Paragraph position="5"> Ratnaparkhi (1997) introduced the idea of oracle reranking: suppose there exists a perfect reranking scheme that magically picks the best parse that has the highest F-score among the top k parses for each sentence.</Paragraph>
      <Paragraph position="6"> Then the performance of this oracle reranking scheme is the upper bound of any actual reranking system like (Collins, 2000).As k increases, the F-score is nondecreasing, and there is some k (which might be very large) at which the F-score converges.</Paragraph>
      <Paragraph position="7"> Ratnaparkhi reports experiments using oracle reranking with his statistical parser MXPARSE, which can compute its k-best parses (in his experiments, k = 20).</Paragraph>
      <Paragraph position="8"> Collins (2000), in his parse-reranking experiments, used his Model 2 parser (Collins, 2003) with a beam width of 10[?]3 together with a cell limit of 100 to obtain k-best lists; the average number of parses obtained per sentence was 29.2, the maximum, 101.7 Charniak and Johnson (2005) use coarse-to- ne parsing on top of the Charniak (2000) parser and get 50-best lists for section 23.</Paragraph>
      <Paragraph position="9"> Figure 9(a) compares the results of oracle reranking.</Paragraph>
      <Paragraph position="10"> Collins' curve converges at around k = 50 while ours continues to increase. With a beam width of 10[?]4 and k = 100, our parser plus oracle reaches an F-score of 96.4%, compared to Collins' 94.9%. Charniak and Johnson's work, however, is based on a completely different parser whose 1-best F-score is 1.5 points higher than the 1-bests of ours and Collins', making it difficult to compare in absolute numbers. So we instead compared the relative improvement over 1-best. Figure 9(b) shows that our work has the largest percentage of improvement in terms of F-score when k &gt; 20.</Paragraph>
      <Paragraph position="11"> To further explore the impact of Collins' cell limit on the quality of k-best lists, we plotted average number of parses for a given sentence length (Figure 10). Generally speaking, as input sentences get longer, the number of parses grows (exponentially). But we see that the curve for Collins' k-best list goes down for large k (&gt; 40). We suspect this is due to the cell limit of 100 pruning away potentially good parses too early in the chart. As sentences get longer, it is more likely that a lower-probability parse might contribute eventually to the k-best parses. So we infer that Collins' k-best lists have limited quality for large k, and this is demonstrated by the early convergence of its oracle-reranking score. By comparison, our curves of both beam widths continue to grow with k = 100.</Paragraph>
      <Paragraph position="12"> All these experiments suggest that our k-best parses are of better quality than those from previous k-best parsers, 7The reason the maximum is 101 and not 100 is that Collins merged the 100-best list using a beam of 10!3 with the 1-best list using a beam of 10!4 (Collins, p.c.).</Paragraph>
      <Paragraph position="13">  This work with beam width 10-4 This work with beam width 10-3 (Collins, 2000) with beam width 10-3  ine) on MT decoding task. Average time (both excluding initial 1-best phase) vs. k (log-log).</Paragraph>
      <Paragraph position="14"> and similar quality to those from (Charniak and Johnson, 2005) which has so far the highest F-score after reranking, and this might lead to better results in real parse reranking.</Paragraph>
    </Section>
    <Section position="2" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
5.2 Experiment 2: MT decoder
</SectionTitle>
      <Paragraph position="0"> Our second experiment was on a CKY-based decoder for a machine translation system (Chiang, 2005), implemented in Python 2.4 accelerated with Psyco 1.3 (Rigo, 2004). We implemented Algorithms 2 and 3 to compute k-best English translations of Mandarin sentences. Because the CFG used in this system is large to begin with (millions of rules), and then effectively intersected with a nite-state machine on the English side (the language model), the grammar constant for this system is quite large. The decoder uses a relatively narrow beam search for efficiency.</Paragraph>
      <Paragraph position="1"> We ran the decoder on a 2.8 GHz Xeon with 4 GB of memory, on 331 sentences from the 2002 NIST MTEval test set. We tested Algorithm 2 for k = 2i,3 * i * 10, and Algorithm 3 (offline algorithm) for k = 2i,3 * i * 20.</Paragraph>
      <Paragraph position="2"> For each sentence, we measured the time to calculate the k-best list, not including the initial 1-best parsing phase. We then averaged the times over our test set to produce the graph of Figure 11, which shows that Algorithm 3 runs an average of about 300 times faster than Algorithm 2. Furthermore, we were able to test Algorithm 3 up to k = 106 in a reasonable amount of time.8 8The curvature in the plot for Algorithm 3 for k &lt; 1000 may be due to lack of resolution in the timing function for short times.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML