XML Viewer - p99-1066

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1066_metho.xml
Size: 14,354 bytes
Last Modified: 2025-10-06 14:15:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1066">
  <Title>Automatic Compensation for Parser Figure-of-Merit Flaws*</Title>
  <Section position="3" start_page="513" end_page="513" type="metho">
    <SectionTitle>
2 Figure of independent merit
</SectionTitle>
    <Paragraph position="0"> (Caraballo and Charniak, 1998) and \[Gold98\] use a figure which indicates the merit of a given constituent or edge, relative only to itself and its children but independent of the progress of the parse we will call this the edge's independent merit (IM).</Paragraph>
    <Paragraph position="1"> The philosophical backing for this figure is that we would like to rank an edge based on the value</Paragraph>
    <Paragraph position="3"> where N~, k represents an edge of type i (NP, S, etc.), which encompasses words j through k- 1 of the sentence, and t0,~ represents all n part-of-speech tags, from 0 to n - 1. (As in the previous research, we simplify by looking at a tag stream, ignoring lexical information.) Given a few basic independence assumptions (Caraballo and Charniak, 1998), this value can be calculated as</Paragraph>
    <Paragraph position="5"> with fl and a representing the well-known &amp;quot;inside&amp;quot; and &amp;quot;outside&amp;quot; probability functions:</Paragraph>
    <Paragraph position="7"> Unfortunately, the outside probability is not calculable until after a parse is completed.</Paragraph>
    <Paragraph position="8"> Thus, the IM is an approximation; if we cannot calculate the full outside probability (the probability of this constituent occurring with all the other tags in the sentence), we can at least calculate the probability of this constituent occurring with the previous and subsequent tag. This approximation, as given in (Caraballo and Charniak, 1998), is</Paragraph>
    <Paragraph position="10"> Of the five values required, P(N~.,kltj) , P(tkltk_l), and P(tklN~,k) can be observed directly from the training data; the inside probability is estimated using the most probable parse for Nj, k, and the tag sequence probability is estimated using a bitag approximation. null Two different probability distributions are used in this estimate, and the PCFG probabilities in the numerator tend to be a bit lower than the brag probabilities in the denominator; this is more of a factor in larger constituents, so the figure tends to favour the smaller ones. To adjust the distributions to counteract this effect, we will use a normalisation constant 7? as in \[Gold98\].</Paragraph>
    <Paragraph position="11"> Effectively, the inside probability fl is multiplied by r/k-j , preventing the discrepancy and hence the preference for shorter edges.</Paragraph>
    <Paragraph position="12"> In this paper we will use r/= 1.3 throughout; this is the factor by which the two distributions differ, and was also empirically shown to be the best tradeoff between number of * popped edges and accuracy (in \[Gold98\]).</Paragraph>
  </Section>
  <Section position="4" start_page="513" end_page="514" type="metho">
    <SectionTitle>
3 Finding FOM flaws
</SectionTitle>
    <Paragraph position="0"> Clearly, any improvement to be had would need to come through eliminating the incorrect edges before they are popped from the agenda--that is, improving the figure of merit. We observed that the FOMs used tended to cause the algorithm to spend too much time in one area of a sentence, generating multiple parses for the same substring, before it would generate even one parse for another area. The reason for that is that the figures of independent merit are frequently good as relative measures for ranking different parses of the same sectio.n of the sentence, but not so good as absolute measures for ranking parses of different substrings.</Paragraph>
    <Paragraph position="1"> For instance, if the word &amp;quot;there&amp;quot; as an NP in &amp;quot;there's a hole in the bucket&amp;quot; had a low probability, it would tend to hold up the parsing of a sentence; since the bi-tag probability of &amp;quot;there&amp;quot; occurring at the beginning of a sentence is very high, the denominator of the IM would overbalance the numerator. (Note that this is a contrived  example--the actual problem cases are more obscure.) Of course, a different figure of independent merit might have different characteristics, but with many of them there will be cases where the figure is flawed, causing a single, vital edge to remain on the agenda while the parser 'thrashes' around in other parts of the sentence with higher IM values.</Paragraph>
    <Paragraph position="2"> We could characterise this observation as follows: Postulate 1 The longer an edge stays in the agenda without any competitors, the more likely it is to be correct (even if it has a low figure of independent merit).</Paragraph>
    <Paragraph position="3"> A better figure, then, would take into account whether a given piece of text had already been parsed or not. We took two approaches to finding such a figure.</Paragraph>
  </Section>
  <Section position="5" start_page="514" end_page="517" type="metho">
    <SectionTitle>
4 Compensating for flaws
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="514" end_page="515" type="sub_section">
      <SectionTitle>
4.1 Experiment 1: Table lookup
</SectionTitle>
      <Paragraph position="0"> In one approach to the problem, we tried to start our program with no extra information and train it statistically to counter the problem mentioned in the previous section.</Paragraph>
      <Paragraph position="1"> There are four values mentioned in Postulate 1: correctness, time (amount of work done), number of competitors, and figure of independent merit. We defined them as follows: null Correctness. The obvious definition is that an edge N~, k is correct if a constituent Nj, k appears in the parse given in the treebank. There is an unobvious but unfortunate consequence of choosing this definition, however; in many cases (especially with larger constituents), the &amp;quot;correct&amp;quot; rule appears just once in the entire corpus, and is thus considered too unlikely to be chosen by the parser as correct. If the &amp;quot;correct&amp;quot; parse were never achieved, we wouldn't have any statistic at all as to the likelihood of the first, second, or third competitor being better than the others. If we define &amp;quot;correct&amp;quot; for the purpose of statisticsgathering as &amp;quot;in the MAP parse&amp;quot;, the problem is diminished. Both definitions were tried for gathering statistics, though of course only the first was used for measuring accuracy of output parses.</Paragraph>
      <Paragraph position="2"> Work. Here, the most logical measure for amount of work done is the number of edges popped off the agenda. We use it both because it is conveniently processor-independent and because it offers us a tangible measure of perfection (47.5 edges--the average number of edges in the correct parse of a sentence).</Paragraph>
      <Paragraph position="3"> Competitorship. At the most basic level, the competitors of a given edge Nj, k would be all those edges N~, n such that m _&lt; j and n &gt; k. Initially we only considered an edge a 'competitor' if it met this definition and were already in the chart; later we tried considering an edge to be a competitor if it had a higher in.dependent merit, no matter whether it be in the agenda or the chart. We also tried a hybrid of the two.</Paragraph>
      <Paragraph position="4"> Merit. The independent merit of an edge is defined in section 2. Unlike earlier work, which used what we call &amp;quot;Independent Merit&amp;quot; as the FOM for parsing, we use this figure as just one of many sources of information about a given edge.</Paragraph>
      <Paragraph position="5"> Given our postulate, the ideal figure of merit would be P( correct l W, C, IM) . (6) We can save information about this probability for each edge in every parse; but to be useful in a statistical model, the IM must first be discretised, and all three prior statistics need to be grouped, to avoid sparse data problems. We bucketed all three logarithmically, with bases 4, 2, and 10, respectively. This gives us the following approximation:</Paragraph>
      <Paragraph position="7"> To somewhat counteract the effect of discretising the IM figure, each time we needed</Paragraph>
      <Paragraph position="9"> to calculate a figure of merit, we looked up the table entry on either side of the IM and interpolated. Thus the actual value used as a figure of merit was that given in equation (8).</Paragraph>
      <Paragraph position="10"> Each trial consisted of a training run and a testing run. The training runs consisted of using a grammar induced on treebank sections 2-21 to run the edge-based best-first algorithm (with the IM alone as figure of merit) on section 24, collecting the statistics along the way. It seems relatively obvious that each edge should be counted when it is created. But our postulate involves edges which have stayed on the agenda for a long time without accumulating competitors; thus we wanted to update our counts when an edge happened to get more competitors, and as time passed. Whenever the number of edges popped crossed into a new logarithmic bucket (i.e. whenever it passed a power of four), we re-counted every edge in the agenda in that new bucket. In addition, when the number of competitors of a given edge passed a bucket boundary (power of two), that edge would be re-counted. In this manner, we had a count of exactly how many edges--correct or not--had a given IM and a given number of competitors at a given point in the parse.</Paragraph>
      <Paragraph position="11"> Already at this stage we found strong evidence for our postulate. We were paying particular attention to those edges with a low IM and zero competitors, because those were the edges that were causing problems when the parser ignored them. When, considering this subset of edges, we looked at a graph of the percentage of edges in the agenda which were correct, we saw an increase of orders of magnitude as work increased--see Figure 1.</Paragraph>
      <Paragraph position="12"> For the testing runs, then, we used as figure of merit the value in expression 8. Aside from that change, we used the same edge-based best-first parsing algorithm as before. The test runs were all made on treebank sec-</Paragraph>
      <Paragraph position="14"> Proportion of agenda edges correct vs. work tion 22, with all sentences longer than 40 words thrown out; thus our results can be directly compared to those in the previous work.</Paragraph>
      <Paragraph position="15"> We made several trials, using different definitions of 'correct' and 'competitor', as described above. Some performed much better than others, as seen in Table 1, which gives our results, both in terms of accuracy and speed, as compared to the best previous result, given in \[Gold98\]. The trial descriptions refer back to the multiple definitions given for 'correct' and 'competitor' at the beginning of this section. While our best speed improvement (48.6% of the previous minimum) was achieved with the first run, it is associated with a significant loss in accuracy. Our best results overall, listed in the last row of the table, let us cut the edge count by almost half while reducing labelled precision/recall by only 0.24%.</Paragraph>
    </Section>
    <Section position="2" start_page="515" end_page="517" type="sub_section">
      <SectionTitle>
4.2 Experiment 2: Demeriting
</SectionTitle>
      <Paragraph position="0"> We hoped, however, that we might be able to find a way to simplify the algorithm such that it would be easier to implement and/or  line is not parallel to the competitor axis, but rather angled so that the low-IM lowcompetitor items pass the scan before the high-IM high-competitor items. This can be simulated by multiplying each edge's independent merit by a demeriting factor 5 per competitor (thus a total of 5c). Its exact value would determine the steepness of the scan line.</Paragraph>
      <Paragraph position="1"> Each trial consisted of one run, an edge-based best-first parse of treebank section 22 (with sentences longer than 40 words thrown out, as before), using the new figure of merit: k-j i i i</Paragraph>
      <Paragraph position="3"> faster to run, without sacrificing accuracy.</Paragraph>
      <Paragraph position="4"> To that end, we looked over the data, viewing it as (among other things) a series of &amp;quot;planes&amp;quot; seen by setting the amount of work constant (see Figure 2). Viewed like this, the original algorithm behaves like a scan line, parallel to the competitor axis, scanning for the one edge with the highest figure of (independent) merit. However, one look at figure 2 dramatically confirms our postulate that an edge with zero competitors can have an IM orders of magnitude lower than an edge with many competitors, and still be more likely to be correct. Effectively, then, under the table lookup algorithm, the scan 2previous work has shown that the parser performs better if it runs slightly past the first parse; so for every run referenced in this paper, the parser was allowed to run to first parse plus a tenth. All reported final counts for popped edges are thus 1.1 times the count at first parse.</Paragraph>
      <Paragraph position="5"> This idea works extremely well. It is, predictably, easier to implement; somewhat surprisingly, though, it actually performs better than the method it approximates. When 5 = .7, for instance, the accuracy loss is only .28%, comparable to the table lookup result, but the number of edges popped drops to just 91.23, or 39.7% of the prior result found in \[Gold98\]. Using other demeriting factors gives similarly dramatic decreases in edge count, with varying effects on accuracy--see Figures 3 and 4.</Paragraph>
      <Paragraph position="6"> It is not immediately clear as to why demeriting improves performance so dramatically over the table lookup method. One possibility is that the statistical method runs into too many sparse data problems around the fringe of the data set--were we able to use a larger data set, we might see the statistics approach the curve defined by the demeriting. Another is that the bucketing is too coarse, although the interpolation along  the independent merit axis would seem to mitigate that problem.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML