XML Viewer - p01-1010

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1010_metho.xml
Size: 18,621 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1010">
  <Title>What is the Minimal Set of Fragments that Achieves Maximal Parse Accuracy?</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The DOP1 Model
</SectionTitle>
    <Paragraph position="0"> To-date, the Data Oriented Parsing model has mainly been applied to corpora of trees whose labels consist of primitive symbols (but see Bod &amp; Kaplan 1998; Bod 2000c, 2001). Let us illustrate the original DOP model presented in Bod (1993), called DOP1, with a simple example.</Paragraph>
    <Paragraph position="1"> Assume a corpus consisting of only two trees:  New sentences may be derived by combining fragments, i.e. subtrees, from this corpus, by means of a node-substitution operation indicated as deg . Node-substitution identifies the leftmost nonterminal frontier node of one subtree with the root node of a second subtree (i.e., the second subtree is substituted on the leftmost nonterminal frontier node of the first subtree). Thus a new sentence such as Mary likes Susan can be derived by combining subtrees from this corpus:  DOP1 computes the probability of a subtree t as the probability of selecting t among all corpus subtrees that can be substituted on the same node as t. This probability is equal to the number of occurrences of t,  |t |, divided by the total number of occurrences of all subtrees t' with the same root label as t. Let r(t) return the root label of t. Then we may write:</Paragraph>
    <Paragraph position="3"> In most applications of DOP1, the subtree probabilities are smoothed by the technique described in Bod (1996) which is based on Good-Turing. (The subtree probabilities are not smoothed by backing off to smaller subtrees, since these are taken into account by the parse tree probability, as we will see.) The probability of a derivation t</Paragraph>
    <Paragraph position="5"> As we have seen, there may be several distinct derivations that generate the same parse tree. The probability of a parse tree T is thus the sum of the probabilities of its distinct derivations. Let t id be the i-th subtree in the derivation d that produces tree T, then the probability of T is given by</Paragraph>
    <Paragraph position="7"> Thus the DOP1 model considers counts of subtrees of a wide range of sizes in computing the probability of a tree: everything from counts of single-level rules to counts of entire trees. This means that the model is sensitive to the frequency of large subtrees while taking into account the smoothing effects of counts of small subtrees.</Paragraph>
    <Paragraph position="8"> Note that the subtree probabilities in DOP1 are directly estimated from their relative frequencies. A number of alternative subtree estimators have been proposed for DOP1 (cf. Bonnema et al 1999), including maximum likelihood estimation (Bod 2000b). But since the relative frequency estimator has so far not been outper formed by any other estimator for DOP1, we will stick to this estimator in the current paper.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3 Computational Issues
</SectionTitle>
    <Paragraph position="0"> Bod (1993) showed how standard chart parsing techniques can be applied to DOP1. Each corpussubtree t is converted into a context-free rule r where the lefthand side of r corresponds to the root label of t and the righthand side of r corresponds to the frontier labels of t. Indices link the rules to the original subtrees so as to maintain the subtree's internal structure and probability.</Paragraph>
    <Paragraph position="1"> These rules are used to create a derivation forest for a sentence (using a CKY parser), and the most probable parse is computed by sampling a sufficiently large number of random derivations from the forest (&amp;quot;Monte Carlo disambiguation&amp;quot;, see Bod 1998). While this technique has been successfully applied to parsing the ATIS portion in the Penn Treebank (Marcus et al. 1993), it is extremely time consuming. This is mainly because the number of random derivations that should be sampled to reliably estimate the most probable parse increases exponentially with the sentence length (see Goodman 1998). It is therefore questionable whether Bod's sampling technique can be scaled to larger domains such as the WSJ portion in the Penn Treebank.</Paragraph>
    <Paragraph position="2"> Goodman (1996, 1998) showed how DOP1 can be reduced to a compact stochastic context-free grammar (SCFG) which contains exactly eight SCFG rules for each node in the training set trees. Although Goodman's method does still not allow for an efficient computation of the most probable parse (in fact, the problem of computing the most probable parse in DOP1 is NP-hard -see Sima'an 1999), his method does allow for an efficient computation of the &amp;quot;maximum constituents parse&amp;quot;, i.e. the parse tree that is most likely to have the largest number of correct constituents.</Paragraph>
    <Paragraph position="3"> Goodman has shown on the ATIS corpus that the maximum constituents parse performs at least as well as the most probable parse if all subtrees are used. Unfortunately, Goodman's reduction method is only beneficial if indeed all subtrees are used. Sima'an (1999: 108) argues that there may still be an isomorphic SCFG for DOP1 if the corpus-subtrees are restricted in size or lexicalization, but that the number of the rules explodes in that case.</Paragraph>
    <Paragraph position="4"> In this paper we will use Bod's subtree-torule conversion method for studying the impact of various subtree restrictions on the WSJ corpus. However, we will not use Bod's Monte Carlo sampling technique from complete derivation forests, as this turned out to be prohibitive for WSJ sentences. Instead, we employ a Viterbi n-best search using a CKY algorithm and estimate the most probable parse from the 1,000 most probable derivations, summing up the probabilities of derivations that generate the same tree. Although this heuristic does not guarantee that the most probable parse is actually found, it is shown in Bod (2000a) to perform at least as well as the estimation of the most probable parse with Monte Carlo techniques. However, in computing the 1,000 most probable derivations by means of Viterbi it is prohibitive to keep track of all subderivations at each edge in the chart (at least for such a large corpus as the WSJ). As in most other statistical parsing systems we therefore use the pruning technique described in Goodman (1997) and Collins (1999: 263-264) which assigns a score to each item in the chart equal to the product of the inside probability of the item and its prior probability. Any item with a score less than 10 [?]5 times of that of the best item is pruned from the chart.</Paragraph>
    <Paragraph position="5"> 4 What is the Minimal Subtree Set that</Paragraph>
    <Section position="1" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
Achieves Maximal Parse Accuracy?
4.1 The base line
</SectionTitle>
      <Paragraph position="0"> For our base line parse accuracy, we used the now standard division of the WSJ (see Collins 1997, 1999; Charniak 1997, 2000; Ratnaparkhi 1999) with sections 2 through 21 for training (approx. 40,000 sentences) and section 23 for testing (2416 sentences [?] 100 words); section 22 was used as development set. All trees were stripped off their semantic tags, co-reference information and quotation marks. We used all training set subtrees of depth 1, but due to memory limitations we used a subset of the subtrees larger than depth 1, by taking for each depth a random sample of 400,000 subtrees.</Paragraph>
      <Paragraph position="1"> These random subtree samples were not selected by first exhaustively computing the complete set of subtrees (this was computationally prohibit ive). Instead, for each particular depth &gt; 1 we sampled subtrees by randomly selecting a node in a random tree from the training set, after which we selected random expansions from that node until a subtree of the particular depth was obtained. We repeated this procedure 400,000 times for each depth &gt; 1 and [?] 14. Thus no subtrees of depth &gt; 14 were used. This resulted in a base line subtree set of 5,217,529 subtrees which were smoothed by the technique described in Bod (1996) based on Good-Turing. Since our subtrees are allowed to be lexicalized (at their frontiers), we did not use a separate part-of-speech tagger: the test sentences were directly parsed by the training set subtrees. For words that were unknown in our subtree set, we guessed their categories by means of the method described in Weischedel et al. (1993) which uses statistics on word-endings, hyphenation and capitalization. The guessed category for each unknown word was converted into a depth-1 subtree and assigned a probability by means of simple Good-Turing estimation (see Bod 1998).</Paragraph>
      <Paragraph position="2"> The most probable parse for each test sentence was estimated from the 1,000 most probable derivations of that sentence, as described in section 3.</Paragraph>
      <Paragraph position="3"> We used &amp;quot;evalb&amp;quot;  to compute the standard PARSEVAL scores for our parse results. We focus on the Labeled Precision (LP) and Labeled Recall (LR) scores only in this paper, as these are commonly used to rank parsing systems.</Paragraph>
      <Paragraph position="4"> Table 1 shows the LP and LR scores obtained with our base line subtree set, and compares these scores with those of previous stochastic parsers tested on the WSJ (respectively Charniak 1997, Collins 1999, Ratnaparkhi 1999, and Charniak 2000).</Paragraph>
      <Paragraph position="5"> The table shows that by using the base line subtree set, our parser outperforms most previous parsers but it performs worse than the parser in Charniak (2000). We will use our scores of 89.5% LP and 89.3% LR (for test sentences [?] 40 words) as the base line result against which the effect of various subtree restrictions is investigated. While most subtree restrictions diminish the accuracy scores, we will see that there are restrictions that improve our scores, even beyond those of Charniak (2000).</Paragraph>
      <Paragraph position="7"> We will initially study our subtree restrictions only for test sentences [?] 40 words (2245 sentences), after which we will give in 4.6 our results for all test sentences [?] 100 words (2416 sentences). While we have tested all subtree restrictions initially on the development set (section 22 in the WSJ), we believe that it is interesting and instructive to report these subtree restrictions on the test set (section 23) rather than reporting our best result only.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 The impact of subtree size
</SectionTitle>
      <Paragraph position="0"> Our first subtree restriction is concerned with subtree size. We therefore performed experiments with versions of DOP1 where the base line subtree set is restricted to subtrees with a certain maximum depth. Table 2 shows the results of these experiments.</Paragraph>
      <Paragraph position="1">  depths (for test sentences [?] 40 words) Our scores for subtree-depth 1 are comparable to Charniak's treebank grammar if tested on word strings (see Charniak 1997). Our scores are slightly better, which may be due to the use of a different unknown word model. Note that the scores consistently improve if larger subtrees are taken into account. The highest scores are obtained if the full base line subtree set is used, but they remain behind the results of Charniak (2000). One might expect that our results further increase if even larger subtrees are used; but due to memory limitations we did not perform experiments with subtrees larger than depth 14.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.3 The impact of lexical context
</SectionTitle>
      <Paragraph position="0"> The more words a subtree contains in its frontier, the more lexical dependencies can be taken into account. To test the impact of the lexical context on the accuracy, we performed experiments with different versions of the model where the base line subtree set is restricted to subtrees whose frontiers contain a certain maximum number of words; the subtree depth in the base line subtree set was not constrained (though no subtrees deeper than 14 were in this base line set). Table 3 shows the results of our experiments.</Paragraph>
      <Paragraph position="1">  We see that the accuracy initially increases when the lexical context is enlarged, but that the accuracy decreases if the number of words in the subtree frontiers exceeds 12 words. Our highest scores of 90.8% LP and 90.5% LR outperform the scores of the best previously published parser by Charniak (2000) who obtains 90.1% for both LP and LR. Moreover, our scores also outperform the reranking technique of Collins (2000) who reranks the output of the parser of Collins (1999) using a boosting method based on Schapire &amp; Singer (1998), obtaining 90.4% LP and 90.1% LR. We have thus found a subtree restriction which does not decrease the parse accuracy but even improves it. This restriction consists of an upper bound of 12 words in the subtree frontiers, for subtrees [?] depth 14. (We have also tested this lexical restriction in combination with subtrees smaller than depth 14, but this led to a decrease in accuracy.)</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.4 The impact of structural context
</SectionTitle>
      <Paragraph position="0"> Instead of investigating the impact of lexical context, we may also be interested in studying the importance of structural context. We may raise the question as to whether we need all unlexicalized subtrees, since such subtrees do not contain any lexical information, although they may be useful to smooth lexicalized subtrees. We accomplished a set of experiments where unlexicalized subtrees of a certain minimal depth are deleted from the base line subtree set, while all lexicalized subtrees up to 12 words are retained.</Paragraph>
      <Paragraph position="1"> depth of deleted  unlexicalized subtrees are retained, but that unlexicalized subtrees larger than depth 6 do not contribute to any further increase in accuracy. On the contrary, these larger subtrees even slightly decrease the accuracy. The highest scores obtained are: 90.8% labeled precision and 90.6% labeled recall. We thus conclude that pure structural context without any lexical information contributes to higher parse accuracy (even if there exists an upper bound for the size of structural context). The importance of structural context is consonant with Johnson (1998) who showed that structural context from higher nodes in the tree (i.e. grandparent nodes) contributes to higher parse accuracy. This mirrors our result of the importance of unlexicalized subtrees of depth 2.</Paragraph>
      <Paragraph position="2"> But our results show that larger structural context (up to depth 6) also contributes to the accuracy.</Paragraph>
    </Section>
    <Section position="5" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.5 The impact of nonheadword dependencies
</SectionTitle>
      <Paragraph position="0"> We may also raise the question as to whether we need almost arbitrarily large lexicalized subtrees (up to 12 words) to obtain our best results. It could be the case that DOP's gain in parse accuracy with increasing subtree depth is due to the model becoming sensitive to the influence of lexical heads higher in the tree, and that this gain could also be achieved by a more compact model which associates each nonterminal with its headword, such as a head-lexicalized SCFG.</Paragraph>
      <Paragraph position="1"> Head-lexicalized stochastic grammars have recently become increasingly popular (see Collins 1997, 1999; Charniak 1997, 2000). These grammars are based on Magerman's head-percolation scheme to determine the headword of each nonterminal (Magerman 1995). Unfortunately this means that head-lexicalized stochastic grammars are not able to capture dependency relations between words that according to Magerman's head-percolation scheme are &amp;quot;nonheadwords&amp;quot; -- e.g. between more and than in the WSJ construction carry more people than cargo where neither more nor than are head-words of the NP constituent more people than cargo. A frontier-lexicalized DOP model, on the other hand, captures these dependencies since it includes subtrees in which more and than are the only frontier words. One may object that this example is somewhat far-fetched, but Chiang (2000) notes that head-lexicalized stochastic grammars fall short in encoding even simple dependency relations such as between left and John in the sentence John should have left . This is because Magerman's head-percolation scheme makes should and have the heads of their respective VPs so that there is no dependency relation between the verb left and its subject John. Chiang observes that almost a quarter of all nonempty subjects in the WSJ appear in such a configuration.</Paragraph>
      <Paragraph position="2"> In order to isolate the contribution of nonheadword dependencies to the parse accuracy, we eliminated all subtrees containing a certain maximum number of nonheadwords, where a nonheadword of a subtree is a word which according to Magerman's scheme is not a headword of the subtree's root nonterminal (although such a nonheadword may of course be a headword of one of the subtree's internal nodes). In the following experiments we used the subtree set for which maximum accuracy was obtained in our previous experiments, i.e.</Paragraph>
      <Paragraph position="3"> containing all lexicalized subtrees with maximally  higher parse accuracy: the difference between using no and all nonheadwords is 1.2% in LP and 1.0% in LR. Although this difference is relatively small, it does indicate that nonheadword dependencies should preferably not be discarded in the WSJ. We should note, however, that most other stochastic parsers do include counts of single nonheadwords: they appear in the backed-off statistics of these parsers (see Collins 1997, 1999; Charniak 1997; Goodman 1998). But our parser is the first parser that also includes counts between two or more nonheadwords, to the best of our knowledge, and these counts lead to improved performance, as can be seen in table 5.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML