File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1011_metho.xml
Size: 14,999 bytes
Last Modified: 2025-10-06 14:07:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1011"> <Title>Parsing with the Shortest Derivation</Title> <Section position="4" start_page="70" end_page="71" type="metho"> <SectionTitle> 3. Computational Aspects </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="70" end_page="71" type="sub_section"> <SectionTitle> 3.1 Computing the most probable parse </SectionTitle> <Paragraph position="0"> Bed (1993) showed how standard chart parsing techniques can be applied to probabilistic DOP. Each corpus-subtree t is converted into a context-free rule r where the lefthand side <51&quot; r corresponds to tile root label of t and tile righthand side of r corresponds to the fronlier labels of t. Indices link the rules to the original subtrees so as to maintain the sublree's internal structure and probability. These rules are used lO Cl'e~:lte il. derivation forest for a sentenc/2, illld the most p,obable parse is computed by sampling a sufficiently large number of random deriwttions from the forest (&quot;Monte Carlo disamt~iguation&quot;, see Bed 1998; Chappelier & Rajman 2000). While this technique has been sttccessfully applied to parsing lhe ATIS portion in the Penn Treebank (Marcus et al.</Paragraph> <Paragraph position="1"> 1993), it is extremely time consuming. This is mainly because lhe nun/bcr of random derivations thai should be sampled to reliably estimate tile most prolmble parse increases exponentially with the sentence length (see Goodman 1998). It is therefore questionable whether Bed's slunpling teclmique can be scaled to larger corpora such as tile OVIS and the WSJ corpora.</Paragraph> <Paragraph position="2"> Goodman (199g) showed how tile probabilistic I)OP model can be reduced to a compact stochastic context-free grammar (SCFG) which contains exactly eight SCFG rules for each node in the training set trues. Although Goodman's rcductkm method does still not allow for an efficient computation {51 tile most probable parse in DOP (ill fact, the prol~lem of computing the most prolmble parse is NP-hard -- sue Sima'an 1996), his method does allow for an efficient computation o1' the &quot;nmximun~ constituents parse&quot;, i.e., the parse tree that is most likely to have the largest number of correct constitueuts (also called the &quot;labeled recall parse&quot;). Goodman has shown on tile ATIS corpus that the nla.xinltllll constituents parse perfor,ns at least as well as lhe most probable parse if all subtl'ees are used.</Paragraph> <Paragraph position="3"> Unfortunately, Goodman's reduction method remains beneficial only if indeed all treebank subtrces arc used (see Sima'an 1999: 108), while maximum parse accuracy is typically obtained with a snbtree set which is smalle,&quot; than the total set of subtrees (this is probably due to data-sparseness effects -- see Bonnema et al. 1997; Bod 1998; Sima'an 1999).</Paragraph> <Paragraph position="4"> In this paper we will use Bod's subtree-to-rule conversion method for studying the behavior of probabilistic against non-probabilistic DOP for different maximtnn subtree sizes. However, we will not use Bod's Monte Carlo sampling technique from complete derivation forests, as this turns out to be computationally impractical for our larger corpora.</Paragraph> <Paragraph position="5"> Instead, we use a Viterbi n-best search and estimate the most probable parse fi'mn the 1,000 most probable deriwltions, summing up tile probabilities hi' derivations that generate the same tree. Tile algorithm for computing n most probable deriwttions follows straightforwardly from the algorithm which computes the most probable derivation by means of Viterbi optimization (see Sima'an 1995, 1999).</Paragraph> </Section> <Section position="2" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 3.2 Computing the shortest derivation </SectionTitle> <Paragraph position="0"> As with the probabilistic DOP model, we first convert the corpus-subtrees into rewrite rules. Next, the shortest derivation can be computed in the same way as the most probable deriwltion (by Viterbi) if we give all rules equal probabilities, in which case tile shortest derivation is equal to the most probable deriwltion. This can be seen as follows: if each rule has a probability p then the probability of a derivation involving n rules is equal to pn, and since 0<p<l the derivation with the fewest rules has the greatest probability. In out&quot; experiments, we gave each rule a probability mass equal to I/R, where R is the ntunbcr of distinct rules derived by Bod's method.</Paragraph> <Paragraph position="1"> As mentioned above, the shortest derivation may not be unique. In that case we compute all shortest derivations of a sentence and then apply out&quot; ranking scheme to these derivations. Note that this ranking scheme does distinguish between snbtrees or different root labels, as it ranks the subtrecs given their root label. The ranks of the shortest derivations are computed by summing up the ranks of the subtrees they involve. The shortest derivation with the smallest stun o1' subtree ranks is taken to produce tile best parse tree. 3</Paragraph> </Section> </Section> <Section position="5" start_page="71" end_page="71" type="metho"> <SectionTitle> 3 It may happcn that different shortest derivations generate </SectionTitle> <Paragraph position="0"> the same tree. We will not distinguish between these cases, however, and co,npt, te only the shortest derivation with the highest rank.</Paragraph> </Section> <Section position="6" start_page="71" end_page="73" type="metho"> <SectionTitle> 4. Experimental Comparison </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 4.1 Experiments on the ATIS corpus </SectionTitle> <Paragraph position="0"> For our first comparison, we used I0 splits from the Penn ATIS corpus (Marcus et al. 1993) into training sets of 675 sentences and test sets of 75 sentences.</Paragraph> <Paragraph position="1"> These splits were random except for one constraint: tbat all words in the test set actually occurred in the training set. As in Bod (1998), we eliminated all epsilon productions and all &quot;pseudo-attachments&quot;. As accuracy metric we used the exact match defined as the percentage of the best parse trees that are identical to the test set parses. Since the Penn ATIS portion is relatively small, we were able to compute the most probable parse both by means of Monte Carlo sampling and by means of Viterbi n-best. Table 1 shows the means o1' tile exact match accuracies for increasing maximum subtrec depths (up to depth 6).</Paragraph> <Paragraph position="2"> Tile table shows that tile two methods for probabilistic DOP score roughly tile same: at dcpfll _< 6, the Monte Carlo method obtains 84.1% while the Viterbi n-best method obtains 84.0%. These differences are not statistically significant. The table also shows that for small subtree depths the non-probabilistic DOP model performs considerably worse than the probabilistic model. This may not be surprising since for small subtrecs tile shortest derivation corresponds to tile smallest parse tree which is known to be a bad prediction of the correct parse tree. Only il' the subtrees are larger than depth 4, the non-probabilistic DOP model scores roughly the same as its probabilistic counterpart. At subtree depth < 6, the non-probabilistic DOP model scores 1.5% better than the best score of the probabilistic DOP model, which is statistically significant according to paired t-tests.</Paragraph> </Section> <Section position="2" start_page="71" end_page="72" type="sub_section"> <SectionTitle> 4.2 Experiments on tile OVIS corpus </SectionTitle> <Paragraph position="0"> For out&quot; comparison on tile OVIS corpus (Bonnema ct al. 1997; Bod 1998) we again used 10 random splits under tile condition that all words in tile test set occurred in the training set (9000 sentences for training, 1000 sentences for testing). The ()VIS trees contain both syntactic and se,nantic annotations, but no epsilon productions. As in Bod (1998), we lreated the syntactic and semantic annotations of each node as one label. Consequently, the labels are very restrictive and collecting statistics over them is difficult. Bonncma et al. (1997) and Sima'an (1999) report that (probal)ilislic) I)OP sulTers considerably from data-sparseness on OVIS, yielding a decrease in parse accuracy if subtrees larger lh'an depth 4 are included. Thus it is interesting to investigate how non-probabilistic DOP behaves on this corpus. Table 2 shows the means of the exact match accuracies for increasing subtree depths.</Paragraph> <Paragraph position="1"> We again sue that the non-pl'olmlfilistic l)()P model performs badly fOl small subtree depths while it outperforms the probabilislic DOP model if the sublrees gel larger (in this case for depth > 3). Bul while lhe accuracy of probabilislic l)()P deteriorates after depth 4, the accuracy of non-prolmbilistic 1)O1 + contintms to grow. Thus non-prolmlfilistic \])()P seems relatively insensitive to tile low frequency of larger subtrees. This properly may be especially useful if no meaningful statistics can be collected while sentences can still be parsed by large chunks. At depth ___ 6, non-probabilislic DOP scores 3.4% better than probalfilistic DOP, which is statistically significant using paired t-tests.</Paragraph> </Section> <Section position="3" start_page="72" end_page="73" type="sub_section"> <SectionTitle> 4.3 Experiments on the WSJ corpus </SectionTitle> <Paragraph position="0"> Both the ATIS and OVIS corpus represent restricted domains. In order to extend ()tit&quot; results to a broad-coverage domain, we tested tile two models also on tile Wall Street Journal portion in the Penn Treebank (Marcus et ill. 1993).</Paragraph> <Paragraph position="1"> To make our results comparable to ()tilers, we did not test on different random splits but used the now slandard division of the WSJ with seclions 2-21 for training (approx. 40,000 sentences) and section 23 for testing (see Collins 1997, 1999; Charniak 1997, 2000; l~,atnalmrkhi 1999); we only tested on sentences _< 40 words (2245 sentences). All trees were stripped off their Selllalltic lags, co-reference information and quotation marks. We used all training set sublrees o1' depth 1, but due to memory limitations we used a subset of the subtrees larger than depth l by taking for each depth a random sample o1' 400,000 subtrecs. No subtrces larger than depth 14 were used. This resulted into a total set of 5,217,529 subtrees which were smoothed by Good-Turing (see Bod 1996). We did not employ a separate part-of-speech tagger: tile test sentences were directly parsed by the training set subtrees. For words that were unknown in tile training set, we guessed their categories by means of the method described in Weischedel et al. (1993) which uses statistics on word-endings, hyphenation and capitalization. The guessed category for each llllklloWn Wol'd was converted into a depth-I subtree and assigned a probability (or frequency for non-probabilistic I)OP) by means of simple Good-Turing.</Paragraph> <Paragraph position="2"> As accuracy metric we used the standard PAP, SEVAI, scores (Black et al. 1991) to compare a proposed parse 1' with tile corresponding correct treebank parse 7' as follows: # correct constituents in P l.abcled Precision # constilucnts in 1' # COI'I'CCI. COllstittlcnts ill l ~ Labeled Recall = # constituents in T A constituent in P is &quot;correct&quot; if there exisls a constituent in 7' of tile sanle label that spans the same words. As in other work, we collapsed AI)VP and Pl?Jl&quot; to the same label when calculating these scores (see Collins 1997; I~,atnaparkhi 1999; Charniak 1997). Table 3 shows the labeled precision (LP) and labeled recall (LR) scores for probabilistic and non-probabilistic DOP for six different maximum subtree depths.</Paragraph> <Paragraph position="3"> The table shows that probabilistic DOP outperl'orms non-probabilistic DOP for maximum subtree depths 4 and 6, while the models yield rather similar results for maximum subtree depth 8. Surprisingly, the scores of nonq~robabilistic DOP deteriorate if the subtrees are further enlarged, while tile scores of probabilistic DOP continue to grow, up to 89.5% LP and 89.3% LR. These scores are higher than those of several other parsers (e.g. Collins 1997, 99; Charniak 1997), but remain behind tim scores of Charniak (2000) who obtains 90.1% LP and 90.1% LR for sentences _< 40 words. However, in Bod (2000b) we show that even higher scores can be obtained with probabilistic DOP by restricting tile number of words in the subtree frontiers to 12 and restricting the depth of unlexicalized subtrees to 6; with these restrictions an LP of 90.8% and an LR of 90.6% is achieved.</Paragraph> <Paragraph position="4"> We may raise the question as to whether we actually need these extremely large subtrees to obtain our best results. One could argue that DOP's gain in parse accuracy with increasing subtree depth is due to tile model becoming sensitive to the int'luence o1' lexical heads higher in tile lree, and that this gain could also be achieved by a more compact depth-1 DOP model (i.e. an SCFG) which annotates the nonterminals with headwords. However, such a head-lexicalized stochastic grammar does not capture dependencies between nonheadwords (such as more aud than in tile W,qJ construction carry more people than cargo where neither more nor th\[lll are headwords ol' tile NP-constitucnt lllore people than cargo)), whe,eas a frontier-lexicalized DOP model using large subtrecs does capture these dependencies since it includes subtrees in which e.g. more and than are the only frontier words. In order to isolate tile contribution of nonheadword dependencies, we eliminated all subtrees containing two or more nonheadwnrds (where a nonheadword of a subtl'ec is a word which is not a headword of the subtree's root nonterminal -- although such a nonheadword may be a headword of one of the subtree's internal nodes). On the WSJ this led to a decrease in LP/LR of 1.2%/1.0% for probabilistic DOP. Thus nonheadword dependencies contribute to higher parse accuracy, and should not be discarded.</Paragraph> <Paragraph position="5"> This goes against common wisdom that the relevant lexical dependencies can be restricted to the locality of beadwords of constituents (as advocated in Collins 1999). It also shows that DOP's frontier lexicalization is a viable alternative to constituent lexicalization (as proposed in Charniak 1997; Collins 1997, 99; Eisner 1997). Moreover, DOP's use of large subtrees makes tim model not only more lexically but also more structurally sensitive.</Paragraph> </Section> </Section> class="xml-element"></Paper>