File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/j04-4004_evalu.xml

Size: 13,291 bytes

Last Modified: 2025-10-06 13:59:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-4004">
  <Title>c(c) 2004 Association for Computational Linguistics Intricacies of Collins' Parsing Model</Title>
  <Section position="12" start_page="503" end_page="506" type="evalu">
    <SectionTitle>
8. Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="503" end_page="504" type="sub_section">
      <SectionTitle>
8.1 Effects of Unpublished Details
</SectionTitle>
      <Paragraph position="0"> In this section we present the results of effectively doing a &amp;quot;clean-room&amp;quot; implementation of Collins' parsing model, that is, using only information available in (Collins 1997, 1999), as shown in Table 4.</Paragraph>
      <Paragraph position="1"> The clean-room model has a 10.6% increase in F-measure error compared to Collins' parser and an 11.0% increase in F-measure error compared to our engine in its complete emulation of Collins' Model 2. This is comparable to the increase in 32 Although we have implemented a version of this type of pruning that limits the number of items that can be collected in any one cell, that is, the maximum number of items that cover a particular span.  Computational Linguistics Volume 30, Number 4 Table 5 Effects of independently removing or changing individual details on overall parsing performance. All reported scores are for sentences of length [?] 40 words. +With beam width =  , processing time was 3.36 times longer than with standard beam (10  error seen when removing such published features as the verb-intervening component of the distance metric, which results in an F-measure error increase of 9.86%, or the subcat feature, which results in a 7.62% increase in F-measure error.  Therefore, while the collection of unpublished details presented in Sections 4-7 is disparate, in toto those details are every bit as important to overall parsing performance as certain of the published features.</Paragraph>
      <Paragraph position="2"> This does not mean that all the details are equally important. Table 5 shows the effect on overall parsing performance of independently removing or changing certain of the more than 30 unpublished details.</Paragraph>
      <Paragraph position="3">  Often, the detrimental effect of a particular change is quite insignificant, even by the standards of the performance-obsessed world of statistical parsing, and occasionally, the effect of a change is not even detrimental at all. That is why we do not claim the importance of any single unpublished detail, but rather that of their totality, given that several of the unpublished details are, most likely, interacting. However, we note that certain individual details, such as the universal p(w|t) model, do appear to have a much more marked effect on overall parsing accuracy than others.</Paragraph>
    </Section>
    <Section position="2" start_page="504" end_page="506" type="sub_section">
      <SectionTitle>
8.2 Bilexical Dependencies
</SectionTitle>
      <Paragraph position="0"> The previous section accounts for the noticeable effects of all the unpublished details of Collins' model. But what of the details that were published? In chapter 8 of his thesis, Collins gives an account on the motivation of various features of his model, including the distance metric, the model's use of subcats (and their interaction with the distance metric), and structural versus semantic preferences. In the discussion of this last issue, Collins points to the fact that structural preferences--which, in his model, are 33 These F-measures and the differences between them were calculated from experiments presented in Collins (1999, page 201); these experiments, unlike those on which our reported numbers are based, were on all sentences, not just those of length [?] 40 words. As Collins notes, removing both the distance metric and subcat features results in a gigantic drop in performance, since without both of these features, the model has no way to encode the fact that flatter structures should be avoided in several crucial cases, such as for PPs, which tend to prefer one argument to the right of their head-children.</Paragraph>
      <Paragraph position="1"> 34 As a reviewer pointed out, the use of the comma constraint is a &amp;quot;published&amp;quot; detail. However, the specifics of how certain commas do not apply to the constraint is an &amp;quot;unpublished detail,&amp;quot; as mentioned in Section 7.2.</Paragraph>
      <Paragraph position="2">  Bikel Intricacies of Collins' Parsing Model Table 6 Number of times our parsing engine was able to deliver a probability for the various levels of back-off of the modifier-word generation model, P</Paragraph>
      <Paragraph position="4"> , when testing on Section 00, having trained on Sections 02-21. In other words, this table reports how often a context in the back-off chain of P  parameters--often provide the right information for disambiguating competing analyses, but that these structural preferences may be &amp;quot;overridden&amp;quot; by semantic preferences. Bilexical statistics (Eisner 1996), as represented by the maximal context of the P</Paragraph>
      <Paragraph position="6"> parameters, serve as a proxy for such semantic preferences, where the actual modifier word (as opposed to, say, merely its part of speech) indicates the particular semantics of its head. Indeed, such bilexical statistics were widely assumed for some time to be a source of great discriminative power for several different parsing models, including that of Collins.</Paragraph>
      <Paragraph position="7"> However, Gildea (2001) reimplemented Collins' Model 1 (essentially Model 2 but without subcats) and altered the P</Paragraph>
      <Paragraph position="9"> parameters so that they no longer had the top level of context that included the headword (he removed back-off level 0, as depicted in Table 1). In other words, Gildea removed all bilexical statistics from the overall model. Surprisingly, this resulted in only a 0.45% absolute reduction in F-measure (3.3% relative increase in error). Unfortunately, this result was not entirely conclusive, in that Gildea was able to reimplement Collins' baseline model only partially, and the performance of his partial reimplementation was not quite as good as that of Collins' parser.</Paragraph>
      <Paragraph position="10">  Training on Sections 02-21, we have duplicated Gildea's bigram-removal experiment, except that our chosen test set is Section 00 instead of Section 23 and our chosen model is the more widely used Model 2. Using the mode that most closely emulates Collins' Model 2, with bigrams, our engine obtains a recall of 89.89% and a precision of 90.14% on sentences of length [?] 40 words (see Table 8, Model M tw,tw ).</Paragraph>
      <Paragraph position="11"> Without bigrams, performance drops only to 89.49% on recall, 89.95% on precision-an exceedingly small drop in performance (see Table 8, Model M tw,t ). In an additional experiment, we have examined the number of times that the parser is able, while decoding Section 00, to deliver a requested probability for the modifier-word generation model using the increasingly less-specific contexts of the three back-off levels. The results are presented in Table 6. Back-off level 0 indicates the use of the full history context, which contains the head-child's headword. Note that probabilities making use of this full context, that is, making use of bilexical dependencies, are available only 1.49% of the time. Combined with the results from the previous experiment, this suggests rather convincingly that such statistics are far less significant than once thought to the overall discriminative power of Collins' models, confirming Gildea's result for Model 2.</Paragraph>
      <Paragraph position="12">  35 The reimplementation was necessarily only partial, as Gildea did not have access to all the unpublished details of Collins' models that are presented in this article. 36 On a separate note, it may come as a surprise that the decoder needs to access more than 219 million probabilities during the course of parsing the 1,917 sentences of Section 00. Among other things, this</Paragraph>
    </Section>
    <Section position="3" start_page="506" end_page="506" type="sub_section">
      <SectionTitle>
8.3 Choice of Heads
</SectionTitle>
      <Paragraph position="0"> If not bilexical statistics, then surely, one might think, head-choice is critical to the performance of a head-driven lexicalized statistical parsing model. Partly to this end, in Chiang and Bikel (2002), we explored methods for recovering latent information in treebanks. The second half of that paper focused on a use of the Inside-Outside algorithm to reestimate the parameters of a model defined over an augmented tree space, where the observed data were considered to be the gold-standard labeled bracketings found in the treebank, and the hidden data were considered to be the headlexicalizations, one of the most notable tree augmentations performed by modern statistical parsers. These expectation maximization (EM) experiments were motivated by the desire to overcome the limitations imposed by the heuristics that have been heretofore used to perform head-lexicalization in treebanks. In particular, it appeared that the head rules used in Collins' parser had been tweaked specifically for the English Penn Treebank. Using EM would mean that very little effort would need to be spent on developing head rules, since EM could take an initial model that used simple heuristics and optimize it appropriately to maximize the likelihood of the unlexicalized (observed) training trees. To test this, we performed experiments with an initial model trained using an extremely simplified head-rule set in which all rules were of the form &amp;quot;if the parent is X, then choose the left/rightmost child.&amp;quot; A surprising side result was that even with this simplified set of head-rules, overall parsing performance still remained quite high. Using our simplified head-rule set for English, our engine in its &amp;quot;Model 2 emulation mode&amp;quot; achieved a recall of 88.55% and a precision of 88.80% for sentences of length [?]40 words in Section 00 (see Table 7). So contrary to our expectations, the lack of careful head-choice is not crippling in allowing the parser to disambiguate competing theories and is a further indication that semantic preferences, as represented by conditioning on a headword, rarely override structural ones.</Paragraph>
    </Section>
    <Section position="4" start_page="506" end_page="506" type="sub_section">
      <SectionTitle>
8.4 Lexical Dependencies Matter
</SectionTitle>
      <Paragraph position="0"> Given that bilexical dependencies are almost never used and have a surprisingly small effect on overall parsing performance, and given that the choice of head is not terribly critical either, one might wonder what power, if any, head-lexicalization is providing.</Paragraph>
      <Paragraph position="1"> The answer is that even when one removes bilexical dependencies from the model, there are still plenty of lexico-structural dependencies, that is, structures being generated conditioning on headwords and headwords being generated conditioning on structures.</Paragraph>
      <Paragraph position="2"> To test the effect of such lexicostructural dependencies in our lexicalized PCFGstyle formalism, we experimented with the removal of the head tag t  parameter class for generating partially lexicalized modifying nonterminals (a nonterminal label and part of speech). P</Paragraph>
      <Paragraph position="4"> is the parameter class that generates the headword of a modifying nonterminal. Together, P</Paragraph>
      <Paragraph position="6"> generate a fully lexicalized modifying nonterminal. The check marks indicate the inclusion of the headword w h and its part of speech t h of the lexicalized head nonterminal H(t  sults are shown in Table 8. Model M tw,tw shows our baseline, and Model M ph,ph shows the effect of removing all dependence on the headword and its part of speech, with the other models illustrating varying degrees of removing elements from the two parameter classes' conditioning contexts. Notably, including the headword w h in or removing it from the P M contexts appears to have a significant effect on overall performance, as shown by moving from Model M tw,t to Model M t,t and from Model M tw,ph to Model M t,ph . This reinforces the notion that particular headwords have structural preferences, so that making the P M parameters dependent on headwords would capture such preferences. As for effects involving dependence on the head tag t  results in a small drop in both recall and precision, whereas making an analogous move from Model M t,t to Model M t,ph results in a drop in recall, but a slight gain in precision (the two moves are analogous in that in both cases, t h is dropped from the context of P</Paragraph>
      <Paragraph position="8"> ). It is not evident why these two moves do not produce similar performance losses, but in both cases, the performance drops are small relative to those observed when eliminating w h from the conditioning contexts, indicating that headwords matter far more than parts of speech for determining structural preferences, as one would expect.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML