XML Viewer - j04-4004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/j04-4004_metho.xml
Size: 64,585 bytes
Last Modified: 2025-10-06 14:08:51
<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-4004">
  <Title>c(c) 2004 Association for Computational Linguistics Intricacies of Collins' Parsing Model</Title>
  <Section position="5" start_page="481" end_page="481" type="metho">
    <SectionTitle>
3. Model Overview
</SectionTitle>
    <Paragraph position="0"> The Collins parsing model decomposes the generation of a parse tree into many small steps, using reasonable independence assumptions to make the parameter estimation problem tractable. Even though decoding proceeds bottom-up, the model is defined in a top-down manner. Every nonterminal label in every tree is lexicalized: the label is augmented to include a unique headword (and that headword's part of speech) that the node dominates. The lexicalized PCFG that sits behind Model 2 has rules of the form</Paragraph>
    <Paragraph position="2"> from its distinguished head-child, H. In this generative model, first P is generated, then its head-child H, then each of the left- and right-modifying nonterminals are generated from the head outward. The modifying nonterminals L</Paragraph>
    <Paragraph position="4"> are generated conditioning on P and H, as well as a distance metric (based on what material intervenes between the currently generated modifying nonterminal and H) and an incremental subcategorization frame feature (a multiset containing the arguments of H that have yet to be generated on the side of H in which the currently generated nonterminal falls). Note that if the modifying nonterminals were generated completely independently, the model would be very impoverished, but in actuality, because it includes the distance and subcategorization frame features, the model captures a crucial bit of linguistic reality, namely, that words often have well-defined sets of complements and adjuncts, occurring with some well-defined distribution in the right-hand sides of a (context-free) rewriting system.</Paragraph>
    <Paragraph position="5"> The process proceeds recursively, treating each newly generated modifier as a parent and then generating its head and modifier children; the process terminates when (lexicalized) preterminals are generated. As a way to guarantee the consistency of the model, the model also generates two hidden +STOP+ nonterminals as the leftmost and rightmost children of every parent (see Figure 7).</Paragraph>
  </Section>
  <Section position="6" start_page="481" end_page="483" type="metho">
    <SectionTitle>
4. Preprocessing Training Trees
</SectionTitle>
    <Paragraph position="0"> To the casual reader of Collins' thesis, it may not be immediately apparent that there are quite a few preprocessing steps for each annotated training tree and that these steps are crucial to the performance of the parser. We identified 11 preprocessing steps necessary to prepare training trees when using Collins' parsing model:  1. pruning of unnecessary nodes 2. adding base NP nodes (NPBs) 3. &amp;quot;repairing&amp;quot; base NPs 4. adding gap information (applicable to Model 3 only) 5. relabeling of sentences with no subjects (subjectless sentences) 6. removing null elements 7. raising punctuation 8. identification of argument nonterminals 9. stripping unused nonterminal augmentations  Computational Linguistics Volume 30, Number 4 10. &amp;quot;repairing&amp;quot; subjectless sentences 11. head-finding The order of presentation in the foregoing list is not arbitrary, as some of the steps depend on results produced in previous steps. Also, we have separated the steps into their functional units; an implementation could combine steps that are independent of one another (for clarity, our implementation does not, however). Finally, we note that the final step, head-finding, is actually required by some of the previous steps in certain cases; in our implementation, we selectively employ a head-finding module during the first 10 steps where necessary.</Paragraph>
    <Section position="1" start_page="482" end_page="482" type="sub_section">
      <SectionTitle>
4.1 Coordinated Phrases
</SectionTitle>
      <Paragraph position="0"> A few of the preprocessing steps rely on the notion of a coordinated phrase. In this article, the conditions under which a phrase is considered coordinated are slightly more detailed than is described in Collins' thesis. A node represents a coordinated  In the Penn Treebank, a coordinating conjunction is any preterminal node with the label CC. This definition essentially picks out all phrases in which the head-child is truly conjoined to some other phrase, as opposed to a phrase in which, say, there is an initial CC, such as an S that begins with the conjunction but.</Paragraph>
    </Section>
    <Section position="2" start_page="482" end_page="482" type="sub_section">
      <SectionTitle>
4.2 Pruning of Unnecessary Nodes
</SectionTitle>
      <Paragraph position="0"> As a preprocessing step, pruning of unnecessary nodes simply removes preterminals that should have little or no bearing on parser performance. In the case of the English Treebank, the pruned subtrees are all preterminal subtrees whose root label is one of {'', '', .}. There are two reasons to remove these types of subtrees when parsing the English Treebank: First, in the treebanking guidelines (Bies 1995), quotation marks were given the lowest possible priority and thus cannot be expected to appear within constituent boundaries in any kind of consistent way, and second, neither of these types of preterminals--nor any punctuation marks, for that matter--counts towards the parsing score.</Paragraph>
    </Section>
    <Section position="3" start_page="482" end_page="483" type="sub_section">
      <SectionTitle>
4.3 Adding Base NP Nodes
</SectionTitle>
      <Paragraph position="0"> An NP is basal when it does not itself dominate an NP; such NP nodes are relabeled NPB. More accurately, an NP is basal when it dominates no other NPs except possessive NPs, where a possessive NP is an NP that dominates POS, the preterminal possessive 4 Our positional descriptions here, such as &amp;quot;posthead but nonfinal,&amp;quot; refer to positions within the list of immediately dominated children of the coordinated phrase node, as opposed to positions within the entire sentence.</Paragraph>
      <Paragraph position="1">  A nonhead NPB child of NP requires insertion of extra NP.</Paragraph>
      <Paragraph position="2"> marker for the Penn Treebank. These possessive NPs are almost always themselves base NPs and are therefore (almost always) relabeled NPB.</Paragraph>
      <Paragraph position="3"> For consistency's sake, when an NP has been relabeled as NPB, a normal NP node is often inserted as a parent nonterminal. This insertion ensures that NPB nodes are always dominated by NP nodes. The conditions for inserting this &amp;quot;extra&amp;quot; NP level are slightly more detailed than is described in Collins' thesis, however. The extra NP level is added if one of the following conditions holds:  In postprocessing, when an NPB is an only child of an NP node, the extra NP level is removed by merging the two nodes into a single NP node, and all remaining NPB nodes are relabeled NP.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="483" end_page="488" type="metho">
    <SectionTitle>
5 Only applicable if relabeling of NPs is performed using a preorder tree traversal.
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="484" end_page="484" type="sub_section">
      <SectionTitle>
4.4 Repairing Base NPs
</SectionTitle>
      <Paragraph position="0"> The insertion of extra NP levels above certain NPB nodes achieves a degree of consistency for NPs, effectively causing the portion of the model that generates children of NP nodes to have less perplexity. Collins appears to have made a similar effort to improve the consistency of the NPB model. NPB nodes that have sentential nodes as their final (rightmost) child are &amp;quot;repaired&amp;quot;: The sentential child is raised so that it becomes a new right-sibling of the NPB node (see Figure 3).</Paragraph>
      <Paragraph position="1">  While such a transformation is reasonable, it is interesting to note that Collins' parser performs no equivalent detransformation when parsing is complete, meaning that when the parser produces the &amp;quot;repaired&amp;quot; structure during testing, there is a spurious NP bracket.</Paragraph>
    </Section>
    <Section position="2" start_page="484" end_page="484" type="sub_section">
      <SectionTitle>
4.5 Adding Gap Information
</SectionTitle>
      <Paragraph position="0"> The gap feature is discussed extensively in chapter 7 of Collins' thesis and is applicable only to his Model 3. The preprocessing step in which gap information is added locates every null element preterminal, finds its co-indexed WHNP antecedent higher up in the tree, replaces the null element preterminal with a special trace tag, and threads the gap feature in every nonterminal in the chain between the common ancestor of the antecedent and the trace. The threaded-gap feature is represented by appending -g to every node label in the chain. The only detail we would like to highlight here is that an implementation of this preprocessing step should check for cases in which threading is impossible, such as when two filler-gap dependencies cross. An implementation should be able to handle nested filler-gap dependencies, however.</Paragraph>
    </Section>
    <Section position="3" start_page="484" end_page="485" type="sub_section">
      <SectionTitle>
4.6 Relabeling Subjectless Sentences
</SectionTitle>
      <Paragraph position="0"> The node labels of sentences with no subjects are transformed from S to SG. This step enables the parsing model to be sensitive to the different contexts in which such subjectless sentences occur as compared to normal S nodes, since the subjectless sentences are functionally acting as noun phrases. Collins' example of  the letter S. For the Penn Treebank, this defines the set {S, SBAR, SBARQ, SINV, SQ}. 7 Since, as mentioned above, the only time an NPB is merged with its parent is when it is the only child of an NP.</Paragraph>
      <Paragraph position="1">  frontier of a subtree.</Paragraph>
      <Paragraph position="2"> illustrates the utility of this transformation. However, the conditions under which an S may be relabeled are not spelled out; one might assume that every S whose subject (identified in the Penn Treebank with the -SBJ function tag) dominates a null element should be relabeled SG. In actuality, the conditions are much stricter. An S is relabeled SG when the following conditions hold:  * One of its children dominates a null element child marked with -SBJ. * Its head-child is a VP.</Paragraph>
      <Paragraph position="3"> * No arguments appear prior to the head-child (see Sections 4.9 and 4.11)  The latter two conditions appear to be an effort to capture only those subjectless sentences that are based around gerunds, as in the flying planes example.</Paragraph>
    </Section>
    <Section position="4" start_page="485" end_page="485" type="sub_section">
      <SectionTitle>
4.7 Removing Null Elements
</SectionTitle>
      <Paragraph position="0"> Removing null elements simply involves pruning the tree to eliminate any subtree that dominates only null elements. The special trace tag that is inserted in the step that adds gap information (Section 4.5) is excluded, as it is specifically chosen to be something other than the null-element preterminal marker (which is -NONE- in the Penn Treebank).</Paragraph>
    </Section>
    <Section position="5" start_page="485" end_page="486" type="sub_section">
      <SectionTitle>
4.8 Raising Punctuation
</SectionTitle>
      <Paragraph position="0"> The step in which punctuation is raised is discussed in detail in chapter 7 of Collins' thesis. The main idea is to raise punctuation--which is any preterminal subtree in which the part of speech is either a comma or a colon--to the highest possible point in the tree, so that it always sits between two other nonterminals. Punctuation that occurs at the very beginning or end of a sentence is &amp;quot;raised away,&amp;quot; that is, pruned. In addition, any implementation of this step should handle the case in which multiple punctuation elements appear as the initial or final children of some node, as well as the more pathological case in which multiple punctuation elements appear along the left or right frontier of a subtree (see Figure 4). Finally, it is not clear what to do with nodes that dominate only punctuation preterminals. Our implementation simply issues a warning in such cases and leaves the punctuation symbols untouched.</Paragraph>
      <Paragraph position="1"> 8 We assume the G in the label SG was chosen to stand for the word gerund.</Paragraph>
    </Section>
    <Section position="6" start_page="486" end_page="488" type="sub_section">
      <SectionTitle>
4.9 Identification of Argument Nonterminals
</SectionTitle>
      <Paragraph position="0"> Collins employs a small set of heuristics to mark certain nonterminals as arguments, by appending -A to the nonterminal label. This section reveals three unpublished details about Collins' argument finding: * The published argument-finding rule for PPs is to choose the first nonterminal after the head-child. In a large majority of cases, this marks the NP argument of the preposition. The actual rule used is slightly more complicated: The first nonterminal to the right of the head-child that is neither PRN nor a part-of-speech tag is marked as an argument.</Paragraph>
      <Paragraph position="1"> The nonterminal PRN in the Penn Treebank marks parenthetical expressions, which can occur fairly often inside a PP, as in the phrase on (or above) the desk.</Paragraph>
      <Paragraph position="2"> * Children that are part of a coordinated phrase (see Section 4.1) are exempt from being relabeled as argument nonterminals.</Paragraph>
      <Paragraph position="3"> * Head-children are distinct from their siblings by virtue of the head-generation parameter class in the parsing model. In spite of this, Collins' trainer actually does not exempt head-children from being relabeled as arguments (see Figure 5).</Paragraph>
      <Paragraph position="4">  This step simply involves stripping away all nonterminal augmentations, except those that have been added from other preprocessing steps (such as the -A augmentation for argument labels). This includes the stripping away of all function tags and indices marked by the Treebank annotators.</Paragraph>
      <Paragraph position="5">  Head moves from right to left conjunct in a coordinated phrase, except when the parent nonterminal is NPB.</Paragraph>
      <Paragraph position="6">  With arguments identified as described in Section 4.9, if a subjectless sentence is found to have an argument prior to its head, this step detransforms the SG so that it reverts to being an S.</Paragraph>
      <Paragraph position="7"> 4.12 Head-Finding Head-finding is discussed at length in Collins' thesis, and the head-finding rules used are included in his Appendix A. There are a few unpublished details worth mentioning, however.</Paragraph>
      <Paragraph position="8"> There is no head-finding rule for NX nonterminals, so the default rule of picking the leftmost child is used.</Paragraph>
      <Paragraph position="9">  NX nodes roughly represent the N' level of syntax and in practice often denote base NPs. As such, the default rule often picks out a less-thanideal head-child, such as an adjective that is the leftmost child in a base NP. Collins' thesis discusses a case in which the initial head is modified when it is found to denote the right conjunct in a coordinated phrase. That is, if the head rules pick out a head that is preceded by a CC that is non-initial, the head should be modified to be the nonterminal immediately to the left of the CC (see Figure 6). An important detail is that such &amp;quot;head movement&amp;quot; does not occur inside base NPs. That is, a phrase headed by NPB may indeed look as though it constitutes a coordinated phrase--it has a CC that is noninitial but to the left of the currently chosen head--but the currently chosen head should remain chosen.</Paragraph>
      <Paragraph position="10">  As we shall see, there is exceptional behavior for base NPs in almost every part of the Collins parser.</Paragraph>
      <Paragraph position="11"> 10 In our first attempt at replicating Collins' results, we simply employed the same head-finding rule for NX nodes as for NP nodes. This choice yields different--but not necessarily inferior--results. 11 In Section 4.1, we defined coordinated phrases in terms of heads, but here we are discussing how the head-finder itself needs to determine whether a phrase is coordinated. It does this by considering the potential new choice of head: If the head-finding rules pick out a head that is preceded by a noninitial CC (Jane), will moving the head to be a child to the left of the CC (John) yield a coordinated phrase? If so, then the head should be moved--except when the parent is NPB.</Paragraph>
      <Paragraph position="12">  vi feature is true when generating right-hand +STOP+ nonterminal, because the NP the will to continue contains a verb.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="488" end_page="490" type="metho">
    <SectionTitle>
5. Training
</SectionTitle>
    <Paragraph position="0"> The trainer's job is to decompose annotated training trees into a series of head- and modifier-generation steps, recording the counts of each of these steps. Referring to</Paragraph>
    <Paragraph position="2"> are generated conditioning on previously generated items, and each of these events consisting of a generated item and some maximal history context is counted. Even with all this decomposition, sparse data are still a problem, and so each probability estimate for some generated item given a maximal context is smoothed with coarser distributions using less context, whose counts are derived from these &amp;quot;top-level&amp;quot; head- and modifier-generation counts.</Paragraph>
    <Section position="1" start_page="488" end_page="489" type="sub_section">
      <SectionTitle>
5.1 Verb Intervening
</SectionTitle>
      <Paragraph position="0"> As mentioned in Section 3, instead of generating each modifier independently, the model conditions the generation of modifiers on certain aspects of the history. One such function of the history is the distance metric. One of the two components of this distance metric is what we will call the &amp;quot;verb intervening&amp;quot; feature, which is a predicate vi that is true if a verb has been generated somewhere in the surface string of the previously generated modifiers on the current side of the head. For example, in Figure 7, when generating the right-hand +STOP+ nonterminal child of the VP, the vi predicate is true, because one of the previously generated modifiers on the right side of the head dominates a verb, continue.</Paragraph>
      <Paragraph position="1">  More formally, this feature is most easily defined in terms of a recursively defined cv (&amp;quot;contains verb&amp;quot;) predicate, which is true if and only if a node dominates a verb:  the vi predicate. This is possible because in a history-based model (cf. Black et al. 1992), anything previously generated--that is, anything in the history--can appear in the conditioning context.  Bikel Intricacies of Collins' Parsing Model Referring to (2), we define the verb-intervening predicate recursively on the first-order Markov process generating modifying nonterminals:</Paragraph>
      <Paragraph position="3"> and similarly for right modifiers.</Paragraph>
      <Paragraph position="4"> What is considered to be a verb? While this is not spelled out, as it happens, a verb is any word whose part-of-speech tag is one of {VB, VBD, VBG, VBN, VBP, VBZ}. That is, the cv predicate returns true only for these preterminals and false for all other preterminals. Crucially, this set omits MD, which is the marker for modal verbs. Another crucial point about the vi predicate is that it does not include verbs that appear within base NPs. Put another way, in order to emulate Collins' model, we need to amend the definition of cv by stipulating that cv(NPB)=false.</Paragraph>
    </Section>
    <Section position="2" start_page="489" end_page="489" type="sub_section">
      <SectionTitle>
5.2 Skip Certain Trees
</SectionTitle>
      <Paragraph position="0"> One oddity of Collins' trainer that we mention here for the sake of completeness is that it skips certain training trees. For &amp;quot;odd historical reasons,&amp;quot;  the trainer skips all trees with more than 500 tokens, where a token is considered in this context to be a word, a nonterminal label, or a parenthesis. This oddity entails that even some relatively short sentences get skipped because they have lots of tree structure. In the standard Wall Street Journal training corpus, Sections 02-21 of the Penn Treebank, there are 120 such sentences that are skipped. Unless there is something inherently wrong with these trees, one would predict that adding them to the training set would improve a parser's performance. As it happens, there is actually a minuscule (and probably statistically insignificant) drop in performance (see Table 5) when these trees are included.</Paragraph>
    </Section>
    <Section position="3" start_page="489" end_page="490" type="sub_section">
      <SectionTitle>
5.3 Unknown Words
</SectionTitle>
      <Paragraph position="0"> words occurring less than 5 times in training data, and words in test data which have never been seen in training, are replaced with the 'UNKNOWN' token (page 186).&amp;quot; The frequency below which words are considered unknown is often called the unknown-word threshold. Unfortunately, this term can also refer to the frequency above which words are considered known. As it happens, the unknown-word threshold Collins uses in his parser for English is six, not five.</Paragraph>
      <Paragraph position="1">  To be absolutely unambiguous, words that occur fewer than six times, which is to say, words that occur five times or fewer, in the data are considered &amp;quot;unknown.&amp;quot; 5.3.2 Not Handled in a Uniform Way. The obvious way to incorporate unknown words into the parsing model, then, is simply to map all low-frequency words in the training data to some special +UNKNOWN+ token before counting top-level events for parameter estimation (where &amp;quot;low-frequency&amp;quot; means &amp;quot;below the unknown-word threshold&amp;quot;). Collins' trainer actually does not do this. Instead, it does not directly modify any of the words in the original training trees and proceeds to break up these unmodified trees into the top-level events. After these events have been collected 13 This phrase was taken from a comment in one of Collins' preprocessing Perl scripts. 14 As with many of the discovered discrepancies between the thesis and the implementation, we determined the different unknown-word threshold through reverse engineering, in this case, through an analysis of the events output by Collins' trainer.</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 30, Number 4 and counted, the trainer selectively maps low-frequency words when deriving counts for the various context (back-off) levels of the parameters that make use of bilexical statistics. If this mapping were performed uniformly, then it would be identical to mapping low-frequency words prior to top-level event counting; this is not the case, however. We describe the details of this unknown-word mapping in Section 6.9.2. While there is a negligible yet detrimental effect on overall parsing performance when one uses an unknown-word threshold of five instead of six, when this change is combined with the &amp;quot;obvious&amp;quot; method for handling unknown words, there is actually a minuscule improvement in overall parsing performance (see Table 5).</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="490" end_page="496" type="metho">
    <SectionTitle>
6. Parameter Classes and Their Estimation
</SectionTitle>
    <Paragraph position="0"> All parameters that generate trees in Collins' model are estimates of conditional probabilities. Even though the following overview of parameter classes presents only the maximal contexts of the conditional probability estimates, it is important to bear in mind that the model always makes use of smoothed probability estimates that are the linear interpolation of several raw maximum-likelihood estimates, using various amounts of context (we explore smoothing in detail in Section 6.8).</Paragraph>
    <Section position="1" start_page="490" end_page="490" type="sub_section">
      <SectionTitle>
6.1 Mapped Versions of the Set of Nonterminals
</SectionTitle>
      <Paragraph position="0"> In Sections 4.5 and 4.9, we saw how the raw Treebank nonterminal set is expanded to include nonterminals augmented with -A and -g. Although it is not made explicit in Collins' thesis, Collins' model uses two mapping functions to remove these augmentations when including nonterminals in the history contexts of conditional probabilities.</Paragraph>
      <Paragraph position="1"> Presumably this was done to help alleviate sparse-data problems. We denote the &amp;quot;argument removal&amp;quot; mapping function as alpha and the &amp;quot;gap removal&amp;quot; mapping function as gamma. For example:</Paragraph>
      <Paragraph position="3"> Since gap augmentations are present only in Model 3, the gamma function effectively is the identity function in the context of Models 1 and 2.</Paragraph>
    </Section>
    <Section position="2" start_page="490" end_page="490" type="sub_section">
      <SectionTitle>
6.2 The Head Parameter Class
</SectionTitle>
      <Paragraph position="0"> The head nonterminal is generated conditioning on its parent nonterminal label, as well as the headword and head tag which they share, since parents inherit their lexical head information from their head-children. More specifically, an unlexicalized head nonterminal label is generated conditioning on the fully lexicalized parent nonterminal. We denote the parameter class as follows:</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="3" start_page="490" end_page="491" type="sub_section">
      <SectionTitle>
6.3 The Subcategorization Parameter Class
</SectionTitle>
      <Paragraph position="0"> When the model generates a head-child nonterminal for some lexicalized parent nonterminal, it also generates a kind of subcategorization frame (subcat) on either side of the head-child, with the following maximal context:</Paragraph>
      <Paragraph position="2"> A fully lexicalized tree. The VP node is the head-child of S.</Paragraph>
      <Paragraph position="4"> Probabilistically, it is as though these subcats are generated with the head-child, via application of the chain rule, but they are conditionally independent.</Paragraph>
      <Paragraph position="5">  These subcats may be thought of as lists of requirements on a particular side of a head. For example, in Figure 8, after the root node of the tree has been generated (see Section 6.10), the head child VP is generated, conditioning on both the parent label S and the headword of that parent, sat-VBD. Before any modifiers of the head-child are generated, both a left- and right-subcat frame are generated. In this case, the left subcat is {NP-A} and the right subcat is {}, meaning that there are no required elements to be generated on the right side of the head. Subcats do not specify the order of the required arguments. They are dynamically updated multisets: When a requirement has been generated, it is removed from the multiset, and subsequent modifiers are generated conditioning on the updated multiset.</Paragraph>
      <Paragraph position="6">  The implementation of subcats in Collins' parser is even more specific: Subcats are multisets containing various numbers of precisely six types of items: NP-A, S-A, SBAR-A, VP-A, g, and miscellaneous. The g indicates that a gap must be generated and is applicable only to Model 3. Miscellaneous items include all nonterminals that were marked as arguments in the training data that were not any of the other named types. There are rules for determining whether NPs, Ss, SBARs, and VPs are arguments, and the miscellaneous arguments occur as the result of the argument-finding rule for PPs, which states that the first non-PRN, non-part-of-speech tag that occurs after the head of a PP should be marked as an argument, and therefore nodes that are not one of the four named types can be marked.</Paragraph>
    </Section>
    <Section position="4" start_page="491" end_page="496" type="sub_section">
      <SectionTitle>
6.4 The Modifying Nonterminal Parameter Class
</SectionTitle>
      <Paragraph position="0"> As mentioned above, after a head-child and its left and right subcats are generated, modifiers are generated from the head outward, as indicated by the modifier nonterminal indices in Figure 1. A fully lexicalized nonterminal has three components: the nonterminal label, the headword, and the headword's part of speech. Fully lexicalized modifying nonterminals are generated in two steps to allow for the parameters to be independently smoothed, which, in turn, is done to avoid sparse-data problems. These two steps estimate the joint event of all three components using the chain rule. In the 15 Using separate steps to generate subcats on either side of the head allows not only for conditional independence between the left and right subcats, but also for these parameters to be separately smoothed from the head-generation parameter.</Paragraph>
      <Paragraph position="1"> 16 Our parsing engine allows an arbitrary mechanism for storage and discharge of requirements: They can be multisets, ordered lists, integers (simply to constrain the number of requirements), or any other mechanism. The mechanism used is determined at runtime.</Paragraph>
      <Paragraph position="2">  A tree containing both punctuation and conjunction.</Paragraph>
      <Paragraph position="3"> first step, a partially lexicalized version of the nonterminal is generated, consisting of the unlexicalized label plus the part of speech of its headword. These partially lexicalized modifying nonterminals are generated conditioning on the parent label, the head label, the headword, the head tag, the current state of the dynamic subcat, and a distance metric. Symbolically, the parameter classes are</Paragraph>
      <Paragraph position="5"> where [?] denotes the distance metric.</Paragraph>
      <Paragraph position="6">  As discussed above, one of the two components of this distance metric is the vi predicate. The other is a predicate that simply reports whether the current modifier is the first modifier being generated, that is, whether i =  highest position in the tree. This means that in some sense, punctuation acts very much like a coordinating conjunction, in that it &amp;quot;conjoins&amp;quot; the two siblings between which it sits. Observing that it might be helpful for conjunctions to be generated conditioning on both of their conjuncts, Collins introduced two new parameter classes in his thesis parser, P</Paragraph>
      <Paragraph position="8"> As per the definition of a coordinated phrase in Section 4.1, conjunction via a CC node or a punctuation node always occurs posthead (i.e., as a right-sibling of the head). Put another way, if a conjunction or punctuation mark occurs prehead, it is 17 Throughout this article we use the notation L(w, t) i to refer to the three items that constitute a fully lexicalized left-modifying nonterminal, which are the unlexicalized label L</Paragraph>
      <Paragraph position="10"> to refer to the two items L</Paragraph>
      <Paragraph position="12"> of a partially lexicalized nonterminal. Finally, when we do not wish to distinguish between a left and right modifier, we use M(w, t)</Paragraph>
      <Paragraph position="14"> 18 Collins' thesis does not say what the back-off structure of these new parameter classes is, that is, how they should be smoothed. We have included this information in the complete smoothing table in the Appendix.</Paragraph>
      <Paragraph position="15">  Bikel Intricacies of Collins' Parsing Model not generated via this mechanism.</Paragraph>
      <Paragraph position="16">  Furthermore, even if there is arbitrary material between the right conjunct and the head, the parameters effectively assume that the left conjunct is always the head-child. For example, in Figure 9, the rightmost NP (bushy bushes) is considered to be conjoined to the leftmost NP (short grass), which is the head-child, even though there is an intervening NP (tall trees). The new parameters are incorporated into the model by requiring that all modifying nonterminals be generated with two boolean flags: coord, indicating that the nonterminal is conjoined to the head via a CC, and punc, indicating that the nonterminal is conjoined to the head via a punctuation mark. When either or both of these flags is true, the intervening punctuation or conjunction is generated via appropriate instances of the P</Paragraph>
      <Paragraph position="18"> parameter classes.</Paragraph>
      <Paragraph position="19"> For example, the model generates the five children in Figure 9 in the following order: first, the head-child is generated, which is the leftmost NP (short grass), conditioning on the parent label and the headword and tag. Then, since modifiers are always generated from the head outward, the right-sibling of the head, which is the tall trees NP, is generated with both the punc and CC flags false. Then, the rightmost NP (bushy bushes) is generated with both the punc and CC booleans true, since it is considered to be conjoined to the head-child and requires the generation of an intervening punctuation mark and conjunction. Finally, the intervening punctuation is generated conditioning on the parent, the head, and the right conjunct, including the headwords of the two conjoined phrases, and the intervening CC is similarly generated.</Paragraph>
      <Paragraph position="20"> A simplified version of the probability of generating all these children is summarized as follows:</Paragraph>
      <Paragraph position="22"> The idea is that using the chain rule, the generation of two conjuncts and that which conjoins them is estimated as one large joint event.</Paragraph>
      <Paragraph position="23">  This scheme of using flags to trigger the P</Paragraph>
      <Paragraph position="25"> parameters is problematic, at least from a theoretical standpoint, as it causes the model to be inconsistent. Figure 10 shows three different trees that would all receive the same probability from Collins' model. The problem is that coordinating conjunctions and punctuation are not generated as first-class words, but only as triggered from these punc and coord flags, meaning that the number of such intervening conjunctive items (and the order in which they are to be generated) is not specified. So for a given sentence/tree pair containing a conjunction and/or a punctuation mark, there is an infinite number of similar sentence/tree pairs with arbitrary amounts of &amp;quot;conjunctive&amp;quot; material between the same two nodes. Because all of these trees have the same, nonzero probability, the</Paragraph>
      <Paragraph position="27"> P(T), where T is a possible tree generated by the model, diverges, meaning the model is inconsistent (Booth and Thompson 1973). Another consequence of not generating posthead conjunctions and punctuation as first-class words is that they 19 In fact, if punctuation occurs before the head, it is not generated at all--a deficiency in the parsing model that appears to be a holdover from the deficient punctuation handling in the model of Collins (1997).</Paragraph>
      <Paragraph position="28"> 20 In (9), for clarity we have left out subcat generation and the use of Collins' distance metric in the conditioning contexts. We have also glossed over the fact that lexicalized modifying nonterminals are actually generated in two steps, using two differently smoothed parameters.</Paragraph>
      <Paragraph position="29">  The Collins model assigns equal probability to these three trees.</Paragraph>
      <Paragraph position="30"> do not count when calculating the head-adjacency component of Collins' distance metric.</Paragraph>
      <Paragraph position="31"> When emulating Collins' model, instead of reproducing the P</Paragraph>
      <Paragraph position="33">  parameter classes directly in our parsing engine, we chose to use a different mechanism that does not yield an inconsistent model but still estimates the large joint event that was the motivation behind these parameters in the first place.</Paragraph>
      <Paragraph position="34">  rather than the dedicated parameter classes P</Paragraph>
      <Paragraph position="36"> , to estimate the joint event of generating a conjunction (or punctuation mark) and its two conjuncts. The first big change that results is that we treat punctuation preterminals and CCs as first-class objects, meaning that they are generated in the same way as any other modifying nonterminal.</Paragraph>
      <Paragraph position="37"> The second change is a little more involved. First, we redefine the distance metric to consist solely of the vi predicate. Then, we add to the conditioning context a mapped version of the previously generated modifier according to the following  Bikel Intricacies of Collins' Parsing Model mapping function:</Paragraph>
      <Paragraph position="39"> So, the maximal context for our modifying nonterminal parameter class is now defined as follows:</Paragraph>
      <Paragraph position="41"> where side is a boolean-valued event that indicates whether the modifier is on the left or right side of the head. By treating CC and punctuation nodes as first-class nonterminals and by adding the mapped version of the previously generated modifier, we have, in one fell swoop, incorporated the &amp;quot;no intervening&amp;quot; component of Collins' distance metric (the i = 0 case of the delta function) and achieved an estimate of the joint event of a conjunction and its conjuncts, albeit with different dependencies, that is, a different application of the chain rule. To put this parameterization change in sharp relief, consider the abstract tree structure  |P} is being estimated, but with the new method, there is no need to add two new specialized parameter classes, and the new method does not introduce inconsistency into the model. Using less simplification, the probability of generating the five children of Figure 9 is</Paragraph>
      <Paragraph position="43"> 21 Originally, we had an additional mechanism that attempted to generate punctuation and conjunctions with conditional independence. One of our reviewers astutely pointed out that the mechanism led to a deficient model (the very thing we have been trying to avoid), and so we have subsequently removed it from our model. The removal leads to a 0.05% absolute reduction in F-measure (which in this case is also a 0.05% relative increase in error) on sentences of length [?] 40 words in Section 00 of the Penn Treebank. As this difference is not at all statistically significant (according to a randomized stratified shuffling test [Cohen 1995]), all evaluations reported in this article are with the original model.</Paragraph>
      <Paragraph position="44">  Computational Linguistics Volume 30, Number 4 As shown in Section 8.1, this new parameterization yields virtually identical performance to that of the Collins model.</Paragraph>
    </Section>
    <Section position="5" start_page="496" end_page="496" type="sub_section">
      <SectionTitle>
6.6 The Base NP Model: A Model unto Itself
</SectionTitle>
      <Paragraph position="0"> As we have already seen, there are several ways in which base NPs are exceptional in Collins' parsing model. This is partly because the flat structure of base NPs in the Penn Treebank suggested the use of a completely different model by which to generate them. Essentially, the model for generating children of NPB nodes is a &amp;quot;bigrams of nonterminals&amp;quot; model. That is, it looks a great deal like a bigram language model, except that the items being generated are not words, but lexicalized nonterminals.</Paragraph>
      <Paragraph position="1"> Heads of NPB nodes are generated using the normal head-generation parameter, but modifiers are always generated conditioning not on the head, but on the previously generated modifier. That is, we modify expressions (7) and (8) to be</Paragraph>
      <Paragraph position="3"> Though it is not entirely spelled out in his thesis, Collins considers the previously generated modifier to be the head-child, for all intents and purposes. Thus, the subcat and distance metrics are always irrelevant, since it is as though the current modifier is right next to the head.</Paragraph>
      <Paragraph position="4">  Another consequence of this is that NPBs are never considered to be coordinated phrases (as mentioned in Section 4.12), and thus CCs dominated by NPB are never generated using a P</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="496" end_page="501" type="metho">
    <SectionTitle>
CC
</SectionTitle>
    <Paragraph position="0"> parameter; instead, they are generated using a normal modifying-nonterminal parameter. Punctuation dominated by NPB,on the other hand, is still, as always, generated via P punc parameters, but crucially, the modifier is always conjoined (via the punctuation mark) to the &amp;quot;pseudohead&amp;quot; that is the previously generated modifier. Consequently, when some right modifier R</Paragraph>
    <Paragraph position="2"> generated, the previously generated modifier on the right side of the head, R i[?]1 ,is never a punctuation preterminal, but always the previous &amp;quot;real&amp;quot; (i.e., nonpunctuation) preterminal.</Paragraph>
    <Paragraph position="3">  Base NPs are also exceptional with respect to determining chart item equality, the comma-pruning rule, and general beam pruning (see Section 7.2 for details).</Paragraph>
    <Section position="1" start_page="496" end_page="497" type="sub_section">
      <SectionTitle>
6.7 Parameter Classes for Priors on Lexicalized Nonterminals
</SectionTitle>
      <Paragraph position="0"> Two parameter classes that make their appearance only in Appendix E of Collins' thesis are those that compute priors on lexicalized nonterminals. These priors are used as a crude proxy for the outside probability of a chart item (see Baker [1979] and Lari and Young [1990] for full descriptions of the Inside-Outside algorithm). Previous work (Goodman 1997) has shown that the inside probability alone is an insufficient scoring metric when comparing chart items covering the same span during decoding and that some estimate of the outside probability of a chart item should be factored into the score. A prior on the root (lexicalized) nonterminal label of the derivation forest represented by a particular chart item is used for this purpose in Collins' parser.</Paragraph>
      <Paragraph position="1"> 22 As described in Bikel (2002), our parsing engine allows easy experimentation with a wide variety of different generative models, including the ability to construct history contexts from arbitrary numbers of previously generated modifiers. The mapping function delta and the transition function tau presented in this section are just two examples of this capability.</Paragraph>
      <Paragraph position="2"> 23 This is the main reason that the cv (&amp;quot;contains verb&amp;quot;) predicate is always false for NPBs, as that predicate applies only to material that intervenes between the current modifier and the head.</Paragraph>
      <Paragraph position="3"> 24 Interestingly, unlike in the regular model, punctuation that occurs to the left of the head is generated when it occurs within an NPB. Thus, this particular--albeit small--deficiency of Collins' punctuation handling does not apply to the base NP model.</Paragraph>
      <Paragraph position="4">  Bikel Intricacies of Collins' Parsing Model The prior of a lexicalized nonterminal M(w, t) is broken down into two separate estimates using parameters from two new classes, P</Paragraph>
      <Paragraph position="6"> class are unsmoothed.</Paragraph>
    </Section>
    <Section position="2" start_page="497" end_page="499" type="sub_section">
      <SectionTitle>
6.8 Smoothing Weights
</SectionTitle>
      <Paragraph position="0"> Many of the parameter classes in Collins' model--and indeed, in most statistical parsing models--define conditional probabilities with very large conditioning contexts. In this case, the conditioning contexts represent some subset of the history of the generative process. Even if there were orders of magnitude more training data available, the large size of these contexts would cause horrendous sparse-data problems. The solution is to smooth these distributions that are made rough primarily by the abundance of zeros. Collins uses the technique of deleted interpolation, which smoothes the distributions based on full contexts with those from coarser models that use less of the context, by successively deleting elements from the context at each back-off level. As a simple example, the head parameter class smoothes P  mate in the back-off chain is computed via maximum-likelihood (ML) estimation, and the overall smoothed estimate with n back-off levels is computed using n[?]1 smoothing weights, denoted l</Paragraph>
      <Paragraph position="2"> back-off level i is computed via the formula</Paragraph>
      <Paragraph position="4"> So, for example, with three levels of back-off, the overall smoothed estimate would be defined as</Paragraph>
      <Paragraph position="6"> Each smoothing weight can be conceptualized as the confidence in the estimate with which it is being multiplied. These confidence values can be derived in a number of sensible ways; the technique used by Collins was adapted from that used in Bikel et al. (1997), which makes use of a quantity called the diversity of the history context (Witten and Bell 1991), which is equal to the number of unique futures observed in training for that history context.</Paragraph>
      <Paragraph position="7"> 6.8.1 Deficient Model. As previously mentioned, n back-off levels require n[?]1 smoothing weights. Collins' parser effectively uses n weights, because the estimator always  Computational Linguistics Volume 30, Number 4 adds an extra, constant-valued estimate to the back-off chain. Collins' parser hardcodes this extra value to be a vanishingly small (but nonzero) &amp;quot;probability&amp;quot; of 10  causes all estimates in the parser to be deficient, as it ends up throwing away probability mass. More formally, the proof leading to equation (17) no longer holds: The &amp;quot;distribution&amp;quot; sums to less than one (there is no history context in the model for which there are 10  possible outcomes).</Paragraph>
      <Paragraph position="8">  for computing smoothing weights is</Paragraph>
      <Paragraph position="10"> is the diversity of that context.</Paragraph>
      <Paragraph position="11">  The multiplicative constant five is used to give less weight to the back-off levels with more context and was optimized by looking at overall parsing performance on the development test set, Section 00 of the Penn Treebank. We call this constant the smoothing factor and denote it as f f . As it happens, the actual formula for computing smoothing weights in Collins' implementation is</Paragraph>
      <Paragraph position="13"> is an unmentioned smoothing term. For every parameter class except the subcat parameter class and P</Paragraph>
      <Paragraph position="15"> = 0.0. This curiously means that diversity is not used at all when smoothing subcat-generation probabilities.  The second case in (19) handles the situation in which the history context was never observed in training, that is, where c</Paragraph>
      <Paragraph position="17"> = 0, which would yield an undefined value 25 Collins used this technique to ensure that even futures that were never seen with an observed history context would still have some probability mass, albeit a vanishingly small one (Collins, personal communication, January 2003). Another commonly used technique would be to back off to the uniform distribution, which has the desirable property of not producing deficient estimates. As with all of the treebank- or model-specific aspects of the Collins parser, our engine uses equation (16) or (18) depending on the value of a particular run-time setting.</Paragraph>
      <Paragraph position="18"> 26 The smoothing weights can be viewed as confidence values for the probability estimates with which they are multiplied. The Witten-Bell technique crucially makes use of the quantity n</Paragraph>
      <Paragraph position="20"> (B) to a possible future. With a little algebraic manipulation, we have</Paragraph>
      <Paragraph position="22"> = 1, that is, when every future observed in training was unique. This latter case represents when the model is most &amp;quot;uncertain,&amp;quot; in that the transition distribution from ph</Paragraph>
      <Paragraph position="24"> (B) is uniform and poorly trained (one observation per possible transition). Because these smoothing weights measure, in some sense, the closeness of the observed distribution to uniform, they can be viewed as proxies for the entropy of the distribution p(*|ph</Paragraph>
      <Paragraph position="26"> parameters are unsmoothed. However, as a result of the deficient estimation method, they still have an associated lambda value, the computation of which, just like the subcat-generation probability estimates, does not make use of diversity.</Paragraph>
      <Paragraph position="28"> are, respectively, the headword and its part of speech of the nonterminal L</Paragraph>
      <Paragraph position="30"> . This table is basically a reproduction of the last column of Table 7.1 in Collins' thesis.</Paragraph>
      <Paragraph position="32"> Our new parameter class for the generation of headwords of modifying nonterminals.</Paragraph>
      <Paragraph position="34"> (B) has never been observed in training, the smoothed estimate using less context, ph  (B), is simply substituted as the &amp;quot;best guess&amp;quot; for the estimate using more context; that is, ~e</Paragraph>
      <Paragraph position="36"/>
    </Section>
    <Section position="3" start_page="499" end_page="501" type="sub_section">
      <SectionTitle>
6.9 Modifier Head-Word Generation
</SectionTitle>
      <Paragraph position="0"> As mentioned in Section 6.4, fully lexicalized modifying nonterminals are generated in two steps. First, the label and part-of-speech tag are generated with an instance of P</Paragraph>
      <Paragraph position="2"> . The back-off contexts for the smoothed estimates of these parameters are specified in Table 1. Notice how the last level of back-off is markedly different from the previous two levels in that it removes nearly all the elements of the history: In the face of sparse data, the probability of generating the headword of a modifying nonterminal is conditioned only on its part of speech.</Paragraph>
      <Paragraph position="3">  order to capture the most data for the crucial last level of back-off, Collins uses words that occur on either side of the headword, resulting in a general estimate ^p(w|t),as</Paragraph>
      <Paragraph position="5"> ). Accordingly, in our emulation of Collins' model, we replace the left- and right-word parameter classes with a single modifier headword generation parameter class that, as with (11), includes a boolean side component that is deleted from the last level of back-off (see Table 2).</Paragraph>
      <Paragraph position="6"> Even with this change, there is still a problem. Every headword in a lexicalized parse tree is the modifier of some other headword--except the word that is the head of the entire sentence (i.e., the headword of the root nonterminal). In order to properly duplicate Collins' model, an implementation must take care that the P(w|t) model includes counts for these important headwords.</Paragraph>
      <Paragraph position="7">  28 This fact is crucial in understanding how little the Collins parsing model relies on bilexical statistics, as described in Section 8.2 and the supporting experiment shown in Table 6.</Paragraph>
      <Paragraph position="8"> 29 In our implementation, we add such counts by having our trainer generate a &amp;quot;fake&amp;quot; modifier event in  The low-frequency word Fido is mapped to +UNKNOWN+, but only when it is generated, not when it is conditioned upon. All the nonterminals have been lexicalized (except for preterminals) to show where the heads are.</Paragraph>
      <Paragraph position="9"> 6.9.2 Unknown-Word Mapping. As mentioned above, instead of mapping every low-frequency word in the training data to some special +UNKNOWN+ token, Collins' trainer instead leaves the training data untouched and selectively maps words that appear in the back-off levels of the parameters from the P</Paragraph>
      <Paragraph position="11"> parameter classes. Rather curiously, the trainer maps only words that appear in the futures of these parameters, but never in the histories. Put another way, low-frequency words are generated as +UNKNOWN+ but are left unchanged when they are conditioned upon. For example, in Figure 11, where we assume Fido is a low-frequency word, the trainer would derive counts for the smoothed parameter</Paragraph>
      <Paragraph position="13"> However, when collecting events that condition on Fido, such as the parameters</Paragraph>
      <Paragraph position="15"> the word would not be mapped.</Paragraph>
      <Paragraph position="16"> This strange mapping scheme has some interesting consequences. First, imagine what happens to words that are truly unknown, that never occurred in the training data. Such words are mapped to the +UNKNOWN+ token outright before parsing. Whenever the parser estimates a probability with such a truly unknown word in the history, it will necessarily throw all probability mass to the backed-off estimate (~e  in our earlier notation), since +UNKNOWN+ effectively never occurred in a history context during training.</Paragraph>
      <Paragraph position="17"> The second consequence is that the mapping scheme yields a &amp;quot;superficient&amp;quot;  model, if all other parts of the model are probabilistically sound (which is actually which the observed lexicalized root nonterminal is considered a modifier of +TOP+, the hidden nonterminal that is the parent of the observed root of every tree (see Section 6.10 for details on the +TOP+ nonterminal).</Paragraph>
      <Paragraph position="18"> 30 The term deficient is used to denote a model in which one or more estimated distributions sums to less than 1. We use the term superficient to denote a model in which one or more estimated distributions sums to greater than 1.</Paragraph>
      <Paragraph position="19">  not the case here). With a parsing model such as Collins' that uses bilexical dependencies, generating words in the course of parsing is done very much as it is in a bigram language model: Every word is generated conditioning on some previously generated word, as well as some hidden material. The only difference is that the word being conditioned upon is often not the immediately preceding word in the sentence. However, one could plausibly construct a consistent bigram language model that generates words with the same dependencies as those in a statistical parser that uses bilexical dependencies derived from head-lexicalization.</Paragraph>
      <Paragraph position="20"> Collins (personal communication, January 2003) notes that his parser's unknownword-mapping scheme could be made consistent if one were to add a parameter class that estimated ^p(w|+UNKNOWN+), where w [?] V L [?]{+UNKNOWN+}. The values of these estimates for a given sentence would be constant across all parses, meaning that the &amp;quot;superficiency&amp;quot; of the model would be irrelevant when determining arg max T P(T|S).</Paragraph>
      <Paragraph position="21"> 6.10 The top Parameter Classes It is assumed that all trees that can be generated by the model have an implicit non-terminal +TOP+ that is the parent of the observed root. The observed lexicalized root nonterminal is generated conditioning on +TOP+ (which has a prior probability of 1.0) using a parameter from the class P TOP . This special parameter class is mentioned in a footnote in chapter 7 of Collins' thesis. There are actually two parameter classes used to generated observed roots, one for generating the partially lexicalized root nonterminal, which we call P</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="501" end_page="503" type="metho">
    <SectionTitle>
TOP
NT
</SectionTitle>
    <Paragraph position="0"> , and the other for generating the headword of the entire sentence, which we call P  (w|t), which is to say the probability of a word's occurring with a tag in the space of lexicalized nonterminals. This is different from the last level of back-off in the modifier headword parameter classes, which is effectively estimating ^p(w|t) in the space of lexicalized preterminals. The difference is that in the same sentence, the same headword can occur with the same tag in multiple nodes, such as sat in Figure 8, which occurs with the tag VBD three times (instead of just once) in the tree shown there. Despite this difference, Collins' parser uses counts from the (shared) last level of back-off of the P  parameter class.</Paragraph>
    <Paragraph position="1"> 7. Decoding Parsing, or decoding, is performed via a probabilistic version of the CKY chart-parsing algorithm. As with normal CKY, even though the model is defined in a topdown, generative manner, decoding proceeds bottom-up. Collins' thesis gives a pseu- null Computational Linguistics Volume 30, Number 4 docode version of his algorithm in an appendix. This section contains a few practical details.</Paragraph>
    <Section position="1" start_page="502" end_page="502" type="sub_section">
      <SectionTitle>
7.1 Chart Item Equality
</SectionTitle>
      <Paragraph position="0"> Since the goal of the decoding process is to determine the maximally likely theory, if during decoding a proposed chart item is equal (or, technically, equivalent) to an item that is already in the chart, the one with the greater score survives. Chart item equality is closely tied to the generative parameters used to construct theories: We want to treat two chart items as unequal if they represent derivation forests that would be considered unequal according to the output elements and conditioning contexts of the parameters used to generate them, subject to the independence assumptions of the model. For example, for two chart items to be considered equal, they must have the same label (the label of the root of their respective derivation forests' subtrees), the same headword and tag, and the same left and right subcat. They must also have the same head label (that is, label of the head-child).</Paragraph>
      <Paragraph position="1"> If a chart item's root label is an NP node, its head label is most often an NPB node, given the &amp;quot;extra&amp;quot; NP levels that are added during preprocessing to ensure that NPB nodes are always dominated by NP nodes. In such cases, the chart item will contain a back pointer to the chart item that represents the base NP. Curiously, however, Collins' implementation considers the head label of the NP chart item not to be NPB, but rather the head label of the NPB chart item. In other words, to get the head label of an NP chart item, one must &amp;quot;peek through&amp;quot; the NPB and get at the NPB's head label. Presumably, this was done as a consideration for the NPB nodes' being &amp;quot;extra&amp;quot; nodes, in some sense. It appears to have little effect on overall parsing accuracy, however.</Paragraph>
    </Section>
    <Section position="2" start_page="502" end_page="503" type="sub_section">
      <SectionTitle>
7.2 Pruning
</SectionTitle>
      <Paragraph position="0"> Ideally, every parse theory could be kept in the chart, and when the root symbol has been generated for all theories, the top-ranked one would &amp;quot;win.&amp;quot; In order to speed things up, Collins employs three different types of pruning. The first form of pruning is to use a beam: The chart memoizes the highest-scoring theory in each span, and if a proposed chart item for that span is not within a certain factor of the top-scoring item, it is not added to the chart. Collins reports in his thesis that he uses a beam width of  . As it happens, the beam width for his thesis experiments was 10  . Interestingly, there is a negligible difference in overall parsing accuracy when this wider beam is used (see Table 5). An interesting modification to the standard beam in Collins' parser is that for chart items representing NP or NP-A derivations with more than one child, the beam is expanded to be 10  . We suspect that Collins made this modification after he added the base NP model, to handle the greater perplexity associated with NPs.</Paragraph>
      <Paragraph position="1"> The second form of pruning employed is a comma constraint. Collins observed that in the Penn Treebank data, 96% of the time, when a constituent contained a comma, the word immediately following the end of the constituent's span was either a comma or the end of the sentence. So for speed reasons, the decoder rejects all theories that would generate constituents that violate this comma constraint.  There is a subtlety to Collins' implementation of this form of pruning, however. Commas are quite common within parenthetical phrases. Accordingly, if a comma in an input 31 If one generates commas as first-class words, as we have done, one must take great care in applying this comma constraint, for otherwise, chart items that represent partially completed constituents (i.e., constituents for which not all modifiers have been generated) may be incorrectly rejected. This is especially important for NPB constituents.</Paragraph>
      <Paragraph position="2">  Bikel Intricacies of Collins' Parsing Model Table 4 Overall parsing results using only details found in Collins (1997, 1999). The first two lines show the results of Collins' parser and those of our parser in its &amp;quot;complete&amp;quot; emulation mode (i.e., including unpublished details). All reported scores are for sentences of length [?] 40 words. LR (labeled recall) and LP (labeled precision) are the primary scoring metrics. CBs is the number of crossing brackets. 0 CBs and [?] 2 CBs are the percentages of sentences with 0 and [?] 2 crossing brackets, respectively. F (the F-measure) is the evenly weighted harmonic mean of precision and recall, or  sentence occurs after an opening parenthesis and before a closing parenthesis or the end of the sentence, it is not considered a comma for the purposes of the comma constraint. Another subtlety is that the comma constraint should effectively not be employed when pursuing theories of an NPB subtree. As it turns out, using the comma constraint also affects accuracy, as shown in Section 8.1.</Paragraph>
      <Paragraph position="3"> The final form of pruning employed is rather subtle: Within each cell of the chart that contains items covering some span of the sentence, Collins' parser uses buckets of items that share the same root nonterminal label for their respective derivations. Only 100 of the top-scoring items covering the same span with the same nonterminal label are kept in a particular bucket, meaning that if a new item is proposed and there are already 100 items covering the same span with the same label in the chart, then it will be compared to the lowest-scoring item in the bucket. If it has a higher score, it will be added to the bucket and the lowest-scoring item will be removed; otherwise, it will not be added. Apparently, this type of pruning has little effect, and so we have not duplicated it in our engine.</Paragraph>
    </Section>
    <Section position="3" start_page="503" end_page="503" type="sub_section">
      <SectionTitle>
7.3 Unknown Words and Parts of Speech
</SectionTitle>
      <Paragraph position="0"> When the parser encounters an unknown word, the first-best tag delivered by Ratnaparkhi's (1996) tagger is used. As it happens, the tag dictionary built up when training contains entries for every word observed, even low-frequency words. This means that during decoding, the output of the tagger is used only for those words that are truly unknown, that is, that were never observed in training. For all other words, the chart is seeded with a separate item for each tag observed with that word in training.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML