XML Viewer - j03-4003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/j03-4003_metho.xml
Size: 41,558 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="J03-4003">
  <Title>c(c) 2003 Association for Computational Linguistics Head-Driven Statistical Models for Natural Language Parsing</Title>
  <Section position="4" start_page="591" end_page="592" type="metho">
    <SectionTitle>
2. Background
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="591" end_page="592" type="sub_section">
      <SectionTitle>
2.1 Probabilistic Context-Free Grammars
</SectionTitle>
      <Paragraph position="0"> Probabilistic context-free grammars are the starting point for the models in this article. For this reason we briefly recap the theory behind nonlexicalized PCFGs, before moving to the lexicalized case.</Paragraph>
      <Paragraph position="1"> Following Hopcroft and Ullman (1979), we define a context-free grammar G as a 4-tuple (N,S, A, R), where N is a set of nonterminal symbols, S is an alphabet, A is a distinguished start symbol in N, and R is a finite set of rules, in which each rule is of the form X - b for some X [?] N, b [?] (N [?]S) [?] . The grammar defines a set of possible strings in the language and also defines a set of possible leftmost derivations under the grammar. Each derivation corresponds to a tree-sentence pair that is well formed under the grammar.</Paragraph>
      <Paragraph position="2"> A probabilistic context-free grammar is a simple modification of a context-free grammar in which each rule in the grammar has an associated probability P(b  |X). This can be interpreted as the conditional probability of X's being expanded using the rule X - b, as opposed to one of the other possibilities for expanding X listed in the grammar. The probability of a derivation is then a product of terms, each term corresponding to a rule application in the derivation. The probability of a given tree-sentence pair (T, S) derived by n applications of context-free rules LHS</Paragraph>
      <Paragraph position="4"> Booth and Thompson (1973) specify the conditions under which the PCFG does in fact define a distribution over the possible derivations (trees) generated by the underlying grammar. The first condition is that the rule probabilities define conditional distributions over how each nonterminal in the grammar can expand. The second is a technical condition that guarantees that the stochastic process generating trees terminates in a finite number of steps with probability one.</Paragraph>
      <Paragraph position="5"> A central problem in PCFGs is to define the conditional probability P(b  |X) for each rule X - b in the grammar. A simple way to do this is to take counts from a treebank and then to use the maximum-likelihood estimates:</Paragraph>
      <Paragraph position="7"> Computational Linguistics Volume 29, Number 4 If the treebank has actually been generated from a probabilistic context-free grammar with the same rules and nonterminals as the model, then in the limit, as the training sample size approaches infinity, the probability distribution implied by these estimates will converge to the distribution of the underlying grammar.</Paragraph>
      <Paragraph position="8">  Once the model has been trained, we have a model that defines P(T, S) for any sentence-tree pair in the grammar. The output on a new test sentence S is the most likely tree under this model,</Paragraph>
      <Paragraph position="10"> The parser itself is an algorithm that searches for the tree, T best , that maximizes P(T, S).</Paragraph>
      <Paragraph position="11"> In the case of PCFGs, this can be accomplished using a variant of the CKY algorithm applied to weighted grammars (providing that the PCFG can be converted to an equivalent PCFG in Chomsky normal form); see, for example, Manning and Sch &amp;quot;utze (1999). If the model probabilities P(T, S) are the same as the true distribution generating training and test examples, returning the most likely tree under P(T, S) will be optimal in terms of minimizing the expected error rate (number of incorrect trees) on newly drawn test examples. Hence if the data are generated by a PCFG, and there are enough training examples for the maximum-likelihood estimates to converge to the true values, then this parsing method will be optimal. In practice, these assumptions cannot be verified and are arguably quite strong, but these limitations have not prevented generative models from being successfully applied to many NLP and speech tasks. (See Collins [2002] for a discussion of other ways of conceptualizing the parsing problem.) In the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993), which is the source of data for our experiments, the rules are either internal to the tree, where LHS is a nonterminal and RHS is a string of one or more nonterminals, or lexical, where LHS is a part-of-speech tag and RHS is a word. (See Figure 1 for an example.)</Paragraph>
    </Section>
    <Section position="2" start_page="592" end_page="592" type="sub_section">
      <SectionTitle>
2.2 Lexicalized PCFGs
</SectionTitle>
      <Paragraph position="0"> A PCFG can be lexicalized  by associating a word w and a part-of-speech (POS) tag t with each nonterminal X in the tree. (See Figure 2 for an example tree.) The PCFG model can be applied to these lexicalized rules and trees in exactly the same way as before. Whereas before the nonterminals were simple (for example, S or NP), they are now extended to include a word and part-of-speech tag (for example, S(bought,VBD) or NP(IBM,NNP)). Thus we write a nonterminal as X(x), where x = &lt;w, t&gt; and X is a constituent label. Formally, nothing has changed, we have just vastly increased the number of nonterminals in the grammar (by up to a factor of |V|x|T|, 2 This point is actually more subtle than it first appears (we thank one of the anonymous reviewers for pointing this out), and we were unable to find proofs of this property in the literature for PCFGs. The rule probabilities for any nonterminal that appears with probability greater than zero in parse derivations will converge to their underlying values, by the usual properties of maximum-likelihood estimation for multinomial distributions. Assuming that the underlying PCFG generating training examples meet both criteria in Booth and Thompson (1973), it can be shown that convergence of rule probabilities implies that the distribution over trees will converge to that of the underlying PCFG, at least when Kullback-Liebler divergence or the infinity norm is taken to be the measure of distance between the two distributions. Thanks to Tommi Jaakkola and Nathan Srebro for discussions on this topic.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="592" end_page="594" type="metho">
    <SectionTitle>
3 We find lexical heads in Penn Treebank data using the rules described in Appendix A of Collins (1999).
</SectionTitle>
    <Paragraph position="0"> The rules are a modified version of a head table provided by David Magerman and used in the parser described in Magerman (1995).</Paragraph>
    <Paragraph position="1">  A nonlexicalized parse tree and a list of the rules it contains. Internal Rules:</Paragraph>
    <Paragraph position="3"> A lexicalized parse tree and a list of the rules it contains.</Paragraph>
    <Paragraph position="4"> where |V |is the number of words in the vocabulary and |T |is the number of part-of-speech tags).</Paragraph>
    <Paragraph position="5"> Although nothing has changed from a formal point of view, the practical consequences of expanding the number of nonterminals quickly become apparent when one is attempting to define a method for parameter estimation. The simplest solution would be to use the maximum-likelihood estimate as in equation (1), for example,  Computational Linguistics Volume 29, Number 4 estimating the probability associated with S(bought,VBD) - NP(week,NN) NP(IBM,NNP)</Paragraph>
    <Paragraph position="7"> But the addition of lexical items makes the statistics for this estimate very sparse: The count for the denominator is likely to be relatively low, and the number of outcomes (possible lexicalized RHSs) is huge, meaning that the numerator is very likely to be zero. Predicting the whole lexicalized rule in one go is too big a step.</Paragraph>
    <Paragraph position="8"> One way to overcome these sparse-data problems is to break down the generation of the RHS of each rule into a sequence of smaller steps, and then to make independence assumptions to reduce the number of parameters in the model. The decomposition of rules should aim to meet two criteria. First, the steps should be small enough for the parameter estimation problem to be feasible (i.e., in terms of having sufficient training data to train the model, providing that smoothing techniques are used to mitigate remaining sparse-data problems). Second, the independence assumptions made should be linguistically plausible. In the next sections we describe three statistical parsing models that have an increasing degree of linguistic sophistication. Model 1 uses a decomposition of which parameters corresponding to lexical dependencies are a natural result. The model also incorporates a preference for right-branching structures through conditioning on &amp;quot;distance&amp;quot; features. Model 2 extends the decomposition to include a step in which subcategorization frames are chosen probabilistically. Model 3 handles wh-movement by adding parameters corresponding to slash categories being passed from the parent of the rule to one of its children or being discharged as a trace.</Paragraph>
  </Section>
  <Section position="6" start_page="594" end_page="606" type="metho">
    <SectionTitle>
3. Three Probabilistic Models for Parsing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="594" end_page="597" type="sub_section">
      <SectionTitle>
3.1 Model 1
</SectionTitle>
      <Paragraph position="0"> This section describes how the generation of the RHS of a rule is broken down into a sequence of smaller steps in model 1. The first thing to note is that each internal rule in a lexicalized PCFG has the form  ) are left and right modifiers of H. Either n or m may be zero, and n = m = 0 for unary rules. Figure 2 shows a tree that will be used as an example throughout this article. We will extend the left and right sequences to include a terminating STOP symbol, allowing a Markov process to model the left and right sequences. Thus L</Paragraph>
      <Paragraph position="2"> Collins Head-Driven Statistical Models for NL Parsing Note that lexical rules, in contrast to the internal rules, are completely deterministic. They always take the form</Paragraph>
      <Paragraph position="4"> where P is a part-of-speech tag, h is a word-tag pair &lt;w, t&gt; , and the rule rewrites to just the word w. (See Figure 2 for examples of lexical rules.) Formally, we will always take a lexicalized nonterminal P(h) to expand deterministically (with probability one) in this way if P is a part-of-speech symbol. Thus for the parsing models we require the nonterminal labels to be partitioned into two sets: part-of-speech symbols and other nonterminals. Internal rules always have an LHS in which P is not a part-of-speech symbol. Because lexicalized rules are deterministic, they will not be discussed in the remainder of this article: All of the modeling choices concern internal rules.</Paragraph>
      <Paragraph position="5"> The probability of an internal rule can be rewritten (exactly) using the chain rule of probabilities:</Paragraph>
      <Paragraph position="7"> parameter types, respectively.) Next, we make the assumption that the modifiers are generated independently of each other:</Paragraph>
      <Paragraph position="9"> In summary, the generation of the RHS of a rule such as (2), given the LHS, has been decomposed into three steps:  1. Generate the head constituent label of the phrase, with probability</Paragraph>
      <Paragraph position="11"> 2. Generate modifiers to the left of the head with probability producttext</Paragraph>
      <Paragraph position="13"> )=STOP. The STOP symbol is added to the vocabulary of nonterminals, and the model stops  generating left modifiers when the STOP symbol is generated. 3. Generate modifiers to the right of the head with probability producttext</Paragraph>
      <Paragraph position="15"> ) as STOP.</Paragraph>
      <Paragraph position="16"> For example, the probability of the rule S(bought) - NP(week) NP(IBM) VP(bought) would be estimated as</Paragraph>
      <Paragraph position="18"> Computational Linguistics Volume 29, Number 4 In this example, and in the examples in the rest of the article, for brevity we omit the part-of-speech tags associated with words, writing, for example S(bought) rather than S(bought,VBD). We emphasize that throughout the models in this article, each word is always paired with its part of speech, either when the word is generated or when the word is being conditioned upon.</Paragraph>
      <Paragraph position="19"> 3.1.1 Adding Distance to the Model. In this section we first describe how the model can be extended to be &amp;quot;history-based.&amp;quot; We then show how this extension can be utilized in incorporating &amp;quot;distance&amp;quot; features into the model.</Paragraph>
      <Paragraph position="20"> Black et al. (1992) originally introduced history-based models for parsing. Equations (3) and (4) of the current article made the independence assumption that each modifier is generated independently of the others (i.e., that the modifiers are generated independently of everything except P, H, and h). In general, however, the probability of generating each modifier could depend on any function of the previous modifiers, head/parent category, and headword. Moreover, if the top-down derivation order is fully specified, then the probability of generating a modifier can be conditioned on any structure that has been previously generated. The remainder of this article assumes that the derivation order is depth-first: that is, each modifier recursively generates the subtree below it before the next modifier is generated. (Figure 3 gives an example that illustrates this.) The models in Collins (1996) showed that the distance between words standing in head-modifier relationships was important, in particular, that it is important to capture a preference for right-branching structures (which almost translates into a preference for dependencies between adjacent words) and a preference for dependencies not to cross a verb. In this section we describe how this information can be incorporated into model 1. In section 7.2, we describe experiments that evaluate the effect of these features on parsing accuracy.</Paragraph>
      <Paragraph position="21"> Figure 3 A partially completed tree derived depth-first. &amp;quot;????&amp;quot; marks the position of the next modifier to be generated--it could be a nonterminal/headword/head-tag triple, or the STOP symbol. The distribution over possible symbols in this position could be conditioned on any previously generated structure, that is, any structure appearing in the figure.</Paragraph>
      <Paragraph position="22">  (or, for that matter, on any structure previously generated elsewhere in the tree).</Paragraph>
      <Paragraph position="23"> Distance can be incorporated into the model by modifying the independence assumptions so that each modifier has a limited dependence on the previous modifiers:</Paragraph>
      <Paragraph position="25"> are functions of the surface string below the previous modifiers. (See Figure 4 for illustration.) The distance measure is similar to that in Collins (1996), a vector with the following two elements: (1) Is the string of zero length? (2) Does the string contain a verb? The first feature allows the model to learn a preference for right-branching structures. The second feature  allows the model to learn a preference for modification of the most recent verb.</Paragraph>
    </Section>
    <Section position="2" start_page="597" end_page="599" type="sub_section">
      <SectionTitle>
3.2 Model 2: The Complement/Adjunct Distinction and Subcategorization
</SectionTitle>
      <Paragraph position="0"> The tree depicted in Figure 2 illustrates the importance of the complement/adjunct distinction. It would be useful to identify IBM as a subject and Last week as an adjunct (temporal modifier), but this distinction is not made in the tree, as both NPsarein the same position  (sisters to a VP under an S node). From here on we will identify complements  by attaching a -C suffix to nonterminals. Figure 5 shows the tree in Figure 2 with added complement markings.</Paragraph>
      <Paragraph position="1"> A postprocessing stage could add this detail to the parser output, but there are a couple of reasons for making the distinction while parsing. First, identifying complements is complex enough to warrant a probabilistic treatment. Lexical information is needed (for example, knowledge that week is likely to be a temporal modifier). Knowledge about subcategorization preferences (for example, that a verb takes exactly one  subject) is also required. For example, week can sometimes be a subject, as in Last week was a good one, so the model must balance the preference for having a subject against 6 Note that this feature means that dynamic programming parsing algorithms for the model must keep track of whether each constituent does or does not have a verb in the string to the right or left of its head. See Collins (1999) for a full description of the parsing algorithms. 7 In the models described in Collins (1997), there was a third question concerning punctuation: (3) Does the string contain 0, 1, 2 or more than 2 commas? (where a comma is anything tagged as &amp;quot;,&amp;quot; or &amp;quot;:&amp;quot;).  The model described in this article has a cleaner incorporation of punctuation into the generative process, as described in section 4.3.</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 29, Number 4 Figure 5 A tree with the -C suffix used to identify complements. IBM and Lotus are in subject and object position, respectively. Last week is an adjunct.</Paragraph>
      <Paragraph position="3"> Figure 6 Two examples in which the assumption that modifiers are generated independently of one another leads to errors. In (1) the probability of generating both Dreyfus and fund as subjects,</Paragraph>
      <Paragraph position="5"> P(VP-C(funding)|VP,VB,was) is a bad independence assumption.</Paragraph>
      <Paragraph position="6"> the relative improbability of week's being the headword of a subject. These problems are not restricted to NPs; compare The spokeswoman said (SBAR that the asbestos was dangerous) with Bonds beat short-term investments (SBAR because the market is down), in which an SBAR headed by that is a complement, but an SBAR headed by because is an adjunct.</Paragraph>
      <Paragraph position="7"> A second reason for incorporating the complement/adjunct distinction into the parsing model is that this may help parsing accuracy. The assumption that complements are generated independently of one another often leads to incorrect parses. (See Figure 6 for examples.)  3.2.1 Identifying Complements and Adjuncts in the Penn Treebank. We add the -C suffix to all nonterminals in training data that satisfy the following conditions: 1. The nonterminal must be (1) an NP, SBAR,orS whose parent is an S; (2) an NP, SBAR, S,orVP whose parent is a VP; or (3) an S whose parent is an SBAR.</Paragraph>
      <Paragraph position="8">  Collins Head-Driven Statistical Models for NL Parsing 2. The nonterminal must not have one of the following semantic tags: ADV, VOC, BNF, DIR, EXT, LOC, MNR, TMP, CLR or PRP. See Marcus et al.</Paragraph>
      <Paragraph position="9"> (1994) for an explanation of what these tags signify. For example, the NP Last week in figure 2 would have the TMP (temporal) tag, and the SBAR in (SBAR because the market is down) would have the ADV (adverbial) tag.</Paragraph>
      <Paragraph position="10"> 3. The nonterminal must not be on the RHS of a coordinated phrase. For  example, in the rule S - SCCS, the two child Ss would not be marked as complements.</Paragraph>
      <Paragraph position="11"> In addition, the first child following the head of a prepositional phrase is marked as a complement.</Paragraph>
      <Paragraph position="12">  training data with the enhanced set of nonterminals, and it might learn the lexical properties that distinguish complements and adjuncts (IBM vs. week,orthat vs. because). It would still suffer, however, from the bad independence assumptions illustrated in Figure 6. To solve these kinds of problems, the generative process is extended to include a probabilistic choice of left and right subcategorization frames:  1. Choose a head H with probability P h (H  |P, h).</Paragraph>
      <Paragraph position="13"> 2. Choose left and right subcategorization frames, LC and RC, with probabilities P</Paragraph>
      <Paragraph position="15"> frame is a multiset  specifying the complements that the head requires in its left or right modifiers.  3. Generate the left and right modifiers with probabilities P</Paragraph>
      <Paragraph position="17"> respectively.</Paragraph>
      <Paragraph position="18"> Thus the subcategorization requirements are added to the conditioning context. As complements are generated they are removed from the appropriate subcategorization multiset. Most importantly, the probability of generating the STOP symbol will be zero when the subcategorization frame is non-empty, and the probability of generating a particular complement will be zero when that complement is not in the subcategorization frame; thus all and only the required complements will be generated. The probability of the phrase S(bought) - NP(week) NP-C(IBM) VP(bought) is</Paragraph>
      <Paragraph position="20"> Here the head initially decides to take a single NP-C (subject) to its left and no complements to its right. NP-C(IBM) is immediately generated as the required subject, and NP-C is removed from LC, leaving it empty when the next modifier, NP(week), is generated. The incorrect structures in Figure 6 should now have low probability, because</Paragraph>
    </Section>
    <Section position="3" start_page="599" end_page="601" type="sub_section">
      <SectionTitle>
3.3 Model 3: Traces and Wh-Movement
</SectionTitle>
      <Paragraph position="0"> Another obstacle to extracting predicate-argument structure from parse trees is whmovement. This section describes a probabilistic treatment of extraction from relative clauses. Noun phrases are most often extracted from subject position, object position, or from within PPs:  (1) The store (SBAR that TRACE bought Lotus) (2) The store (SBAR that IBM bought TRACE) (3) The store (SBAR that IBM bought Lotus from TRACE)  It might be possible to write rule-based patterns that identify traces in a parse tree. We argue again, however, that this task is best integrated into the parser: The task is complex enough to warrant a probabilistic treatment, and integration may help parsing accuracy. A couple of complexities are that modification by an SBAR does not always involve extraction (e.g., the fact (SBAR that besoboru is played with a ball and a bat)), and it is not uncommon for extraction to occur through several constituents (e.g., The changes (SBAR that he said the government was prepared to make TRACE)). One hope is that an integrated treatment of traces will improve the parameterization of the model. In particular, the subcategorization probabilities are smeared by extraction. In examples (1), (2), and (3), bought is a transitive verb; but without knowledge of traces, example (2) in training data will contribute to the probability of bought's being an intransitive verb.</Paragraph>
      <Paragraph position="1"> Formalisms similar to GPSG (Gazdar et al. 1985) handle wh-movement by adding a gap feature to each nonterminal in the tree and propagating gaps through the tree until they are finally discharged as a trace complement (see Figure 7). In extraction cases the Penn Treebank annotation coindexes a TRACE with the WHNP head of the SBAR, so it is straightforward to add this information to trees in training data.</Paragraph>
      <Paragraph position="2">  (1) NP - NP SBAR(+gap) (2) SBAR(+gap) - WHNP S-C(+gap) (3) S(+gap) - NP-C VP(+gap) (4) VP(+gap) - VB TRACE NP  A +gap feature can be added to nonterminals to describe wh-movement. The top-level NP initially generates an SBAR modifier but specifies that it must contain an NP trace by adding the +gap feature. The gap is then passed down through the tree, until it is discharged as a TRACE complement to the right of bought.</Paragraph>
      <Paragraph position="3">  Collins Head-Driven Statistical Models for NL Parsing Given that the LHS of the rule has a gap, there are three ways that the gap can be passed down to the RHS: Head: The gap is passed to the head of the phrase, as in rule (3) in Figure 7. Left, Right: The gap is passed on recursively to one of the left or right modifiers of the head or is discharged as a TRACE argument to the left or right of the head. In rule (2) in Figure 7, it is passed on to a right modifier, the S complement. In rule (4), a TRACE is generated to the right of the head VB. We specify a parameter type P g (G|P, h, H) where G is either Head, Left, or Right. The generative process is extended to choose among these cases after generating the head of the phrase. The rest of the phrase is then generated in different ways depending on how the gap is propagated. In the Head case the left and right modifiers are generated as normal. In the Left and Right cases a +gap requirement is added to either the left or right SUBCAT variable. This requirement is fulfilled (and removed from the subcategorization list) when either a trace or a modifier nonterminal that has the +gap feature, is generated. For example, rule (2) in Figure 7, SBAR(that)(+gap) -</Paragraph>
      <Paragraph position="5"> In rule (2), Right is chosen, so the +gap requirement is added to RC. Generation of S-C(bought)(+gap) fulfills both the S-C and +gap requirements in RC. In rule (4), Right is chosen again. Note that generation of TRACE satisfies both the NP-C and +gap subcategorization requirements.</Paragraph>
      <Paragraph position="6"> 4. Special Cases: Linguistically Motivated Refinements to the Models Sections 3.1 to 3.3 described the basic framework for the parsing models in this article. In this section we describe how some linguistic phenomena (nonrecursive NPs and coordination, for example) clearly violate the independence assumptions of the general models. We describe a number of these special cases, in each instance arguing that the phenomenon violates the independence assumptions, then describing how the model can be refined to deal with the problem.</Paragraph>
    </Section>
    <Section position="4" start_page="601" end_page="603" type="sub_section">
      <SectionTitle>
4.1 Nonrecursive NPs
</SectionTitle>
      <Paragraph position="0"> We define nonrecursive NPs (from here on referred to as base-NPs and labeled NPB rather than NP)asNPs that do not directly dominate an NP themselves, unless the dominated NP is a possessive NP (i.e., it directly dominates a POS-tag POS). Figure 8 gives some examples. Base-NPs deserve special treatment for three reasons: * The boundaries of base-NPs are often strongly marked. In particular, the start points of base-NPs are often marked with a determiner or another  Computational Linguistics Volume 29, Number 4 Figure 8 Three examples of structures with base-NPs.</Paragraph>
      <Paragraph position="1"> distinctive item, such as an adjective. Because of this, the probability of generating the STOP symbol should be greatly increased when the previous modifier is, for example, a determiner. As they stand, the independence assumptions in the three models lose this information. The probability of NPB(dog) - DT(the) NN(dog) would be estimated as</Paragraph>
      <Paragraph position="3"> In making the independence assumption</Paragraph>
      <Paragraph position="5"> the model will fail to learn that the STOP symbol is very likely to follow a determiner. As a result, the model will assign unreasonably high probabilities to NPs such as [NP yesterday the dog] in sentences such as Yesterday the dog barked.</Paragraph>
      <Paragraph position="6"> * The annotation standard in the treebank leaves the internal structure of base-NPs underspecified. For example, both pet food volume (where pet modifies food and food modifies volume) and vanilla ice cream (where both vanilla and ice modify cream) would have the structure NPB - NN NN NN.</Paragraph>
      <Paragraph position="7"> Because of this, there is no reason to believe that modifiers within NPBs are dependent on the head rather than the previous modifier. In fact, if it so happened that a majority of phrases were like pet food volume, then conditioning on the previous modifier rather than the head would be preferable.</Paragraph>
      <Paragraph position="8"> * In general it is important (in particular for the distance measure to be effective) to have different nonterminal labels for what are effectively different X-bar levels. (See section 7.3.2 for further discussion.) For these reasons the following modifications are made to the models: * The nonterminal label for base-NPs is changed from NP to NPB. For consistency, whenever an NP is seen with no pre- or postmodifiers, an NPB level is added. For example, [S [NP the dog] [VP barks] ] would be transformed into [S [NP [NPB the dog] ] [VP barks ] ]. These &amp;quot;extra&amp;quot; NPBs are removed before scoring the output of the parser against the treebank.</Paragraph>
      <Paragraph position="9"> 11 For simplicity, we give probability terms under model 1 with no distance variables; the probability terms with distance variables, or for models 2 and 3, will be similar, but with the addition of various pieces of conditioning information.</Paragraph>
      <Paragraph position="10">  Collins Head-Driven Statistical Models for NL Parsing * The independence assumptions are different when the parent nonterminal is an NPB. Specifically, equations (5) and (6) are modified to</Paragraph>
      <Paragraph position="12"> The modifier and previous-modifier nonterminals are always adjacent, so the distance variable is constant and is omitted. For the purposes of this model, L  ) are defined to be H(h). The probability of the previous example is now</Paragraph>
      <Paragraph position="14"> (STOP|NPB,DT,the) will be very close to one.</Paragraph>
    </Section>
    <Section position="5" start_page="603" end_page="604" type="sub_section">
      <SectionTitle>
4.2 Coordination
</SectionTitle>
      <Paragraph position="0"> Coordination constructions are another example in which the independence assumptions in the basic models fail badly (at least given the current annotation method in the treebank). Figure 9 shows how coordination is annotated in the treebank.</Paragraph>
      <Paragraph position="1">  To use an example to illustrate the problems, take the rule NP(man) - NP(man) CC(and) NP(dog), which has probability</Paragraph>
      <Paragraph position="3"> The independence assumptions mean that the model fails to learn that there is always exactly one phrase following the coordinator (CC). The basic probability models will give much too high probabilities to unlikely phrases such as NP - NP CC or NP NP CC NP NP. For this reason we alter the generative process to allow generation of both the coordinator and the following phrase in one step; instead of just generating a nonterminal at each step, a nonterminal and a binary-valued coord flag are generated.</Paragraph>
      <Paragraph position="4"> coord = 1 if there is a coordination relationship. In the generative process, generation of a coord = 1 flag along with a modifier triggers an additional step in the generative  Computational Linguistics Volume 29, Number 4 process, namely, the generation of the coordinator tag/word pair, parameterized by the P cc parameter. For the preceding example this would give probability</Paragraph>
      <Paragraph position="6"> Note the new type of parameter, P cc , for the generation of the coordinator word and POS tag. The generation of coord=1 along with NP(dog) in the example implicitly requires generation of a coordinator tag/word pair through the P cc parameter. The generation of this tag/word pair is conditioned on the two words in the coordination dependency (man and dog in the example) and the label on their relationship (NP,NP,NP in the example, representing NP coordination).</Paragraph>
      <Paragraph position="7"> The coord flag is implicitly zero when normal nonterminals are generated; for example, the phrase S(bought) - NP(week) NP(IBM) VP(bought) now has probability</Paragraph>
      <Paragraph position="9"/>
    </Section>
    <Section position="6" start_page="604" end_page="605" type="sub_section">
      <SectionTitle>
4.3 Punctuation
</SectionTitle>
      <Paragraph position="0"> This section describes our treatment of &amp;quot;punctuation&amp;quot; in the model, where &amp;quot;punctuation&amp;quot; is used to refer to words tagged as a comma or colon. Previous work--the generative models described in Collins (1996) and the earlier version of these models described in Collins (1997)--conditioned on punctuation as surface features of the string, treating it quite differently from lexical items. In particular, the model in Collins (1997) failed to generate punctuation, a deficiency of the model. This section describes how punctuation is integrated into the generative models.</Paragraph>
      <Paragraph position="1"> Our first step is to raise punctuation as high in the parse trees as possible. Punctuation at the beginning or end of sentences is removed from the training/test data altogether.</Paragraph>
      <Paragraph position="2">  All punctuation items apart from those tagged as comma or colon (items such as quotation marks and periods, tagged &amp;quot;&amp;quot;or . ) are removed altogether. These transformations mean that punctuation always appears between two nonterminals, as opposed to appearing at the end of a phrase. (See Figure 10 for an example.) Figure 10 A parse tree before and after punctuation transformations.</Paragraph>
      <Paragraph position="3"> 13 As one of the anonymous reviewers of this article pointed out, this choice of discarding the sentence-final punctuation may not be optimal, as the final punctuation mark may well carry useful information about the sentence structure.</Paragraph>
      <Paragraph position="4">  Collins Head-Driven Statistical Models for NL Parsing Punctuation is then treated in a very similar way to coordination: Our intuition is that there is a strong dependency between the punctuation mark and the modifier generated after it. Punctuation is therefore generated with the following phrase through a punc flag that is similar to the coord flag (a binary-valued feature equal to one if a punctuation mark is generated with the following phrase).</Paragraph>
      <Paragraph position="5"> Under this model, NP(Vinken) - NPB(Vinken) ,(,) ADJP(old) would have</Paragraph>
      <Paragraph position="7"> is a new parameter type for generation of punctuation tag/word pairs. The generation of punc=1 along with ADJP(old) in the example implicitly requires generation of a punctuation tag/word pair through the P p parameter. The generation of this tag/word pair is conditioned on the two words in the punctuation dependency (Vinken and old in the example) and the label on their relationship (NP,NPB,ADJP in the example.)</Paragraph>
    </Section>
    <Section position="7" start_page="605" end_page="605" type="sub_section">
      <SectionTitle>
4.4 Sentences with Empty (PRO) Subjects
</SectionTitle>
      <Paragraph position="0"> Sentences in the treebank occur frequently with PRO subjects that may or may not be controlled: As the treebank annotation currently stands, the nonterminal is S whether or not a sentence has an overt subject. This is a problem for the subcategorization probabilities in models 2 and 3: The probability of having zero subjects, P</Paragraph>
      <Paragraph position="2"> verb), will be fairly high because of this. In addition, sentences with and without subjects appear in quite different syntactic environments. For these reasons we modify the nonterminal for sentences without subjects to be SG (see figure 11). The resulting model has a cleaner division of subcategorization: P</Paragraph>
      <Paragraph position="4"> ({NP-C}|SG, VP, verb)=0. The model will learn probabilistically the environments in which S and SG are likely to appear.</Paragraph>
    </Section>
    <Section position="8" start_page="605" end_page="606" type="sub_section">
      <SectionTitle>
4.5 A Punctuation Constraint
</SectionTitle>
      <Paragraph position="0"> As a final step, we use the rule concerning punctuation introduced in Collins (1996) to impose a constraint as follows. If for any constituent Z in the chart Z - &lt;..X Y..&gt; two of its children X and Y are separated by a comma, then the last word in Y must be directly followed by a comma, or must be the last word in the sentence. In training data 96% of commas follow this rule. The rule has the benefit of improving efficiency by reducing the number of constituents in the chart. It would be preferable to develop a probabilistic analog of this rule, but we leave this to future research.</Paragraph>
      <Paragraph position="1"> Figure 11 (a) The treebank annotates sentences with empty subjects with an empty -NONE- element under subject position; (b) in training (and for evaluation), this null element is removed; (c) in models 2 and 3, sentences without subjects are changed to have a nonterminal SG.</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 29, Number 4 Table 1 The conditioning variables for each level of back-off. For example, P</Paragraph>
      <Paragraph position="4"/>
    </Section>
  </Section>
  <Section position="7" start_page="606" end_page="607" type="metho">
    <SectionTitle>
5. Practical Issues
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="606" end_page="606" type="sub_section">
      <SectionTitle>
5.1 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the various levels of back-off for each type of parameter in the model.</Paragraph>
      <Paragraph position="1"> Note that we decompose P</Paragraph>
      <Paragraph position="3"> , c and p are the coord and punc flags associated with the nonterminal, and [?] is the distance measure) into the product</Paragraph>
      <Paragraph position="5"> These two probabilities are then smoothed separately. Eisner (1996b) originally used POS tags to smooth a generative model in this way. In each case the final estimate is  are maximum-likelihood estimates with the context at levels 1, 2, and 3 in the table, and l  . The coefficient five was chosen to maximize accuracy on the development set, section 0 of the treebank (in practice it was found that any value in the range 2-5 gave a very similar level of performance).</Paragraph>
    </Section>
    <Section position="2" start_page="606" end_page="607" type="sub_section">
      <SectionTitle>
5.2 Unknown Words and Part-of-Speech Tagging
</SectionTitle>
      <Paragraph position="0"> All words occurring less than six times  in training data, and words in test data that have never been seen in training, are replaced with the UNKNOWN token. This allows the model to handle robustly the statistics for rare or new words. Words in test data that have not been seen in training are deterministically assigned the POS tag that is assigned by the tagger described in Ratnaparkhi (1996). As a preprocessing step, the 14 In Collins (1999) we erroneously stated that all words occuring less than five times in training data were classified as &amp;quot;unknown.&amp;quot; Thanks to Dan Bikel for pointing out this error.  Collins Head-Driven Statistical Models for NL Parsing tagger is used to decode each test data sentence. All other words are tagged during parsing, the output from Ratnaparkhi's tagger being ignored. The POS tags allowed for each word are limited to those that have been seen in training data for that word (any tag/word pairs not seen in training would give an estimate of zero in the P</Paragraph>
      <Paragraph position="2"> distributions). The model is fully integrated, in that part-of-speech tags are statistically generated along with words in the models, so that the parser will make a statistical decision as to the most likely tag for each known word in the sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="607" end_page="607" type="sub_section">
      <SectionTitle>
5.3 The Parsing Algorithm
</SectionTitle>
      <Paragraph position="0"> The parsing algorithm for the models is a dynamic programming algorithm, which is very similar to standard chart parsing algorithms for probabilistic or weighted grammars. The algorithm has complexity O(n  ), where n is the number of words in the string. In practice, pruning strategies (methods that discard lower-probability constituents in the chart) can improve efficiency a great deal. The appendices of Collins (1999) give a precise description of the parsing algorithms, an analysis of their computational complexity, and also a description of the pruning methods that are employed. See Eisner and Satta (1999) for an O(n  ) algorithm for lexicalized grammars that could be applied to the models in this paper. Eisner and Satta (1999) also describe an O(n  ) algorithm for a restricted class of lexicalized grammars; it is an open question whether this restricted class includes the models in this article.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML