File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1003_metho.xml

Size: 15,990 bytes

Last Modified: 2025-10-06 14:14:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1003">
  <Title>Three Generative, Lexicalised Models for Statistical Parsing</Title>
  <Section position="4" start_page="0" end_page="20" type="metho">
    <SectionTitle>
2 The Three Parsing Models
2.1 Model 1
</SectionTitle>
    <Paragraph position="0"> In general, a statistical parsing model defines the conditional probability, 7)(T\] S), for each candidate parse tree T for a sentence S. The parser itself is an algorithm which searches for the tree, Tb~st, that maximises 7~(T I S). A generative model uses the observation that maximising 7V(T, S) is equivalent</Paragraph>
    <Paragraph position="2"> to a top-down derivation of the tree. In a PCFG, for a tree derived by n applications of context-free re-write rules LHSi ~ RHSi, 1 &lt; i &lt; n,</Paragraph>
    <Paragraph position="4"> The re-write rules are either internal to the tree, where LHS is a non-terminal and RHS is a string 7~(T,S) 17~(S) is constant, hence maximising ~ is equivalent to maximising &amp;quot;P(T, S).</Paragraph>
    <Paragraph position="6"> of one or more non-terminals; or lexical, where LHS is a part of speech tag and RHS is a word.</Paragraph>
    <Paragraph position="7"> A PCFG can be lexicalised 2 by associating a word w and a part-of-speech (POS) tag t with each non-terminal X in the tree. Thus we write a non-terminal as X(x), where x = (w,t), and X is a constituent label. Each rule now has the form3:</Paragraph>
    <Paragraph position="9"> H is the head-child of the phrase, which inherits the head-word h from its parent P. L1...L~ and R1...Rm are left and right modifiers of H. Either n or m may be zero, and n = m = 0 for unary rules. Figure 1 shows a tree which will be used as an example throughout this paper.</Paragraph>
    <Paragraph position="10"> The addition of lexical heads leads to an enormous number of potential rules, making direct estimation of ?)(RHS { LHS) infeasible because of sparse data problems. We decompose the generation of the RHS of a rule such as (3), given the LHS, into three steps -- first generating the head, then making the independence assumptions that the left and right modifiers are generated by separate 0th-order markov  processes 4: 1. Generate the head constituent label of the phrase, with probability 7)H(H I P, h).</Paragraph>
    <Paragraph position="11"> 2. Generate modifiers to the right of the head  with probability 1-Ii=1..m+1 ~n(Ri(ri) { P, h, H). R,~+l(r,~+l) is defined as STOP -- the STOP symbol is added to the vocabulary of nonterminals, and the model stops generating right modifiers when it is generated.</Paragraph>
    <Paragraph position="12"> 2We find lexical heads in Penn treebank data using rules which are similar to those used by (Magerman 95; Jelinek et al. 94).</Paragraph>
    <Paragraph position="13"> SWith the exception of the top rule in the tree, which has the form TOP --+ H(h).</Paragraph>
    <Paragraph position="14"> 4An exception is the first rule in the tree, T0P -+ H (h), which has probability Prop (H, hlTOP )  3. Generate modifiers to the left of the head with probability rL=l..n+ l ?) L ( L~( li ) l P, h, H), where</Paragraph>
    <Paragraph position="16"> For example, the probability of the rule S(bought)</Paragraph>
    <Paragraph position="18"> We have made the 0 th order markov assumptions</Paragraph>
    <Paragraph position="20"> but in general the probabilities could be conditioned on any of the preceding modifiers. In fact, if the derivation order is fixed to be depth-first -- that is, each modifier recursively generates the sub-tree below it before the next modifier is generated -then the model can also condition on any structure below the preceding modifiers. For the moment we exploit this by making the approximations</Paragraph>
    <Paragraph position="22"> where distancez and distancer are functions of the surface string from the head word to the edge of the constituent (see figure 2). The distance measure is the same as in (Collins 96), a vector with the following 3 elements: (1) is the string of zero length? (Allowing the model to learn a preference for right-branching structures); (2) does the string contain a  verb? (Allowing the model to learn a preference for modification of the most recent verb). (3) Does the string contain 0, 1, 2 or &gt; 2 commas? (where a comma is anything tagged as &amp;quot;,&amp;quot; or &amp;quot;:&amp;quot;).  distance is a function of the surface string from the word after h to the last word of R2, inclusive. In principle the model could condition on any structure dominated by H, R1 or R2.</Paragraph>
    <Section position="1" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
2.2 Model 2: The complement/adjunct
</SectionTitle>
      <Paragraph position="0"> distinction and subcategorisation The tree in figure 1 is an example of the importance of the complement/adjunct distinction. It would be useful to identify &amp;quot;Marks&amp;quot; as a subject, and &amp;quot;Last week&amp;quot; as an adjunct (temporal modifier), but this distinction is not made in the tree, as both NPs are in the same position 5 (sisters to a VP under an S node). From here on we will identify complements by attaching a &amp;quot;-C&amp;quot; suffix to non-terminals -- figure 3 gives an example tree.</Paragraph>
      <Paragraph position="1">  and object position respectively. &amp;quot;Last week&amp;quot; is an adjunct.</Paragraph>
      <Paragraph position="2"> A post-processing stage could add this detail to the parser output, but we give two reasons for making the distinction while parsing: First, identifying complements is complex enough to warrant a probabilistic treatment. Lexical information is needed  Brooks&amp;quot;.</Paragraph>
      <Paragraph position="3"> -- for example, knowledge that &amp;quot;week '' is likely to be a temporal modifier. Knowledge about subcategorisation preferences -- for example that a verb takes exactly one subject -- is also required. These problems are not restricted to NPs, compare &amp;quot;The spokeswoman said (SBAR that the asbestos was dangerous)&amp;quot; vs. &amp;quot;Bonds beat short-term investments (SBAR because the market is down)&amp;quot;, where an SBAR headed by &amp;quot;that&amp;quot; is a complement, but an SBAI:t headed by &amp;quot;because&amp;quot; is an adjunct. The second reason for making the complement/adjunct distinction while parsing is that it may help parsing accuracy. The assumption that complements are generated independently of each other often leads to incorrect parses -- see figure 4 for further explanation.</Paragraph>
      <Paragraph position="4">  Adjuncts in the Penn Treebank We add the &amp;quot;-C&amp;quot; suffix to all non-terminals in training data which satisfy the following conditions:  1. The non-terminal must be: (1) an NP, SBAR, or S whose parent is an S; (2) an NP, SBAR, S, or VP whose parent is a VP; or (3) an S whose parent is an SBAR.</Paragraph>
      <Paragraph position="5"> 2. The non-terminal must not have one of the fol- null lowing semantic tags: ADV, VOC, BNF, DIR, EXT, LOC, MNR, TMP, CLR or PRP. See (Marcus et al. 94) for an explanation of what these tags signify. For example, the NP &amp;quot;Last week&amp;quot; in figure 1 would have the TMP (temporal) tag; and the SBAR in &amp;quot;(SBAR because the market is down)&amp;quot;, would have the ADV (adverbial) tag.</Paragraph>
      <Paragraph position="6"> In addition, the first child following the head of a prepositional phrase is marked as a complement.  Frames The model could be retrained on training data with the enhanced set of non-terminals, and it might learn the lexical properties which distinguish complements and adjuncts (&amp;quot;Marks&amp;quot; vs &amp;quot;week&amp;quot;, or &amp;quot;that&amp;quot; vs. &amp;quot;because&amp;quot;). However, it would still suffer from the bad independence assumptions illustrated in figure 4. To solve these kinds of problems, the generative process is extended to include a probabilistic choice of left and right subcategorisation frames:  1. Choose a head H with probability ~H(H\[P, h). 2. Choose left and right subcat frames, LC and  RC, with probabilities 7)~c(LC \[ P, H, h) and  other leads to errors. In (1) the probability of generating both &amp;quot;Dreyfus&amp;quot; and &amp;quot;fund&amp;quot; as subjects, 7~(NP-C(Dreyfus) I S,VP,was) * T'(NP-C(fund) I S,VP,was) is unreasonably high. (2) is similar: 7 ~ (NP-C (bill), VP-C (funding) I VP, VB, was) = P(NP-C (bill) I VP, VB, was) * 7~(VP-C (funding) I VP, VB, was) is a bad independence assumption.</Paragraph>
      <Paragraph position="7"> Prc(RCIP, H,h ). Each subcat frame is a multiset 6 specifying the complements which the head requires in its left or right modifiers.</Paragraph>
      <Paragraph position="8"> 3. Generate the left and right modifiers with probabilities 7)l(Li, li I H, P, h, distancet(i - 1), LC) and 7~r (R~, ri I H, P, h, distancer(i - 1), RC) respectively. Thus the subcat requirements are added to the conditioning context. As complements are generated they are removed from the appropriate subcat multiset. Most importantly, the probability of generating the STOP symbol will be 0 when the subcat frame is non-empty, and the probability of generating a complement will be 0 when it is not in the subcat frame; thus all and only the required complements will be generated.</Paragraph>
      <Paragraph position="9"> The probability of the phrase S(bought)-&gt;</Paragraph>
      <Paragraph position="11"> Here the head initially decides to take a single NP-C (subject) to its left, and no complements ~A rnultiset, or bag, is a set which may contain duplicate non-terminal labels.</Paragraph>
      <Paragraph position="12"> to its right. NP-C(Marks) is immediately generated as the required subject, and NP-C is removed from LC, leaving it empty when the next modifier, NP(week) is generated. The incorrect structures in figure 4 should now have low probability because ~Ic({NP-C,NP-C} \[ S,VP,bought) and &amp;quot;Prc({NP-C,VP-C} I VP,VB,was) are small.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="20" type="sub_section">
      <SectionTitle>
2.3 Model 3: Traces and Wh-Movement
</SectionTitle>
      <Paragraph position="0"> Another obstacle to extracting predicate-argument structure from parse trees is wh-movement. This section describes a probabilistic treatment of extraction from relative clauses. Noun phrases are most often extracted from subject position, object position, or from within PPs: Example 1 The store (SBAR which TRACE  It might be possible to write rule-based patterns which identify traces in a parse tree. However, we argue again that this task is best integrated into the parser: the task is complex enough to warrant a probabilistic treatment, and integration may help parsing accuracy. A couple of complexities are that modification by an SBAR does not always involve extraction (e.g., &amp;quot;the fact (SBAR that besoboru is  initially generates an SBAR modifier, but specifies that it must contain an NP trace by adding the +gap feature. The gap is then passed down through the tree, until it is discharged as a TRACE complement to the right of bought.</Paragraph>
      <Paragraph position="1"> played with a ball and a bat)&amp;quot;), and it is not uncommon for extraction to occur through several constituents, (e.g., &amp;quot;The changes (SBAR that he said the government was prepared to make TRACE)&amp;quot;).</Paragraph>
      <Paragraph position="2"> The second reason for an integrated treatment of traces is to improve the parameterisation of the model. In particular, the subcategorisation probabilities are smeared by extraction. In examples 1, 2 and 3 above 'bought' is a transitive verb, but without knowledge of traces example 2 in training data will contribute to the probability of 'bought' being an intransitive verb.</Paragraph>
      <Paragraph position="3"> Formalisms similar to GPSG (Gazdar et al. 95) handle NP extraction by adding a gap feature to each non-terminal in the tree, and propagating gaps through the tree until they are finally discharged as a trace complement (see figure 5). In extraction cases the Penn treebank annotation co-indexes a TRACE with the WHNP head of the SBAR, so it is straight-forward to add this information to trees in training data.</Paragraph>
      <Paragraph position="4"> Given that the LHS of the rule has a gap, there are 3 ways that the gap can be passed down to the RHS: Head The gap is passed to the head of the phrase, as in rule (3) in figure 5.</Paragraph>
      <Paragraph position="5"> Left, Right The gap is passed on recursively to one of the left or right modifiers of the head, or is discharged as a trace argument to the left/right of the head. In rule (2) it is passed on to a right modifier, the S complement. In rule (4) a trace is generated to the right of the head VB.</Paragraph>
      <Paragraph position="6"> We specify a parameter 7~c(GIP, h, H) where G is either Head, Left or Right. The generative process is extended to choose between these cases after generating the head of the phrase. The rest of the phrase is then generated in different ways depending on how the gap is propagated: In the Head case the left and right modifiers are generated as normal. In the Left, Right cases a gap requirement is added to either the left or right SUBCAT variable. This requirement is fulfilled (and removed from the subcat list) when a trace or a modifier non-terminal which has the +gap feature is generated. For example, Rule (2), SBAR(that) (+gap) -&gt;</Paragraph>
      <Paragraph position="8"> In rule (2) Right is chosen, so the +gap requirement is added to RC. Generation of S-C(bought)(+gap)</Paragraph>
      <Paragraph position="10"/>
      <Paragraph position="12"/>
      <Paragraph position="14"> :ulfills both the S-C and +gap requirements in RC.</Paragraph>
      <Paragraph position="15"> In rule (4) Right is chosen again. Note that generation of trace satisfies both the NP-C and +gap subcat requirements.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="20" end_page="20" type="metho">
    <SectionTitle>
3 Practical Issues
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.1 Smoothing and Unknown Words
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the various levels of back-off for each type of parameter in the model. Note that we decompose &amp;quot;PL(Li(lwi,lti) I P, H,w,t,~,LC) (where lwi and Iti are the word and POS tag generated with non-terminal Li, A is the distance measure) into the product 79L1(Li(lti) I P, H,w,t, Zx,LC) x 7~ L2(lwi ILi, lti, 19, H, w, t, A, LC), and then smooth these two probabilities separately (Jason Eisner, p.c.). In each case 7 the final estimate is e----Ale1 + (1 - &amp;l)(A2e2 + (1 - &amp;2)ea) where ex, e2 and e3 are maximum likelihood estimates with the context at levels 1, 2 and 3 in the table, and ,kl, ,k2 and )~3 are smoothing parameters where 0 _&lt; ,ki _&lt; 1. All words occurring less than 5 times in training data, and words in test data which rExcept cases L2 and R2, which have 4 levels, so that</Paragraph>
      <Paragraph position="2"> have never been seen in training, are replaced with the &amp;quot;UNKNOWN&amp;quot; token. This allows the model to robustly handle the statistics for rare or new words.</Paragraph>
    </Section>
    <Section position="2" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.2 Part of Speech Tagging and Parsing
</SectionTitle>
      <Paragraph position="0"> Part of speech tags are generated along with the words in this model. When parsing, the POS tags allowed for each word are limited to those which have been seen in training data for that word. For unknown words, the output from the tagger described in (Ratnaparkhi 96) is used as the single possible tag for that word. A CKY style dynamic programming chart parser is used to find the maximum probability tree for each sentence (see figure 6).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML