File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0317_metho.xml

Size: 21,116 bytes

Last Modified: 2025-10-06 14:14:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0317">
  <Title>Attaching Multiple Prepositional Phrases: Generalized Backed-off Estimation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> There has recently been considerable interest in the use of lexically-based statistical techniques to resolve prepositional phrase attachments. To our knowledge, however, these investigations have only considered the problem of attaching the first PP, i.e., in a IV NP PP\] configuration. In this paper, we consider one technique which has been successfully applied to this problem, backed-off estimation, and demonstrate how it can be extended to deal with the problem of multiple PP attachment. The multiple PP attachment introduces two related problems: sparser data (since multiple PPs are naturally rarer), and greater syntactic ambiguity (more attachment configurations which must be distinguished). We present and algorithm which solves this problem through re-use of the relatively rich data obtained from first PP training, in resolving subsequent PP attachments.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Ambiguity is the most specific feature of natural languages, which sets them aside from programming languages, and which is at the root of the difficulty of the parsing enterprise, pervading languages at all levels: lexical, morphological, syntactic, semantic and pragmatic. Unless clever techniques are developed to deal with ambiguity, the number of possible parses for an average sentence (20 words) is simply intractable. In the case Of prepositional phrases, the expansion of the number of possible analysis is the Catalan number series, thus the number of possible analyses grows with a function that is exponential in the number of Prepositional Phrase (Church and Patil, 1982). One of the most interesting topics of debate at the moment, is the use of frequency information for automatic syntactic disambiguation.</Paragraph>
    <Paragraph position="1"> As argued in many pieces of work in the AI tradition (Marcus, 1980; Crain and Steedman, 1985; Altmann and Steedman, 1988; Hirst, 1987), the exact solution of the disambiguation problem requires complex reasoning and high level syntactic and semantic knowledge. However, current work in part-of-speech tagging has succeeded in showing that it is possible to carve one particular subproblem and solve it by approximation -- using statistical techniques -- independently of the other levels of computation. null In this paper we consider the problem of prepositional phrase (PP) ambiguity. While there have been a number of recent studies concerning the use of statistical techniques for resolving single PP attachments, i.e. in constructions of the form \[V NP PP\], we are unaware of published work which applies these techniques to the more general, and pathological, problem of multiple PPs, e.g. IV NP PP1 PP2 ...\]. In particular, the multiple PP attachment problem results in sparser data which must be used to resolve greater ambiguity: a strong test for any probabilistic approach.</Paragraph>
    <Paragraph position="2"> We begin with an overview of techniques which have been used for PP attachment disambiguation, and then consider how one of the most successful of these, the backed-off estimation technique, can be applied to the general problem of multiple PP attachment.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="149" type="metho">
    <SectionTitle>
2 Existing Models of Attachment
</SectionTitle>
    <Paragraph position="0"> Attempts to resolve the problem of PP attachment in computational linguistics are numerous, but the problem is hard and success rate typically depends on the domain of application. Historically, the shift from attempts to resolve the problem completely, by using heuristics developed using typical AI techniques (Jensen and Binot, 1987; Marcus, 1980; Crain and Steedman, 1985; Altmann and Steedman, 1988) has left the place for attempts to solve the problem by less expensive means, even if only approximately.</Paragraph>
    <Paragraph position="1"> As shown by many psycholinguistic and practical  studies (Ford et al., 1982; Taraban and McClelland, 1988; Whittemore et al., 1990), lexical information is one of the main cues to PP attachment disambiguation. null In one of the earliest attempts to resolve the problem of PP attachment ambiguity using lexical measures, Hindle and Pmoth (1993) show that a measure of mutual information limited to lexical association can correctly resolve 80% of the cases of PP attachment ambiguity, confirming the initial hypothesis that lexical information, in particular co-occurrence frequency, is central in determining the choice of attachment. null The same conclusion is reached by Brill and Resnik (1994). They apply transformation-based learning (Brill, 1993) to the problem of learning different patterns of PP attachment. After acquiring 471 patterns of PP attachment, the parser can correctly resolve approximately 80% of the ambiguity. If word classes (Resnik, 1993) are taken into account, only 266 rules are needed to perform at 80% accuracy. null Magerman and Marcus (1991) report 54/55 correct PP attachments for Pearl, a probabilistic chart parser, with Earley style prediction, that integrates lexical co-occurrence knowledge into a probabilistic context-free grammar. The probabilities of the rules are conditioned on the parent rule and on the tri-gram centered at the first input symbol that would be covered by the rule. Even if the parser has been tested only in the direction giving domain, where the behaviour of prepositions is very consistent, it shows that a mixture of lexical and structural information is needed to solve the problem successfully.</Paragraph>
    <Paragraph position="2"> Collins and Brooks (1995) propose a 4-gram model for PP disambiguation which exploits backed-off estimation to smooth null events (see next section).</Paragraph>
    <Paragraph position="3"> Their model achieves 84.5% accuracy. The authors point out that prepositions are the most informative element in the tuple, and that taking low frequency events into account improves performance by several percentage points. In other words, in solving the PP attachment problem, backing-off is not advantageous unless the tuple that is being tested is not present in the training set (it has zero counts).</Paragraph>
    <Paragraph position="4"> Moreover, tuples that contain prepositions are the most informative.</Paragraph>
    <Paragraph position="5"> The second result is roughly confirmed by Brill and Resnik, (ignoring the importance of n2 when it is a temporal modifier, such as yesterday, today). In their work, the top 20 transformations learned are primarily based on specific prepositions.</Paragraph>
  </Section>
  <Section position="6" start_page="149" end_page="150" type="metho">
    <SectionTitle>
3 Back-off Estimation
</SectionTitle>
    <Paragraph position="0"> The PP attachment model presented by Collins and Brooks (1995) determines the most likely attachment for a particular prepositional phrase by estimating the probability of the attachment. We let C represent the attachment event, where C = 1 indicates that the PP attaches to the verb, and C = 2 indicates attachment to the object NP. The attach- null ment is conditioned by the relevant head words, a 4-gram, of the VP.</Paragraph>
    <Paragraph position="1"> * Tuple format: (C, v, nl, p, n2) * So: John read \[\[the article\] \[about the budget\]\] * Is encoded as: (2, read, article, about, budget)  Using a simple maximal likelihood approach, the best attachment for a particular input tuple (v,nl,p,n2) can now be determined from the training data via the following equation: argmaxi 15(C = ilv, nl, p, n2) = f(i, v, nl, p, n2) f(v,</Paragraph>
    <Paragraph position="3"> Here f denotes the frequency with which a particular tuple occurs. Thus, we can estimate the probability for each configuration 1 &lt; i &lt; 2, by counting the number of times the four head words were observed in that configuration, and dividing it by the total number of times the 4-tuple appeared in the training set.</Paragraph>
    <Paragraph position="4"> While the above equation is perfectly valid in theory, sparse data means it is rather less useful in practice. That is, for a particular sentence containing a PP attachment ambiguity, it is very likely that we will never have seen the precise (v,nl,p,n2) quadruple before in the training data, or that we will have only seen it rarely. 1 To address this problem, they employ backed-off estimation when zero counts occur in the training data. Thus if f(v, nl,p, n2) is zero, they 'back-off' to an alternative estimation of /~ which relies on 3-tuples rather than 4-tuples:</Paragraph>
    <Paragraph position="6"> f(i, v, nl, p) + f(i, v, p, ,72) + f(i, nl, p, n2) (2) f(v, nl, p) + f(v, p, n2) + f(nl, p, n2) Similarly, if no 3-tuples exist in the training data, they back-off further:</Paragraph>
    <Paragraph position="8"> The above equations incorporate the proposal by Collins and Brooks that only tuples including the preposition should be considered, following their results that the preposition is the most informative lexical item. Using this technique, Collins and Brooks achieve an overall accuracy of 84.5%.</Paragraph>
    <Paragraph position="9"> aThough as Collins and Brooks point out, this is less of an issue since even low counts are still useful.</Paragraph>
  </Section>
  <Section position="7" start_page="150" end_page="151" type="metho">
    <SectionTitle>
4 The Multiple PP Attachment
</SectionTitle>
    <Paragraph position="0"> Previous work has focussed on the problem of single PP attachment, in configurations of the form IV NP PP\] where both the NP and the PP are assumed to be attached within the VP. The algorithm presented in the previous section, for example, simply determines the maximally likely attachment event (to NP or VP) based on the supervised training provided by a parsed corpus. The broader value of this approach, however, remains suspect until it can be demonstrated to apply more generally. We now consider how this approach - and the use of lexical statistics in general - might be naturally extended to handle the more difficult problem of multiple PP attachment. In particular, we investigate the PP attachment problem in cases containing two PPs, \[V NP PP1 PP2\], and three PPs, \[V NP PP1 PP2 PP3\], with a view to determining whether n-gram based parse disambiguation models which use the backed-off estimate can be usefully applied. Mul- null tiple PP attachment presents two challenges to the approach: 1. For a single PP, the model must make a choice between two structures. For multiple PPs, the space of possible structural configurations increases dramatically, placing increased demands on the disambiguation technique.</Paragraph>
    <Paragraph position="1"> 2. Multiple PP structures are less frequent, and  contain more words, than single PP structures.</Paragraph>
    <Paragraph position="2"> This substantially increases the sparse data problems when compared with the single PP attachment case.</Paragraph>
    <Section position="1" start_page="150" end_page="150" type="sub_section">
      <SectionTitle>
4.1 Materials and Method
</SectionTitle>
      <Paragraph position="0"> To carry out the investigation, training and test data were obtained from the Penn Tree-bank, using the tgrep tools to extract tuples for 1-PP, 2-PP, and 3-PP cases. For the single PP study, VP attachment was coded as 1 and NP attachment was coded as 2. A database of quadruples of the form (configuration, v,n,p) was then created. The table below shows the two configurations and their frequencies in the corpus.</Paragraph>
      <Paragraph position="2"> The same procedure was used to create a database of 6-tuples (conflguratwn, v, nl,pl,n2,p2) for the attachment of 2 PPs. The values for the configuration varies over a range 1..5, corresponding to the 5 grammatical structures possible for 2 PPs, shown and exemplified below with their counts in the corpus. 2 2We did not consider the left-recursive NP structure for the 2 PP (or indeed 3 PP) cases. Checking the fre- null 1. The agency said it will keep the debt under review for possible further downgrade.</Paragraph>
      <Paragraph position="3"> 2. Penney decided to extend its involvement with the service for at least five years.</Paragraph>
      <Paragraph position="4"> 3. The bill was then sent back to the House to resolve the question of how to address budget limits on credit allocations for the Federal Housing Administration.</Paragraph>
      <Paragraph position="5"> 4. Sears officials insist they don't intend to abandon the everyday pricing approach in the face of the poor results.</Paragraph>
      <Paragraph position="6"> 5. Mr. Ridley hinted at this motive in answer null ing questions from members of Parliament after his announcement Finally, a database of 8-tuples (configuration, v, nl,pl, n2,p2,n3,p3) was created for 3 PPs. The value of the configuration varies over a range 1..14, corresponding to the 14 structures possible for 3 PPs, shown in Table 1 with their counts in the corpus. null The above datasets were then split into training and test sets by automatically extracting stratified samples. For PP1, we extracted quadruples of about 5% of the total (1014/19963). We then created a test set for PP2 which is a subset of the PP1 test set, and approximately 10% of the 2 PP tuples (464/4683). Similarly, the test set for PP3 is a subset of the PP2 test set of approximately 10% (94/907). It is important that the test sets are subsets to ensure that, e.g., a PP2 test case doesn't appear in the PP1 training set, since the PP1 data is used by our algorithm to estimate PP2 attachment, and similarly for the PP3 test set.</Paragraph>
    </Section>
    <Section position="2" start_page="150" end_page="151" type="sub_section">
      <SectionTitle>
4.2 Does Distance Matter?
</SectionTitle>
      <Paragraph position="0"> In exploring multiple PP attachment, it seems natural to investigate the effects of the distance of the PP from the verb. The following table reports accuracy of noun-attachment, when the attachment decision is conditioned only on the preposition and on the distance -in other words, when estimating 15(lip, d) where 1 is the coding of the attachment to the noun, p is the preposition and d = {1,2, 3}. 3 quency of their occurrences revealed that there were only  It can be seen from these figures that conditionmg the attachment according to both preposition and distance results in only a minor improvement in performance, mostly because separating the biases according to preposition distance increases the sparse data problem. It must be noted, however, that counts show a steady increase in the proportion of low attachments for PP further from the verb, as shown in the table below. The simplest explanation of this fact is that more (inherently) noun-attaching prepositions must be occurring in 2nd and 3rd positions. This predicts that the distribution of preposition occurrences changes from PP1 to PP3, with an increase in the proportion of low attaching PPs.</Paragraph>
      <Paragraph position="1"> Globally, failure to use position results in 41.3% of correct configurations, while use of position results in 45% correct attachments.</Paragraph>
      <Paragraph position="2">  Having established that the distance parameter is not as influential a factor as we hypothesized, we exploit the observation that attachment preferences do not significantly change depending on the distance testing on the training data. Moreover, we are only considering 2 attachment possibilities for each preposition, either it attaches to the verb or it attaches to the lowest nOlln.</Paragraph>
      <Paragraph position="3"> of the PP from the verb. In the following section, we discuss an extension of the back-off estimation model that capitalizes on this property.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="151" end_page="153" type="metho">
    <SectionTitle>
5 The Generalized Backed-Off
Algorithm
</SectionTitle>
    <Paragraph position="0"> The algorithm for attaching the first preposition is almost identical to that of Collins and Brooks (1995), and we follow them in including only tuples which contain the preposition. We do not, however, use the final noun (following the preposition) in any of our tuples, thus basing our model of PP 1 on three, rather than four, head words.</Paragraph>
    <Paragraph position="1">  The most likely configuration is: arg rnaxi pl(C2 ~- ilv, n,p), where 1 &lt; i &lt; 2  1. IF f(v,n,p) &gt; 0 THEN !(i ..... p) th(ilv, n,p) = J(.,~,v) 2. ELSEIF f(v, p) + f(n, p) &gt; 0 THEN lh(ilv, n, p) = :(~,v,v)+.:(i,,,v) f(v,p)+J(n,p) 3. ELSEIF f(p) &gt; 0 THEN h(ilv, ~,p) = \](P) 4. ELSE l~l(llv, n,p ) = O,l)l(21v, n,p ) = 1  In this case i denotes the attachment configuration: i = 1 is VP attachment, i = 2 is NP attachment. The subscript on C~ is used simply to make clear that C has 2 possible values. In the subsequent algorithms, C5 and C14 are used to indicate the larger sets of configurations.</Paragraph>
    <Paragraph position="2"> The algorithm used to handle the cases containing 2PPs is shown in Figure 1, where j ranges over  the five possible attachment configurations outlined above.</Paragraph>
    <Paragraph position="3"> The first three steps use the standard backed-off estimation, again including only those tuples containing bolh prepositions. However, after backing-off to three elements, we abandon the standard backed-off estimation technique. The combination of sparse data., and too few lexical heads, renders backed-off estimation ineffective. Rather, we propose a technique which makes use of the richer data available from the PP1 training set. Our hypothesis is that this information will be useful in determining the attachments of subsequent PPs as well. This is motivated by our observations, reported in the previous section, that the distribution of high-low attachments for specific prepositions did not vary significantly for PPs further from the verb. The Competitive Backed-Off Estimate procedure, presented below, operates by initially fixing the configuration of the first preposition (to either the VP or the direct object NP), and then considers how the second preposition would be optimally attached into the configuration.</Paragraph>
    <Paragraph position="4">  1. C~ is the most likely configuration for PP1, arg maxi /)1(C~ = ilv, nl,pl) 2. C~' is the preferred configuration for PP2 w.r.t n2, arg maxi /~I(C~' = ilv, n2,p2) 3. C~&amp;quot; is the preferred configuration for PP2 w.r.t nl, max ^ I~,,, iJv, nl,p2) arg i Pl/t-~2 : 4. Find Best Configuration  First we determine C~,, on which depends the attachment of pl. We then determine C~', which indicates the preference for p2 to attach to the VP or to n2, and C~&amp;quot;, which is the preference for p2 to attach to the VP or to nl. Given the preferred configurations C~, C~', and C~&amp;quot;, we now must determine the best of the five possible configurations, C5, for the  The tests 1 to 5 simply use the attachment values C~, C~', and C~&amp;quot; to determine C%: the best configuration. In the final instance, step 6, where the C~' indicates a preference for n2 attachment, and C~&amp;quot; indicates a preference for nl attachment a tie-break is necessary to determine which noun to attach to. As a first approximation, we use the frequency of occurrence used in determining these preferences, rather than the probability for each preference. That is, we favour the bias for which there is more evidence,  though whether this is optimal remains an empirical question. For example, if C~' is based on 4 observations, and C~&amp;quot; is based on 7, then the C~&amp;quot; preference is considered stronger.</Paragraph>
    <Paragraph position="5"> Having constructed the algorithm to determine the best configuration for 2 PPs, we can similarly generalize the algorithm to handle three. In this case /k denotes one of fourteen possible attachment configurations shown earlier. The pseudo code for procedure B3 is shown below, simplified for reasons of space.</Paragraph>
    <Paragraph position="6"> Procedure B3 The most likely configuration is: arg maxk p3(C14 ~- k\[v, na,pl, n2,p2, n3,p3), where  2. ELSE Try backing-off to 6 or 5 items ...</Paragraph>
    <Paragraph position="7"> 3. ELSE Competitive Backed-off Estimate: (a) Use Procedure B2 to determine C~, the configuration of pl and p2 (b) Compute C~', C~&amp;quot;, C~'&amp;quot;, the preferred attachment of p3 w.r.t nl, n2, n3 respectively (c) Determine the best configuration  Again, we back-off up to two times, always including tuples which contain the three prepositions. After this, backing-off becomes unstable, so we use the Competitive Backed-off Estimate, as above, but scaled up to handle the three prepositions and fourteen possible configurations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML