File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2018_metho.xml

Size: 17,568 bytes

Last Modified: 2025-10-06 14:07:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2018">
  <Title>A Maximum-Entropy-Inspired Parser *</Title>
  <Section position="3" start_page="0" end_page="132" type="metho">
    <SectionTitle>
2 The Generative Model
</SectionTitle>
    <Paragraph position="0"> The model assigns a probability to a parse by a top-down process of considering each constituent c in ~r and for each c first guessing the pre-terminal of c, t(c) (t for &amp;quot;tag&amp;quot;), then the lexical head of c, h(c), and then the expansion of c into further constituents c(c). Thus the probability of a parse is given by the equation</Paragraph>
    <Paragraph position="2"> where l(c) is the label of c (e.g., whether it is a noun phrase (np), verb-phrase, etc.) and H(c)is the relevant history of c -- information outside c that our probability model deems important in determining the probability in question. Much of the interesting work is determining what goes into H(c). Whenever it is clear to which constituent we are referring we omit the (c) in, e.g., h(c). In this notation the above equation takes the following form:</Paragraph>
    <Paragraph position="4"> Next we describe how we assign a probability to the expansion e of a constituent. In Section 5 we present some results in which the possible expansions of a constituent are fixed in advanced by extracting a tree-bank grammar \[3\] from the training corpus. The method that gives the best results, however, uses a Markov grammar -- a method for assigning probabilities to any possible expansion using statistics gathered from the training corpus \[6,10,15\]. The method we use follows that of \[10\]. In this scheme a traditional probabilistic context-free grammar (PCFG) rule can be thought of as consisting of a left-hand side with a label l(e) drawn from the non-terminal symbols of our grammar, and a right-hand side that is a sequence of one or more such symbols. (We assume that all terminal symbols are generated by rules of the form &amp;quot;preterm -+ word' and we treat these as a special case.) For us the non-terminal symbols are those of the tree-bank, augmented by the symbols aux and auxg, which have been assigned deterministically to certain auxiliary verbs such as &amp;quot;have&amp;quot; or &amp;quot;having&amp;quot;. For each expansion we distinguish one of the right-hand side labels as the &amp;quot;middle&amp;quot; or &amp;quot;head&amp;quot; symbol M(c). M(c) is the constituent from which the head lexical item h is obtained according to deterministic rules that pick the head of a constituent from among the heads of its children. To the left of M is a sequence of one or more left labels Li (c) including the special termination symbol A, which indicates that there are no more symbols to the left, and similarly for the labels to the right, Ri(c).</Paragraph>
    <Paragraph position="5"> Thus an expansion e(c) looks like:</Paragraph>
    <Paragraph position="7"> The expansion is generated by guessing first M, then in order L1 through Lm+t (= A), and similarly for R1 through R,+~.</Paragraph>
    <Paragraph position="8"> In a pure Markov PCFG we are given the left-hand side label l and then probabilisticaily generate the right-hand side conditioning on no information other than I and (possibly) previously generated pieces of the right-hand side itself. In the simplest of such models, a zero-order Markov grammar, each label on the right-hand side is generated conditioned only on l -that is, according to the distributions p(Li I l), p(M l I), and p(Ri l l).</Paragraph>
    <Paragraph position="9"> More generally, one can condition on the m previously generated labels, thereby obtaining an mth-order Markov grammar. So, for example, in a second-order Markov PCFG, L2 would be conditioned on L1 and M. In our complete model, of course, the probability of each label in the expansions is also conditioned on other material as specified in Equation 1, e.g., p(e I l, t, h, H). Thus we would use p(L2 I L1, M, l, t, h, H). Note that the As on both ends of the expansion in Expression 2 are conditioned just like any other label in the expansion.</Paragraph>
  </Section>
  <Section position="4" start_page="132" end_page="134" type="metho">
    <SectionTitle>
3 Maximum-Entropy-Inspired Parsing
</SectionTitle>
    <Paragraph position="0"> The major problem confronting the author of a generative parser is what information to use to condition the probabilities required in the model, and how to smooth the empirically obtained probabilities to take the sting out of the sparse data problems that are inevitable with even the most modest conditioning. For example, in a second-order Markov grammar we conditioned the L2 label according to the distribution p(L2 I Lt,M,I,t,h,H). Also, remember that H is a placeholder for any other information beyond the constituent e that may be useful in assigning c a probability.</Paragraph>
    <Paragraph position="1"> In the past few years the maximum entropy, or log-linear, approach has recommended itself to probabilistic model builders for its flexibility and its novel approach to smoothing \[1,17\]. A complete review of log-linear models is beyond the scope of this paper. Rather, we concentrate on the aspects of these models that most directly influenced the model presented here.</Paragraph>
    <Paragraph position="2"> To compute a probability in a log-linear model one first defines a set of &amp;quot;features&amp;quot;, functions from the space of configurations over which one is trying to compute probabilities to integers that denote the number of times some pattern occurs in the input. In our work we assume that any feature can occur at most once, so features are boolean-valued: 0 if the pattern does not occur, 1 if it does.</Paragraph>
    <Paragraph position="3"> In the parser we further assume that features are chosen from certain feature schemata and that every feature is a boolean conjunction of sub-features. For example, in computing the probability of the head's pre-terminal t we might want a feature schema f(t, l) that returns  label of c = l, and zero otherwise. This feature is obviously composed of two sub-features, one recognizing t, the other 1. If both return 1, then the feature returns 1.</Paragraph>
    <Paragraph position="4"> Now consider computing a conditional probability p(a I H) with a set of features fl... fj that connect a to the history H. In a log-linear model the probability function takes the following form:</Paragraph>
    <Paragraph position="6"> Here the Ai are weights between negative and positive infinity that indicate the relative importance of a feature: the more relevant the feature to the value of the probability, the higher the absolute value of the associated X. The function Z(H), called the partition function, is a normalizing constant (for fixed H), so the probabilities over all a sum to one.</Paragraph>
    <Paragraph position="7"> Now for our purposes it is useful to rewrite this as a sequence of multiplicative functions gi(a,H) for 0 &lt; i &lt; j: p(a I H)= go(a,H)gl(a,H) ...gj(a,H). (4) Here go(a,H) = 1/Z(H) and gi(a,H) = e'~(a'n)f~(a'H). The intuitive idea is that each factor gi is larger than one if the feature in question makes the probability more likely, one if the feature has no effect, and smaller than one if it makes the probability less likely.</Paragraph>
    <Paragraph position="8"> Maximum-entropy models have two benefits for a parser builder. First, as already implicit in our discussion, factoring the probability computation into a sequence of values corresponding to various 'tfeatures&amp;quot; suggests that the probability model should be easily changeable -- just change the set of features used. This point is emphasized by Ratnaparkhi in discussing his parser \[17\]. Second, and this is a point we have not yet mentioned, the features used in these models need have no particular independence of one another. This is useful if one is using a log-linear model for smoothing. That is, suppose we want to compute a conditional probability p(a \] b,c), but we are not sure that we have enough examples of the conditioning event b, c in the training corpus to ensure that the empirically obtained probability/~(a \[ b, c) is accurate. The traditional way to handle this is also to compute/~(a I b), and perhaps iS(a I c) as well, and take some combination of these values as one's best estimate for p(a I b, c). This method is known as &amp;quot;deleted interpolation&amp;quot; smoothing. In max-entropy models one can simply include features for all three events fl(a, b, c), f2(a, b), and f3(a, c) and combine them in the model according to Equation 3, or equivalently, Equation 4. The fact that the features are very far from independent is not a concern.</Paragraph>
    <Paragraph position="9"> Now let us note that we can get an equation of exactly the same form as Equation 4 in the following fashion: p(alb, c)p(alb, c,d) p(alb, c,d)=p(alb)-~alb) p(alb, c) (5) Note that the first term of the equation gives a probability based upon little conditioning information and that each subsequent term is a number from zero to positive infinity that is greater or smaller than one if the new information being considered makes the probability greater or smaller than the previous estimate.</Paragraph>
    <Paragraph position="10"> As it stands, this last equation is pretty much content-free. But let us look at how it works for a particular case in our parsing scheme. Consider the probability distribution for choosing the pre-terminal for the head of a constituent.</Paragraph>
    <Paragraph position="11"> In Equation I we wrote this as p(t I l, H). As we discuss in more detail in Section 5, several different features in the context surrounding c are useful to include in H: the label, head pre-terminal and head of the parent of c (denoted as lv, tv, hp), the label of c's left sibling (lb for &amp;quot;before&amp;quot;), and the label of the grandparent of c (la). That is, we wish to compute p(t I l, lv, tv, lb, lg, by). We can now rewrite this in the form of Equation 5 as follows: p(t I 1, Iv, tv, lb, IQ, hv) = p(t l t)P(t l t, tv) P(t l t, tv, tv) p(t l t, tp, tv, tb) p(t l l) p(t l l, lp) p(t l t, tp, tp) P(t l t'Iv'tv'Ib'Ig)p(t l t'Ip'tv'Ib'Ig'hP). (6) p(t I z, t,, t,, lb) p(t I t, l,, t,, lb, t,) Here we have sequentially conditioned on steadily increasing portions of c's history. In many cases this is clearly warranted. For example, it does not seem to make much sense to condition on, say, h v without first conditioning on tp. In other cases, however, we seem  to be conditioning on apples and oranges, so to speak. For example, one can well imagine that one might want to condition on the parent's lexical head without conditioning on the left sibling, or the grandparent label. One way to do this is to modify the simple version shown in Equation 6 to allow this: p(t I l, l., b, h,) = p(t t l)P(t l l, lv) P(t l l, lp, tv) P(t l l, lv, tp, lb) p(t i l ) p(t l l ,lp) p(t l l ,lv,tv) p(t I l, lp, tp, p(t I l, t,,, p(t I l, lp, tp) p(t I l, tp, (7) Note the changes to the last three terms in Equation 7. Rather than conditioning each term on the previous ones, they are now conditioned only on those aspects of the history that seem most relevant. The hope is that by doing this we will have less difficulty with the splitting of conditioning events, and thus somewhat less difficulty with sparse data.</Paragraph>
    <Paragraph position="12"> We make one more point on the connection of Equation 7 to a maximum entropy formulation. Suppose we were, in fact, going to compute a true maximum entropy model based upon the features used in Equation 7, fl(t,l),f2(t,l, lp),f3(t,l, lv) .... This requires finding the appropriate his for Equation 3, which is accomplished using an algorithm such as iterative scaling \[11\] in which values for the Ai are initially &amp;quot;guessed&amp;quot; and then modified until they converge on stable values. With no prior knowledge of values for the )q one traditionally starts with )~i = 0, this being a neutral assumption that the feature has neither a positive nor negative impact on the probability in question.</Paragraph>
    <Paragraph position="13"> With some prior knowledge, non-zero values can greatly speed up this process because fewer iterations are required for convergence. We comment on this because in our example we can substantially speed up the process by choosing values picked so that, when the maximum-entropy equation is expressed in the form of Equation 4, the gi have as their initial values the values of the corresponding terms in Equation 7. (Our experience is that rather than requiring 50 or so iterations, three suffice.) Now we observe that if we were to use a maximum-entropy approach but run iterative scaling zero times, we would, in fact, just have Equation 7.</Paragraph>
    <Paragraph position="14"> The major advantage of using Equation 7 is that one can generally get away without computing the partition function Z(H). In the simple (content-free) form (Equation 6), it is clear that Z(H) = 1. In the more interesting version, Equation 7, this is not true in general, but one would not expect it to differ much from one, and we assume that as long as we are not publishing the raw probabilities (as we would be doing, for example, in publishing perplexity results) the difference from one should be unimportant. As partition-function calculation is typically the major on-line computational problem for maximum-entropy models, this simplifies the model significantly.</Paragraph>
    <Paragraph position="15"> Naturally, the distributions required by Equation 7 cannot be used without smoothing. In a pure maximum-entropy model this is done by feature selection, as in Ratnaparkhi's maximum-entropy parser \[17\]. While we could have smoothed in the same fashion, we choose instead to use standard deleted interpolation.</Paragraph>
    <Paragraph position="16"> (Actually, we use a minor variant described in \[4\].)</Paragraph>
  </Section>
  <Section position="5" start_page="134" end_page="135" type="metho">
    <SectionTitle>
4 The Experiment
</SectionTitle>
    <Paragraph position="0"> We created a parser based upon the maximum-entropy-inspired model of the last section, smoothed using standard deleted interpolation.</Paragraph>
    <Paragraph position="1"> As the generative model is top-down and we use a standard bottom-up best-first probabilistic chart parser \[2,7\], we use the chart parser as a first pass to generate candidate possible parses to be evaluated in the second pass by our probabilistic model. For runs with the generative model based upon Markov grammar statistics, the first pass uses the same statistics, but conditioned only on standard PCFG information.</Paragraph>
    <Paragraph position="2"> This allows the second pass to see expansions not present in the training corpus.</Paragraph>
    <Paragraph position="3"> We use the gathered statistics for all observed words, even those with very low counts, though obviously our deleted interpolation smoothing gives less emphasis to observed probabilities for rare words. We guess the preterminals of words that are not observed in the training data using statistics on capitalization, hyphenation, word endings (the last two letters), and the probability that a given pre-terminal is realized using a previously unobserved word.</Paragraph>
    <Paragraph position="4"> As noted above, the probability model uses  ous work five smoothed probability distributions, one each for L~, M, Ri, t, and h. The equation for the (unsmoothed) conditional probability distribution for t is given in Equation 7. The other four equations can be found in a longer version of this paper available on the author's website (www.cs.brown.edu/~.,ec). L and R are conditioned on three previous labels so we are using a third-order Markov grammar. Also, the label of the parent constituent Ip is conditioned upon even when it is not obviously related to the further conditioning events. This is due to the importance of this factor in parsing, as noted in, e.g., \[14\].</Paragraph>
    <Paragraph position="5"> In keeping with the standard methodology \[5, 9,10,15,17\], we used the Penn Wall Street Journal tree-bank \[16\] with sections 2-21 for training, section 23 for testing, and section 24 for development (debugging and tuning).</Paragraph>
    <Paragraph position="6"> Performance on the test corpus is measured using the standard measures from \[5,9,10,17\].</Paragraph>
    <Paragraph position="7"> In particular, we measure labeled precision (LP) and recall (LR), average number of crossbrackets per sentence (CB), percentage of sentences with zero cross brackets (0CB), and percentage of sentences with &lt; 2 cross brackets (2CB). Again as standard, we take separate measurements for all sentences of length &lt;_ 40 and all sentences of length &lt; 100. Note that the definitions of labeled precision and recall are those given in \[9\] and used in all of the previous work. As noted in \[5\], these definitions typically give results about 0.4% higher than the more obvious ones. The results for the new parser as well as for the previous top-three individual parsers on this corpus are given in Figure 1.</Paragraph>
    <Paragraph position="8"> As is typical, all of the standard measures tell pretty much the same story, with the new parser outperforming the other three parsers. Looking in particular at the precision and recall figures, the new parser's give us a 13% error reduction over the best of the previous work, Co1199 \[9\].</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML