XML Viewer - j97-4005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/j97-4005_metho.xml
Size: 39,465 bytes
Last Modified: 2025-10-06 14:14:30
<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-4005">
  <Title>Stochastic Attribute-Value Grammars</Title>
  <Section position="4" start_page="599" end_page="604" type="metho">
    <SectionTitle>
2. Stochastic Context-Free Grammars
</SectionTitle>
    <Paragraph position="0"> Let us begin by examining stochastic context-free grammars (SCFGs) and asking why the natural extension of SCFG parameter estimation to attribute-value grammars fails.</Paragraph>
    <Paragraph position="1"> A point of terminology: I will use the term grammar to refer to an unweighted grammar, be it a context-free grammar or attribute-value grammar. A grammar equipped with weights (and other periphenalia as necessary) I will refer to as a model. Occasionally I will also use model to refer to the weights themselves, or the probability distribution they define.</Paragraph>
    <Paragraph position="2"> Throughout we will use the following stochastic context-free grammar for illustrative purposes. Let us call the underlying grammar GI and the grammar equipped with weights as shown, MI:  1. S-+AA fll = 1/2 2. S-+B f12 = 1/2 3. A--+a f13 = 2/3 4. A--+b f14 = 1/3 5. B--+ a a f15 = 1/2 6. B --+ b b f16 = 1/2  The probability of a given tree is computed as the product of probabilities of rules used in it. For example: Let x be the tree in Figure 2 and let ql be the probability distribution over trees defined by model M1. Then: 1 2 2 2 ql(x) = ill. fiB&amp;quot; ~3 = ~&amp;quot; 5&amp;quot; ~ = In parsing, we use the probability distribution ql (x) defined by model M1 to disambiguate: the grammar assigns some set of trees {Xl ..... Xn} to a sentence or, and we  Computing the probability of a parse tree.</Paragraph>
    <Paragraph position="3"> choose that tree xi that has greatest probability ql (Xi)&amp;quot; The issue of efficiently computing the most-probable parse for a given sentence has been thoroughly addressed in the literature. The standard parsing techniques can be readily adapted to the random-field models to be discussed below, so I simply refer the reader to the literature. Instead, I concentrate on parameter estimation, which, for attribute-value grammars, cannot be accomplished by standard techniques.</Paragraph>
    <Paragraph position="4"> By parameter estimation we mean determining values for the weights ft. In order for a stochastic grammar to be useful, we must be able to compute the correct weights, where by correct weights we mean the weights that best account for a training corpus. The degree to which a given set of weights accounts for a training corpus is measured by the similarity between the distribution q(x) determined by the weights fl and the distribution of trees x in the training corpus.</Paragraph>
    <Section position="1" start_page="600" end_page="602" type="sub_section">
      <SectionTitle>
2.1 The Goodness of a Model
</SectionTitle>
      <Paragraph position="0"> The distribution determined by the training corpus is known as the empirical distribution. For example, suppose we have a training corpus containing twelve trees of the four types from L(G1) shown in Figure 3, where c(x) is the count of how often the</Paragraph>
      <Paragraph position="2"> An empirical distribution. There are twelve parse trees of four distinct types.</Paragraph>
      <Paragraph position="3"> tree (type) x appears in the corpus, and/3(.) is the empirical distribution, defined as:</Paragraph>
      <Paragraph position="5"> In comparing a distribution q to the empirical distribution \]~, we shall actually measure dissimilarity rather than similarity. Our measure for dissimilarity of distributions</Paragraph>
      <Paragraph position="7"> The divergence between ~ and q at point x is the log of the ratio of ~(x) to q(x). The overall divergence between ~ and q is the average divergence, where the averaging is over tree (tokens) in the corpus; i.e., point divergences In(~(x)/q(x)) are weighted by ~(x) and summed.</Paragraph>
      <Paragraph position="8"> For example, let ql be, as before, the distribution determined by model M1. Table 1 shows ql, P, the ratio ql (X)/\])(X), and the weighted point divergence ~(x) ln(~(x)/ql (x)). The sum of the fourth column is the KL divergence D(~llql ) between ~ and ql. The third column contains ql(x)/~(x) rather than ~(x)/ql (x) so that one can see at a glance whether ql(x) is too large (&gt; 1) or too small (&lt; 1). The total divergence D(~\]lql ) = 0.32. One set of weights is better than another if its divergence from the empirical distribution is less. For example, let us consider a different set of weights for grammar G1. Let M' be G1 with weights (1/2,1/2,1/2,1/2,1/2,1/2), and let q' be the probability distribution determined by Mq Then the computation of the KL divergence is as in  distribution ql is a better distribution than q', in the sense that ql is more similar (less dissimilar) to the empirical distribution than q~ is.</Paragraph>
      <Paragraph position="9"> One reason for adopting minimal KL divergence as a measure of goodness is that minimizing KL divergence maximizes likelihood. The likelihood of distribution q is the probability of the training corpus according to q:</Paragraph>
      <Paragraph position="11"> Abney Stochastic Attribute-Value Grammars Since log is monotone increasing, maximizing likelihood is equivalent to maximizing log likelihood:</Paragraph>
      <Paragraph position="13"> The expression on the right-hand side is -1/N times the cross entropy of q with respect to ~, hence maximizing log likelihood is equivalent to minimizing cross entropy.</Paragraph>
      <Paragraph position="14"> Finally, D(~llq) is equal to the cross entropy of q less the entropy of ~, and the entropy of is constant with respect to q; hence minimizing cross entropy (maximizing likelihood) is equivalent to minimizing divergence.</Paragraph>
    </Section>
    <Section position="2" start_page="602" end_page="604" type="sub_section">
      <SectionTitle>
2.2 The ERF Method
</SectionTitle>
      <Paragraph position="0"> For stochastic context-free grammars, it can be shown that the ERF method yields the best model for a given training corpus. First, let us introduce some terminology and notation. With each rule i in a stochastic context-free grammar is associated a weight fli and a functionj~(x) that returns the number of times rule i is used in the derivation of tree x. For example, consider the tree in Figure 2, repeated here in Figure 4 for convenience: Rule 1 is used once and rule 3 is used twice; accordingly fl(x) = 1,</Paragraph>
      <Paragraph position="2"> We use the notation p\[yq to represent the expectation off under probability distribution p; that is, p\[yq -- ~x p(x)f(x). The ERF method instructs us to choose the weight fli for rule i proportional to its empirical expectation ~\[f;\]. Algorithmically, we compute the expectation of each rule's frequency, and normalize among rules with the same left-hand side.</Paragraph>
      <Paragraph position="3"> To illustrate, let us consider corpus (2.1) again. The expectation of each rule frequencyy~ is a sum of terms ~(x)fi(x). These terms are shown for each tree, in Table 3. For example, in tree xl, rule 1 is used once and rule 3 is used twice. The empirical probability of xl is 1/3, so Xl'S contribution to \]~\[fl\] is 1/3.1, and its contribution to \]~\[f3\] is 1/3.2. The weight fli is obtained from p\[fi\] by normalizing among rules with the same left-hand side. For example, the expected rule frequencies/~\[fl\] and \]~\[f2\] of rules with left-hand side S already sum to 1, so they are adopted without change as fll and f12. On the other hand, the expected rule frequencies \])\[fs\] and/)\[f6\] for rules with left-hand side B sum to 1/2, not 1, so they are doubled to yield weights t55 and t56. It should be observed that the resulting weights are precisely the weights of model M1.</Paragraph>
      <Paragraph position="4"> It can be proven that the ERF weights are the best weights for a given context-free grammar, in the sense that they define the distribution that is most similar to the empirical distribution. That is, if fl are the ERF weights (for a given grammar),</Paragraph>
      <Paragraph position="6"> defining distribution q, and fl' defining q~ is any set of weights such that q ~ q', then D(~\]\]q) &lt; D(fii\[q').</Paragraph>
      <Paragraph position="7"> One might expect the best weights to yield D(fi\[\]q) = 0, but such is not the case. We have just seen, for example, that the best weights for grammar G1 yield distribution ql, yet D(/~\]\]ql) = 0.32 &gt; 0. A closer inspection of the divergence calculation in Table 1 reveals that ql is sometimes less than ~, but never greater than ~. Could we improve the fit by increasing ql? For that matter, how can it be that ql is never greater than fi? As probability distributions, ql and/3 should have the same total mass, namely, one.</Paragraph>
      <Paragraph position="8"> Where is the missing mass for ql? The answer is of course that ql and /3 are probability distributions over L(G1), but not all of L(G1) appears in the corpus. Two trees are missing, and they account for the missing mass. These two trees are given in Figure 5. Each of these trees has  The trees from L(G1) that are missing in the training corpus.</Paragraph>
      <Paragraph position="9"> probability 0 according to ~ (hence they can be ignored in the divergence calculation), but probability 1/9 according to ql.</Paragraph>
      <Paragraph position="10"> Intuitively, the problem is this: The distribution ql assigns too little weight to trees xl and x2, and too much weight to the &amp;quot;missing&amp;quot; trees; call them x5 and x6. Yet exactly the same rules are used in x5 and x6 as are used in xl and x2. Hence there is no way to increase the weight for trees Xl and x2, improving their fit to ~, without simultaneously increasing the weight for Xs and x6, making their fit to ~ worse. The distribution ql is the best compromise possible.</Paragraph>
      <Paragraph position="11"> To say it another way, our assumption that the corpus was generated by a context-free grammar means that any context dependencies in the corpus must be accidental, the result of sampling noise. There is indeed a dependency in the corpus in Figure 3: in the trees where there are two A's, the A's always rewrite the same way. If the corpus was generated by a stochastic context-free grammar, then this dependency is accidental.</Paragraph>
      <Paragraph position="12"> This does not mean that the context-free assumption is wrong. If we generate twelve trees at random from ql, it would not be too surprising if we got the corpus in</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="604" end_page="604" type="metho">
    <SectionTitle>
Abney Stochastic Attribute-Value Grammars
</SectionTitle>
    <Paragraph position="0"> impossible for the resulting empirical distribution to match the distribution ql. But as the corpus size increases, the fit between ~ and ql becomes ever better.</Paragraph>
  </Section>
  <Section position="6" start_page="604" end_page="607" type="metho">
    <SectionTitle>
3. Attribute-Value Grammars
</SectionTitle>
    <Paragraph position="0"> But what if the dependency in corpus (3) is not accidental? What if we wish to adopt a grammar that imposes the constraint that both A's rewrite the same way? We can impose such a constraint by means of an attribute-value grammar.</Paragraph>
    <Paragraph position="1"> We may formalize an attribute-value grammar as a context-free grammar with attribute labels and path equations. An example is the following grammar; let us call it G2:  1. S--* I:A2:A /1 1) =/2 1) 2. S --* I:B 3. A ~ l:a 4. A--* l:b 5. B --* l:a 6. B --* l:b  Generating a dag. The grammar used is G2. node labeled with the start category of G2, namely, S. A node x is expanded by choosing a rule that rewrites the category of x. In this case, we choose rule 1 to expand the root node. Rule 1 instructs us to create two children, both labeled A. The edge to the first child is labeled 1 and the edge to the second child is labeled 2. The constraint (1 1) = (2 1) indicates that the 1 child of the 1 child of x is identical to the 1 child of the 2 child of x. We create an unlabeled node to represent this grandchild of x and direct appropriately labeled edges from the children, yielding (b). We proceed to expand the newly introduced nodes. We choose rule 3 to expand the first A node. In this case, a child with edge labeled 1 already exists, so we use it rather than creating a new one. Rule 3 instructs us to label this child a, yielding (c). Now we expand the second A node. Again we choose rule 3. We are instructed to label the 1 child a, but it already has that label, so we do not need to do anything. Finally, in (d), the only remaining node is the bottom-most node, labeled a. Since its label is a terminal category, it does not need to be expanded, and we are done. Let us back up to (c) again. Here we were free to choose rule 4 instead of rule 3 to expand the right-hand A node. Rule 4 instructs us to label the I child b, but we cannot, inasmuch as it is already labeled a. The derivation fails, and no dag is generated.  Computational Linguistics Volume 23, Number 4 The language L(G2) is the set of dags produced by successful derivations, as shown in Figure 7. (The edges of the dags should actually be labeled with l's and 2's, but I  The language generated by G2.</Paragraph>
    <Paragraph position="2"> have suppressed the edge labels for the sake of perspicuity.)</Paragraph>
    <Section position="1" start_page="605" end_page="606" type="sub_section">
      <SectionTitle>
3.1 AV Grammars and the ERF Method
</SectionTitle>
      <Paragraph position="0"> Now we face the question of how to attach probabilities to grammar G2. The natural extension of the method we used for context-free grammars is the following: Associate a weight with each of the six rules of grammar G2. For example, let M2 be the model consisting of G2 plus weights (ill ..... /36) = (1/2,1/2, 2/3,1/3,1/2,1/2). Let C/2(x) be the weight that M2 assigns to dag x; it is defined to be the product of the weights of the rules used to generate x. For example, the weight C/2(xl) assigned to tree xl of  Observe that C/2(xa) = fllfl 2, which is to say, fl/l(x,)fl/~(x,) Moreover, since fl0 1, 1 3 &amp;quot; it does not hurt to include additional factors fl:(xl) for those i where y~(xl) = 0. That is, we can define the dag weight C/ corresponding to rule weights fl = (ill ..... fin) generally as:</Paragraph>
      <Paragraph position="2"> The next question is how to estimate weights. Let us consider what happens when we use the ERF method. Let us assume a corpus distribution for the dags in Figure 7 analogous to the distribution in Figure 3:</Paragraph>
      <Paragraph position="4"> Using the ERF method, we estimate rule weights as in Table 4. This table is identical to the one given earlier in the context-free case. We arrive at the same weights M2 we considered above, defining dag weights C/2(x).</Paragraph>
      <Paragraph position="6"/>
    </Section>
    <Section position="2" start_page="606" end_page="607" type="sub_section">
      <SectionTitle>
3.2 Why the ERF Method Fails
</SectionTitle>
      <Paragraph position="0"> But at this point a problem arises: ~2 is not a probability distribution. Unlike in the context-free case, the four dags in Figure 7 constitute the entirety of L(G2). This time, there are no missing dags to account for the missing probability mass.</Paragraph>
      <Paragraph position="1"> There is an obvious &amp;quot;fix&amp;quot; for this problem: we can simply normalize 62. We might define the distribution q for an AV grammar with weight function ~b as: q(X)=z~(X) where Z is the normalizing constant: xEL(G) In particular, for ~2, we have Z = 2/9 + 1/18 + 1/4 + 1/4 = 7/9. Dividing ~2 by 7/9</Paragraph>
      <Paragraph position="3"> On the face of it, then, we can transplant the methods we used in the context-free case to the AV case and nothing goes wrong. The only problem that arises (@ not summing to one) has an obvious fix (normalization).</Paragraph>
      <Paragraph position="4"> However, something has actually gone very wrong. The ERF method yields the best weights only under certain conditions that we inadvertently violated by changing L(G) and re-apportioning probability via normalization. In point of fact, we can easily see that the ERF weights in Table 4 are not the best weights for our example grammar. Consider the alternative model M* given in Figure 9, defining probability distribution q*.</Paragraph>
      <Paragraph position="5">  Computational Linguistics Volume 23, Number 4 side sum to one. The reader can verify that ** sums to Z = 3+v~ and that q, is: 3</Paragraph>
      <Paragraph position="7"> In short, in the AV case, the ERF weights do not yield the best weights. This means that the ERF method does not converge to the correct weights as the corpus size increases. If there are genuine dependencies in the grammar, the ERF method converges systematically to the wrong weights. Fortunately, there are methods that do converge to the right weights. These are methods that have been developed for random fields.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="607" end_page="613" type="metho">
    <SectionTitle>
4. Random Fields
</SectionTitle>
    <Paragraph position="0"> A random field defines a probability distribution over a set of labeled graphs f~ called configurations. In our case, the configurations are the dags generated by the grammar, i.e., f~ = L(G). The weight assigned to a configuration is the product of the weights assigned to selected features of the configuration. We use the notation:</Paragraph>
    <Paragraph position="2"> where fli is the weight for feature i and f/(.) is its frequency function, that is, fi(x) is the number of times that feature i occurs in configuration x. (For most purposes, a feature can be identified with its frequency function; I will not always make a careful distinction between them.) I use the term feature here as it is used in the machine learning and statistical pattern recognition literature, not as in the constraint grammar literature, where feature is synonymous with attribute. In my usage, dag edges are labeled with attributes, not features. Features are rather like geographic features of dags: a feature is some larger or smaller piece of structure that occurs--possibly at more than one place---in a dag.</Paragraph>
    <Paragraph position="3"> The probability of a configuration (that is, a dag) is proportional to its weight, and is obtained by normalizing the weight distribution.</Paragraph>
    <Paragraph position="5"> If we identify the features of a configuration with local trees equivalently, with applications of rewrite rules--the random field model is almost identical to the model we considered in the previous section. There are two important differences. First, we no longer require weights to sum to one for rules with the same left-hand side.</Paragraph>
    <Paragraph position="6"> Second, the model does not require features to be identified with rewrite rules. We use the grammar to define the set of configurations f~ = L(G), but in defining a probability distribution over L(G), we can choose features of dags however we wish.</Paragraph>
    <Paragraph position="7"> Let us consider an example. Let us continue to assume grammar G2 generating the language in Figure 7, and let us continue to assume the empirical distribution in (1). But now rather than taking rule applications to be features, let us adopt the two features in Figure 10. For purpose of illustration, take feature 1 to have weight fll = v~ and feature 2 to have weight f12 -- 3/2. The functions fl and f2 represent the frequencies of features 1 and 2, respectively, as in Figure 11. We are able to exactly</Paragraph>
    <Paragraph position="9"> The frequencies (number of instances) of features 1 and 2 in dags generated by G2, and the computation of dag weights ~ and dag probabilities q.</Paragraph>
    <Paragraph position="10"> recreate the empirical distribution using fewer features than before. Intuitively, we need only use as many features as are necessary to distinguish among trees that have different empirical probabilities.</Paragraph>
    <Paragraph position="11"> This added flexibility is welcome, but it does make parameter estimation more involved. Now we must not only choose values for weights, we must also choose the features that weights are to be associated with. We would like to do both in a way that permits us to find the best model, in the sense of the model that minimizes the  Feature Selection. Consider every feature that might be added to field Mt and choose the best one.</Paragraph>
    <Paragraph position="12"> Weight Adjustment. Readjust weights for all features. The result is field Mt+l.</Paragraph>
    <Paragraph position="13"> Iterate until the field cannot be improved.</Paragraph>
    <Paragraph position="14"> For the sake of concreteness, let us take features to be labeled subdags. In step 2 of the algorithm we do not consider every conceivable labeled subdag, but only the atomic (i.e., single-node) subdags and those complex subdags that can be constructed by combining features already in the field or by combining a feature in the field with some atomic feature. We also limit our attention to features that actually occur in the training corpus.</Paragraph>
    <Paragraph position="15"> In our running example, the atomic features are as shown in Figure 12. Features can be combined by adding connecting arcs, as shown in Figure 13, for example.  Combining features to create more complex features.</Paragraph>
    <Section position="1" start_page="609" end_page="610" type="sub_section">
      <SectionTitle>
5.1 The Null Field
</SectionTitle>
      <Paragraph position="0"> Field induction begins with the null field. With the corpus we have been assuming, the null field takes the form in Figure 14. No dag x has any features, so C/(x) = I\]i fl~(x) is a</Paragraph>
      <Paragraph position="2"> product of zero terms, and hence has value 1. As a result, q is the uniform distribution.</Paragraph>
      <Paragraph position="3"> The Kullback-Leibler divergence D (/~ llq) is 0.03. The aim of feature selection is to choose a feature that reduces this divergence as much as possible.</Paragraph>
      <Paragraph position="4"> The astute reader will note that there is a problem with the null field if L(G) is infinite. Namely, it is not possible to have a uniform probability mass distribution over an infinite set. If each dag in an infinite set of dags is assigned a constant nonzero probability e, then the total probability is infinite, no matter how small e is. There are a couple of ways of dealing with the problem. The approach that DD&amp;L adopt is to assume a consistent prior distribution p(k) over graph sizes k, and a family of random fields qk representing the conditional probability q(x I k); the probability of a tree is then p(k)q(x I k). All the random fields have the same features and weights, differing only in their normalizing constants.</Paragraph>
      <Paragraph position="5"> I will take a somewhat different approach here. As sketched at the beginning of section 3, we can generate dags from an AV grammar much as proposed by Brew and Eisele. If we ignore failed derivations, the process of dag generation is completely analogous to the process of tree generation from a stochastic CFG--indeed, in the limiting case in which none of the rules contain constraints, the grammar is a CFG.</Paragraph>
      <Paragraph position="6"> To obtain an initial distribution, we associate a weight with each rule, the weights for rules with a common left-hand side summing to one. The probability of a dag is proportional to the product of weights of rules used to generate it. (Renormalization is necessary because of the failed derivations.) We estimate weights using the ERF method: we estimate the weight of a rule as the relative frequency of the rule in the training corpus, among rules with the same left-hand side.</Paragraph>
      <Paragraph position="7"> The resulting initial distribution (the ERF distribution) is not the maximum-likelihood distribution, as we know. But it can be taken as a useful first approximation. Intuitively, we begin with the ERF distribution and construct a random field to take</Paragraph>
    </Section>
    <Section position="2" start_page="610" end_page="610" type="sub_section">
      <SectionTitle>
Abney Stochastic Attribute-Value Grammars
</SectionTitle>
      <Paragraph position="0"> account of context dependencies that the ERF distribution fails to capture, incrementally improving the fit to the empirical distribution.</Paragraph>
      <Paragraph position="1"> In this framework, a model consists of: (1) An AV grammar G whose purpose is to define a set of dags L(G). (2) A set of initial weights 0 attached to the rules  of G. The weight of a dag is the product of weights of rules used in generating it. Discarding failed derivations and renormalizing yields the initial distribution po(x). (3) A set of features fl .... ,fn with weights fll .... , fin to define the field distribution</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="3" start_page="610" end_page="611" type="sub_section">
      <SectionTitle>
5.2 Feature Selection
</SectionTitle>
      <Paragraph position="0"> At each iteration, we select a new feature f by considering all atomic features, and all complex features that can be constructed from features already in the field. Holding the weights constant for all old features in the field, we choose the best weight fl forf (how fl is chosen will be discussed shortly), yielding a new distribution qfi,/. The score for feature f is the reduction it permits in D(pl\[qold), where qold is the old field. That is, the score for f is D(~llqold ) -- D(~llqfi,f ). We compute the score for each candidate feature and add to the field that feature with the highest score.</Paragraph>
      <Paragraph position="1"> To illustrate, consider the two atomic features a and B. Given the null field as old field, the best weight for a is fl = 7/5, and the best weight for B is fl ~- 1. This yields q and D(/S\[~) as in Figure 15. The better feature is a, and a would be added to the field  Comparing features, qa is the best (minimum-divergence) distribution that can be generated by adding the feature &amp;quot;a&amp;quot; to the field, and qB is the best distribution generable by adding the feature &amp;quot;B'.</Paragraph>
      <Paragraph position="2"> if these were the only two choices.</Paragraph>
      <Paragraph position="3"> Intuitively, a is better than B because a permits us to distinguish the set {xl, X3} from the set {x2, x4}; the empirical probability of the former is 1/3+1/4 -- 7/12 whereas the empirical probability of the latter is 5/12. Distinguishing these sets permits us to model the empirical distribution better (since the old field assigns them equal probability, counter to the empirical distribution). By contrast, the feature B distinguishes the set {xl, x2} from {x3, x4}. The empirical probability of the former is 1/3 + 1/6 = 1/2 and the empirical probability of the latter is also 1/2. The old field models these probabilities exactly correctly, so making the distinction does not permit us to improve on the old field. As a result, the best weight we can choose for B is 1, which is equivalent to not having the feature B at all.</Paragraph>
    </Section>
    <Section position="4" start_page="611" end_page="611" type="sub_section">
      <SectionTitle>
Computational Linguistics Volume 23, Number 4
5.3 Selecting the Initial Weight
</SectionTitle>
      <Paragraph position="0"> DD&amp;L show that there is a unique weight fl that maximizes the score for a new feature f (provided that the score for f is not constant for all weights). Writing q~ for the distribution that results from assigning weight fl to feature f, fl is the solution to the equation</Paragraph>
      <Paragraph position="2"> Intuitively, we choose the weight such that the expectation of f under the resulting new field is equal to its empirical expectation.</Paragraph>
      <Paragraph position="3"> Solving equation (2) for fl is easy if L(G) is small enough to enumerate. Then the sum over L(G) that is implicit in qfl \[f\] can be expanded out, and solving for fl is simply a matter of arithmetic. Things are a bit trickier if L(G) is too large to enumerate. DD&amp;L show that we can solve equation (2) if we can estimate qold\[f = k\] for k from 0 to the maximum value off in the training corpus. (See Appendix 1 for details.) We can estimate qold\[f = k\] by means of random sampling. The idea is actually rather simple: to estimate how often the feature appears in &amp;quot;the average dag,&amp;quot; we generate a representative mini-corpus from the distribution qold and count. That is, we generate dags at random in such a way that the relative frequency of dag x is qold(X) (in the limit), and we count how often the feature of interest appears in dags in our generated mini-corpus.</Paragraph>
      <Paragraph position="4"> The application that DD&amp;L consider is the induction of English orthographic constraints, that is, inducing a field that assigns high probability to &amp;quot;English-sounding&amp;quot; words and low probability to non-English-sounding words. For this application, Gibbs sampling is appropriate. Gibbs sampling does not work for the application to AV grammars, however. Fortunately, there is an alternative random sampling method we can use: Metropolis-Hastings sampling. We will discuss the issue in some detail shortly.</Paragraph>
    </Section>
    <Section position="5" start_page="611" end_page="612" type="sub_section">
      <SectionTitle>
5.4 Readjusting Weights
</SectionTitle>
      <Paragraph position="0"> When a new feature is added to the field, the best value for its initial weight is chosen, but the weights for the old features are held constant. In general, however, adding the new feature may make it necessary to readjust weights for all features. The second half of the IIS algorithm involves finding the best weights for a given set of features.</Paragraph>
      <Paragraph position="1"> The method is very similar to the method for selecting the initial weight for a new feature. Let (fl .... , fin) be the old weights for the features. We wish to compute &amp;quot;increments&amp;quot; (61,..., 6,) to determine a new field with weights (61fll,..., 6,ft,). Consider the equation</Paragraph>
      <Paragraph position="3"> where f#(x) = y'~if/(x) is the total number of features of dag x. The reason for the factor 6 f# is a bit involved. Very roughly, we would like to choose weights so that the expectation offi under the new field is equal to/5\[f/\]. Now qnew(X) is:</Paragraph>
      <Paragraph position="5"> where we factor Z as ZaZ~, for Zfl the normalization constant in qold- Hence, qnew \[f/\] = qold\[d-Ji I-Ij6fj;x\] * Now there are two problems with this expression: it requires us to compute Za, which we are not able to do, and it requires us to determine weights</Paragraph>
    </Section>
    <Section position="6" start_page="612" end_page="612" type="sub_section">
      <SectionTitle>
Abney Stochastic Attribute-Value Grammars
</SectionTitle>
      <Paragraph position="0"> ~j for all the features simultaneously, not just the weight ~i for feature i. We might consider approximating qnew\[fi\] by ignoring the normalization factor and assuming that all features have the same weight as feature i. Since \]-Ij 6~ (x) = 6//'(x), we arrive at the expression on the left-hand side of equation (3).</Paragraph>
      <Paragraph position="1"> One might expect the approximation just described to be rather poor, but it is proven in Della Pietra, Della Pietra, and Lafferty (1995) that solving equation (3) for 6i (for each i) and setting the new weight for feature i to ~ifli is guaranteed to improve the model. This is the real justification for equation (3), and the reader is referred to Della Pietra, Della Pietra, and Lafferty (1995) for details.</Paragraph>
      <Paragraph position="2"> Solving (3) yields improved weights, but it does not necessarily immediately yield the globally best weights. We can obtain the globally best weights by iterating. Set fli *- 6ifli, for all i, and solve equation (3) again. Repeat until the weights no longer change.</Paragraph>
      <Paragraph position="3"> As with equation (2), solving equation (3) is straightforward if L(G) is small enough to enumerate, but not if L(G) is large. In that case, we must use random sampling. We generate a representative mini-corpus and estimate expectations by counting in the mini-corpus. (See Appendix 2.)</Paragraph>
    </Section>
    <Section position="7" start_page="612" end_page="613" type="sub_section">
      <SectionTitle>
5.5 Random Sampling
</SectionTitle>
      <Paragraph position="0"> We have seen that random sampling is necessary both to set the initial weight for features under consideration and to adjust all weights after a new feature is adopted.</Paragraph>
      <Paragraph position="1"> Random sampling involves creating a corpus that is representative of a given model distribution q(x). To take a very simple example, a fair coin can be seen as a method for sampling from the distribution q in which q(H) = 1/2, q(T) = 1/2. Saying that a corpus is representative is actually not a comment about the corpus itself but the method by which it was generated: a corpus representative of distribution q is one generated by a process that samples from q. Saying that a process M samples from q is to say that the empirical distributions of corpora generated by M converge to q in the limit. For example, if we flip a fair coin once, the resulting empirical distribution over (H, T) is either (1, 0) or (0,1), not the fair-coin distribution (1/2,1/2). But as we take larger and larger corpora, the resulting empirical distributions converge to (1/2,1/2).</Paragraph>
      <Paragraph position="2"> An advantage of SCFGs that random fields lack is the transparent relationship between an SCFG defining a distribution q and a sampler for q. We can sample from q by performing stochastic derivations: each time we have a choice among rules expanding a category X, we choose rule X --* ~i with probability fli, where fli is the weight of rule X--* G Now we can sample from the initial distribution p0 by performing stochastic derivations. At the beginning of Section 3, we sketched how to generate dags from an AV grammar G via nondeterministic derivations. We defined the initial distribution in terms of weights ~ attached to the rules of G. We can convert the nondeterministic derivations discussed at the beginning of Section 3 into stochastic derivations by choosing rule X --* ~i with probability ~i when expanding a node labeled X. Some derivations fail, but throwing away failed derivations has the effect of renormalizing the weight function, so that we generate a dag x with probability p0 (x), as desired.</Paragraph>
      <Paragraph position="3"> The Metropolis-Hastings algorithm provides us with a means of converting the sampler for the initial distribution po(x) into a sampler for the field distribution q(x). Generally, let p(.) be a distribution for which we have a sampler. We wish to construct a sample xl ..... xN from a different distribution q(.). Assume that items xl .... , x, are already in the sample, and we wish to choose xn+l. The sampler for p(.) proposes a new item y. We do not simply add y to the sample--that would give us a sample  Computational Linguistics Volume 23, Number 4 from p(.)--but rather we make a stochastic decision whether to accept the proposal y or reject it. If we accept y, it is added to the sample (Xn+l = y), and if we reject y, then Xn is repeated (Xn+l = xn).</Paragraph>
      <Paragraph position="4"> The acceptance decision is made as follows: If p(y) &gt; q(y), then y is overrepresented among the proposals. We can quantify the degree of overrepresentation as p(y)/q(y). The idea is to reject y with a probability corresponding to its degree of overrepresentation. However, we do not consider the absolute degree of overrepresentation, but rather the degree of overrepresentation relative to x,. (If y and Xn are equally overrepresented, there is no reason to reject y in favor of xn.) That is, we consider the value p(y)/q(y) _ p(y)q(xn) r= p(x,)/q(xn) p(xn)q(y) If r &lt;_ 1, then y is underrepresented relative to x,, and we accept y with probability one. If r &gt; 1, then we accept y with a probability that diminishes as r increases: specifically, with probability 1/r. In brief, the acceptance probability of y is A(y \] x,) = min(1,1/r). It can be shown that proposing items with probability p(.) and accepting them with probability A(. \] x,) yields a sampler for q(.). (See, for example, Winkler \[1995\]). 2 The acceptance probability A(y \] xn) reduces in our case to a particularly simple form. If r &lt; 1 then A(y \] x) = 1. Otherwise, writing ~b(x) for the &amp;quot;field weight&amp;quot; \[Ii fl~lxl,</Paragraph>
      <Paragraph position="6"/>
    </Section>
  </Section>
  <Section position="8" start_page="613" end_page="613" type="metho">
    <SectionTitle>
6. Final Remarks
</SectionTitle>
    <Paragraph position="0"> In summary, we cannot simply transplant CF methods to the AV grammar case. In particular, the ERF method yields correct weights only for SCFGs, not for AV grammars.</Paragraph>
    <Paragraph position="1"> We can define a probabilistic version of AV grammars with a correct weight-selection method by going to random fields. Feature selection and weight adjustment can be accomplished using the IIS algorithm. In feature selection, we need to use random sampling to find the initial weight for a candidate feature, and in weight adjustment we need to use random sampling to solve the weight equation. The random sampling method that DD&amp;L used is not appropriate for sets of dags, but we can solve that problem by using the Metropolis-Hastings method instead.</Paragraph>
    <Paragraph position="2"> Open questions remain. First, random sampling is notorious for being slow, and it remains to be shown whether the approach proposed here will be practicable. I expect practicability to be quite sensitive to the choice of grammar--the more the grammar's</Paragraph>
  </Section>
  <Section position="9" start_page="613" end_page="614" type="metho">
    <SectionTitle>
2 The Metropolis-Hastings acceptance probability is usually given in the form
</SectionTitle>
    <Paragraph position="0"> ( ly ly, A(y \[ x) = min 1,~r(x)g(x,y)\] in which 7r is the distribution we wish to sample from (q, in our notation) and g(x, y) is the proposal probability: the probability that the input sampler will propose y if the previous configuration was x. The case we consider is a special case in which the proposal probability is independent of x: the proposal probability g(x, y) is, in our notation, p(y).</Paragraph>
    <Paragraph position="1"> The original Metropolis algorithm is also a special case of the Metropolis-Hastings algorithm, in which the proposal probability is symmetric, that is, g(x, y) = g(y, x). The acceptance function then reduces to rain(l, ~r(y)/Tr(x)), which is rain(l, q(y)/q(x)) in our notation. I mention this only to point out that it is a different special case. Our proposal probability is not symmetric, but rather independent of the previous configuration, and though our acceptance function reduces to a form (4) that is similar to the original Metropolis acceptance function, it is not the same: in general, (b(y)/(b(x) =7/= q(y)/q(x).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML