XML Viewer - p02-1036

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1036_metho.xml
Size: 19,726 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1036">
  <Title>Dynamic programming for parsing and estimation of stochastic uni cation-based grammars</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Maxwell and Kaplan packed
</SectionTitle>
    <Paragraph position="0"> representations This section characterises the properties of uni cation grammars and the Maxwell and Kaplan packed parse representations that will be important for what follows. This characterisation omits many details about uni cation grammars and the algorithm by which the packed representations are actually constructed; see Maxwell III and Kaplan (1995) for details. null A parse generated by a uni cation grammar is a nite subset of a set F of features. Features are parse fragments, e.g., chart edges or arcs from attribute-value structures, out of which the packed representations are constructed. For this paper it does not matter exactly what features are, but they are intended to be the atomic entities manipulated by a dynamic programming parsing algorithm. A grammar de nes a set of well-formed or grammatical parses. Each parse ! 2 is associated with a string of words Y (!) called its yield. Note that except for trivial grammars F and are in nite.</Paragraph>
    <Paragraph position="1"> If y is a string, then let (y) = f! 2 jY (!) = yg and F(y) = S!2 (y)ff 2 !g. That is, (y) is the set of parses of a string y and F(y) is the set of features appearing in the parses of y. In the grammars of interest here (y) and hence also F(y) are nite.</Paragraph>
    <Paragraph position="2"> Maxwell and Kaplan's packed representations often provide a more compact representation of the set of parses of a sentence than would be obtained by merely listing each parse separately. The intuition behind these packed representations is that for most strings y, many of the features in F(y) occur in many of the parses (y). This is often the case in natural language, since the same substructure can appear as a component of many different parses.</Paragraph>
    <Paragraph position="3"> Packed feature representations are de ned in terms of conditions on the values assigned to a vector of variables X. These variables have no direct linguistic interpretation; rather, each different assignment of values to these variables identi es a set of features which constitutes one of the parses in the packed representation. A condition a on X is a function from X to f0;1g. While for uniformity we write conditions as functions on the entire vector X, in practice Maxwell and Kaplan's approach produces conditions whose value depends only on a few of the variables in X, and the ef ciency of the algorithms described here depends on this.</Paragraph>
    <Paragraph position="4"> A packed representation of a nite set of parses is a quadruple R = (F0;X;N; ), where: F0 F(y) is a nite set of features, X is a nite vector of variables, where each variable X' ranges over the nite set X', N is a nite set of conditions on X called the no-goods,2 and is a function that maps each feature f 2 F0 to a condition f on X.</Paragraph>
    <Paragraph position="5"> A vector of values x satis es the no-goods N iff N(x) = 1, where N(x) = Q 2N (x). Each x that satis es the no-goods identi es a parse !(x) = ff 2 F0j f(x) = 1g, i.e., ! is the set of features whose conditions are satis ed by x. We require that each parse be identi ed by a unique value satisfying 2The name no-good comes from the TMS literature, and was used by Maxwell and Kaplan. However, here the no-goods actually identify the good variable assignments.</Paragraph>
    <Paragraph position="6"> the no-goods. That is, we require that:</Paragraph>
    <Paragraph position="8"> Finally, a packed representation R represents the set of parses (R) that are identi ed by values that satisfy the no-goods, i.e., (R) = f!(x)jx 2</Paragraph>
    <Paragraph position="10"> Maxwell III and Kaplan (1995) describes a parsing algorithm for uni cation-based grammars that takes as input a string y and returns a packed representation R such that (R) = (y), i.e., R represents the set of parses of the string y. The SUBG parsing and estimation algorithms described in this paper use Maxwell and Kaplan's parsing algorithm as a subroutine.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Stochastic Uni cation-Based Grammars
</SectionTitle>
    <Paragraph position="0"> This section reviews the probabilistic framework used in SUBGs, and describes the statistics that must be calculated in order to estimate the parameters of a SUBG from parsed training data.</Paragraph>
    <Paragraph position="1"> For a more detailed exposition and descriptions of regularization and other important details, see Johnson et al. (1999).</Paragraph>
    <Paragraph position="2"> The probability distribution over parses is de ned in terms of a nite vector g = (g1;::: ;gm) of properties. A property is a real-valued function of parses . Johnson et al. (1999) placed no restrictions on what functions could be properties, permitting properties to encode arbitrary global information about a parse. However, the dynamic programming algorithms presented here require the information encoded in properties to be local with respect to the features F used in the packed parse representation. Speci cally, we require that properties be dened on features rather than parses, i.e., each feature f 2 F is associated with a nite vector of real values (g1(f);::: ;gm(f)) which de ne the property functions for parses as follows:</Paragraph>
    <Paragraph position="4"> That is, the property values of a parse are the sum of the property values of its features. In the usual case, some features will be associated with a single property (i.e., gk(f) is equal to 1 for a speci c value of k and 0 otherwise), and other features will be associated with no properties at all (i.e., g(f) = 0).</Paragraph>
    <Paragraph position="5"> This requires properties be very local with respect to features, which means that we give up the ability to de ne properties arbitrarily. Note however that we can still encode essentially arbitrary linguistic information in properties by adding specialised features to the underlying uni cation grammar. For example, suppose we want a property that indicates whether the parse contains a reduced relative clauses headed by a past participle (such garden path constructions are grammatical but often almost incomprehensible, and alternative parses not including such constructions would probably be preferred). Under the current de nition of properties, we can introduce such a property by modifying the underlying uni cation grammar to produce a certain diacritic feature in a parse just in case the parse actually contains the appropriate reduced relative construction. Thus, while properties are required to be local relative to features, we can use the ability of the underlying uni cation grammar to encode essentially arbitrary non-local information in features to introduce properties that also encode non-local information. null A Stochastic Uni cation-Based Grammar is a triple (U;g; ), where U is a uni cation grammar that de nes a set of parses as described above, g = (g1;::: ;gm) is a vector of property functions as just described, and = ( 1;::: ; m) is a vector of non-negative real-valued parameters called property weights. The probability P (!) of a parse ! 2 is:</Paragraph>
    <Paragraph position="7"> Intuitively, if gj(!) is the number of times that prop-erty j occurs in ! then j is the 'weight' or 'cost' of each occurrence of property j and Z is a normalising constant that ensures that the probability of all parses sums to 1.</Paragraph>
    <Paragraph position="8"> Now we discuss the calculation of several important quantities for SUBGs. In each case we show that the quantity can be expressed as the value that maximises a product of functions or else as the sum of a product of functions, each of which depends on a small subset of the variables X. These are the kinds of quantities for which dynamic programming graphical model algorithms have been developed.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The most probable parse
</SectionTitle>
      <Paragraph position="0"> In parsing applications it is important to be able to extract the most probable (or MAP) parse ^!(y) of string y with respect to a SUBG. This parse is:</Paragraph>
      <Paragraph position="2"> Given a packed representation (F0;X;N; ) for the parses (y), let ^x(y) be the x that identi es ^!(y).</Paragraph>
      <Paragraph position="3"> Since W (^!(y)) &gt; 0, it can be shown that:</Paragraph>
      <Paragraph position="5"> pends on exactly the same variables in X as f does.</Paragraph>
      <Paragraph position="6"> As (3) makes clear, nding ^x(y) involves maximising a product of functions where each function depends on a subset of the variables X. As explained below, this is exactly the kind of maximisation that can be solved using graphical model techniques.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Conditional likelihood
</SectionTitle>
      <Paragraph position="0"> We now turn to the estimation of the property weights from a training corpus of parsed data D = (!1;::: ;!n). As explained in Johnson et al. (1999), one way to do this is to nd the that maximises the conditional likelihood of the training corpus parses given their yields. (Johnson et al. actually maximise conditional likelihood regularized with a Gaussian prior, but for simplicity we ignore this here). If yi is the yield of the parse !i, the conditional likelihood of the parses given their yields is:</Paragraph>
      <Paragraph position="2"> Then the maximum conditional likelihood estimate</Paragraph>
      <Paragraph position="4"> problems, but since (yi) (the set of parses for yi) can be large, calculating Z ( (yi)) by enumerating each ! 2 (yi) can be computationally expensive.</Paragraph>
      <Paragraph position="5"> However, there is an alternative method for calculating Z ( (yi)) that does not involve this enumeration. As noted above, for each yield yi;i = 1;::: ;n, Maxwell's parsing algorithm returns a packed feature structure Ri that represents the parses of yi, i.e.,  (yi) = (Ri). A derivation parallel to the one for (3) shows that for R = (F0;X;N; ):</Paragraph>
      <Paragraph position="7"> (This derivation relies on the isomorphism between parses and variable assignments in (1)). It turns out that this type of sum can also be calculated using graphical model techniques.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Conditional Expectations
</SectionTitle>
      <Paragraph position="0"> In general, iterative numerical procedures are required to nd the property weights that maximise the conditional likelihood LD( ). While there are a number of different techniques that can be used, all of the ef cient techniques require the calculation of conditional expectations E [gkjyi] for each prop-erty gk and each sentence yi in the training corpus, where:</Paragraph>
      <Paragraph position="2"> For example, the Conjugate Gradient algorithm, which was used by Johnson et al., requires the calculation not just of LD( ) but also its derivatives</Paragraph>
      <Paragraph position="4"> We have just described the calculation of LD( ), so if we can calculate E [gkjyi] then we can calculate the partial derivatives required by the Conjugate Gradient algorithm as well.</Paragraph>
      <Paragraph position="5"> Again, let R = (F0;X;N; ) be a packed representation such that (R) = (yi). First, note that (2) implies that:</Paragraph>
      <Paragraph position="7"> Note that P(f! : f 2 !gjyi) involves the sum of weights over all x 2 X subject to the conditions that N(x) = 1 and f(x) = 1. Thus P(f! : f 2 !gjyi) can also be expressed in a form that is easy to evaluate using graphical techniques.</Paragraph>
      <Paragraph position="9"/>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Graphical model calculations
</SectionTitle>
    <Paragraph position="0"> In this section we brie y review graphical model algorithms for maximising and summing products of functions of the kind presented above. It turns out that the algorithm for maximisation is a generalisation of the Viterbi algorithm for HMMs, and the algorithm for computing the summation in (5) is a generalisation of the forward-backward algorithm for HMMs (Smyth et al., 1997). Viewed abstractly, these algorithms simplify these expressions by moving common factors over the max or sum operators respectively. These techniques are now relatively standard; the most well-known approach involves junction trees (Pearl, 1988; Cowell, 1999). We adopt the approach approach described by Geman and Kochanek (2000), which is a straightforward generalization of HMM dynamic programming with minimal assumptions and programming overhead. However, in principle any of the graphical model computational algorithms can be used.</Paragraph>
    <Paragraph position="1"> The quantities (3), (4) and (5) involve maximisation or summation over a product of functions, each of which depends only on the values of a subset of the variables X. There are dynamic programming algorithms for calculating all of these quantities, but for reasons of space we only describe an algorithm for nding the maximum value of a product of functions. These graph algorithms are rather involved. It may be easier to follow if one reads Example 1 before or in parallel with the de nitions below.</Paragraph>
    <Paragraph position="2"> To explain the algorithm we use the following notation. If x and x0 are both vectors of length m then x =j x0 iff x and x0 disagree on at most their jth components, i.e., xk = x0k for k = 1;::: ;j 1;j + 1;::: m. If f is a function whose domain is X, we say that f depends on the set of variables d(f) = fXjj9x;x0 2 X;x =j x0;f(x) 6= f(x0)g.</Paragraph>
    <Paragraph position="3"> That is, Xj 2 d(f) iff changing the value of Xj can change the value of f.</Paragraph>
    <Paragraph position="4"> The algorithm relies on the fact that the variables in X = (X1;::: ;Xn) are ordered (e.g., X1 precedes X2, etc.), and while the algorithm is correct for any variable ordering, its ef ciency may vary dramatically depending on the ordering as described below. Let H be any set of functions whose domains are X. We partition H into disjoint subsets H1;::: ;Hn+1, where Hj is the subset of H that depend on Xj but do not depend on any variables ordered before Xj, and Hn+1 is the subset of H that do not depend on any variables at all (i.e., they are constants).3 That is, Hj = fH 2 HjXj 2 d(H);8i &lt; j Xi 62 d(H)g and Hn+1 = fH 2 Hjd(H) = ;g.</Paragraph>
    <Paragraph position="5"> As explained in section 3.1, there is a set of functions A such that the quantities we need to calculate have the general form:</Paragraph>
    <Paragraph position="7"> Mmax is the maximum value of the product expression while ^x is the value of the variables at which the maximum occurs. In a SUBG parsing application ^x identi es the MAP parse.</Paragraph>
    <Paragraph position="8"> 3Strictly speaking this does not necessarily de ne a partition, as some of the subsets Hj may be empty.</Paragraph>
    <Paragraph position="9"> The procedure depends on two sequences of functions Mi;i = 1;::: ;n + 1 and Vi;i = 1;::: ;n.</Paragraph>
    <Paragraph position="10"> Informally, Mi is the maximum value attained by the subset of the functions A that depend on one of the variables X1;::: ;Xi, and Vi gives information about the value of Xi at which this maximum is attained. null To simplify notation we write these functions as functions of the entire set of variables X, but usually depend on a much smaller set of variables. The Mi are real valued, while each Vi ranges over Xi.</Paragraph>
    <Paragraph position="11"> Let M = fM1;::: ;Mng. Recall that the sets of functions A and M can be both be partitioned into disjoint subsets A1;::: ;An+1 and M1;::: ;Mn+1 respectively on the basis of the variables each Ai and Mi depend on. The de nition of the Mi and  The de nition of Mi in (8) may look circular (since M appears in the right-hand side), but in fact it is not. First, note that Mi depends only on variables ordered after Xi, so if Mj 2 Mi then j &lt; i. More  Thus we can compute the Mi in the order M1;::: ;Mn+1, inserting Mi into the appropriate set Mk, where k &gt; i, when Mi is computed.</Paragraph>
    <Paragraph position="12"> We claim that Mmax = Mn+1. (Note that Mn+1 and Mn are constants, since there are no variables ordered after Xn). To see this, consider the tree T whose nodes are the Mi, and which has a directed edge from Mi to Mj iff Mi 2 Mj (i.e., Mi appears in the right hand side of the de nition (8) of Mj). T has a unique root Mn+1, so there is a path from every Mi to Mn+1. Let i j iff there is a path from Mi to Mj in this tree. Then a simple induction shows that Mj is a function from d(Mj) to a maximisation over each of the variables Xi where i j of Qi j;A2Ai A.</Paragraph>
    <Paragraph position="13"> Further, it is straightforward to show that Vi(^x) = ^xi (the value ^x assigns to Xi). By the same arguments as above, d(Vi) only contains variables ordered after Xi, so Vn = ^xn. Thus we can evaluate the Vi in the order Vn;::: ;V1 to nd the maximising assignment ^x.</Paragraph>
    <Paragraph position="14"> Example 1 Let X = f X1; X2; X3; X4; X5; X6; X7g and set A = fa(X1;X3); b(X2;X4); c(X3;X4;X5); d(X4;X5); e(X6;X7)g. We can represent the sharing of variables in Aby means of a undirected graph GA, where the nodes of GA are the variables X and there is an edge in GA connecting Xi to Xj iff 9A 2 A such that both Xi;Xj 2 d(A).</Paragraph>
    <Paragraph position="15"> GA is depicted below.</Paragraph>
    <Paragraph position="17"> Starting with the variable X1, we compute M1 and V1:</Paragraph>
    <Paragraph position="19"> We now proceed to the variable X2.</Paragraph>
    <Paragraph position="21"> Since M1 belongs to M3, it appears in the de nition of M3.</Paragraph>
    <Paragraph position="23"> Note that M5 is a constant, re ecting the fact that in GA the node X5 is not connected to any node ordered after it.</Paragraph>
    <Paragraph position="25"> The second component is de ned in the same way:</Paragraph>
    <Paragraph position="27"> The maximum value for the product M8 = Mmax is de ned in terms of M5 and M7.</Paragraph>
    <Paragraph position="29"> Finally, we evaluate V7;::: ;V1 to nd the maximising assignment ^x.</Paragraph>
    <Paragraph position="31"> We now brie y consider the computational complexity of this process. Clearly, the number of steps required to compute each Mi is a polynomial of order jd(Mi)j+1, since we need to enumerate all possible values for the argument variables d(Mi) and for each of these, maximise over the set Xi. Further, it is easy to show that in terms of the graph GA, d(Mj) consists of those variables Xk;k &gt; j reachable by a path starting at Xj and all of whose nodes except the last are variables that precede Xj.</Paragraph>
    <Paragraph position="32"> Since computational effort is bounded above by a polynomial of order jd(Mi)j+ 1, we seek a variable ordering that bounds the maximum value of jd(Mi)j.</Paragraph>
    <Paragraph position="33"> Unfortunately, nding the ordering that minimises the maximum value of jd(Mi)j is an NP-complete problem. However, there are several ef cient heuristics that are reputed in graphical models community to produce good visitation schedules. It may be that they will perform well in the SUBG parsing applications as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML