File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2190_metho.xml

Size: 15,082 bytes

Last Modified: 2025-10-06 14:15:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2190">
  <Title>Conditions on Consistency of Probabilistic Tree Adjoining Grammars*</Title>
  <Section position="5" start_page="1164" end_page="1165" type="metho">
    <SectionTitle>
3 Applying probability measures to
Tree Adjoining Languages
</SectionTitle>
    <Paragraph position="0"> To gain some intuition about probability assignments to languages, let us take for example, a language well known to be a tree adjoining language: null</Paragraph>
    <Paragraph position="2"> It seems that we should be able to use a function C/ to assign any probability distribution to the strings in L(G) and then expect that we can assign appropriate probabilites to the adjunctions in G such that the language generated by G has the same distribution as that given by C/. However a function C/ that grows smaller by repeated multiplication as the inverse of an exponential function cannot be matched by any TAG because of the constant growth property of TAGs (see (Vijay-Shanker, 1987), p. 104). An example of such a function C/ is a simple Poisson distribution (2), which in fact was also used as the counterexample in (Booth and Thompson, 1973) for CFGs, since CFGs also have the constant growth property.</Paragraph>
    <Paragraph position="4"> This shows that probabilistic TAGs, like CFGs, are constrained in the probabilistic languages that they can recognize or learn. As shown above, a probabilistic language can fail to have a generating probabilistic TAG.</Paragraph>
    <Paragraph position="5"> The reverse is also true: some probabilistic TAGs, like some CFGs, fail to have a corresponding probabilistic language, i.e. they are not consistent. There are two reasons why a probabilistic TAG could be inconsistent: &amp;quot;dirty&amp;quot; grammars, and destructive or incorrect probability assignments.</Paragraph>
    <Paragraph position="6"> &amp;quot;Dirty&amp;quot; grammars. Usually, when applied to language, TAGs are lexicalized and so probabilities assigned to trees are used only when the words anchoring the trees are used in a derivation. However, if the TAG allows non-lexicalized trees, or more precisely, auxiliary trees with no yield, then looping adjunctions which never generate a string are possible. However, this can be detected and corrected by a simple search over the grammar. Even in lexicalized grammars, there could be some auxiliary trees that are assigned some probability mass but which can never adjoin into another tree.</Paragraph>
    <Paragraph position="7"> Such auxiliary trees are termed unreachable and techniques similar to the ones used in detecting unreachable productions in CFGs can be used here to detect and eliminate such trees.</Paragraph>
    <Paragraph position="8"> Destructive probability assignments.</Paragraph>
    <Paragraph position="9"> This problem is a more serious one, and is the main subject of this paper. Consider the probabilistic TAG shown in (3) 2.</Paragraph>
    <Paragraph position="11"> Consider a derivation in this TAG as a generative process. It proceeds as follows: node $1 in tl is rewritten as t2 with probability 1.0. Node $2 in t2 is 99 times more likely than not to be rewritten as t2 itself, and similarly node $3 is 49 times more likely than not to be rewritten as t2.</Paragraph>
    <Paragraph position="12"> This however, creates two more instances of $2 and $3 with same probabilities. This continues, creating multiple instances of t2 at each level of the derivation process with each instance of t2 creating two more instances of itself. The grammar itself is not malicious; the probability assignments are to blame. It is important to note that inconsistency is a problem even though for any given string there are only a finite number of derivations, all halting. Consider the probability mass function (pmf) over the set of all derivations for this grammar. An inconsistent grammar would have a pmfwhich assigns a large portion of probability mass to derivations that are non-terminating. This means there is a finite probability the generative process can enter a generation sequence which has a finite probability of non-termination.</Paragraph>
  </Section>
  <Section position="6" start_page="1165" end_page="1168" type="metho">
    <SectionTitle>
4 Conditions for Consistency
</SectionTitle>
    <Paragraph position="0"> A probabilistic TAG G is consistent if and only if:</Paragraph>
    <Paragraph position="2"> where Pr(v) is the probability assigned to a string in the language. If a grammar G does not satisfy this condition, G is said to be inconsistent. null To explain the conditions under which a probabilistic TAG is consistent we will use the TAG 2The subscripts are used as a simple notation to uniquely refer to the nodes in each elementary tree. They are not part of the node label for purposes of adjunction.  in (5) as an example.</Paragraph>
    <Paragraph position="4"> From this grammar, we compute a square matrix A4 which of size IVI, where V is the set of nodes in the grammar that can be rewritten by adjunction. Each AzIij contains the expected value of obtaining node Xj when node Xi is rewritten by adjunction at each level of a TAG derivation. We call Ad the stochastic expectation matrix associated with a probabilistic  We then write a matrix N which has \[I U A\[ rows and IV\[ columns. An element Nij is 1.0 if node Xj is a node in tree ti.</Paragraph>
    <Paragraph position="6"> Then the stochastic expectation matrix A4 is simply the product of these two matrices.</Paragraph>
    <Paragraph position="7"> 3Note that P is not a row stochastic matrix. This is an important difference in the construction of .h4 for TAGs when compared to CFGs. We will return to this point in SS5.</Paragraph>
    <Paragraph position="9"> By inspecting the values of A4 in terms of the grammar probabilities indicates that .h4ij contains the values we wanted, i.e. expectation of obtaining node Aj when node Ai is rewritten by adjunction at each level of the TAG derivation process.</Paragraph>
    <Paragraph position="10"> By construction we have ensured that the following theorem from (Booth and Thompson, 1973) applies to probabilistic TAGs. A formal justification for this claim is given in the next section by showing a reduction of the TAG derivation process to a multitype Galton-Watson branching process (Harris, 1963). Theorem 4.1 A probabilistic grammar is consistent if the spectral radius p(A4) &lt; 1, where ,h,4 is the stochastic expectation matrix computed from the grammar. (Booth and Thompson, 1973; Soule, 1974) This theorem provides a way to determine whether a grammar is consistent. All we need to do is compute the spectral radius of the square matrix A4 which is equal to the modulus of the largest eigenvalue of . If this value is less than one then the grammar is consistent 4. Computing consistency can bypass the computation of the eigenvalues for A4 by using the following theorem by Ger~gorin (see (Horn and Johnson, 1985; Wetherell, 1980)).</Paragraph>
    <Paragraph position="11"> Theorem 4.2 For any square matrix .h4, p(.M) &lt; 1 if and only if there is an n &gt; 1 such that the sum of the absolute values of the elements of each row of .M n is less than one. Moreover, any n' &gt; n also has this property. (GerSgorin, see (Horn and Johnson, 1985; Wetherell, 1980)) 4The grammar may be consistent when the spectral radius is exactly one, but this case involves many special considerations and is not considered in this paper. In practice, these complicated tests are probably not worth the effort. See (Harris, 1963) for details on how this special case can be solved.</Paragraph>
    <Paragraph position="12">  This makes for a very simple algorithm to check consistency of a grammar. We sum the values of the elements of each row of the stochastic expectation matrix A4 computed from the grammar. If any of the row sums are greater than one then we compute A42, repeat the test and compute :~422 if the test fails, and so on until the test succeeds 5. The algorithm does not halt ifp(A4) _&gt; 1. In practice, such an algorithm works better in the average case since computation of eigenvalues is more expensive for very large matrices. An upper bound can be set on the number of iterations in this algorithm. Once the bound is passed, the exact eigenvalues can be computed.</Paragraph>
    <Paragraph position="13"> For the grammar in (5) we computed the following stochastic expectation matrix:  each row must be less than one, we compute the power matrix ,~v/2. However, the sum of one of the rows is still greater than 1. Continuing we  This time all the row sums are less than one, hence p(,~4) &lt; 1. So we can say that the grammar defined in (5) is consistent. We can confirm this by computing the eigenvalues for A4 which are 0, 0, 0.6, 0 and 0.1, all less than 1.</Paragraph>
    <Paragraph position="14"> Now consider the grammar (3) we had considered in Section 3. The value of .PS4 for that grammar is computed to be:</Paragraph>
    <Paragraph position="16"> SWe compute A422 and subsequently only successive powers of 2 because Theorem 4.2 holds for any n' &gt; n.</Paragraph>
    <Paragraph position="17"> This permits us to use a single matrix at each step in the algorithm.</Paragraph>
    <Paragraph position="18"> The eigenvalues for the expectation matrix M computed for the grammar (3) are 0, 1.97  To show that Theorem 4.1 in Section 4 holds for any probabilistic TAG, it is sufficient to show that the derivation process in TAGs is a Galton-Watson branching process.</Paragraph>
    <Paragraph position="19"> A Galton-Watson branching process (Harris, 1963) is simply a model of processes that have objects that can produce additional objects of the same kind, i.e. recursive processes, with certain properties. There is an initial set of objects in the 0-th generation which produces with some probability a first generation which in turn with some probability generates a second, and so on. We will denote by vectors Z0, Z1, Z2,... the 0-th, first, second, ... generations. There are two assumptions made about Z0, Z1, Z2,...: . The size of the n-th generation does not influence the probability with which any of the objects in the (n + 1)-th generation is produced. In other words, Z0, Z1,Z2,...</Paragraph>
    <Paragraph position="20"> form a Markov chain.</Paragraph>
    <Paragraph position="21"> . The number of objects born to a parent object does not depend on how many other objects are present at the same level.</Paragraph>
    <Paragraph position="22"> We can associate a generating function for each level Zi. The value for the vector Zn is the value assigned by the n-th iterate of this generating function. The expectation matrix A4 is defined using this generating function.</Paragraph>
    <Paragraph position="23"> The theorem attributed to Galton and Watson specifies the conditions for the probability of extinction of a family starting from its 0-th generation, assuming the branching process represents a family tree (i.e, respecting the conditions outlined above). The theorem states that p(.~4) &lt; 1 when the probability of extinction is  level 0 level 1 level 2 level 3 level 4 (6) .s (~) The assumptions made about the generating process intuitively holds for probabilistic TAGs. (6), for example, depicts a derivation of the string a2a2a2a2a3a3al by a sequence of adjunctions in the grammar given in (5) 6. The parse tree derived from such a sequence is shown in Fig. 7. In the derivation tree (6), nodes in the trees at each level i axe rewritten by adjunction to produce a level i + 1. There is a final level 4 in (6) since we also consider the probability that a node is not rewritten further, i.e. Pr(A ~-~ nil) for each node A.</Paragraph>
    <Paragraph position="24"> We give a precise statement of a TAG derivation process by defining a generating function for the levels in a derivation tree. Each level i in the TAG derivation tree then corresponds to Zi in the Maxkov chain of branching pro- null are node addresses where each tree has adjoined into its parent. Recall the definition of node addresses in Section 2.</Paragraph>
    <Paragraph position="25"> cesses. This is sufficient to justify the use of Theorem 4.1 in Section 4. The conditions on the probability of extinction then relates to the probability that TAG derivations for a probabilistic TAG will not recurse infinitely. Hence the probability of extinction is the same as the probability that a probabilistic TAG is consistent. null For each Xj E V, where V is the set of nodes in the grammar where adjunction can occur, we define the k-argument adjunction generating \]unction over variables si,..., Sk corresponding to the k nodes in V.</Paragraph>
    <Paragraph position="27"> where, rj (t) = 1 iff node Xj is in tree t, rj (t) = 0 otherwise.</Paragraph>
    <Paragraph position="28"> For example, for the grammar in (5) we get the following adjunction generating functions taking the variable sl, s2, 83, 84, 85 to represent the nodes A1, A2, B1, A3, B2 respectively.</Paragraph>
    <Paragraph position="30"> The n-th level generating function Gn(sl,...,sk) is defined recursively as follows. null</Paragraph>
    <Paragraph position="32"> For the grammar in (5) we get the following level generating functions.</Paragraph>
    <Paragraph position="34"> Examining this example, we can express Gi(s1,...,Sk) as a sum Di(sl,...,Sk) + Ci, where Ci is a constant and Di(.) is a polynomial with no constant terms. A probabilistic TAG will be consistent if these recursive equations terminate, i.e. iff limi+ooDi(sl, . . . , 8k) --+ 0 We can rewrite the level generation functions in terms of the stochastic expectation matrix Ad, where each element mi, j of .A4 is computed as follows (cf. (Booth and Thompson, 1973)).</Paragraph>
    <Paragraph position="36"> The limit condition above translates to the condition that the spectral radius of 34 must be less than 1 for the grammar to be consistent.</Paragraph>
    <Paragraph position="37"> This shows that Theorem 4.1 used in Section 4 to give an algorithm to detect inconsistency in a probabilistic holds for any given TAG, hence demonstrating the correctness of the algorithm. null Note that the formulation of the adjunction generating function means that the values for C/(X ~4 nil) for all X E V do not appear in the expectation matrix. This is a crucial difference between the test for consistency in TAGs as compared to CFGs. For CFGs, the expectation matrix for a grammar G can be interpreted as the contribution of each non-terminal to the derivations for a sample set of strings drawn from L(G). Using this it was shown in (Chaudhari et al., 1983) and (SPSnchez and Bened~, 1997) that a single step of the inside-outside algorithm implies consistency for a probabilistic CFG. However, in the TAG case, the inclusion of values for C/(X ~-+ nil) (which is essentim if we are to interpret the expectation matrix in terms of derivations over a sample set of strings) means that we cannot use the method used in (8) to compute the expectation matrix and furthermore the limit condition will not be convergent.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML