File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1012_metho.xml

Size: 18,673 bytes

Last Modified: 2025-10-06 14:14:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1012">
  <Title>Stochastic HPSG</Title>
  <Section position="4" start_page="0" end_page="83" type="metho">
    <SectionTitle>
2 Probabilistic interpretation of
</SectionTitle>
    <Paragraph position="0"> PCFGs We review the standard probabilistic interpretation of PCFGs 1 A PCFG is a four-tuple &lt; W,N, N1,R &gt; , where W is a Set of terminal symbols {wl,..., w~}, N is a set of non-terminal symbols {N1,...,N~}, N1 is the starting symbol and R is a set of rules of the form N ~ ~ (J, where (J is a string of terminals and non-terminals. Each rule has a probability P(N i --~ ~J) and the probabilities for all the rules that expand a given non-terminal must sum to one. We associate probabilities with partial phrase markers, which are sets of terminal and non-terminal nodes generated by beginning from the starting node successively expanding non-terminal leaves of the partial tree. Phrase markers are those partial phrase markers which have no non-terminal leaves. Probabilities are assigned by the following inductive definition:</Paragraph>
    <Paragraph position="2"> partial phrase marker which differs from it only in that a single non-terminal node N k in T has been expanded to ~'~ in T ', then</Paragraph>
    <Paragraph position="4"> In this definition R acts as a specification of the accessibility relationships which can hold between nodes of the trees admitted by the grammar. The rule probabilities specify the cost of 1 Our description is closely based on that given by Charniak(Charniak, 1993, p. 52 if)  making particular choices about the way in which the rules develop. It is going to turn out that an exactly analogous system of accessibility relations is present in the probabilistic type hierarchies which we define later.</Paragraph>
    <Paragraph position="5"> Limitations of PCFGs The definition of PCFGs implies that the probability of a phrase marker depends only on the choice of rules used in expanding non-terminal nodes. In particular, the probability does not depend on the order in which the rules are applied. This has the arguably unwelcome consequence that PCFGs are unable to make certain discriminations between trees which differ only in their configuration 2. The models developed in this paper build in similar independence assumptions. A large part of the art of probabilistic language modelling resides in the management of the trade-off between descriptive power (which has the merit of allowing us to make the discriminations which we want) and independence assumptions (which have the merit of making training practical by allowing us to treat similar situations as equivalent).</Paragraph>
    <Paragraph position="6"> The crucial advantage of PCFGs over CFGs is that they can be trained and/or learned from corpora. Readers for whom this fact is unfamiliar are referred to Charniak's textbook (Charniak, 1993, Chapter 7). We do not have space to recapitulate the discussion of training which can be found there. We do however illustrate the outcome of training.</Paragraph>
    <Section position="1" start_page="83" end_page="83" type="sub_section">
      <SectionTitle>
2.1 Applying a PCFG to a simple corpus
</SectionTitle>
      <Paragraph position="0"> Consider the simple grammar in figure 1 and its training against the corpus in figure 2. Since there are 3 plural sentences and only 2 singular sentences, the optimal set of parameters will reflect the distribution found in the corpus, as shown in figure 3 One might have hoped that the ratio P(np-sing\[np)/P(np-pl\[np) would be 2/3, but it is instead V/-~. This is a consequence of the assumption of independence. Effectively the algorithm is ascribing the difference in distribution of singular and plural sentences to the joint effect of two independent decisions. What we would really like it to do is to recognize that the two apparently independent decisions are (in effect) one and the same. Also, because the grammar has no means of enforcing number agreement, the system systematically prefers plurals to singulars, even when doing this will lead to agreement clashes. Thus &amp;quot;buses stop&amp;quot; has estimated 0.55 x 0.55 = 0.3025, &amp;quot;bus stop&amp;quot; and &amp;quot;buses stops&amp;quot; both have probability 0.55 x 0.45 = 0.2475 and &amp;quot;bus stops&amp;quot; has probability 0.45 x 0.45 = 0.2025. This behaviour is clearly unmotivated by the corpus, and arises ~The most obvious case is prepositional-phrase attachment.</Paragraph>
      <Paragraph position="1"> purely because of the inadequacy of the probabilistic model.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="83" end_page="83" type="metho">
    <SectionTitle>
3 Probabilistic type hierarchies
ALE signatures Carpenter's ALE (Carpenter,
</SectionTitle>
    <Paragraph position="0"> 1993) allows the user to define the type hierarchy of a grammar by writing a collection of clauses which together denote an inheritance hierarchy, a set of features and a set of appropriateness conditions. An example of such a hierarchy is given in ALE syntax in figure 4.</Paragraph>
    <Paragraph position="1"> What the ALE signature tells us The inheritance information tells us that a sign is a forced choice between a sentence and a phrase, that a phrase is a forced choice between a noun-phrase (np) and a verb-phrase (vp) and that number values (num) are partitioned into singular (sing) and plural (pl). The features which are defined are left,right, and nura, and the appropriateness information says that the feature num introduces a new instance of the type num on all phrases, and that left and right introduce np and vp respectively on sentences.</Paragraph>
    <Paragraph position="2"> The parallel with PCFGs The parallel which makes it possible to apply the PCFG training scheme almost unchanged is that the sub-types of a given super-type partition the feature structures of that type in just the same way that the different rules which expand a given non-terminal N of the PCFG partition the space of trees whose top-most node is N. Equally, the features defined in the hierarchy act as an accessibility relation between nodes in a way which is for our purposes entirely equivalent to the way in which the right hand sides of the rules introduce new nodes into partial phrase markers 3. The hierarchy in figure 4 is related to but not isomorphic with the grammar in figure 1.</Paragraph>
    <Paragraph position="3"> One difference is that num is explicitly introduced as a feature in the hierarchy, where at is only implicitly present in the original grammar.</Paragraph>
    <Paragraph position="4"> The other difference is the use of left and right as models of the dominance relationships between nodes.</Paragraph>
  </Section>
  <Section position="6" start_page="83" end_page="87" type="metho">
    <SectionTitle>
4 A probabilistic interpretation of
</SectionTitle>
    <Paragraph position="0"> typed feature-structures For our purposes, a probabilistic type hierarchy (PTH) is a four-tuple &lt; MT, NT, NT1, I &gt; where MT is a set of maximal types 4 {t 1 .... ,to~},</Paragraph>
    <Paragraph position="2"> NT1 is the starting symbol and I is a set of introduction relationships of the form (T ~ ~ TJ) ~k, where ~J is a multiset of maximal and non-maximal types. Each introduction relationship has a probability P((T i ~ TJ) --+ ~k) and the probabilities for all the introduction relationships that apply to a given non-maximal type must sum to one.</Paragraph>
    <Paragraph position="3"> As things stand this definition is nearly isomorphic to that given for PCFGs, with the major differences being two changes which move us from rules to introduction relationships. Firstly, we relax the stipulation that the items on the right hand side of the rules are strings, allowing them instead to be multisets. Secondly, we introduce an additional term in the head of introduction rules to signal the fact that when we apply a particular introduction relationship to a node we also specialize the type of the node by picking exactly one of the direct subtypes of its current type. Finally, we need to deal with the case where TJ is non-maximal. This is simply achieved by defining the iterated introduction relationships from T i as being those corresponding to the chains of introduction relationships from T i which refine the type to a maximal type. In the probabilistic type hierarchy, it is the iterated introduction relationships which correspond to the context-free rewrite rules of a PCFG. A useful side-effect of this is that we can preserve the invariant that all types except those at the fringe of the structure are maximal.</Paragraph>
    <Paragraph position="4"> The hierarchy whose ALE syntax is given in figure 4 is captured in the new notation by figure 5 We associate probabilities with feature structures, which are sets of maximal and non-maximal nodes generated by beginning from the starting node and successively expanding non-maximal leaves of the partial tree. Maximally specified lealure slruclures are those feature structures which have only maximal leaves. Probabilities are assigned by the following inductive definition:</Paragraph>
    <Paragraph position="6"> feature structure which differs from it only in that a single non-maximal node NT k of type To k in F has been refined to type T1 k expanded to ~'~ in F', then P(F') = P(F) x</Paragraph>
    <Paragraph position="8"> Modulo notation, this definition is identical to the one given earlier for PCFGs. Given the correspondence between the definitions of a PTH and a PCFG it should be apparent that the training methods which apply to one can equally be used with the other. We will shortly provide an example. Because we have not yet treated the crucial matter of re-entrancy, it would be inappropriate to call what we so far have stochastic HPSG, so we refer to it as stochastic HPSG-.</Paragraph>
    <Paragraph position="9"> mum amounts of information possible.</Paragraph>
    <Section position="1" start_page="85" end_page="87" type="sub_section">
      <SectionTitle>
4.1 Using stochastic HPSG- with the
corpus
</SectionTitle>
      <Paragraph position="0"> Using the hierarchy in figure 4 the analyses of the five sentences from figure 2 are as in figure 6.</Paragraph>
      <Paragraph position="1"> Training is a matter of counting the transitions which are found the observed results, then using counts to refine initial estimates of the probabilities of particular transitions. This is entirely analogous to what went on with PCFGs. The results of training are essentially identical to those given earlier, with the optimal assignment being as shown in figure 7. At this point we have provided a system which allows us to use feature structures instead of PCFGs, but we have not yet dealt with the question of re-entrancy, which forms a crucial part of the expressive power of typed feature structures. We will return to this shortly, but first we consider the detailed implications of what we have done so far. The similarities between these results and those in figure 3 * We still model the distribution observed in the corpus by assuming two independent decisions. null * We still get a strange ranking of the parses, which favours number disagreement,in spite of the fact that the grammar which generated the corpus enforces number agreement.</Paragraph>
      <Paragraph position="2"> The differences between these results and the earlier ones are: * The hierarchy uses bot rather than s as its start symbol. The probabilities tell us that the corpus contains no free-standing structures of type num.</Paragraph>
      <Paragraph position="3"> * The zero probability of sign ~ phrase codifies a similar observation that there are no free-standing structures with type phrase.</Paragraph>
      <Paragraph position="4"> * Since items of type phrase are never introduced at that type, but only in the form of sub-types, there are no transitions from phrase in the corpus. Therefore the initial estimates of the probabilities of such transitions are unaffected by training.</Paragraph>
      <Paragraph position="5"> * In the PCFG the symmetry between the expansions of np and vp to singular and plural variants is implicit, whereas in the PTH the distribution of singular and plural variants is encoded at a single location, namely that at which num is refined.</Paragraph>
      <Paragraph position="6"> The independence assumption which is built into the training algorithm is that types are to be refined according to the same probability distribution irrespective of the context in which they are expanded. We have already seen a consequence of this: the PTH lumps together all occasions where num is expanded, irrespective of whether the enclosing context is np or vp. For the moment we are prepared to tolerate this because:</Paragraph>
      <Paragraph position="8"> {sentence, np, vp, sing, pl} {bot, sign, phrase, num}</Paragraph>
      <Paragraph position="10"> * Clarity: The decisions which we have made lead to a system with a clear probabilistic semantics. null * Trainability: the number of parameters  which must be estimated for a grammar is a linear function of the size of the type hierarchy null * Easy extensibility: There is a clear route to a more finely grained account if we allow the expansion probabilities to be conditioned on surrounding context. This would increase the number of parameters to be estimated, which may or may not prove to be a problem.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="87" end_page="88" type="metho">
    <SectionTitle>
5 Adding re-entrancies
</SectionTitle>
    <Paragraph position="0"> We now turn to an extension of the system which takes proper account of re-entrancies in the structure. The essence of our approach is to define a stochastic procedure which simultaneously expands the nodes of the tree in the way outlined above and guesses the pattern of re-entrancies which relate them. It pays to stipulate that the structures which we build are fully inequated in the sense defined by Carpenter (Carpenter, 1992, p120).</Paragraph>
    <Paragraph position="1"> The essential insight is that the choice of a fully inequated feature structure involving a set of nodes is the same thing as the choice of an arbitrary equivalence relation over these nodes, and this is in turn equivalent to the choice of a partition of the set of nodes into a set of non-empty sets. These sets of nodes are equivalence classes. The standard reeursive procedure for generating partitions of k + 1 elements is to non-deterministically add the k + lthq node to each of the equivalence classes of each of the partitions of k nodes, and also to nondeterministically consider the new node as a singleton set. The basis of the stochastic procedure for generating fullyinequated feature structures is to interleave the generation of equivalence classes with the expansion from the initial node as described above.</Paragraph>
    <Paragraph position="2"> For the purposes of the expansion algorithm, a fully inequated feature structure consists of a feature tree (as before) and an equivalence relation 5 over all the maximal nodes in that tree. The task of the algorithm is to generate all such structures and to equip them with probabilities. We proceed as in the case without re-entrancy, except that we only ever expand sub-trees in the case where the new node begins a new equivalence class. This avoids the double counting which was a problem earlier.</Paragraph>
    <Paragraph position="3"> The remaining task is that of assigning scores to equivalence relations. We do not have a fully sat5Since maximal types are mutually inconsistent, this equivalence relation can be efficiently represented by a associating a separate partition with each maximal type isfactory solution to this problem. The reason for this is that we would ideally like to assign probabilities to intermediate structures in such a way that the probabilities of fully expanded structures are independent of the route by which they were arrived at. This can be done, and the method which we adopt has the merit of simplicity.</Paragraph>
    <Section position="1" start_page="87" end_page="87" type="sub_section">
      <SectionTitle>
5.1 Scoring re-entrancies
</SectionTitle>
      <Paragraph position="0"> We associate a single probabilistic parameter P(T=) with each type T, and derive the probability of the structure in which a particular pairwise equation of-nodes in type T have been equated by multiplying the probability of the structure in which no decision has been made by P(T=).</Paragraph>
      <Paragraph position="1"> We derive the probability of the corresponding inequated structure by multiplying by 1 - P(T=) in an entirely analogous way. This ensures that the probabilities of the equated and inequated extensions of the original structure sum to the original probability. The cost is a deficiency in modelling, since this takes no account of the fact that token identity of nodes is transitive, which are generated. As things stand the stochastic procedure is free to generate structures where nl ~ n2, n2 - n3 but nl 7~ n3, which are not in fact legal feature structures. This leads to distortions of the probability estimates since the training algorithm spends part of its probability mass on impossible structures.</Paragraph>
    </Section>
    <Section position="2" start_page="87" end_page="88" type="sub_section">
      <SectionTitle>
5.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> Even a crude account of re-entrancy is better than completely ignoring the issue, and the one proposed gets the right result for cases of double counting such as those discussed above, but it should be obvious that there is room for improvement in the treatment which we provide. Intuitively what is required is a parametrisable means of distributing probability mass among the distinct equivalence relations which extend the current structure. One attractive possibility would be to enumerate the relations which can be obtained by adding the current node to the various different equivalence classes which are available, apply some scoring function to each class, and then normalize such that the total score over all alternatives is one. But this might introduce unpleasant dependencies of the probabilities of feature structures on the order in which the stochastic procedure chooses to expand nodes, because the normalisation is carried out before we have full knowledge of the equivalence classes with which the current node might become associated. It may be that an appropriate choice of scoring function will circumvent this difficulty, but this is left as a matter for further research.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML