File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/c90-2071_abstr.xml

Size: 21,321 bytes

Last Modified: 2025-10-06 13:46:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-2071">
  <Title>Probe bilistic Unification-Based Integration Of Syntactic and Semantic Preferences For Nominal Compounds</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In this paper, we describe a probabilistic framework for unification-based grammars that facilitates'integrating syntactic a~ld sem~mtic constraints and preferences. We share many of the concerns found in recent work on massively-parallel language interpret'ation models, although the proposal reflects our belief in the value of a higher-level account that is not stated in terms of distributed'computati0n. We also feel that inadequate learning theories severely limit ex'isting massively-parallel language interpretation models. A learning theory is not only interesting in its own right, but must underlie,any quantitative account of language interpretation, because the complexity of interaction between constraints and preferences makes ad hoc trial-and-error strategies for picking numbers infeasible, particula~'ly for semantics in realistically-sized do~ fire,ins.</Paragraph>
    <Paragraph position="1"> Introduction Massively-parallel models of language interpretation ~:including markeropassing models and neural networks of both the connectionist and PDP (parallel distributed processing) variety--have provoked some fundamental questions about the limits of symbolic, logic- or rule-based frameworks. Traditional frameworks have difficulty integrating preferences in the presence of complex dependency relationships, in analyzing ambiguous phrases, for example, semantic information should sometimes override syntactic prefcrenccs, and vice versa. Such interactions can take place at different levels within a phrase's constituent structure, even for a single analysis. Massiw;ly-parallel models excel at integrating different sources of preferences in a natural, intuitive *Many thanks to Robert Wilensky and Charles Fillmore for helpful discussions, and to Hans Karlgren and Nigel Ward for constructive suggestions On drafts. This research was sponsored in part by the Defense Advanced Research Projects Agency (DoD), monitored by the Space and Naval Warfare Systems Command under N00039-88-C-0292, the Office of Naval Research under contract N00014-89-J-3205, and the Sloan Foundation under grant 86-10-3.</Paragraph>
    <Paragraph position="2"> fashion; for example, connectionist models simply translate dependency constraints into excitatory or inhibitory links in relaxation networks (Waltz &amp; Pollack 1985). Furthermore, massively-parallel models have shown remarkable ability to compute complex semantic preferences.</Paragraph>
    <Paragraph position="3"> We argue that it is possible and desirable to give a more meaningful account of preference integration at a higher level, without resort to distributed algorithms. One could say that we are interested in characterizing the nature of the preferences, rather than how they might be efficiently computed. We do not claim that all properties of massively-parallel models can or should be described at this level. However, few language interpretation models take advantage of those properties that can only be characterized at the distributed level.</Paragraph>
    <Paragraph position="4"> We also propose a quantitative theory that assigns an interpretation to the numbers used in our model. A quantitative theory explains the numbers' significance by defining the procedure by which the model--in principle, at least--can learn the numbers. Much of the mystique of neural networks is due to their potential learning properties, but surprisingly, few PDP and no connectionist models of language interpretation that we know of specify quantitative theories, even though numbers must be used to run the models. Without a quantitative theoretical basis, it seems unlikely that the network structures will generalize much beyond the particular hand-coded examples, if for no other reason than the inamense room for variation in constructing such networks. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Case Study: Nominal Compounds
</SectionTitle>
      <Paragraph position="0"> Nominal compounds exemplify the sort of phenomena modeled by interacting preferences. Nouns themselves are often homonymous--is dream stale a sleep condition or California?---necessitating lexical ambiguity resolution. Structural ambiguity resolution required for nested nominal compounds, which have more than one parse; consider \[baby pool\] lable versus baby \[pool tabk\]. Lexicalized nominal compounds necessitate syntactic preferences, while semantic preferences are needed to guide semantic composition tasks like frame selection and case/role binding, as nominal compounds nearly al-' ways have multiple possible meanings. Traditionally, linguists have only classified nominal come  pounds according to broad criteria such as part-whole or source-result relationships (Jespersen 1946; Quirk et al. 1985); several large-scale studies have provided somewhat finer-grained classifications on the order of a dozen classes (Lees 1963; Levi 1978; Warren 1978). However, the emphasis has been on predicting the possible meanings of a compound, rather than predicting its preferred meaning. An exception is Leonard's (1984) rule-based model which, howew~r, only produces fairly coarse interpretations with medium (76%) accuracy.</Paragraph>
      <Paragraph position="1"> We distinguish three major classes of nominal compounds: lexicalized (conventional), such as clock radio; identificative, such as clock gears; and creative, such as clock table. Both identificative and creative compounds are novel in Downing's (1977) sense; they differ in that an identificative compound serves to identify a known (but hitherto unnamed) semantic category, whereas to interpret a creative compound requires constructing a new semantic structure. There is a bias to use the most specific pre-existing categories that match the compound being analyzed, syntactic or semantic. Precedence is given to a conventional parse if one exists, then a parse with an identificative interpretation, and lastly a parse with a creative interpretation. However, this &amp;quot;Maximal Conventionality Principle&amp;quot; can easily be overruled by global considerations arising from the embedding phrase and context. Figure 1 shows examples where two conventional compounds compete, and where global considerations cause an identificative compound to be preferred over a competing conventional compound. These cases require integration of quantitative syntactic and semantic preferences, since non-quantitative integration schemes (e.g., Marcus 1980; Hirst 1987; Lytinen 1986) do not discriminate adequately between the alternative analyses.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
What Do Massively-Parallel Models
Really Say?
</SectionTitle>
      <Paragraph position="0"> One use of massive parallelism is to evaluate the similarity or compatibility between two concepts in order to generate semantic preferences. Similarity evaluators usually employ PDP networks where semantic concepts are internally represented as distributed activation patterns over a set of ~microfeaturcs'. Conceptually, the network in Figure 2a gives a similarity metric between a given concept and every other concept, computed as the weighted sum of shared microfeatures. 1 Likewise, the hidden layer in Figure 2b computes the goodness of every possible relation between the given pair of nouns. In non-massively-parallel terms, what such nets do is capture statistical dependencies between concepts, down to the granularity of the chosen &amp;quot;microfeatures'. A probabilistic feature-structure formalism employing the same granularity of features should be able to capture the same dependencies.</Paragraph>
      <Paragraph position="1"> Connectionist models are often used to integrate syntactic and semantic preferences front different information sources (Cottrell 1984, 1985; Wermter  1989b; Wermter &amp; Lehnert 1989). Nodes represent t Ignoring Bookman's persistent activation, which simulates recency-based contextual priming.</Paragraph>
      <Paragraph position="2">  hypotheses about word senses, parse structures, or role bindings; links represent either supportive or inhibitory dependencies between hypotheses. The links constrain the network so that activation propagation causes the net to relax into a state where the hypotheses are all consistent with one another.</Paragraph>
      <Paragraph position="3"> ~.I'he most severe problem with these models is the ~rbitariness of the numbers used; Cottrell, for examp\]e, admits &amp;quot;weight twiddling&amp;quot; and notes that lack of formal analysis hampers determination of parameters. In other words, although the networks settle itlto consistent states, there is no principle determinlag the probability of each state.</Paragraph>
      <Paragraph position="4"> McClelland &amp; Kawamoto's (1986) PDP model learns how :syntactic (word-order) cues affect semantic frame/case selection, yielding more principled preference integration. Like the PDP similarity evaluators, however, the information encoded in the network and its weights is not easily comprehended.</Paragraph>
      <Paragraph position="5"> Previous non-massively-parallel proposals for quantitative preference integration have used non-probabilistic evidence combination schemes. Schubert (1986) suggests summing &amp;quot;potentials&amp;quot; up the phrase-structure trees; these potentials derive from salience, logical-form typicality, and semantic typicality conditions. McDonald's (1982) noun compound interpretation model also sums different sources of syntactic, semantic, and contextual evidence. Though qualitatively appealing, additive calculi are liable to count the same evidence more than once, and use arbitrary evidence weighting schemes, making it impossible to construct a model that works for all cases. Hobbs el al. (1988) propose a theorem-proving lnodel that integrates syntactic constraints with variable-cost abductive semantic and pragmatic a~sumptions. The danger of these non-probabilistic approaches, as with connectionist preference integrator's, is that the use of poorly defined &amp;quot;magic num- null We are primarily concerned here with the following p,:oblem: given a nominal compound, determine the ranking of its possible interpretations from most to least likely. The problem can be formulated in terms of unification. Unification-based formalisms provide an elegant means of describing the information structures used to construct interpretations. Lexical and structural ambiguity resolution, as well as semantic composition, are readily characterized as choices between alternative sequences of unification operations. A key feature of unification--especially important foJ: preference integration--is its neutrality with respect to control, i.c., there is no inherent bias in the order of unifications, and thus, no bias as to which choices take precedence ovcr others. Although nominal compound interpretation involves lcxical and st t'uctural ambiguity resolution and semantic comp()sition, it is not a good idea to centralize control around any single isolated task, because there is too much interaction. For example, the frame selection problem affects lexical arnbiguity resolution (consider the special case where the frame selected is that signified by the lexical item). Likewise, frame selection and case/role binding are two aspects of the same semantic composition problem, and structural ambiguity resolution depends largely on preferences in semantic composition.</Paragraph>
      <Paragraph position="6"> Thus we turn to unification for a clean formulation of the problem. Three classes of feature-structures are used: syntactic, semantic, and constructions. The construction is defined in Fillmore's (1988) Construction Grammar as &amp;quot;a pairing of a syntactic pattern with a meaning structure&amp;quot;; they are similar to signs in HPSG (Pollard &amp; Sag 1987) and pattern-concept pairs (Wilensky &amp; Arens 1980; Wilensky et al. 1988). Figure 3 shows a sample construction containing both syntactic and semantic feature-structures. 2 Typed feature-structures are used: the value of the special feature TYPE is a type in a multiple-inheritance type hierarchy, and two TYPE values unify only if they are not disjoint.</Paragraph>
      <Paragraph position="7"> This allows (1) easy transformation from semantic feature-structures to more convenient frame-based semantic network representations, and (2) efficient encoding of partially redundant lexical/syntactic categories using inheritance (see, for example, Pollard &amp; Sag 1987; Jurafsky 1990). Our notation is chosen for generality; the exact encoding of signification relationships is inessential to our purpose here.</Paragraph>
      <Paragraph position="8">  Given a nominal compound (of arbitary length), an intevpretalion is defined as an instantiated construction--including all the syntactic, semantic, and sub-construction f-structures--such that the syntactic structure parses the nominal compound, and the semantic structure is consistent with all the (sub-)constructions. Figure 4 shows an interpretation of afternoon rest. Given this framework, lexical ambiguity resolution is the selection of a particular sub-construction for a lexical item that matches more than one construction, structural ambiguity resolution is the selection between alternative syntactic fstructures, and semantic composition is the selection between alternative semantic f-structures. In each case we must be be able to compare alternative interpretations and determine the best.</Paragraph>
      <Paragraph position="9"> Before discussing how to compare interpretations, let us briefly consider the sort of information available. We extend the unification paradigm with a function f that returns the relative frequency of any category in the type hierarchy, normalized so that for any category cat, f(cat) = P\[cat(x)\] where x is a  terpretation of &amp;quot;afternoon rest&amp;quot;.</Paragraph>
      <Paragraph position="10"> random variable ranging over all categories. For semantic categories, this provides the means of encoding typicality information. For syntactic categories and constructions, this provides a means of encoding information about degrees of lexicalization. Since f is defined in terms of relative frequency, there is a learning procedure for f: given a training set of correct interpretations, we need only count the instances of each category (and then normalize).</Paragraph>
      <Paragraph position="11"> The probabilistic goodness metric for an interpretation is defined as follows: the goodness of an interpretation is the probability of the entire construction given the input words of the nominal compound, e.g., P\[+c:)l + s1, +82\] = P\[ NN-constrl(ig)l &amp;quot;afternoon&amp;quot;(ix)^ &amp;quot;rest&amp;quot;(i2)\]. The metric is global, in that for any set of alternative interpretations, the most likely interpretation is that with the highest metric.</Paragraph>
      <Paragraph position="12"> As a simplified example of computing the metric, suppose the feature graph of Figure 4 constituted a complete dependency graph containing all candida~ hypotheses (actually an unrealistic assumption since this would preclude any alternative interpretations).</Paragraph>
      <Paragraph position="13"> For each pair of connected nodes, the conditional probability of the child, given the ancestor, is given by the ratio of their relative frequencies (Figure 5a).</Paragraph>
      <Paragraph position="14"> The metric only requires computing the probability of c9 (Figure 5b). 3 Nodes are clustered into multi-valued compound variables as necessary to eliminate loops, to ensure counting any piece of evidence only once (Figure 5c).</Paragraph>
      <Paragraph position="15"> The conditional probability vectors P\[+c91zi\] and P\[zll + sl, +s2\] are computed using the disjunctive interaction model: 4</Paragraph>
      <Paragraph position="17"> &amp;quot;csl + 81, +s2\] ....</Paragraph>
      <Paragraph position="18"> Finally, we compute P\[+c91 + s1,+s2\] by conditioning on the compound variable Z and taking the weighted average of P\[+cglZ, +sl, +s2\] over all states of Z:</Paragraph>
      <Paragraph position="20"> Both syntactic and semantic preferences are taken into account. The influence of semantic preferences is encoded in the conditional probabilities P\[+cg\] + c7\] and P\[+cgl + cs\]J The loops in the original dependency graph correspond to support for the interpretation via both syntactic and semantic paths.</Paragraph>
      <Paragraph position="21"> A more complex example demonstrating structural ambiguity resolution is shown in Figure 6; here an afternoon rest schema produces a semantic preference that overrides a syntactic preference arising from weak lexicalization of the nominal compound rest area. 6 A major unsolved problem with this approach is specificity selection. This is a well-known trade-off in classification models: the more general the interpretation, the higher its probability is; whereas the more specific the interpretation, the greater its utility and the more informative it is. The probabilistic goodness metric does not help when comparing two interpretations whose only difference is that one is more general than the other. 7 In our initial studies we attempted to handle this trade-off using thresholded marker-passing techniques (Wu 1987, 1989), but we are currently investigating a stronger utility used to complete the probability model in cases where is infeasible to gather or store full conditional probability matrices for all input combinations (see Pearl 1988). Heavily biased conditional probability matrices that cannot be satisfactorily approximated by disjunctive interaction can sometimes be handled by forming additional categories. The apparent schema-organization of human memory may well arise for the same reason.</Paragraph>
      <Paragraph position="22"> ~These conditional probabilities cannot be derived solely from frequency counts since c9 is an instance of a novel category--the category of &amp;quot;afternoon rest&amp;quot; constructions denoting a nap--with zero previous frequency. Instead, the conditional probabilities P\[+c9\] + c~\] and P\[+cgl + cs\] are a function of the ancestral conditional</Paragraph>
      <Paragraph position="24"> theory to complement the probabilistic model, incorporating both explicit invariant biases and probabilistically learned utility expectations. It is not yet clear whether we shall also need to incorporate pragmatic utility expectations in the constructions.</Paragraph>
      <Paragraph position="25"> For methodological reasons we have deliberately impoverished the statistical database, by depriving the model of all information except for category frequencies, relying upon disjunctive interaction to complete the probability model. This limitation on the complexity of statistical information is too restrictive; disjunctive interaction cannot satisfactorily approximate cases where P\[-{-c3lCl, (:21 ~, 1 - P\[-c3lcl\]. P\[ czlq\].</Paragraph>
      <Paragraph position="26"> Such cases appear to arise often; for example, the presence of two nouns, rather than one, increases the probability of a compound by a much greater factor than modeled by disjunctive interaction. We intend to test variants of the model empirically on a corpus of nominal compounds, with randomly selected training sets; the restrictions on complexity of conditional probability information will be relaxed depending upon the resulting prediction accuracy.</Paragraph>
      <Paragraph position="27"> Conclusion We have suggested extending unification-based formalisrrLs to express the sort of interacting preferences used in massively-parallel language models, using probabilistic techniques. In this way, quantitative claims that remain hidden in many massively-parallel models can be made more explicit; moreover, the numbers and the calculus are motivated by a reasonable assumption about language learning. We hope to see increased use of pr0babilistic models rather than arbitrary calculi in language research: Charniak &amp; Goldman's (1989) recent analysis of probabilities in semantic story structnres is a promising development in this direction. Stoleke (1989) transformed a unification grammar into a connectionist framework (albeit without preferences); we have taken the opposite tack. Many linguists have acknowledged the need to extend their frameworks to handle statistically-based syntactic and semantic judgements (e.g., Karlgren 1974; Ford et al. 1982, p. 745), but only in passing, largely, we suspect, due to the unavailability of adequate representational tools. Because our proposal makes direct use of traditional unification-based structures, larger grammars should be easy to construct and 5 417 incorporate; because of the direct correspondence to semantic net representations, complex semantic models of the type found in AI work may be more readily exploited.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML