File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/84/j84-3001_metho.xml

Size: 48,780 bytes

Last Modified: 2025-10-06 14:11:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="J84-3001">
  <Title>On the Mathematical Properties of Linguistic Theories I</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Preliminary Definitions
</SectionTitle>
    <Paragraph position="0"> We assume that the reader is familiar with the basic definitions of regular, context-free (CF), context-sensitive (CS), recursive, and recursively enumerable (r.e.) languages, as well as with their acceptors (see Hopcroft and Ullman 1979). We will be much concerned with the problem of recognizing whether a string is contained in a given language (the recognition problem) and with that of l This research was sponsored in part by the National Science and Engineering Research Council of Canada under Grant A9285. It was made possible in part by a gift from the Systems Development Foundation. An earlier version of this paper appeared in the Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, June 1983.</Paragraph>
    <Paragraph position="1"> I would like to thank Bob Berwick, Alex Borgida, Jim Hoover, Aravind Joshi, Lauri Karttunen, Fernando Pereira, Stanley Peters, Peter Sells, Hans Uszkoreit, and the referees for their suggestions.</Paragraph>
    <Paragraph position="2"> ZAlthough we will not examine them here, formal studies of other syntactic theories have been undertaken: e.g. Warren (1979) for Montague's PTQ (1973). Pereira and Shieber (1984) use techniques from the denotational semantics of programming languages to investi- null tation in linguistics has not always been beneficial. Some pseudoformal arguments against rival theories were unquestionably accepted by an audience that did not always have the mathematical sophistication to be critical. For example, Postal's claim (1964b) that two-level stratificational grammars generated only context-free languages was based on an imprecise definition by its proponents, as well as by the failure to see that among the more precise definitions were many very powerful ones. Copyright 1985 by the Association for Computational Linguistics. Permission to copy without fee all or part of this material is granted provided that the copies are not made for direct commercial advantage and the CL reference and this copyright notice are included on the first page. To copy otherwise, or to republish, requires a fee and/or specific permission.</Paragraph>
    <Paragraph position="3"> 0362-613X/84/030165-12503.00 Computational Linguistics, Volume 10, Numbers 3-4, July-December 1984 165 C. Raymond Perrault On the Mathematical Properties of Linguistic Theories generating one (or all) derivations of the string (the parsing problem).</Paragraph>
    <Paragraph position="4"> Some elementary definitions from complexity theory may be useful. Further details may be found in Aho et al. (1974). Complexity theory is the study of the resources required by algorithms, usually space and time.</Paragraph>
    <Paragraph position="5"> Let f(x) be a function, say, the recognition function for a language L. The most interesting results we could obtain regarding f would be a lower bound on the resources needed to compute f on a machine of a given architecture, say, avon Neumann computer or a parallel array of neurons. These results over whole classes of machines are very difficult to obtain, and none of any significance exist for parsing problems.</Paragraph>
    <Paragraph position="6"> Restricting ourselves to a specific machine model and an algorithm M for f, we can ask about the cost (e.g., in time or space) c(x) of executing M on a specific input x.</Paragraph>
    <Paragraph position="7"> Typically, c is too fine-grained to be useful: what one studies instead is a function c w whose argument is an integer n denoting the size of the input to M, and which gives some measure of the cost of processing inputs of length n. Complexity theorists have been most interested in the asymptotic behaviour of Cw, i.e., the behaviour of c w as n gets large.</Paragraph>
    <Paragraph position="8"> If one is interested in upper bounds on the behaviour of M, one usually defines Cw(n) as the maximum of c(x) over all inputs x of size n. This is called the worst-case complexity function for M. Other definitions are possible: for example, one can define the expected complexity function Ce(n) for M as the average of c(x) over all inputs of length n. c e might be more useful than c w if one had an idea as to the distribution of possible inputs to M. Not only are realistic distributions rarely available, but the introduction of probabilistic considerations makes the study of expected complexity technically more difficult than that of worst-case complexity. For a given problem, expected and worst-case measures may be quite different. 4 It is quite difficult to get detailed descriptions of Cw; for many purposes, however, a cruder estimate is sufficient. The next abstraction involves &amp;quot;.lumping&amp;quot;. classes of c w functions into simpler ones that demonstrate their asymptotic behaviour more clearly and are easier to manipulate. This is the purpose of O-notation (read &amp;quot;bigoh notation&amp;quot;). Let f(n) and g(n) be two functions. Function f is said to be O(g) if a constant multiple of g is an upper bound for f, for all but a finite number of values of n. More precisely, f is O(g) if there is are constants K and n o such that for all n &gt; no, fin) &lt; K * g(n).</Paragraph>
    <Paragraph position="9"> Given an algorithm M, we will say that M is TIME(g) or, equivalently, that its worst-case time complexity is O(g) if the worst-case time cost function Cw(n) for M is O(g). 5 This merely says that almost all inputs to M of size n can be processed in time at most a constant times g(n).</Paragraph>
    <Paragraph position="10"> It does not say that all inputs require g(n) time, or Ruzzo machine that implements f. Also, if two algorithms A i and A 2 are available for a function f, and if their worst-case complexity can be given respectively as O(g) and O(g), and gl &lt;- g2, it may still be true that for a large number of cases (maybe even all those likely to be encountered in practice), A 2 will be the preferable algorithm simply because the constant K 1 for gl may be much larger than is K 2 for g2&amp;quot; A parsing-related example is given in Section 3.</Paragraph>
    <Paragraph position="11"> In examining known results pertaining to the recognition complexity of various theories, it is useful to consider how robust they are in the face of changes in the machine model from which they were derived. These models can be divided into two classes: sequential and parallel. Sequential models (Aho et al. 1974) include the familiar single- and multitape Turing machines (TM) as well as random-access machines (RAM) and random-access stored-program machines (RASP). A RAM is like a TM except that its working memory is random-access rather than sequential. A RASP is like a RAM but stores its program in its memory. Of all these models, the RASP is most like a yon Neumann computer.</Paragraph>
    <Paragraph position="12"> All these sequential models can simulate one another in ways that do not require great changes in time complexity. For example, a k-tape Turing Machine that runs in time O(t) can be simulated by a RAM in time O(t log0, conversely, a RAM running in O(t) can be simulated by a k-tape TM in time O(t2).. In fact, all the familiar sequential models are polynomially related: they can simulate one another with at most a polynomial loss in efficiency. 6 Thus, if a syntactic model is known to have a difficult recognition problem when implemented on one sequential model, execution of an equivalent algorithm on another sequential machine will not be much easier.</Paragraph>
    <Paragraph position="13"> Transforming a sequential algorithm to one on a parallel machine with a fixed number K of processors provides at most a factor K improvement in speed. More interesting results are obtained when the number of processors is allowed to grow with the size of the problem, e.g., with the length of the string to be parsed. These processors can be viewed as connected together in a circuit, with inputs entering at one end and outputs being produced at the other. The depth of the circuit, or the maximum number of processors that data must be passed through from input to output, corresponds to the parallel time required to complete the computation. A problem that has a solution on a sequential machine in polynomial time and in space s will have a solution on a parallel machine with a polynomial number of processors and circuit depth (and hence parallel time) O(s2). This means that algorithms with sequential solutions requiring small space (such as deterministic CSLs) have fast parallel solutions.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Hoare's Quicksort algorithm, for example, has expected time complex-
</SectionTitle>
    <Paragraph position="0"> ity of O(n logn) and worst-case complexity of O(n2), using notation defined in the next paragraph.</Paragraph>
    <Paragraph position="1"> Similarly, let M be SPACE(g) if the worst-case space complexity of M  is O(g).</Paragraph>
    <Paragraph position="2"> 6 RAMs and RASPs are allowed to store arbitrarily large numbers in their registers. These results assume that the cost of performing elementary operations on those numbers is proportional to their length, i.e. to their logarithm.</Paragraph>
    <Paragraph position="3"> 166 Computational Linguistics, Volume 10, Numbers 3-4, July-December 1984. C. Raymond Perrault On the Mathematical Properties of Linguistic Theories For a comprehensive survey of parallel computation, see Cook (1981).</Paragraph>
    <Paragraph position="4"> 3. Context-Free Languages Recognition techniques for context-free languages are well known (Aho and Ullman 1972). The so-called CKY or &amp;quot;dynamic programming&amp;quot; method is attributed by Hays (1962) to J. Cocke; it was discovered independently by Kasami (1965) and Younger (1967), who showed it to be O(n3). It requires the grammar to be in Chomsky  Normal Form, and putting an arbitrary grammar in CNF may square its size. Berwick and Weinberg (1982) point out that, since the complexity of parsing algorithms is generally at least linearly dependent on the size of the grammar, this requirement may make CKY less than optimal for parsing short sentences.</Paragraph>
    <Paragraph position="5"> Earley's algorithm recognizes strings in arbitrary CFGs in time O(n 3) and space O(n2), and in time O(n e) for unambiguous CFGs. Graham, Harrison, and Ruzzo (1980) offer an algorithm that unifies CKY and Earley's algorithm (1970), and discuss implementation details. Valiant (1975) showed how to interpret the CKY algorithm as the finding of the transitive closure of a matrix and thus reduced CF recognition to matrix multiplication, for which subcubic algorithms exist. Because of the enormous constants of proportionality associated with this method, it is not likely to be of much practical use, either an implementation method or as a &amp;quot;psychologically realistic&amp;quot; model.</Paragraph>
    <Paragraph position="6"> Ruzzo (1979) has shown how CFLs can be recognized by Boolean circuits of depth O(log(n)2), and therefore that parallel recognition can be accomplished in time O(log(n)2). The required circuit size is polynomial in n. So as not to be mystified by the upper bounds on CF recognition, it is useful to remember that no known CFL requires more than linear time, nor is there even a nonconstructive proof of the existence of such a language.</Paragraph>
    <Paragraph position="7"> This is also a good place to recall the difference between recognition and parsing: if parsing requires that distinct structures be produced for all parses, it will be TIME(2n), since in some grammars sentences of length n may have 2 n parses (Church and Patil 1982). For an empirical comparison of various parsing methods, see Slocum (1981).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Transformational Grammar
</SectionTitle>
    <Paragraph position="0"> From its earliest days, discussions of transformational grammar (TG) have included consideration of matters computational.</Paragraph>
    <Paragraph position="1"> Peters and Ritchie (1973a) provided some the first nontrivial results regarding the generative power of TGs. Their model reflects the Aspects version quite faithfully, including transformations that move and add constituents, and delete them subject to recoverability. All transformations are obligatory, and applied cyclically from the bottom up. They show that every r.e. set can be generated by applying a set of transformations to a context-sensitive base. The proof is quite simple: the right-hand sides of the type-0 rules that generate the r.e. set are padded with a new &amp;quot;blank&amp;quot; symbol to make them at least as long as their left-hand sides. Rules are added to allow the blank symbols to commute with all others. These context-sensitive rules are then used as the base of a TG whose only transformation deletes the blank symbols.</Paragraph>
    <Paragraph position="2"> Thus, if the transformational formalism itself is supposed to characterize the grammatical strings of possible natural languages, then the only languages being excluded by the formalism are those that are not enumerable under any model of computation. The characterization assumption is further discussed in Section 9. At the expense of a considerably more intricate argument, the previous result can be strengthened (Peters and Ritchie 1971) to show that every r.e. set can be generated by a context-free based TG, as long as a filter - an intersection with a regular set - can be applied to the phrase-markers produced by the transformations. In fact, the base grammar can be independent of the language being generated. The proof involves the simulation of a TM by a TG. The transformations first generate an &amp;quot;input tape&amp;quot; for the TM being simulated, then apply the TM productions, one per cycle of the grammar.</Paragraph>
    <Paragraph position="3"> The filter ensures that the base grammar will generate just as many S nodes as necessary to generate the input string and do the simulation. In this case too, if the transformational formalism is supposed to characterize the possible natural languages, the universal base hypothesis (Peters and Ritchie 1969), according to which all natural languages can be generated from the same base grammar, is empirically vacuous: any recursively enumerable language can.</Paragraph>
    <Paragraph position="4"> Following Peters and Ritchie's work, several attempts were made to find a restricted form of the transformational model that is descriptively adequate, yet whose generated languages are recursiVe (see, for example, LaPointe 1977). Since a key part of the proof in Peters and Ritchie (1971) involves the user of a filter on the final derivation trees, Peters and Ritchie (1973c) examined the consequences of forbidding final filtering. They show that, if S is the recursive symbol in the CF base, the generated language L is predictably enumerable and exponentially bounded. A language L is predictably enumerable if there is an &amp;quot;easily&amp;quot; computable function t(n) that gives an upper bound on the number of tape squares needed by its enumerating TM to enumerate the first n elements of L. L is exponentially bounded if there is a constant K such that, for every string x in L, there is another string t . x m L whose length is at most K times the length of x.</Paragraph>
    <Paragraph position="5"> The class of nonfiltering languages is quite unusual, including all the CFLs (obviously), but also properly intersecting the CSLs, the recursive languages, and the r.e. languages.</Paragraph>
    <Paragraph position="6"> The source of nonrecursivity in transformationally generated languages is that transformations can delete large parts of the tree, thus producing surface trees that are arbitrarily smaller than the deep structure trees they Computational Linguistics, Volume 10, Numbers 3-4, July-December 1984 167 C. Raymond Perrault On the Mathematical Properties of Linguistic Theories were derived from. This is what Chomsky's &amp;quot;recoverability of deletions&amp;quot; condition was meant to avoid. In his thesis, Petrick (1965) defines the following condition on transformational derivations: a derivation satisfies the terminal-length-increasing condition if the length of the yield of any subtree u, resulting from the application of the transformational cycle to a subtree t, is greater than the length of the yield of any subtree u r resulting from the application of the cycle to a subtree t r of t.</Paragraph>
    <Paragraph position="7"> Petrick shows that, if all recursion in the base grammar &amp;quot;passes through S&amp;quot; and all derivations satisfy the terminal-length-increasing condition, then the generated language is recursive. Using a slightly more restricted model of transformations Rounds (1973) strengthens this result by showing that the resulting languages are in fact context-sensitive.</Paragraph>
    <Paragraph position="8"> In an unpublished paper, Myhill shows that, if Petrick's condition is weakened to terminal-length-nondecreasing, the resulting languages can be recognized in space that is at most exponential in the length of the input. This implies that recognition can be done in at most double-exponential time, but Rounds (1975) proves that not only can recognition be done in exponential time, but that every language recognizable in exponential time can be generated by a TG satisfying the terminal-lengthnondecreasing condition and recoverability of deletions. This is a very strong result, because of the closure properties of the class of exponential-time languages&gt; To see why this is so requires a few more definitions.</Paragraph>
    <Paragraph position="9"> Let P be the class of all languages that can be recognized in polynomial time on a deterministic TM, and NP the class of all languages that can be recognized in polynomial time on a nondeterministic TM. P is obviously contained in NP, but the converse is not known, although there is much evidence that it is false.</Paragraph>
    <Paragraph position="10"> There is a class of problems, the so-called NP-eomplete problems, which are in NP and &amp;quot;as difficult&amp;quot; as any other problems in NP in the following sense: if any of them could be shown to be in P, all the problems in NP would also be in P. One way to show that a language L is NP-complete is to show that L is in NP and that every other language L o in NP can be polynomially transformed into L, - i.e., that there is a deterministic TM, operating in polynomial time, that will transform an input w to L into an input w o to L o such that w is in L if and only if w o is in L o. In practice, to show that a language is NP-complete, one shows that it is in NP and that some already known NP-complete language can be polynomially transformed into it.</Paragraph>
    <Paragraph position="11"> All the known NP-complete languages can be recognized in exponential time on a deterministic machine, and none have been shown to be recognizable in less than exponential time. Thus, since the restricted transformational languages of Rounds characterize the exponential languages, if all of them were to be in P, P would be equal to NP. Putting it another way, if P is not equal to NP, some transformational languages (even those satisfying the terminal-length-nonincreasing condition) have no &amp;quot;tractable&amp;quot; (i.e., polynomial-time) recognition procedures on any deterministic TM. It should be noted that this result also holds for all the other known sequential models of computation, as they are all polynomially related, and even for parallel machines with as many as a polynomial number of processors.</Paragraph>
    <Paragraph position="12"> All the results outlined so far in this section are inspired by the model of transformational grammar presented in Aspects. More recent versions of the theory are substantially different, primarily in that most of the constructions handled in terms of deletions from the base trees are now handled using traces (i.e., constituents with no lexical material) indexed to other constituents. In his contribution to this issue (p. 189), Berwick presents a formalization of the theory of Government and Binding (GB) and some of its consequences. The formalization is unusual in that it reduces grammaticality to well-formedness conditions on what he calls annotated surface structures. From these conditions, two results follow. One is that for every GB grammar G there is a constant K such that for every string w in L(G) and for every annotated surface structure s whose yield is w, the number of nodes in s is bounded by K*length(w). This, of course, ensures that the L(G) is recursive. The second result is that GB languages all have the linear growth or arithmetic growth property: for every sufficiently long string w in a GB language L there is another string w p in L which is at most K symbols shorter than w.</Paragraph>
    <Paragraph position="13"> A few comments about Berwick's formalization and results are in order. To begin with, the formalization is clearly a quite radical simplification of current practice among GB practitioners, as it does not reflect D-structure, LF, or PF, nor case theory, the theta-eriterion, and control theory. Thus, in its current form, the formalization does not include the machinery necessary to account for passives and raising. It also assumes that X-bar theory limits the base to trees generated by CFGs with no useless nonterminals and no cycles, except presumably through the S and NP nodes. This excludes accounts of stacked adjectives, as in the white speckled shaggy Pekingese, and of stacked relative clauses.</Paragraph>
    <Paragraph position="14"> We suspect that most of these features could be added to the formalization without affecting either result, and that it is extremely useful to have even a first approximation of one to work with. Although Berwick is mute on the subject, we conjecture that recognition in the model he gives can be done in polynomial time. What is less clear is what will happen to recognition complexity under models that include the other constraints.</Paragraph>
    <Paragraph position="15"> Berwick's result about the linear growth property has no immediate functional consequence for complexity or even for weak generative capacity. It is presented as a property that natural languages seem to have and thus that should be predicted by the linguistic model.</Paragraph>
    <Paragraph position="16">  168 Computational Linguistics, Volume 10, Numbers 3-4, July-December 1984 C. Raymond Perrault On the Mathematical Properties of Linguistic Theories 5. LexicaI-Functional Grammar  In part, transformational grammar seeks to account for a range of constraints or dependencies within sentences.</Paragraph>
    <Paragraph position="17"> Of particular interest are subcategorization, predicate-argument dependencies, and long-distance dependencies, such as wh-movement. Several recent theories suggest different ways of accounting for these dependencies, but without making use of transformations. We examine three of these in the next several sections: lexical-functional grammar, generalized phrase structure grammar, and tree adjunct grammar.</Paragraph>
    <Paragraph position="18"> In the lexical-functional grammar (LFG) of Kaplan and Bresnan (1982), two levels of syntactic structure are postulated: constituent and functional. All the work done previously by transformations is instead encoded both in the lexicon and in links established between nodes in the constituent and functional structures.</Paragraph>
    <Paragraph position="19"> The languages generated by LFGs, or LFLs, are CSLs and properly include the CFLs (Kaplan and Bresnan 1982). Berwick (1982) shows that a set of strings whose recognition problem is known to be NP-complete, namely, the set of satisfiable Boolean formulas, is an LFL.</Paragraph>
    <Paragraph position="20"> Therefore, as was the case for Rounds's restricted class of TGs, if P is not equal to NP, then some languages generated by LFGs do not have polynomial-time recognition algorithms. Indeed only the &amp;quot;basic&amp;quot; parts of the LFG mechanism are necessary to the reduction. This includes mechanisms necessary for feature agreement, for forcing verbs to take certain cases, and for allowing lexical ambiguity. Thus, no simple change in the formalism is likely to avoid the combinatorial consequences of the full mechanism. It should be noted that the c-structures and f-structures necessary to make satisfiable Boolean formulas into an LFL are not much larger than the strings themselves; the complexity comes in finding the assignment of truth-values to the variables. In his paper in this issue (p. 189), Berwick argues that the complexity of LFLs stems from their ability to unify trees of arbitrary size, and that such a mechanism does not exist in GB. However, the recognition complexity of GB languages, as formalized in Berwick (1984) or in more &amp;quot;faithful&amp;quot; models, remains open, and may arise from other constraints.</Paragraph>
    <Paragraph position="21"> Both Berwick and Roach have examined the relation between LFG and the class of languages generated by indexed grammars (Aho 1968), a class known to be a proper subset of the CSLs, but including some NP-complete languages (Rounds 1973). They claim (personal communication) that the indexed languages are a proper subset of the LFLs.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6. Generalized Phrase Structure Grammar
</SectionTitle>
    <Paragraph position="0"> In a series of papers, Gerald Gazdar and his colleagues (1982) have argued for a joint account of syntax and semantics that is like LFG in eschewing the use of transformations, but unlike it in positing only one level of syntactic description. The syntactic apparatus is based on a nonstandard interpretation of phrase-structure rules and on the use of metarules. The formal consequences of both these devices have been investigated.</Paragraph>
    <Paragraph position="1"> 6.1. Node admissibility There are two ways of interpreting the function of CF rules. The first, and most common, is to treat them as rules for rewriting strings. Derivation trees can then be seen as canonical representatives of classes of derivations producing the same string, differing only in the order in which the same productions are applied.</Paragraph>
    <Paragraph position="2"> The second interpretation of CF rules is as constraints on derivation trees: a legal derivation tree is one in which each node is &amp;quot;admitted' by a rule, i.e., each node dominates a sequence of nodes in a manner sanctioned by a rule. For CF rules, the two interpretations obviously generate the same strings and the same set of trees.</Paragraph>
    <Paragraph position="3"> Following a suggestion of McCawley's, Peters and Ritchie (1973b) showed that, if one considered context-sensitive rules from the node-admissibility point of view, the languages defined were still CF. Thus, for example, the use of CS rules in the base to impose subcategorization restrictions does not increase the weak generative capacity of the base component. (For some different restrictions of context-sensitive rules that guarantee that only CFLs will be generated, see Baker (1972).) Rounds (1970b) gives a simpler proof of Peters and Ritchie's node admissibility result, using the techniques from tree-automata theory, a generalization to trees of finite state automata theory for strings. Just as a finite-state automaton (FSA) accepts a string by reading it one character at a time, changing its state at each transition, a finite-state tree automaton (FSTA) traverses trees, propagating states. The top-down FSTA &amp;quot;attaches&amp;quot; a starting state (from a finite set) to the root of the tree. Transitions are allowed by productions of the form (q, a, n) =&gt; (q, ..... q,,) such that if state q is being applied to a node labeled a and dominating n descendants, then state qi should be applied to its ith descendant. Acceptance occurs if all leaves of the tree end up labeled with states in the accepting subset. The bottom-up FSTA is similar: starting states are attached to the leaves of the tree and the productions are of the form (a, n, (q, ..... q) =&gt; q) indicating that, if a node labeied a dominates n descendants, each labeled with states ql to q,e then node a gets labeled with state q. Acceptance occurs when the root is labeled by a state from the subset of accepting states.</Paragraph>
    <Paragraph position="4"> As is the case with FSAs, FSTAs of both varieties can be either deterministic or nondeterministic. A set of trees is said to be recognizable if it is accepted by a nondeterministic bottom-up FSTA. Once again, as with FSAs, any set of trees accepted by a nondeterministic bottom-up FSTA is accepted by a deterministic bottom-up FSTA, but the result does not hold for top-down Computational Linguistics, Volume 10, Numbers 3-4, July-December 1984 169 C. Raymond Perrault On the Mathematical Properties of Linguistic Theories FSTA, even though the recognizable sets are exactly the languages recognized by nondeterministic top-down FSTAs.</Paragraph>
    <Paragraph position="5"> A set of trees is local if it is the set of derivation trees of a CF grammar. Clearly, every local set is recognizable by a one-state bottom-up FSTA that checks at each node to verify that it satisfies a CF production. Furthermore, the yield of a recognizable set of trees (the set of strings it generates) is CF. Not all recognizable sets are local: an example is the set of trees that satisfies the constraints of X-bar theory and the 0-criterion. However, they can all be mapped into local sets by a simple homomorphic mapping. 7 Rounds's proof (1970a) that CS rules under node admissibility generate only CFLs involves showing that the set of trees accepted by the rules is recognizable - i.e., that there is a nondeterministic bottom-up FSTA that can check at each node that some node admissibility condition holds there. This requires confirming that the &amp;quot;strictly context-free&amp;quot; part of the rule holds and that a proper analysis of the tree passing through the node satisfies the &amp;quot;context-sensitive&amp;quot; part of the rule. Joshi and Levy (1977) strengthened Peters and Ritchie's result by showing that the node admissibility conditions could also include arbitrary Boolean combinations of dominance conditions: a node could specify a bounded set of labels that must occur either immediately above it along a path to the root, or immediately below it on a path to the frontier.</Paragraph>
    <Paragraph position="6"> In general, the CF grammars constructed in the proof of weak equivalence to the CS grammars under node admissibility are much larger than the original, and not useful for practical recognition. Joshi, Levy, and Yueh (1981), however, show how Earley's algorithm can be extended to a parser that uses the local constraints directly. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2. Metarules
</SectionTitle>
      <Paragraph position="0"> The second important mechanism used by Gazdar (1982) is metarules, or rules that apply to rules to produce other rules. Using standard notation for CF rules, one example of a metarule that could replace the Apects transformation known as &amp;quot;particle movement&amp;quot; is V-~ VNPtX =&gt; V-~ VPtN\[-PRO\]X The symbol X here behaves like variables in structural analyses of Aspects transformations. If such variables are restricted to being used as abbreviations, that is, if they are allowed to range only from a finite subset of strings over the vocabulary, then closing the grammar under the metarules produces only a finite set of derived rules; and thus the generative power of the formalism is not increased. If, on the other hand, X is allowed to range over strings of unbounded length, as are the essential variables of transformational theory, then the consequences are less clear. It is well known, for example, that, if the 7 This mapping is a bottom-up finite-state tree transducer that simply labels each node with the state the recognizing bottom-up FSTA would have been in at that node.</Paragraph>
      <Paragraph position="1"> right-hand sides of phrase structure rules are allowed to be arbitrary regular expressions, the generated languages are still context-free. Might something like this not be happening with essential variables in metarules? It turns out that such is not the case.</Paragraph>
      <Paragraph position="2"> The formal consequences of the presence of essential variables in metarules depend on the presence of another device, the so-called phantom categories. It may be convenient in formulating metarules to allow, in the left-hand sides of rules, occurrences of syntactic categories that are never introdu~ced by the grammar, i.e., that never appear in the right-hand sides of rules. In standard CFLs, these are called useless categories,&amp;quot; rules containing them can simply be dropped, with no change in weak generative capacity. Not so with metarules: it is possible for metarules to be used to rewrite rules containing phantom categories into rules without them. Such a device was proposed at one time as a way to implement passives in the GPSG framework.</Paragraph>
      <Paragraph position="3"> Uszkoreit and Peters (1983) have shown that essential variables in metarules are powerful devices indeed: CF grammars with metarules that use at most one essential variable and allow phantom categories can generate all recursively enumerable sets. Even if phantom categories are banned, some nonrecursive sets can be generated as long as the use of at least one essential variable is allowed.</Paragraph>
      <Paragraph position="4"> Two constraints on metarules have been proposed to restrict the generative capacity of metarule systems.</Paragraph>
      <Paragraph position="5"> Gazdar (1982) has suggested replacing essential variables by abbreviative ones, i.e. variables that can only range over a finite set of (predetermined) alternatives.</Paragraph>
      <Paragraph position="6"> Shieber et al. (1983) argue that a generalization is lost in so doing, in the sense that the class of instantiations of the variable must be defined bye extension rather than by intension. Given the alternative, this seems a small price to pay.</Paragraph>
      <Paragraph position="7"> The other constraint, suggested by Gazdar and Pullum (1982), is finite closure of the metarule derivation process: no metarule is allowed to apply more than once in the derivation of a rule. Shieber et al. (1983) present several examples, namely the treatment of discontinuous noun phrases in Walpiri, adverb distribution in German, and causatives in Japanese, that cannot be handled under the finite closure constraint.</Paragraph>
      <Paragraph position="8"> It should be noted that other ways of using one grammar to generate the rules of another have been proposed. VanWijngaarden (1969), for example, presented a scheme in which one grammar's sentences are the rules of another. Greibach (1974) gives some of its properties.</Paragraph>
      <Paragraph position="9"> 7. Tree Adjunct Grammar The tree adjunct grammars (TAG) of Joshi and his colleagues (1982, 1984) provide a different way of accounting for syntactic dependencies. A TAG consists of two finite sets of finite trees, the centre trees and the adjunct trees.</Paragraph>
      <Paragraph position="10"> 170 Computational Linguistics, Volume 10, Numbers 3-4, July-December 1984 C. Raymond Perrault On the Mathematical Properties of Linguistic Theories The centre trees correspond to the surface structures of the &amp;quot;kernel&amp;quot; sentences of the languages. The root of the adjunct trees is labelled with a nonterminal symbol that also appears exactly once on the frontier of the tree. All other frontier nodes are labelled with terminal symbols. Derivations in TAGs are defined by repeated application of the adjunction operation. If c is a centre tree containing an occurrence of a nonterminal A, and a * is an adjunct tree whose root (and one node n on the frontier) is labelled A, then the adjunction of a to c is performed by &amp;quot;detaching&amp;quot; from c the subtree t rooted at A, attaching a in its place, and reattaching t at node n.</Paragraph>
      <Paragraph position="11"> Adjunction may then be seen as a tree analogue of a context-free derivation for strings (Rounds 1970a). The string languages obtained by taking the yields of the tree languages generated by TAGs are called tree adjunct languages (TAL).</Paragraph>
      <Paragraph position="12"> In TAGs, all long-distance dependencies are the result of adjunctions separating nodes that at one point in the derivation were &amp;quot;close&amp;quot;. Both crossing and noncrossing dependencies can be represented (Joshi 1983)). The formal properties of TALs are fully discussed by Joshi, Levy, and Takahashi (1975); Joshi and Levy (1982); and Yokomori and Joshi (to appe~ar). Of particular interest are the following.</Paragraph>
      <Paragraph position="13"> TALs properly contain the CFLs and are properly contained in the indexed languages, which in turn are properly contained in the CSLs. Although the indexed languages contain NP-complete languages, TALs are much better behaved: Joshi and Yokomori report (personal communication) an O(n 4) recognition algorithm and conjecture that an O(n 3) bound may be possible.</Paragraph>
      <Paragraph position="14"> 8. Stratificational Grammar The constituent and functional structures of LFG, the metarules of GPSG, the constraints on deep and surface structures in TG, and the two-level grammars of van Wijngaarden are all different ways in which syntactic constraints can be distributed across more than one structure. The Stratificational Grammar (SG) of Lamb and Gleason (Lamb 1966, Gleason 1964) is yet another.</Paragraph>
      <Paragraph position="15"> SG postulates the existence of several coupled components, known as strata; phonology, morphology, syntax, and semology are examples of linguistic strata. Each stratum specifies a set of correct structures, and an utterance has a representative structure at each stratum. The strata are linearly ordered and constrained b.9 a realization relation.</Paragraph>
      <Paragraph position="16"> Following Gleason's model, Borgida (1983) defines the realization relation so that it couples the application of specific pairs of productions (or sequences of productions) in the different grammars. Note that this is a generalization of the pairing of syntactic and semantic rules suggested by Montague, for example.</Paragraph>
      <Paragraph position="17"> With any derivation in a rewrite grammar, one can associate a string of the productions used in the derivation. If a canonical order is imposed on the derivations for example, that the leftmost nonterminal must be the next one to be expanded - a unique string of productions can be associated with each derivation tree.</Paragraph>
      <Paragraph position="18"> A two-level stratifieational grammar consists of two rewrite grammars G 1 and G 2, called tactics, with sets of productions Pi and P2, respectively, and a realization relation R, which is a finite set of pairs, each consisting of a string of productions of P1 and a string of productions of P2. A derivation D~ in G~ is realized by a derivation D 2 in G 2 if the strings of productions sz and s 2 associated with D~ and D 2 can be decomposed into substrings s~=ur.u&amp;quot; and s2=vr..ve respectively, such that R(u,,v,), for all i from 1 to n. The language generated by a two-level SG is the set of string generated derivations in G 2 that realize derivations in Grextended to more than two strata.</Paragraph>
      <Paragraph position="19"> Because the realization relation binds derivations, it is the strong generative capacity of the tactics that determines the languages generated. Borgida (1983) studied the languages of two-level SGs as the strong generative capacity of the tactics is systematically varied. Some of his results are unexpected. All r.e. languages can be generated by two-level SGs with CF tactics. On the other hand, if the upper tactics are restricted to being right-recursive, only CFLs can be generated, even with type 0 lower tactics. If the grammars are restricted to have no length-decreasing rules, the languages describable by SGs lie in the class of quasi-real time languages, defined as recognizable by nondeterministic TMs in linear time.</Paragraph>
      <Paragraph position="20"> The principal feature of SGs that accounts for high generative power is the presence of left recursion in the tactics: to escape from the regular languages, one needs left recursion on at least one stratum; to escape context-free languages, two non-right-recursive strata are needed. These results apply to SGs with arbitrary number of strata. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
9. Seeking Significance
</SectionTitle>
    <Paragraph position="0"> How, then, can metatheoretical results be useful in selecting among syntactic theories? The obvious route, of course, is to claim that the computationally most restrictive theory is preferable. However, this comparison is useful only if the theories to be compared rest on a number of shared assumptions and observations concerning the scope of the syntax, the computational properties of the human processor and the relation between the processor and the syntactic theory.</Paragraph>
    <Paragraph position="1"> In this section, we first briefly consider the assumption of common syntactic coverage and the computational consequences of theory decomposition. We then ask how metatheoretical results can be used first as lower bounds and then as upper bounds on acceptable theories.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
9.1. Coverage
</SectionTitle>
      <Paragraph position="0"> Competing linguistic theories must obviously agree on the burden of their respective syntactic components. We consider here one example of a constraint for which two analyses have been presented, one purportedly completely syntactic, and the other partly semantic. The Computational Linguistics, Volume 10, Numbers 3-4, July-December 1984 171 C. Raymond Perrault On the Mathematical Properties of Linguistic Theories problem at hand is the distribution of the so-called polarity-sensitive items, such as any and the metaphorical sense of lift a finger. Simply put, these terms need to appear within the scope of a polarity reverser, such as not, or rarely. The question is: how are scope and polarity reverser defined? In Linebarger's syntactic analysis (1980), the scope relation is defined on the logical forms of the government and binding theory (GB): An item is in the immediate scope of NOT if (1) it occurs only in the proposition which is the entire scope of NOT and (2) within the proposition there are no logical elements intervening between it and NOT.</Paragraph>
      <Paragraph position="1"> In this analysis, scope and intervening must be defined configurationally, and one assumes that logical element is defined in the lexicon. Note that not is the only lexical element that can be a license. Linebarger assumes that all other cases are, strictly speaking, ill formed and salvaged only by the availability of an implicature which can be formalized to contain the polarity items in the appropriate relation to NOT. (Ladusaw 1983) Ladusaw's analysis (1979), within the framework of Montague grammar, is in three parts:  1. A negative polarity item will be acceptable only if it is in the scope of a polarity-reversing expression.</Paragraph>
      <Paragraph position="2"> 2. For any two expressions a and /3, constituents of a sentence S, a is in the scope of/3 with respect to a composition structure of S, S t, iff the interpretation of a is used in the formulation of the argument of/3's interpretation in S t .</Paragraph>
      <Paragraph position="3"> 3. An expression D is a polarity reverser with respect to an interpretation function ,~ if and only if, for all expressions X and y,s</Paragraph>
      <Paragraph position="5"> In (1), &amp;quot;acceptable&amp;quot; is predicated of negative polarity items; these are clearly parts of surface structures, and thus syntactic objects. The condition on acceptability is in terms of scope and polarity-reversing expression. In (3), polarity reverser is applied to syntactic objects and defined in terms of their denotations. In (2) a is in the scope of/3. is defined again of syntactic objects a and/3, but in terms of the function that interprets the structure they occur in, not of their denotations. So the condition applies to syntactic structures, but is defined in terms of the denotations of parts of that structure and in terms of the interpretation function itself. Although it would be satisfying to do so, there appears to be no natural way to recast Ladusaw's constraint as one that is fully semantic, namely, by making the interpretation function partial (i.e., in a way that allows John knows anything to be grammatical but uninterpretable) because the definition of scope is in terms of the interpretation function, not the denotations themselves. We seem condemned to straddle the fence on this one.</Paragraph>
      <Paragraph position="6"> Thus we have here one theory that deals, completely within the syntactic domain, only with the license not, and another that accounts for a much broader range of licenses by imposing on syntactic structures conditions defined in terms of their interpretations and of the interpretation function itself. They are computationally incomparable.</Paragraph>
      <Paragraph position="7"> We close this section with an aside on the separation of constraints. Constraint separation can occur in two ways. In the case of polarity-sensitive items, it takes place across the syntax-semantics boundary. In several syntactic theories, such as GB and LFG, it can also occur within the syntactic theory itself: grammaticality in LFG, for example, is defined in terms of the existence of pairs of appropriately related constituent and functional structures. null In general, the class resulting from the intersection of the separated classes will be at least as large as either of them: e.g., the intersection of two CFLs is not always a CFL. More interesting is the fact that separation sometimes has beneficial computational effects. Consider, for example, the constraint in many programming languages that variables can only occur in the scope of a declaration for them. This constraint cannot be imposed by a CFG but can be by an indexed grammar, at the cost of a dramatic increase in recognition complexity. In practice, however, the requirement is simply not checked by the parser, which only recognizes CFLs. The declaration conditions are checked separately by a process that traverses the parse tree. In this case, the overall recognition complexity remains some low-order polynomial. It is not clear to me whether one wants to consider the declaration requirement syntactic or not. The point is that, in this case, the &amp;quot;unified account&amp;quot; is more general, and computationally more onerous, than the modular one.</Paragraph>
      <Paragraph position="8"> Some arguments of this kind can be found in Berwick and Weinberg (1982).</Paragraph>
      <Paragraph position="9"> 9.2. Metatheoretical results as lower bounds The first use of formal results is to argue that a theory should be rejected if it is insufficiently powerful to account for observed constraints. Chomsky used this strategy initially against finite-state grammars 9 and then against CFGs. It obviously first requires extracting from empirical observation (and decisions about idealization) what the minimal generative capacity and recognition complexity of actual languages are. Several arguments have been made against the weak generative adequacy of CFGs. The best known of these are Bar-Hillel's claim (1961) based on the occurrence of respectively and Postal's (1964a) on nominalization in Mohawk. Higginbotham (1984) claims non-context-freeness for English</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML