File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1017_metho.xml
Size: 22,585 bytes
Last Modified: 2025-10-06 14:14:00
<?xml version="1.0" standalone="yes"?> <Paper uid="E95-1017"> <Title>Incremental Interpretation of Categorial Grammar*</Title> <Section position="3" start_page="119" end_page="121" type="metho"> <SectionTitle> 2 Applicative Categorial Grammar </SectionTitle> <Paragraph position="0"> Applicative Categorial Grammar is the most basic form of Categorial Grammar, with just a single combination rule corresponding to function application. It was first applied to linguistic description by Adjukiewicz and Bar-Hillel in the 1950s.</Paragraph> <Paragraph position="1"> Although it is still used for linguistic description (e.g. Bouma and van Noord, 1994), it has been somewhat overshadowed in recent years by HPSG (Pollard and Sag 1994), and by Lambek Categorial Grammars (Lambek 1958). It is therefore worth giving some brief indications of how it fits in with these developments.</Paragraph> <Paragraph position="2"> The first directed Applicative CG was proposed by Bar-Hillel (1953). Functional types included a list of arguments to the left, and a list of arguments to the right. Translating Bar-Hillel's notation into a feature based notation similar to that in HPSG (Pollard and Sag 1994), we obtain the following category for a ditransitive verb such as put:</Paragraph> <Paragraph position="4"> The list of arguments to the left are gathered under the feature, l, and those to the right, an np and a pp in that order, under the feature r.</Paragraph> <Paragraph position="5"> Bar-Hillel employed a single application rule, which corresponds to the following: ix 1 L~ ... L1 I(L1 .. * Ln) R1 . .. Rn ~ X r(R1 ...R~> The result was a system which comes very close to the formalised dependency grammars of Gaifman (1965) and Hays (1964). The only real difference is that Bar-Hillel allowed arguments to themselves be functions. For example, an adverb such as slowly could be given the type 4 LrO An unfortunate aspect of Bar-Hillel's first system was that the application rule only ever resulted in a primitive type. Hence, arguments with functional types had to correspond to single lexical items: there was no way to form the type np\s ~ for a non-lexical verb phrase such as likes Mary. Rather than adapting the Application Rule to allow functions to be applied to one argument at a time, Bar-Hillel's second system (often called AB Categorial Grammar, or Adjukiewicz/Bar-Hillel CG, Bar-Hillel 1964) adopted a 'Curried' notation, and this has been adopted by most CGs since. To represent a function which requires an np on the left, and an np and a pp to the right, there is a choice of the following three types using Curried notation:</Paragraph> <Paragraph position="7"> Most CGs either choose the third of these (to give a vp structure), or include a rule of Associativity which means that the types are interchangeable (in the Lambek Calculus, Associativity is a consequence of the calculus, rather than being specified separately).</Paragraph> <Paragraph position="8"> The main impetus to change Applicative CG came from the work of Ades and Steedman (1982).</Paragraph> <Paragraph position="9"> Ades and Steedman noted that the use of function composition allows CGs to deal with unbounded dependency constructions. Function composition enables a function to be applied to its argument, even if that argument is incomplete e.g.</Paragraph> <Paragraph position="10"> s/pp + pp/np --+ s/np This allows peripheral extraction, where the 'gap' is at the start or the end of e.g. a relative clause. Variants of the composition rule were proposed in order to deal with non-peripheral extraction, Bar-Hillel, who used a slightly problematic 'double slash' notation for functions of functions.</Paragraph> <Paragraph position="11"> 5Lambek notation (Lambek 1958).</Paragraph> <Paragraph position="12"> but this led to unwanted effects elsewhere in the grammar (Bouma 1987). Subsequent treatments of non-peripheral extraction based on the Lambek Calculus (where standard composition is built in: it is a rule which can be proven from the calculus) have either introduced an alternative to the forward and backward slashes i.e. / and \ for normal args, ? for wh-args (Moortgat 1988), or have introduced so called modal operators on the whargument (Morrill et al. 1990). Both techniques can be thought of as marking the wh-arguments as requiring special treatment, and therefore do not lead to unwanted effects elsewhere in the grammar. null However, there are problems with having just composition, the most basic of the non-applicative operations. In CGs which contain functions of functions (such as very, or slowly), the addition of composition adds both new analyses of sentences, and new strings to the language. This is due to the fact that composition can be used to form a function, which can then be used as an argument to a function of a function. For example, if the two types, n/n and n/n are composed to give the type n/n, then this can be modified by an adjectival modifier of type (n/n)/(n/n). Thus, the noun very old dilapidated car can get the unacceptable bracketing, \[\[very \[old dilapidated\]\] car\]. Associative CGs with Composition, or the Lambek Calculus also allow strings such as boy with the to be given the type n/n predicting very boy with the car to be an acceptable noun. Although individual examples might be possible to rule out using appropriate features, it is difficult to see how to do this in general whilst retaining a calculus suitable for incremental interpretation.</Paragraph> <Paragraph position="13"> If wh-arguments need to be treated specially anyway (to deal with non-peripheral extraction), and if composition as a general rule is problematic, this suggests we should perhaps return to grammars which use just Application as a general operation, but have a special treatment for wh-arguments. Using the non-Curried notation of Bar-Hillel, it is more natural to use a separate wh-list than to mark wh-arguments individually.</Paragraph> <Paragraph position="14"> For example, the category appropriate for relative clauses with a noun phrase gap would be: lO |,:o / It is then possible to specify operations which act as purely applicative operations with respect to the left and right arguments lists, but more like composition with respect to the wh-list. This is very similar to the way in which wh-movement is dealt with in GPSG (Gazdar et al. 1985) and HPSG, where wh-arguments are treated using slash mechanisms or feature inheritance principles which correspond closely to function composition. Given that our arguments have produced a categorial grammar which looks very similar to HPSG, why not use HPSG rather than Applicative CG? The main reason is that Applicative CG is a much simpler formalism, which can be given a very simple syntax semantics interface, with function application in syntax mapping to function application in semantics 6'7. This in turn makes it relatively easy to provide proofs of soundness and completeness for an incremental parsing algorithm. Ultimately, some of the techniques developed here should be able to be extended to more complex formalisms such as HPSG.</Paragraph> </Section> <Section position="4" start_page="121" end_page="121" type="metho"> <SectionTitle> 3 AB Categorial grammar with Associativity (AACG) </SectionTitle> <Paragraph position="0"> In this section we define a grammar similar to Bar-Hillel's first grammar. However, unlike Bar-Hillel, we allow one argument to be absorbed at a time.</Paragraph> <Paragraph position="1"> The resulting grammar is equivalent to AB Categorial Grammar plus associativity.</Paragraph> <Paragraph position="2"> The categories of the grammar are defined as follows: 1. If X is a syntactic type (e.g. s, np), then 10 is a category.</Paragraph> <Paragraph position="3"> r0 2. If X is a syntactic type, and L and R are lists of categories, then Application to the right is defined by the ruleS: 6One area where application based approaches to semantic combination gain in simplicity over unification based approaches is in providing semantics for functions of functions. Moore (1989) provides a treatment of functions of functions in a unification based approach, but only by explicitly incorporating lambda expressions. Pollard and Sag (1994) deal with some functions of functions, such as non-intersective adjectives, by explicit set construction.</Paragraph> <Paragraph position="4"> 7As discussed above, wh-movement requires something more like composition than application. A simple syntax semantics interface can be retained if the same operation is used in both syntax and semantics. Wh-arguments can be treated as similar to other arguments i.e. as lambda abstracted in the semantics. For example, the fragment: John found a woman who Mary can be given the semantics ,kP.3x. woman(x) &: found(john,x) g~ P(mary, x), where P is a function from a left argument Mary of type e and a whargument, also of type e.</Paragraph> <Paragraph position="5"> s,., is list concatenation e.g. (np)-(s) equals (np,s). j j Application to the left is defined by the rule: L=R \[=RJ The basic grammar provides some spurious derivations, since sentences such as John likes Mary can be bracketed as either ((John likes) Mary) or (John (likes Mary)). However, we will see that these spurious derivations do not translate into spurious ambiguity in the parser, which maps from strings of words directly to semantic representations. null</Paragraph> </Section> <Section position="5" start_page="121" end_page="124" type="metho"> <SectionTitle> 4 An Incremental Parser </SectionTitle> <Paragraph position="0"> Most parsers which work left to right along an input string can be described in terms of state transitions i.e. by rules which say how the current parsing state (e.g. a stack of categories, or a chart) can be transformed by the next word into a new state. Here this will be made particularly explicit, with the parser described in terms of just two rules which take a state, a new word and create a new state 9. There are two unusual features. Firstly, there is nothing equivalent to a stack mechanism: at all times the state is characterised by a single syntactic type, and a single semantic value, not by some stack of semantic values or syntax trees which are waiting to be connected together. Secondly, all transitions between states occur on the input of a new word: there are no 'empty' transitions (such as the reduce step of a shift-reduce parser).</Paragraph> <Paragraph position="1"> The two rules, which are given in Figure 1 tdeg, are difficult to understand in their most general form.</Paragraph> <Paragraph position="2"> Here we will work upto the rules gradually, by considering which kinds of rules we might need in particular instances. Consider the following pairing of sentence fragments with their simplest possible Now consider taking each type as a description of the state that the parser is in after absorbing the fragment. We obtain a sequence of transitions as follows: 9This approach is described in greater detail in Milward (1994), where parsers are specified formally in terms of their dynamics.</Paragraph> <Paragraph position="3"> ldegLi, Ri, Hi are lists of categories, li and ri are lists of variables, of the same length as the corresponding</Paragraph> <Paragraph position="5"> &quot;John .... likes .... Sue&quot; s/s -~ s/(np\s) -~ s/np --~ s If an embedded sentence such as John likes Sue is a mapping from an s/s to an s, this suggests that it might be possible to treat all sentences as mapping from some category expecting an s to that category i.e. from X/s to X. Similarly, all noun phrases might be treated as mappings from an X/np to an X.</Paragraph> <Paragraph position="6"> Now consider individual transitions. The simplest of these is where the type of argument expected by the state is matched by the next word i.e.</Paragraph> <Paragraph position="7"> &quot;Sue&quot; s/np -~ s where: Sue: np This can be generalised to the following rule, which is similar to Function Application in stan-</Paragraph> <Paragraph position="9"> A similar transition occurs for likes. Here an np\s was expected, but likes only provides part of this: nit differs in not being a rule of grammar: here the functor is a state category and the argument is a lexical category. In standard CG function application, the functor and argument can correspond to a word or a phrase.</Paragraph> <Paragraph position="10"> it requires an np to the right to form an np\s.</Paragraph> <Paragraph position="11"> Thus after likes is absorbed the state category will need to expect an np. The rule required is similar to Function Composition in CG i.e.</Paragraph> <Paragraph position="12"> ,{W~ X/Y ~ X/Z where: W: Y/Z Considering this informally in terms of tree structures, what is happening is the replacement of an empty node in a partial tree by a second partial tree i.e.</Paragraph> <Paragraph position="14"> The two rules specified so far need to be further generalised to allow for the case where a lexical item has more than one argument (e.g. if we replace likes by a di-transitive such as gives or a tri-transitive such as bets). This is relatively trivial using a non-curried notation similar to that used for AACG. What we obtain is the single rule of State-Application, which corresponds to application when the list of arguments, R1, is empty, to function composition when R1 is of length one, and to n-ary composition when R1 is of length n.</Paragraph> <Paragraph position="15"> The only change needed from AACG notation is</Paragraph> <Paragraph position="17"> the inclusion of an extra feature list, the h list, which stores information about which arguments are waiting for a head (the reasons for this will be explained later). The lexicon is identical to that for a standard AACG, except for having h-lists which are always set to empty.</Paragraph> <Paragraph position="18"> Now consider the first transition. Here a sentence was expected, but what was encountered was a noun phrase, John. The appropriate rule in CG notation would be: &quot;W&quot; X/Y -+ X/(Z\Y) where: W: Z This rule states that if looking for a Y and get a Z then look for a Y which is missing a Z. In tree structure terms we have:</Paragraph> <Paragraph position="20"> The rule of State-Prediction is obtained by further generalising to allow the lexical item to have missing arguments, and for the expected argument to have missing arguments.</Paragraph> <Paragraph position="21"> State-Application and State-Prediction together provide the basis of a sound and complete parser 12. Parsing of sentences is achieved by starting in a state expecting a sentence, and applying the rules non-deterministically as each word is input. A successful parse is achieved if the final state expects no more arguments. As an example, reconsider the string John likes Sue. The sequence of transitions corresponding to John likes Sue being a sentence, is given in Figure 2.</Paragraph> <Paragraph position="22"> The transition on encountering John is deterministic: State-Application cannot apply, and State-Prediction can only be instantiated one way. The result is a new state expecting an argument which, given an np could give an s i.e. an np\s.</Paragraph> <Paragraph position="23"> 12The parser accepts the same strings as the grammar and assigns them the same semantic values. This is slightly different from the standard notion of soundness and completeness of a parser, where the parser accepts the same strings as the grammar and assigns them the same syntax trees.</Paragraph> <Paragraph position="24"> The transition on input of likes is nondeterministic. State-Application can apply, as in Figure 2. However, State-Prediction can also apply, and can be instantiated in four ways (these correspond to different ways of cutting up the left and right subcategorisation lists of the lexical entry, likes, i.e. as (np> * 0 or 0 * (np>).</Paragraph> <Paragraph position="25"> One possibility corresponds to the prediction of an s\s modifier, a second to the prediction of an (np\s)\(np\s) modifier (i.e. a verb phrase toodiffer), a third to there being a function which takes the subject and the verb as separate arguments, and the fourth corresponds to there being a function which requires an s/np argument. The second of these is perhaps the most interesting, and is given in Figure 3. It is the choice of this particular transition at this point which allows verb phrase modification, and hence, assuming the next word is Sue, an implicit bracketing of the string fragment as (John (likes Sue)). Note that if State-Application is chosen, or the first of the State-Prediction possibilities, the fragment John likes Sue retains a flat structure. If there is to be no modification of the verb phrase, no verb phrase structure is introduced. This relates to there being no spurious ambiguity: each choice of transition has semantic consequences; each choice affects whether a particular part of the semantics is to be modified or not.</Paragraph> <Paragraph position="26"> Finally, it is worth noting why it is necessary to use h-lists. These are needed to distinguish between cases of real functional arguments (of functions of functions), and functions formed by State-Prediction. Consider the following trees, where the np\s node is empty.</Paragraph> <Paragraph position="27"> Both trees have the same syntactic type, however in the first case we want to allow for there to be an s\s modifier of the lower s, but not in the second. The headed list distinguishes between the two cases, with only the first having an np on its</Paragraph> </Section> <Section position="6" start_page="124" end_page="125" type="metho"> <SectionTitle> 5 Parsing Lexicalised Grammars </SectionTitle> <Paragraph position="0"> When we consider full sentence processing, as opposed to incremental processing, the use of lexicalised grammars has a major advantage over the use of more standard rule based grammars. In processing a sentence using a lexicalised formalism we do not have to look at the grammar as a whole, but only at the grammatical information indexed by each of the words. Thus increases in the size of a grammar don't necessarily effect efficiency of processing, provided the increase in size is due to the addition of new words, rather than increased lexical ambiguity. Once the full set of possible lexical entries for a sentence is collected, they can, if required, then be converted back into a set of phrase structure rules (which should correspond to a small subset of the rule based formalism equivalent to the whole lexicalised grammar), before being parsing with a standard algorithm such as Earley's (Earley 1970).</Paragraph> <Paragraph position="1"> In incremental parsing we cannot predict which words will appear in the sentence, so cannot use the same technique. However, if we are to base a parser on the rules given above, it would seem that we gain further. Instead of grammatical information being localised to the sentence as a whole, it is localised to a particular word in its particular context: there is no need to consider a pp as a start of a sentence if it occurs at the end, even if there is a verb with an entry which allows for a subject pp.</Paragraph> <Paragraph position="2"> However there is a major problem. As we noted in the last paragraph, it is the nature of parsing incrementally that we don't know what words are to come next. But here the parser doesn't even use the information that the words are to come from a lexicon for a particular language. For example, given an input of 3 nps, the parser will happily create a state expecting 3 nps to the left.</Paragraph> <Paragraph position="3"> This might be a likely state for say a head final language, but an unlikely state for a language such as English. Note that incremental interpretation will be of no use here, since the semantic representation should be no more or less plausible in the different languages. In practical terms, a naive interactive parallel Prolog implementation on a current workstation fails to be interactive in a real sense after about 8 words 13.</Paragraph> <Paragraph position="4"> What seems to be needed is some kind of language tuning 14. This could be in the nature of fixed restrictions to the rules e.g. for English we might rule out uses of prediction when a noun phrase is encountered, and two already exist on the left list.</Paragraph> <Paragraph position="5"> A more appealing alternative is to base the tuning on statistical methods. This could be achieved by running the parser over corpora to provide probabilities of particular transitions given particular words. These transitions would capture the likelihood of a word having a particular part of speech, and the probability of a particular transition being performed with that part of speech.</Paragraph> <Paragraph position="6"> 13This result should however be treated with some caution: in this implementation there was no attempt to perform any packing of different possible transitions, and the algorithm has exponential complexity. In contrast, a packed recogniser based on a similar, but much simpler, incremental parser for Lexicalised Dependency Grammar has O(n a) time complexity (Milward 1994) and good practical performance, taking a couple of seconds on 30 word sentences.</Paragraph> <Paragraph position="7"> 14The usage of the term language tuning is perhaps broader here than its use in the psycholinguistic literature to refer to different structural preferences between languages e.g. for high versus low attachment (Mitchell et al. 1992).</Paragraph> <Paragraph position="8"> There has already been some early work done on providing statistically based parsing using transitions between recursively structured syntactic categories (Tugwell 1995) 15. Unlike a simple Markov process, there are a potentially infinite number of states, so there is inevitably a problem of sparse data. It is therefore necessary to make various generalisations over the states, for example by ignoring the R2 lists.</Paragraph> <Paragraph position="9"> The full processing model can then be either serial, exploring the most highly ranked transitions first (but allowing backtracking if the semantic plausibility of the current interpretation drops too low), or ranked parallel, exploring just the n paths ranked highest according to the transition probabilities and semantic plausibility.</Paragraph> </Section> class="xml-element"></Paper>