File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/j93-1002_metho.xml

Size: 86,543 bytes

Last Modified: 2025-10-06 14:13:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-1002">
  <Title>Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars</Title>
  <Section position="2" start_page="0" end_page="26" type="metho">
    <SectionTitle>
1. Wide-Coverage Parsing of Natural Language
</SectionTitle>
    <Paragraph position="0"> The task of syntactically analyzing substantial corpora of naturally occurring text and transcribed speech has become a focus of recent work. Analyzed corpora would be of great benefit in the gathering of statistical data regarding language use, for example to train speech recognition devices, in more general linguistic research, and as a first step toward robust wide-coverage semantic interpretation. The Alvey Natural Language Tools (ANLT) system is a wide-coverage lexical, morphological, and syntactic analysis system for English (Briscoe et al. 1987). Previous work has demonstrated that the ANLT system is, in principle, able to assign the correct parse to a high proportion of English noun phrases drawn from a variety of corpora. The goal of the work reported here is to develop a practical parser capable of returning probabilistically highly ranked analyses (from the usually large number of syntactically legitimate possibilities) for material drawn from a specific corpus on the basis of minimal (supervised) training and manual modification.</Paragraph>
    <Paragraph position="1"> * University of Cambridge, Computer Laboratory, Pembroke Street, Cambridge, CB2 3QG, UK. (Tel. +44-223-334600), (ejb / jac @cl.cam.ac.uk). (~) 1993 Association for Computational Linguistics Computational Linguistics Volume 19, Number 1 The first issue to consider is what the analysis will be used for and what constraints this places on its form. The corpus analysis literature contains a variety of proposals, ranging from part-of-speech tagging to assignment of a unique, sophisticated syntactic analysis. Our eventual goal is to recover a semantically and pragmatically appropriate syntactic analysis capable of supporting semantic interpretation. Two stringent requirements follow immediately: firstly, the analyses assigned must determinately represent the syntactic relations that hold between all constituents in the input; secondly, they must be drawn from an a priori defined, well-formed set of possible syntactic analyses (such as the set defined by a generative grammar). Otherwise, semantic interpretation of the resultant analyses cannot be guaranteed to be (structurally) unambiguous, and the semantic operations defined (over syntactic configurations) cannot be guaranteed to match and yield an interpretation. These requirements immediately suggest that approaches that recover only lexical tags (e.g. de Rose 1988) or a syntactic analysis that is the 'closest fit' to some previously defined set of possible analyses (e.g. Sampson, Haigh, and Atwell 1989), are inadequate (taken alone). &amp;quot; Pioneering approaches to corpus analysis proceeded on the assumption that computationally tractable generative grammars of sufficiently general coverage could not be developed (see, for example, papers in Garside, Leech, and Sampson 1987). However, the development of wide-coverage declarative and computationally tractable grammars makes this assumption questionable. For example, the ANLT word and sentence grammar (Grover et al. 1989; Carroll and Grover 1989) consists of an English lexicon of approximately 40,000 lexemes and a 'compiled' fixed-arity term unification grammar containing around 700 phrase structure rules. Taylor, Grover, and Briscoe (1989) demonstrate that an earlier version of this grammar was capable of assigning the correct analysis to 96.8% of a corpus of 10,000 noun phrases extracted (without regard for their internal form) from a variety of corpora. However, although Taylor, Grover, and Briscoe show that the ANLT grammar has very wide coverage, they abstract away from issues of lexical idiosyncrasy by formimg equivalence classes of noun phrases and parsing a single token of each class, and they do not address the issues of 1) tuning a grammar to a particular corpus or sublanguage 2) selecting the correct analysis from the set licensed by the grammar and 3) providing reliable analyses of input outside the coverage of the grammar. Firstly, it is clear that vocabulary, idiom, and conventionalized constructions used in, say, legal language and dictionary definitions, will differ both in terms of the range and frequency of words and constructions deployed. Secondly, Church and Patil (1982) demonstrate that for a realistic grammar parsing realistic input, the set of possible analyses licensed by the grammar can be in the thousands. Finally, it is extremely unlikely that any generative grammar will ever be capable of correctly analyzing all naturally occurring input, even when tuned for a particular corpus or sublanguage (if only because of the synchronic idealization implicit in the assumption that the set of grammatical sentences of a language is well formed.) In this paper, we describe our approach to the first and second problems and make some preliminary remarks concerning the third (far harder) problem. Our approach to grammar tuning is based on a semi-automatic parsing phase during which additions to the grammar are made manually and statistical information concerning the frequency of use of grammar rules is acquired. Using this statistical information and modified grammar, a breadth-first probabilistic parser is constructed. The latter is capable of ranking the possible parses identified by the grammar in a useful (and efficient) manner. However, (unseen) sentences whose correct analysis is outside the coverage of the grammar ren,.~in a problem. The feasibility and usefulness of our approach has been investigated in a preliminary way by analyzing a ~small corpus of  Ted Briscoe and John Carroll Generalized Probabilistic LR Parsing noun definitions drawn from the Longman Dictionary of Contemporary English (LDOCE) (Procter 1978). This corpus was chosen because the vocabulary employed is restricted (to approximately 2,000 morphemes), average definition length is about 10 words (with a maximum of around 30), and each definition is independent, allowing us to ignore phenomena such as ellipsis. In addition, the language of definitions represents a recognizable sublanguage, allowing us to explore the task of tuning a general purpose grammar. The results reported below suggest that probabilistic information concerning the frequency of occurrence of syntactic rules correlates in a useful (though not absolute) way with the semantically and pragmatically most plausible analysis.</Paragraph>
    <Paragraph position="2"> In Section 2, we briefly review extant work on probabilistic approaches to corpus analysis and parsing and argue the need for a more refined probabilistic model to distinguish distinct derivations. Section 3 discusses work on LR parsing of natural language and presents our technique for automatic construction of LR parsers for unification-based grammars. Section 4 presents the method and results for constructing a LALR(1) parse table for the ANLT grammar and discusses these in the light of both computational complexity and other empirical results concerning parse table size and construction time. Section 5 motivates our interactive and incremental approach to semi-automatic production of a disambiguated training corpus and describes the variant of the LR parser used for this task. Section 6 describes our implementation of a breadth-first LR parser and compares its performance empirically to a highly optimized chart parser for the same grammar, suggesting that (optimized) LR parsing is more efficient in practice for the ANLT grammar despite exponential worst case complexity results. Section 7 explains the technique we employ for deriving a probabilistic version of the LR parse table from the training corpus, and demonstrates that this leads to a more refined and parse-context-dependent probabilistic model capable of distinguishing derivations that in a probabilistic context-free model would be equally probable. Section 8 describes and presents the results of our first experiment parsing LDOCE noun definitions, and Section 9 draws some preliminary conclusions and outlines ways in which the work described should be modified and extended.</Paragraph>
  </Section>
  <Section position="3" start_page="26" end_page="30" type="metho">
    <SectionTitle>
2. Probabilistic Approaches to Parsing
</SectionTitle>
    <Paragraph position="0"> In the field of speech recognition, statistical techniques based on hidden Markov modeling are well established (see e.g. Holmes 1988:129f for an introduction). The two main algorithms utilized are the Viterbi (1967) algorithm and the Baum-Welch algorithm (Baum 1972). These algorithms provide polynomial solutions to the tasks of finding the most probable derivation for a given input and a stochastic regular grammar, and of performing iterative re-estimation of the parameters of a (hidden) stochastic regular grammar by considering all possible derivations over a corpus of inputs, respectively. Baker (1982) demonstrates that Baum-Welch re-estimation can be extended to context-free grammars (CFGs) in Chomsky Normal Form (CNF). Fujisaki et al. (1989) demonstrate that the Viterbi algorithm can be used in conjunction with the CYK parsing algorithm and a CFG in CNF to efficiently select the most probable derivation of a given input. Kupiec (1991) extends Baum-Welch re-estimation to arbitrary (non-CNF) CFGs. Baum-Welch re-estimation can be used with restricted or unrestricted grammars/models in the sense that some of the parameters corresponding to possible productions over a given (non-)terminal category set/set of states can be given an initial probability of zero. Unrestricted grammars/models quickly become impractical because the number of parameters requiring estimation becomes large and these algorithms are polynomial in the length of the input and number of free parameters.</Paragraph>
    <Paragraph position="1">  Computational Linguistics Volume 19, Number 1 Typically, in applications of Markov modeling in speech recognition, the derivation used to analyze a given input is not of interest; rather what is sought is the best (most likely) model of the input. In any application of these or similar techniques to parsing, though, the derivation selected is of prime interest. Baum (1972) proves that Baum-Welch re-estimation will converge to a local optimum in the sense that the initial probabilities will be modified to increase the likelihood of the corpus given the grammar and 'stabilize' within some threshold after a number of iterations over the training corpus. However, there is no guarantee that the global optimum will be found, and the a priori initial probabilities chosen are critical for convergence on useful probabilities (e.g. Lari and Young 1990). The main application of these techniques to written input has been in the robust, lexical tagging of corpora with part-of-speech labels (e.g. Garside, Leech, and Sampson 1987; de Rose 1988; Meteer, Schwartz, and Weischedel 1991; Cutting et al. 1992).</Paragraph>
    <Paragraph position="2"> Fujisaki et al. (1989) describe a corpus analysis experiment using a probabilistic CNF CFG containing 7550 rules on a corpus of 4206 sentences (with an average sentence length of approximately 11 words). The unsupervised training process involved automatically assigning probabilities to each CF rule on the basis of their frequency of occurrence in all possible analyses of each sentence of the corpus. These probabilities were iteratively re-estimated using a variant of the Baum-Welch algorithm, and the Viterbi algorithm was used in conjunction with the CYK parsing algorithm to efficiently select the most probable analysis after training. Thus the model was restricted in that many of the possible parameters (rules) defined over the (non-)terminal category set were initially set to zero and training was used only to estimate new probabilities for a set of predefined rules. Fujisaki et al. suggest that the stable probabilities will model semantic and pragmatic constraints in the corpus, but this will only be so if these correlate with the frequency of rules in correct analyses, and also if the 'noise' in the training data created by the incorrect parses is effectively factored out. Whether this is so will depend on the number of 'false positive' examples with only incorrect analyses, the degree of heterogeneity in the training corpus, and so forth. Fujisaki et al. report some results based on testing the parser on the corpus used for training.</Paragraph>
    <Paragraph position="3"> In 72 out of 84 sentences examined, the most probable analysis was also the correct analysis. Of the remainder, 6 were false positives and did not receive a correct parse, while the other 6 did but it was not the most probable. A success rate (per sentence) of 85% is apparently impressive, but it is difficult to evaluate properly in the absence of full details concerning the nature of the corpus. For example, if the corpus contains many simple and similar constructions, unsupervised training is more likely to converge quickly on a useful set of probabilities.</Paragraph>
    <Paragraph position="4"> Sharman, Jelinek, and Mercer (1990) conducted a similar experiment with a grammar in ID/LP format (Gazdar et al. 1985; Sharman 1989). ID/LP grammars separate the two types of information encoded in CF rules--immediate dominance and immediate precedence--into two rule types that together define a CFG. This allows probabilities concerning dominance, associated with ID rules, to be factored out from those concerning precedence, associated with LP rules. In this experiment, a supervised training regime was employed. A grammar containing 100 terminals and 16 nonterminals and initial probabilities based on the frequency of ID and LP relations was extracted from a manually parsed corpus of about one million words of text. The resulting probabilistic ID/LP grammar was used to parse 42 sentences of 30 words or less drawn from the same corpus. In addition, lexical syntactic probabilities were integrated with the probability of the ID/LP relations to rank parses. Eighteen of the parses were identical to the original manual analyses, while a further 19 were 'similar,' yielding a success rate of 88%. What is noticeable about this experiment is that the results are no better  Ted Briscoe and John Carroll Generalized Probabilistic LR Parsing than Fujisaki et al.'s unsupervised training experiment discussed above, despite the use of supervised training and a more sophisticated grammatical model. It is likely that these differences derive from the corpus material used for training and testing, and that the results reported by Fujisaki et al. will not be achieved with all corpora.</Paragraph>
    <Paragraph position="5"> Pereira and Schabes (1992) report an experiment using Baum-Welch re-estimation to infer a grammar and associated rule probabilities from a category set containing 15 nonterminals and 48 terminals, corresponding to the Penn Treebank lexical tagset (Santorini 1990). The training data was 770 sentences, represented as tag sequences, drawn from the treebank. They trained the system in an unsupervised mode and also in a 'semi-supervised' mode, in which the manually parsed version of the corpus was used to constrain the set of analyses used during re-estimation. In supervised training analyses were accepted if they produced bracketings consistent but not necessarily identical with those assigned manually. They demonstrate that in supervised mode, training not only converges faster but also results in a grammar in which the most probable analysis is compatible with the manually assigned analysis of further test sentences drawn from the tree bank in a much greater percentage of cases--78% as opposed to 35%. This result indicates very clearly the importance of supervised training, particularly in a context where the grammar itself is being inferred in addition to the probability of individual rules.</Paragraph>
    <Paragraph position="6"> In our work, we are concerned to utilize the existing wide-coverage ANLT grammar; therefore, we have concentrated initially on exploring how an adequate probabilistic model can be derived for a unification-based grammar and trained in a supervised mode to effectively select useful analyses from the large space of syntactically legitimate possibilities. There are several inherent problems with probabilistic CFG (including ID/LP)-based systems. Firstly, although CFG is an adequate model of the majority of constructions occurring in natural language (Gazdar and Mellish 1989), it is clear that wide-coverage CFGs will need to be very large indeed, and this will lead to difficulties of (manual) development of consistent grammars and, possibly, to computational intractability at parse time (particularly during the already computationally expensive training phase). Secondly, associating probabilities with CF rules means that information about the probability of a rule applying at a particular point in a parse derivation is lost. This leads to complications distinguishing the probability of different derivations when the same rule can be applied several times in more than one way. Grammar 1 below is an example of a probabilistic CFG, in which each production is associated with a probability and the probabilities of all rules expanding a given nonterminal category sum to one.</Paragraph>
    <Paragraph position="7">  The probability of a particular parse is the product of the probabilities of each rule used in the derivation. Thus the probability of parse a) in Figure 1 is 0.0336. The probability of parse b) or c) must be identical though (0.09), because the same rule is applied twice in each case. Similarly, the probability of d) and e) is also identical (0.09) for essentially the same reason. However, these rules are natural treatments of noun compounding and prepositional phrase (PP) attachment in English, and the different derivations correlate with different interpretations. For example, b) would be an appropriate analysis for toy coffee grinder, while c) would be appropriate for cat food tin, and each of d) and e) yields one of the two possible interpretations of the man in the park with the telescope. We want to keep these structural configurations probabilistically distinct in case there are structurally conditioned differences in their frequency of occurrence; as would be predicted, for example, by the theory of parsing strategies (e.g.</Paragraph>
    <Paragraph position="8"> Frazier 1988). Fujisaki et al. (1989) propose a rather inelegant solution for the noun compound case, which involves creating 5582 instances of 4 morphosyntactically identical rules for classes of word forms with distinct bracketing behavior in noun-noun compounds. However, we would like to avoid enlarging the grammar and eventually to integrate probabilistic lexical information with probabilistic structural information in a more modular fashion.</Paragraph>
    <Paragraph position="9"> Probabilistic CFGs also will not model the context dependence of rule use; for example, an NP is more likely to be expanded as a pronoun in subject position than elsewhere (e.g. Magerman and Marcus 1991), but only one global probability can be associated with the relevant CF production. Thus the probabilistic CFG model predicts (incorrectly) that a) and f) will have the same probability of occurrence. These considerations suggest that we need a technique that allows use of a more adequate grammatical formalism than CFG and a more context-dependent probabilistic model.</Paragraph>
    <Paragraph position="10"> Our approach is to use the LR parsing technique as a natural way to obtain a finite-state representation of a non-finite-state grammar incorporating information about parse context. In the following sections, we introduce the LR parser and in Section 8  Ted Briscoe and John Carroll Generalized Probabilistic LR Parsing we demonstrate that LR parse tables do provide an appropriate amount of contextual information to solve the problems described above.</Paragraph>
  </Section>
  <Section position="4" start_page="30" end_page="38" type="metho">
    <SectionTitle>
3. LR Parsing in a Unification-Based Grammar Framework
</SectionTitle>
    <Paragraph position="0"> The heart of the LR parsing technique is the parse table construction algorithm, which is the most complex and computationally expensive aspect of LR parsing. Much of the attraction of the technique stems from the fact that the real work takes place in a precompilation phase and the run time behavior of the resulting parser is relatively simple and directed. An LR parser finds the 'rightmost derivation in reverse,' for a given string and CF grammar. The precompilation process results in a parser control mechanism that enables the parser to identify the 'handle,' or appropriate substring in the input to reduce, and the appropriate rule of the grammar with which to perform the reduction. The control information is standardly encoded as a parse table with rows representing parse states, and columns terminal and nonterminal symbols of the grammar. This representation defines a finite-state automaton. Figure 2 gives the LALR(1) parse table for Grammar 1. (LALR(1) is the most commonly used variant of LR since it usually provides the best trade-off between directed rule invocation and parse table size.) If the grammar is in the appropriate LR class (a stronger restriction than being an unambiguous CFG), the automaton will be deterministic; however, some algorithms for parse table construction are also able to build nondeterministic automata containing action conflicts for ambiguous CFGs. Parse table construction is discussed further in Section 4.</Paragraph>
    <Paragraph position="1">  Computational Linguistics Volume 19, Number 1</Paragraph>
    <Section position="1" start_page="31" end_page="32" type="sub_section">
      <SectionTitle>
3.1 Creating LR Parse Tables from Unification Grammars
</SectionTitle>
      <Paragraph position="0"> Tomita (1987) describes a system for nondeterministic LR parsing of context-free grammars consisting of atomic categories, in which each CF production may be augmented with a set of tests (which perform similar types of operations to those available in a unification grammar). At parse time, whenever a sequence of constituents is about to be reduced into a higher-level constituent using a production, the augmentation associated with the production is invoked to check syntactic or semantic constraints such as agreement, pass attribute values between constituents, and construct a representation of the higher-level constituent. (This is the standard approach to parsing with attribute grammars). The parser is driven by an LR parse table; however, the table is constructed solely from the CF portion of the grammar, and so none of the extra information embodied in the augmentations is taken into account during its construction. Thus the predictive power of the parser to select the appropriate rule given a specific parse history is limited to the CF portion of the grammar, which must be defined manually by the grammar writer. This requirement places a greater load on the grammar writer and is inconsistent with most recent unification-based grammar formalisms, which represent grammatical categories entirely as feature bundles (e.g. Gazdar et al. 1985; Pollard and Sag 1987; Zeevat, Calder, and Klein 1987). In addition, it violates the principle that grammatical formalisms should be declarative and defined independently of parsing procedure, since different definitions of the CF portion of the grammar will, at least, effect the efficiency of the resulting parser and might, in principle, lead to nontermination on certain inputs in a manner similar to that described by Shieber (1985).</Paragraph>
      <Paragraph position="1"> In what follows, we will assume that the unification-based grammars we are considering are represented in the ANLT object grammar formalism (Briscoe et al. 1987).</Paragraph>
      <Paragraph position="2"> This formalism is a notational variant of Definite Clause Grammar (e.g. Pereira and Warren 1980), in which rules consist of a mother category and one or more daughter categories, defining possible phrase structure configurations. Categories consist of sets of feature name-value pairs, with the possibility of variable values, which may be bound within a rule, and of category-valued features. Categories are combined using fixed-arity term unification (Prolog-style). The results and techniques we report below should generalize to many other unification-based formalisms. An example of a possible ANLT object grammar rule is:</Paragraph>
      <Paragraph position="4"> This rule provides a (simple) analysis of the structure of English clauses, corresponding to S --* NP VP, using a feature system based loosely on that of GPSG (Gazdar et al.</Paragraph>
      <Paragraph position="5"> 1985). In Tomita's LR parsing framework, each such rule must be manually converted into a rule of the following form in which some subpart of each category has been replaced by an atomic symbol.</Paragraph>
      <Paragraph position="7"> However, it is not obvious which features should be so replaced--why not include BAR and CASE? It will be difficult for the grammar writer to make such substitutions in a consistent way, and still more difficult to make them in an optimal way for the purposes of LR parsing, since both steps involve consideration and comparison of all the categories mentioned in each rule of the grammar.</Paragraph>
      <Paragraph position="8">  Ted Briscoe and John Carroll Generalized Probabilistic LR Parsing Constructing the LR parse table directly and automatically from a unification grammar would avoid these drawbacks. In this case, the LR parse table would be based on complex categories, with unification of complex categories taking the place of equality of atomic ones in the standard LR parse table construction algorithm (Osborne 1990; Nakazawa 1991). However, this approach is computationally prohibitively expensive: Osborne (1990:26) reports that his implementation (in HP Common Lisp on a Hewlett Packard 9000/350) takes almost 24 hours to construct the LR(0) states for a unification grammar of just 75 productions.</Paragraph>
    </Section>
    <Section position="2" start_page="32" end_page="38" type="sub_section">
      <SectionTitle>
3.2 Constructing a CF Backbone from a Unification Grammar
</SectionTitle>
      <Paragraph position="0"> Our approach, described below, not only extracts unification information from complex categories, but is computationally tractable for realistic sized grammars and also safe from inconsistency. We start with a unification grammar and automatically construct a CF 'backbone' of rules containing categories with atomic names and an associated 'residue' of feature name-value pairs. Each backbone grammar rule is generally in direct one-to-one correspondence with a single unification grammar rule. The LR parse table is then constructed from the CF backbone grammar. The parser is driven by this table, but in addition when reducing a sequence of constituents the parser performs the unifications specified in the relevant unification grammar rule to form the category representing the higher-level constituent, and the derivation fails if one of the unifications fails. Our parser is thus similar to Tomita's (1987), except that it performs unifications rather than invoking CF rule augmentations; however, the main difference between our approach and Tomita's is the way in which the CF grammar that drives the parser comes into being.</Paragraph>
      <Paragraph position="1"> Even though a unification grammar will be, at best, equivalent to a very large (and at worst, if features are employed in recursive or cyclic ways, possibly infinite) set of atomic-category CF productions, in practice we have obtained LR parsers that perform well from backbone grammars containing only about 30% more productions than the original unification grammar. The construction method ensures that for any given grammar the CF backbone captures at least as much information as the optimal CFG that contains the same number of rules as the unification grammar. Thus the construction method guarantees that the resulting LR parser will terminate and will be as predictive as the source grammar in principle allows.</Paragraph>
      <Paragraph position="2"> Building the backbone grammar is a two-stage process: . Compute the largest maximally specific set (in terms of subsumption) of disjoint categories covering the whole grammar and assign to each category a distinct atomic category name. That is: initialize disjoint-set to be empty; for each category C in grammar let disjoint-merge be the categories in disjoint-set which unify with C; if disjoint-merge is empty then add C to disjoint-set; else replace all elements of disjoint-merge in disjoint-set with the single most specific category which subsumes C and all categories in disjoint-merge; assign a distinct name to each category in disjoint-set.</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 19, Number 1</Paragraph>
      <Paragraph position="5"/>
      <Paragraph position="7"> Figure 5 Backbone grammar corresponding to object grammar. . For each unification grammar rule, create a backbone grammar rule containing atomic categories, each atomic category being the name assigned to the category in the disjoint category set that unifies with the corresponding category in the unification grammar rule: for each rule K of form C1 -~ C2 ... Cn in unification grammar add a rule B of form B1 -~ B2 ... Bn to backbone grammar where Bi is the name assigned to the (single) category in disjoint-set which unifies with Ci, for i=l, n.</Paragraph>
      <Paragraph position="8"> For example, for the rules in Figure 3 (corresponding loosely to S ---* NP VP, NP -~ Vi and VP --* Vt NP), step 1 would create the disjoint-set shown in Figure 4. (Note that the value for CASE on the NP categories in the grammar has 'collapsed' down to a  Backbone parse tree for either kim or lee or sandy using rule N2 --&gt; N2 \[C0NJ EITHER\], N2 \[CONJ OR\] +.</Paragraph>
      <Paragraph position="9"> variable, but that the two V categories remain distinct). Figure 5 shows the backbone rules that would be built in step 2.</Paragraph>
      <Paragraph position="10"> Algorithms for creating LR parse tables assume that the terminal vocabulary of the grammar is distinct from the nonterminal one, so the procedure described above will not deal properly with a unification grammar rule whose mother category is assumed elsewhere in the grammar to be a lexical category. The modification we make is to automatically associate two different atomic categories, one terminal and one nonterminal, with such categories, and to augment the backbone grammar with a unary rule expanding the nonterminal category to the terminal.</Paragraph>
      <Paragraph position="11"> Two other aspects of the ANLT grammar formalism require further minor elaborations to the basic algorithm: firstly, a rule may introduce a gap by including the feature specification \[NULL +\] on the gapped daughter--for each such daughter an extra rule is added to the backbone grammar expanding the gap category to the null string; secondly, the formalism allows Kleene star and plus operators (Gazdar et al. 1985)-in the ANLT grammar these operators are utilized in rules for coordination. A rule containing Kleene star daughters is treated as two rules: one omitting the daughters concerned and one with the daughters being Kleene plus. A new nonterminal category is created for each distinct Kleene plus category, and two extra rules are added to the backbone grammar to form a right-branching binary tree structure for it; a parser can easily be modified to flatten this out during processing into the intended flat sequence of categories. Figure 6 gives an example of what such a backbone tree looks like. Grammars written in other, more low-level unification grammar formalisms, such as PATR-I1 (Shieber 1984), commonly employ treatments of the type just described to deal with phenomena such as gapping, coordination, and compounding. However, this method both allows the grammar writer to continue to use the full facilities of the ANLT formalism and allows the algorithmic derivation of an appropriate backbone grammar to support LR parsing.</Paragraph>
      <Paragraph position="12"> The major task of the backbone grammar is to encode sufficient information (in the atomic categoried CF rules) from the unification grammar to constrain the application of the latter's rules at parse time. The nearly one-to-one mapping of unification grammar rules to backbone grammar rules described above works quite well for the ANLT grammar, with only a couple of exceptions that create spurious shift-reduce conflicts during parsing, resulting in an unacceptable degradation in performance. The  Computational Linguistics Volume 19, Number 1 phenomena concerned are coordination and unbounded dependency constructions.</Paragraph>
      <Paragraph position="13"> In the ANLT grammar three very general rules are used to form nominal, adjectival, and prepositional phrases following a conjunction; the categories in these rules lead to otherwise disjoint categories for conjuncts being merged, giving rise to a set of overly general backbone grammar rules. For example, the rule in the ANLT grammar for forming a noun phrase conjunct introduced by a conjunction is N2\[CONJ @con\] --&gt; \[SUBCAT @con, C0NJN +\], H2.</Paragraph>
      <Paragraph position="14"> The variable value for the C0NJ feature in the mother means that all N2 categories specified for this feature (e.g. N2 \[C0NJ EITHER\], N2 \[C0NJ NULL\] ) are generalized to the same category. This results in the backbone rules, when parsing either kim or lee helps, being unable, after forming a N2 \[C0NJ EITHER\] for either kim, to discriminate between the alternatives of preparing to iterate this constituent (as in the phrase kim, lee, or sandy helps where kim would be N2 \[C0NJ NULL\]), or shifting the next word or to start a new constituent. We solve this problem by declaring C0NJ to be a feature that may not have a variable value in an element of the disjoint category set. This directs the system to expand out each unification grammar rule that has a category containing this feature with a variable value into a number of rules fully specified for the feature, and to create backbone rules for each of these. There are eight possible values for C0NJ in the grammar, so the general rule for forming a nominal conjunct given above, for example, ends up being represented by a set of eight specialized backbone grammar rules.</Paragraph>
      <Paragraph position="15"> In the grammar, unbounded dependency constructions (UBCs) are analyzed by propagating the preposed constituent through the parse tree as the value of the SLASH feature, to link it with the 'gap' that appears in the constituent's normal position. All nonlexical major categories contain the feature, rules in the grammar propagating it between mother and a single daughter; other daughters are marked \[SLASH \[NOSLASH +\] \] indicating that the daughter is not 'gapped.&amp;quot; Backbone grammar construction would normally lose the information in the unification grammar about where gaps are allowed to occur, significantly degrading the performance of a parser. To carry the information over into the backbone we declare that wherever SLASH occurs with a variable value, the value should be expanded out into two values: \[NOSLASH +\], and a notional value unifying with anything except \[NOSLASH +\]. We have also experimented with a smaller grammar employing 'gap threading' (e.g. Pereira and Shieber 1987), an alternative treatment of UBCs. We were able to use the same techniques for expanding out and inference on the values of the (in this case atomic) features used for threading the gaps to produce a backbone grammar (and parse table) that had the same constraining power with respect to gaps as the original grammar.</Paragraph>
      <Paragraph position="16"> To date, we have not attempted to compute CF backbones for grammars written in formalisms with minimal phrase structure components and (almost) completely general categories, such as HPSG (Pollard and Sag 1987) and UCG (Zeevat, Calder, and Klein 1987); more extensive inference on patterns of possible unification within nested categories and appropriate expanding-out of the categories concerned would be necessary for an LR parser to work effectively. This and other areas of complexity in unification-based formalisms need further investigation before we can claim to have developed a system capable of producing a useful LR parse table for any unification-based grammar. In particular, declaring certain category-valued features so that they cannot take variable values may lead to nontermination in the backbone construction for some grammars. However, it should be possible to restrict the set of features that are considered in category-valued features in an analogous way to Shieber's (1985) restrictors for Earley's (1970) algorithm, so that a parse table can still be constructed.  Ted Briscoe and John Carroll Generalized Probabilistic LR Parsing 4. Building LR Parse Tables for Large NL Grammars The backbone grammar generated from the ANLT grammar is large: it contains almost 500 distinct categories and more than 1600 productions. When we construct the LALR(1) parse table, we therefore require an algorithm with practical time and space requirements. In the LR parsing literature there are essentially two approaches to constructing LALR(1) parse tables. One approach is graph-based (DeRemer and Pennello 1982), transforming the parse table construction problem to a set of well-known directed graph problems, which in turn are solvable by efficient algorithms. Unfortunately this approach does not work for grammars that are not LR(k) for any k (DeRemer and Pennello 1982:633), for example, ambiguous grammars. We therefore broadly follow the alternative approach of Aho, Sethi, and Ullman (1986), but with a number of optimizations:</Paragraph>
      <Paragraph position="18"> Constructing the LR(0) sets of items: we compute LR(0) states containing only kernel items (the item IS' --&gt; S\], where S' is the start symbol, and all items that have a symbol to the left of the dot), since nonkernel items can be cached in a table and retrieved only if needed. Being able to partition the items in this way is especially useful with the ANLT grammar, since the mean number of kernel items in each LR(0) set is about 9, whereas the mean number of nonkernel items per state is more than 400.</Paragraph>
      <Paragraph position="19"> Computing the LALR(1) lookaheads for each item: the conventional approach is to compute the LR(1) closure of each kernel item in order to determine the lookaheads that are generated spontaneously and those that propagate from other items. However, in an initial implementation we found that the LR(1) closure operation as described by Aho et al. was too expensive to be practicable for the number and size of LR(0) states we deal with, even with schemes for caching the closures of nonkernel items once they had been computed. Instead, we have moved to an algorithm devised by Kristensen and Madsen (1981), which avoids performing the LR(1) closure operation. The crucial advantage of this algorithm is the ability, at any stage in the computation, to tell whether the calculation of the lookahead set for a particular item has been completed, is underway, or has not yet started. This means that even partially computed lookahead sets can be cached (with the computation yet to be done explicitly marked), and that items whose lookahead sets are found to subsume those of others are able to just copy the results from the subsumed sets.</Paragraph>
      <Paragraph position="20"> Constructing the parse table: the LALR(1) parse table is derived straightforwardly from the lookahead sets, although to keep the size of the parse table within reasonable bounds we chose appropriate data structures to represent the goto entries and shift and reduce actions. For the ANLT backbone grammar there are approximately 150,000 goto entries (nonterminal--state pairs), 440,000 shift actions (terminal--state pairs), and 670,000 reduce actions (terminal--rule-number pairs); however, of the goto entries only 2,600 are distinct and of the shift actions only 1,100 are distinct; most states contain just reduce or just shift actions, and in any one state very few different rules are involved  Computational Linguistics Volume 19, Number 1 in reduce actions. 1 The majority of states contain just reduce or just shift actions, and in any one state very few different rules are involved in reduce actions. Taking advantage of the characteristics of this distribution, in each state we represent (in Common Lisp) (a) a set of goto entries as a list of (nonterminal--state) conses sorted into a canonical order, list elements and tails of lists shared where possible between states,  (b) a set of shift actions as a list containing a single (large) integer (the list shared when possible between states), where if the state shifts to state s on lookahead t, the element indexed by t in an auxiliary array will contain s together with a number n, and bit n in the binary representation of the integer will be 1, (c) a set of reduce actions as, for each rule involved, a cons whose second element is the rule number and whose first is a bit-vector  (shared when possible between states) whose nth bit is 1 if the reduce should occur with the nth terminal as lookahead, (d) an accept action as a cons with the first element being the lookahead symbol.</Paragraph>
      <Paragraph position="21"> For the grammars we have investigated, this representation achieves a similar order of space saving to the comb vector representation suggested by Aho, Sethi, and Ullman (1986:244ff) for unambiguous grammars (see Klein and Martin \[1989\] for a survey of representation techniques). The parse table for the ANLT grammar occupies approximately 360 Kbytes of memory, and so represents each action (shift, reduce, or goto) in an average of less than 2.3 bits. In contrast to conventional techniques, though, we maintain a faithful representation of the parse table, not replacing error entries with more convenient nonerror ones in order to save extra space. Our parsers are thus able to detect failures as soon as theoretically possible, an important efficiency feature when parsing nondeterministically with ambiguous grammars, and a time-saving feature when parsing interactively with them (see next section).</Paragraph>
      <Paragraph position="22"> Table 1 compares the size of the LALR(1) parse table for the ANLT grammar with others reported in the literature. From these figures, the ANLT grammar is more than twice the size of Tomita's (combined morphological and syntactic) grammar for Japanese (Tomita 1987:45). The grammar itself is about one order of magnitude bigger than that of a typical programming language, but the LALR(1) parse table, in terms of number of actions, is two orders of magnitude bigger. Although Tomita (1984:357) anticipates LR parsing techniques being applied to large NL grammars written in formalisms such as GPSG, the sizes of parse tables for such grammars grow more rapidly than he predicts. However, for large real-world NL grammars such as the ANLT, the table size is still quite manageable despite Johnson's (1989) worst-case complexity result of the number of LR(0) states being exponential on grammar size (leading to a parser with exponentially bad time performance). We have, therefore, not found it necessary to use Schabes' (1991a) LR-like tables (with number of states guaranteed to be polynomial even in the worst case).</Paragraph>
      <Paragraph position="23"> 1 of the 3,710 states, 2,200 contain at least 1 action conflict, with a median of 34 conflicts per state. There are a total of 230,000 shift-reduce conflicts and 220,000 reduce-reduce conflicts, fairly uniformly distributed across the terminal lookahead symbols. In half of the latter conflicts, the rules involved have an identical number of daughters. One implication of this finding is that an approach to conflict resolution such as that of Shieber (1983) where reduce-reduce conflicts are resolved in favor of the longer reduction may not suffice to select a unique analysis for realistic NL grammars.  As might be expected, and Table 2 illustrates, parse table construction for large grammars is CPU-intensive. As a rough guide, Grosch (1990) quotes LALR(1) table construction for a grammar for Modula-2 taking from about 5 to 50 seconds, so scaling up two orders of magnitude, our timings for the ANLT grammar fall in the expected region.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="38" end_page="39" type="metho">
    <SectionTitle>
5. Interactive Incremental Deterministic Parsing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
5.1 Constructing a Disambiguated Training Corpus
</SectionTitle>
      <Paragraph position="0"> The major problem with attempting to employ a disambiguated training corpus is to find a way of constructing this corpus in an error-free and resource-efficient fashion.</Paragraph>
      <Paragraph position="1"> Even manual assignment of lexical categories is slow, labor-intensive, and error-prone.</Paragraph>
      <Paragraph position="2"> The greater complexity of constructing a complete parse makes the totally manual approach very unattractive, if not impractical, Sampson (1987:83) reports that it took 2 person-years to produce the 'LOB tree bank' of 50,000 words. Furthermore, in that project, no attempt was made to ensure that the analyses were well formed with respect to a generative grammar. Attempting to manually construct analyses consistent with a grammar of any size and sophistication would place an enormous additional load on the analyst. Leech and Garside (1991) discuss the problems that arise in manual parsing of corpora concerning accuracy and consistency of analyses across time and analyst, the labor-intensive nature of producing detailed analyses, and so forth. They advocate an approach in which simple 'skeleton' parses are produced by hand from previously tagged material, with checking for consistency between analysts. These skeleton analyses can then be augmented automatically with further information implicit in the lexical tags. While this approach may well be the best that can be achieved 2 Figures given by Klein and Martin (1989). 3 Grammar from Spector (1983) with optionality expanded out; statistics taken from a parse table constructed by the second author.</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 19, Number 1 with fully manual techniques, it is still unsatisfactory in several respects. Firstly, the analyses are crude, while we would like to automatically parse with a grammar capable of assigning sophisticated semantically interpretable ones; but it is not clear how to train an existing grammar with such unrelated analyses. Secondly, the quality of any grammar obtained automatically from the parsed corpus is likely to be poor because of the lack of any rigorous checks on the form of the skeleton parses. Such a grammar might, in principle, be trained from the parsed corpus, but there are still likely to be small mismatches between the actual analysis assigned manually and any assigned automatically. For these reasons, we decided to attempt to produce a training corpus using the grammar that we wished ultimately to train. As long as the method employed ensured that any analysis assigned was a member of the set defined by the grammar, these problems during training should not arise.</Paragraph>
      <Paragraph position="4"> Following our experience of constructing a substantial lexicon for the ANLT grammar from unreliable and indeterminate data (Carroll and Grover 1989), we decided to construct the disambiguated training corpus semi-automatically, restricting manual interaction to selection between alternatives defined by the ANLT grammar. One obvious technique would be to generate all possible parses with a conventional parser and to have the analyst select the correct parse from the set returned (or reject them all).</Paragraph>
      <Paragraph position="5"> However, this approach places a great load on the analyst, who will routinely need to examine large numbers of parses for given sentences. In addition, computation of all possible analyses is likely to be expensive and, in the limit, intractable.</Paragraph>
      <Paragraph position="6"> Briscoe (1987) demonstrates that the structure of the search space in parse derivations makes a left-to-right, incremental mode of parse selection most efficient. For example, in noun compounds analyzed using a recursive binary-branching rule (N --* N N) the number of analyses correlates with the Catalan series (Church and Patil, 1982), 4 so a 3-word compound has 2 analyses, 4 has 5, 5 has 14, 9 has 1430, and so forth. However, Briscoe (1987:154f) shows that with a simple bounded context parser (with one word lookahead) set up to request help whenever a parse indeterminacy arises, it is possible to select any of the 14 analyses of a 5-word compound with a maximum of 5 interactions and any of the 1430 analyses of a 9-word compound with around 13 interactions. In general, resolution of the first indeterminacy in the input will rule out approximately half the potential analyses, resolution of the next, half of the remaining ones, and so on. For 'worst case' CF ambiguities (with O(n 3) complexity) this approach to parse selection appears empirically to involve numbers of interactions that increase at little more than linear rate with respect to the length of the input. It is possible to exploit this insight in two ways. One method would be to compute all possible analyses represented as a (packed) parse forest and ask the user to select between competing subanalyses that have been incorporated into a successful analysis of the input. In this way, only genuine global syntactic ambiguities would need to be considered by the user. However, the disadvantage of this approach is that it relies on a prior (and perhaps CPU-intensive) on-line computation of the full set of analyses. The second method involves incremental interaction with the parser during the parse to guide it through the search space of possibilities. This has the advantage of being guaranteed to be computationally tractable but the potential disadvantage of requiring the user to resolve many local syntactic ambiguities that will not be</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="39" end_page="43" type="metho">
    <SectionTitle>
4 The nth Catalan number is given by
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> Ted Briscoe and John Carroll Generalized Probabilistic LR Parsing incorporated into a successful analysis. Nevertheless, using LR techniques this problem can be minimized and, because we do not wish to develop a system that must be able to compute all possible analyses (at some stage) in order to return the most plausible one, we have chosen the latter incremental method.</Paragraph>
    <Section position="1" start_page="40" end_page="43" type="sub_section">
      <SectionTitle>
5.2 The Interactive LR Parsing System
</SectionTitle>
      <Paragraph position="0"> The interactive incremental parsing system that we implemented asks the user for a decision at each choice point during the parse. However, to be usable in practice, such a system must avoid, as far as possible, presenting the user with spurious choices that could be ruled out either by using more of the left context or by looking at words yet to be parsed. Our approach goes some way to addressing these points, since the parser is as predictive as the backbone grammar and LR technique allow, and the LALR(1) parse table allows one word lookahead to resolve some ambiguities (although, of course, the resolution of a local ambiguity may potentially involve an unlimited amount of lookahead; e.g. Briscoe 1987:125ff). In fact, LR parsing is the most effectively predictive parsing technique for which an automatic compilation procedure is known, but this is somewhat undermined by our use of features, which will block some derivations so that the valid prefix property will no longer hold (e.g. Schabes 1991b). Extensions to the LR technique, for example those using LR-regular grammars (Culic and Cohen 1973; Bermudez 1991), might be used to further cut down on interactions; however, computation of the parse tables to drive such extended LR parsers may prove intractable for large NL grammars (Hektoen 1991).</Paragraph>
      <Paragraph position="1"> An LR parser faces an indeterminacy when it enters a state in which there is more than one possible action, given the current lookahead. In a particular state there cannot be more than one shift or accept action, but there can be several reduce actions, each specifying a reduction with a different rule. When parsing, each shift or reduce choice must lead to a different final structure, and so the indeterminacy represents a point of syntactic ambiguity (although it may not correspond to a genuinely global syntactic ambiguity in the input, on account of the limited amount of lookahead).</Paragraph>
      <Paragraph position="2"> In the ANLT grammar and lexicon, lexical ambiguity is at least as pervasive as structural ambiguity. A naive implementation of an interactive LR parser would ask the user the correct category for each ambiguous word as it was shifted; many open-class words are assigned upwards of twenty lexical categories by the ANLT lexicon with comparatively fine distinctions between them, so this strategy would be completely impracticable. To avoid asking the user about lexical ambiguity, we use the technique of preterminal delaying (Shieber 1983), in which the assignment of an atomic preterminal category to a lexical item is not made until the choice is forced by the use of a particular production in a later reduce action. After shifting an ambiguous lexical item, the parser enters a state corresponding to the union of states that would be entered on shifting the individual lexical categories. (Each union of states will in practice be small, since it being otherwise would imply that the current context was completely failing to constrain the following input). Since, in general, several unification grammar categories for a single word may be subsumed by a single atomic preterminal category, we extend Shieber's technique so that it deals with a grammar containing complex categories by associating a set of alternative analyses with each state (not just one), and letting the choice between them be forced by later reduce actions, just as with atomic preterminal categories.</Paragraph>
      <Paragraph position="3"> In order not to overload the user with spurious choices concerning local ambiguities, the parser does not request help immediately after it reaches a parse action conflict. Instead the parser pursues each option in a limited breadth-first fashion and only requests help with analysis paths that remain active. In our current system this  type of lookahead is limited to up to four indeterminacies ahead. Such checking is cheap in terms of machine resources and very effective in cutting down both the number of choice points the user is forced to consider and also the average number of options in each one. Table 3 shows the reduction in user interaction achieved by increasing the amount of lookahead in our system. Computation of the backbone grammar generates extra rules (as previously described to deal with lexical categories used as rule mothers and daughters specified to be repeatable an indefinite number of times) that do not correspond directly to single unification grammar rules. At choice points, reductions involving these rules are not presented to the user; instead the system applies the reductions automatically, proceeding until the next shift action or choice point is reached, including these options in those presented to the user.</Paragraph>
      <Paragraph position="4"> The final set of measures taken to reduce the amount of interaction required with the user is to ask if the phrase being parsed contains one or more gaps or instances of coordination before presenting choices involving either of these phenomena, blocking consideration of rules on the basis of the presence of particular feature-value pairs.</Paragraph>
      <Paragraph position="5">  ing parse tree is displayed with category aliases substituted for the actual complex categories.</Paragraph>
      <Paragraph position="6"> The requests for manual selection of the analysis path are displayed to the analyst in as terse a manner as possible, and require knowledge of the ANLT grammar and lexicon to be resolved effectively. Figure 8 summarizes the amount of interaction required in the experiment reported below for parsing a set of 150 LDOCE noun definitions with the ANLT grammar. To date, the largest number of interactions we have observed for a single phrase is 55 for the (30-word) LDOCE definition for youth hostel: a hostel for usu young people walking around country areas on holiday for which they pay small amounts of money to the youth hostels association or the international yha.</Paragraph>
      <Paragraph position="7"> Achieving the correct analysis interactively took the first author about 40 minutes (including the addition of two lexical entries). Definitions of this length will often have many hundreds or even thousands of parses; computing just the parse forest for this definition takes of the order of two hours of CPU time (on a DEC 3100 Unix workstation). Since in a more general corpus of written material the average sentence length is likely to be 30--40 words, this example illustrates clearly the problems with any approach based on post hoc on-line selection of the correct parse. However, using  Computational Linguistics Volume 19, Number 1 the incremental approach to semi-automatic parsing we have been able to demonstrate that the correct analysis is among this set. Furthermore, a probabilistic parser such as the one described later may well be able to compute this analysis in a tractable fashion by extracting it from the parse forest. (To date, the largest example for which we have been able to compute all analyses had approximately 2500).</Paragraph>
      <Paragraph position="8"> The parse histories resulting from semi-automatic parsing are automatically stored and can be used to derive the probabilistic information that will guide the parser after training. We return to a discussion of the manner in which this information is utilized in Section 7.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="43" end_page="52" type="metho">
    <SectionTitle>
6. Non-Deterministic LR Parsing with Unification Grammars
</SectionTitle>
    <Paragraph position="0"> As well as building an interactive parsing system incorporating the ANLT grammar (described above), we have implemented a breadth-first, nondeterministic LR parser for unification grammars. This parser is integrated with the Grammar Development Environment (GDE; Carroll et al. 1988) in the ANLT system, and provided as an alternative parser for use with stable grammars for batch parsing of large bodies of text. The existing chart parser, although slower, has been retained since it is more suited to grammar development, because of the speed with which modifications to the grammar can be compiled and its better debugging facilities (Boguraev et al. 1988).</Paragraph>
    <Paragraph position="1"> Our nondeterministic LR parser is based on Kipps' (1989) reformulation of Tomita's (1987) parsing algorithm and uses a graph-structured stack in the same way. Our parser is driven by the LALR(1) state table computed from the backbone grammar, but in addition on each reduction the parser performs the unifications appropriate to the unification grammar version of the backbone rule involved. The analysis being pursued fails if one of the unifications fails. The parser performs sub-analysis sharing (where if two or more trees have a common sub-analysis, that sub-analysis is represented only once), and local ambiguity packing (in which sub-analyses that have the same top node and cover the same input have their top nodes merged, being treated by higher level structures as a single sub-analysis). However, we generalize the technique of atomic category packing described by Tomita, driven by atomic category names, to complex feature-based categories following Alshawi (1992): the packing of sub-analyses is driven by the subsumption relationship between the feature values in their top nodes. An analysis is only packed into one that has already been found if its top node is subsumed by, or is equal to that of the one already found. An analysis, once packed, will thus never need to be unpacked during parsing (as in Tomita's system) since the value of each feature will always be uniquely determined.</Paragraph>
    <Paragraph position="2"> Our use of local ambiguity packing does not in practice seem to result in exponentially bad performance with respect to sentence length (cf. Johnson 1989) since we have been able to generate packed parse forests for sentences of over 30 words having many thousands of parses. We have implemented a unification version of Schabes' (1991a) chart-based LR-like parser (which is polynomial in sentence length for CF grammars), but experiments with the ANLT grammar suggest that it offers no practical advantages over our Tomita-style parser, and Schabes' table construction algorithm yields less fine-grained and, therefore, less predictive parse tables. Nevertheless, searching the parse forest exhaustively to recover each distinct analysis proved computationally intractable for sentences over about 22 words in length. Wright, Wrigley, and Sharman (1991) describe a Viterbi-like algorithm for unpacking parse forests containing probabilities of (sub-)analyses to find the n-best analyses, but this approach does not generalize (except in a heuristic way) to our approach in which unification failure on the different extensions of packed nodes (resulting from differing super- or sub- null analyses) cannot be computed 'locally.' In subsequent work (Carroll and Briscoe 1992) we have developed such a heuristic technique for best-first search of the parse forest which, in practice, makes the recovery of the most probable analyses much more efficient (allowing analysis of sentences containing over 30 words).</Paragraph>
    <Paragraph position="3"> We noticed during preliminary experiments with our unification LR parser that it was often the case that the same unifications were being performed repeatedly, even during the course of a single reduce action. The duplication was happening in cases where two or more pairs of states in the graph-structured stack had identical complex categories between them (for example due to backbone grammar ambiguity). During a reduction with a given rule, the categories between each pair of states in a backwards traversal of the stack are collected and unified with the appropriate daughters of the rule. Identical categories appearing here between traversed pairs of states leads to duplication of unifications. By caching unification results we eliminated this wasted effort and improved the initially poor performance of the parser by a factor of about three.</Paragraph>
    <Paragraph position="4"> As for actual parse times, Table 4 compares those for the GDE chart parser, the semi-automatic, user-directed LR parser, and the nondeterministic LR parser. Our general experience is that although the nondeterministic LR parser is only around 30-50% faster than the chart parser, it often generates as little as a third the amount of garbage. (The relatively modest speed advantage compared with the substantial space saving appears to be due to the larger overheads involved in LR parsing). Efficient use of space is obviously an important factor for practical parsing of long and ambiguous texts.</Paragraph>
    <Paragraph position="5"> 7. LR Parsing with Probabilistic Disambiguation Several researchers (Wright and Wrigley 1989; Wright 1990; Ng and Tomita 1991; Wright, Wrigley, and Sharman 1991) have proposed using LR parsers as a practical method of parsing with a probabilistic context-free grammar. This approach assumes that probabilities are already associated with a CFG and describes techniques for distributing those probabilities around the LR parse table in such a way that a probabilistic ranking of alternative analyses can be computed quickly at parse time, and probabilities assigned to analyses will be identical to those defined by the original probabilistic CFG. However, our method of constructing the training corpus allows us to associate probabilities with an LR parse table directly, rather than simply with rules of the grammar. An LR parse state encodes information about the left and right context of the current parse. Deriving probabilities relative to the parse context will allow the probabilistic parser to distinguish situations in which identical rules reapply  Computational Linguistics Volume 19, Number 1 in different ways across different derivations or apply with differing probabilities in different contexts.</Paragraph>
    <Paragraph position="6"> Semi-automatic parsing of the training corpus yields a set of LR parse histories that are used to construct the probabilistic version of the LALR(1) parse table. The parse table is a nondeterministic finite-state automaton so it is possible to apply Markov modeling techniques to the parse table (in a way analogous to their application to lexical tagging or CFGs). Each row of the parse table corresponds to the possible transitions out of the state represented by that row, and each transition is associated with a particular lookahead item and a parse action. Nondeterminism arises when more than one action, and hence transition, is possible given a particular lookahead item. The most straightforward technique for associating probabilities with the parse table is to assign a probability to each action in the action part of the table (e.g. Wright 1990). 5 If probabilities are associated directly with the parse table rather than derived from a probabilistic CFG or equivalent global pairing of probabilities to rules, then the resulting probabilistic model will be more sensitive to parse context. For example, in a derivation for the sentence he loves her using Grammar 1, the distinction between reducing the first pronoun and second pronoun to NP--using rule 5 (NP --&gt; ProNP)-can be maintained in terms of the different lookahead items paired with the reduce actions relating to this rule (in state 5 of the parse table in Figure 2); in the first case, the lookahead item will be Vi, and in the second $ (the end of sentence marker). However, this approach does not make maximal use of the context encoded into a transition in the parse table, and it is possible to devise situations in which the reduction of a pronoun in subject position and elsewhere would be indistinguishable in terms of lookahead alone; for example, if we added appropriate rules for adverbs to Grammar 1, then this reduction would be possible with lookahead Adv in sentences such as he passionately loves her and he loves her passionately.</Paragraph>
    <Paragraph position="7"> A slightly less obvious approach is to further subdivide reduce actions according to the state reached after the reduce action has applied. This state is used together with the resultant nonterminal to define the state transition in the goto part of the parse table. Thus, this move corresponds to associating probabilities with transitions in the automaton rather than with actions in the action part of the table. For example, a reduction of pronoun to NP in subject position in the parse table for Grammar 1 in Figure 2 always results in the parser returning to state 0 (from which the goto table deterministically prescribes a transition to state 7 with nonterminal RP). Reduction to NP of a pronoun in object position always results in the parser returning to state 11.</Paragraph>
    <Paragraph position="8"> Thus training on a corpus with more subject than nonsubject pronominal NPs will now result in a probabilistic preference for reductions that return to 'pre-subject' states with 'post-subject' lookaheads. Of course, this does not mean that it will be impossible to devise grammars in which reductions cannot be kept distinct that might, in principle, have different frequencies of occurrence. However, this approach appears to be the natural stochastic, probabilistic model that emerges when using a LALR(1) table. Any further sensitivity to context would require sensitivity to patterns in larger sections of a parse derivation than can be defined in terms of such a table.</Paragraph>
    <Paragraph position="9"> The probabilities required to create the probabilistic version of the parse table can be derived from the set of parse histories resulting from the training phase described in Section 5, by computing the frequency with which each transition from a particular state has been taken and converting these to probabilities such that the probabilities 5 In our implementation, the probabilities are actually stored separately from the parse table to ensure that otherwise-sharable transitions in the table can still be represented compactly even if their probabilities differ.</Paragraph>
    <Paragraph position="10">  A probabilistic version of the parse table for Grammar 1.</Paragraph>
    <Paragraph position="11"> assigned to each transition from a given state sum to one. In Figure 9 we show a probabilistic LALR(1) parse table for Grammar 1 derived from a simple, partial (and artificial) training phase. In this version of the table a probability is associated with each shift action in the standard way, but separate probabilities are associated with reduce  Computational Linguistics Volume 19, Number 1</Paragraph>
    <Paragraph position="13"> Parse derivations for the winter holiday camp closed.</Paragraph>
    <Paragraph position="14"> actions, depending on the state reached after the action; for example, in state 4 with lookahead N~ the probability of reducing with rule 10 is 0.17 if the state reached is 3 and 0.22 if the state reached is 5. The actions that have no associated probabilities are ones that have not been utilized during the training phase; each is assigned a smoothed probability that is the reciprocal of the result of adding one to the total number of observations of actions actually taken in that state. Differential probabilities are thus assigned to unseen events in a manner analogous to the Good-Turing technique. For this reason, the explicit probabilities for each row add up to less than one. The goto part of the table is not shown because it is always deterministic and, therefore, we do not associate probabilities with goto transitions.</Paragraph>
    <Paragraph position="15"> The difference between our approach and one based on probabilistic CFG can be brought out by considering various probabilistic derivations using the probabilistic parse table for Grammar 1. Assuming that we are using probabilities simply to rank parses, we can compute the total probability of an analysis by multiplying together the probabilities of each transition we take during its derivation. In Figure 10, we give the two possible complete derivations for a sentence such as the winter holiday camp closed consisting of a determiner, three nouns, and an intransitive verb. The ambiguity concerns whether the noun compound is left- or right-branching, and, as we saw in Section 2, a probabilistic CFG cannot distinguish these two derivations. The probability of each step can be read off the action table and is shown after the lookahead item in the figure.</Paragraph>
    <Paragraph position="16"> In step 8 a shift-reduce conflict occurs so the stack 'splits' while the left- and right-branching analyses of the noun compound are constructed. The a) branch corresponds  Ted Briscoe and John Carroll Generalized Probabilistic LR Parsing to the right-branching derivation and the product of the probabilities is 4.6 x 10 -8, while the product for the left-branching b) derivation is 5.1 x 10 -7. Since the table was constructed from parse histories with a preponderance of left-branching structures this is the desired result. In practice, this technique is able to distinguish and train accurately on 3 of the 5 possible structures for a 4-word noun-noun compound; but it inaccurately prefers a completely left-branching analysis over structures of the form ((n n)(n n)) and ((n (nn)) n). Once we move to 5-word noun-noun compounds, performance degrades further. However, this level of performance on such structural configurations is likely to be adequate, because correct resolution of most ambiguity in such constructions is likely to be dominated by the actual lexical items that occur in individual texts. Nevertheless, if there are systematic structural tendencies evident in corpora (for example, Frazier's \[1988\] parsing strategies predict a preference for left-branching analyses of such compounds), then the probabilistic model is sensitive enough to discriminate them. 6 In practice, we take the geometric mean of the probabilities rather than their product to rank parse derivations. Otherwise, it would be difficult to prevent the system from always developing a bias in favor of analyses involving fewer rules or equivalently 'smaller' trees, almost regardless of the training material. Of course, the need for this step reflects the fact that, although the model is more context-dependent than probabilistic CFG, it is by no means a perfect probabilistic model of NL. 7 For example, the stochastic nature of the model and the fact that the entire left context of a parse derivation is not encoded in LR state information means that the probabilistic model cannot take account of, say, the pattern of resolution of earlier conflicts in the current derivation. Another respect in which the model is approximate is that we are associating probabilities with the context-free backbone of the unification grammar.</Paragraph>
    <Paragraph position="17"> Successful unification of features at parse time does not affect the probability of a (partial) analysis, while unification failure, in effect, sets the probability of any such analysis to zero. As long as we only use the probabilistic model to rank successful analyses, this is not particularly problematic. However, parser control regimes that attempt some form of best-first search using probabilistic information associated with transitions might not yield the desired result given this property. For example, it is not possible to use Viterbi-style optimization of search for the maximally probable parse because this derivation may contain a sub-analysis that will be pruned locally before a subsequent unification failure renders the current most probable analysis impossible.</Paragraph>
    <Paragraph position="18"> In general, the current breadth-first probabilistic parser is more efficient than its nonprobabilistic counterpart described in the previous section. In contrast to the parser described by Ng and Tomita (1991), our probabilistic parser is able to merge (state and stack) configurations and in all cases still maintain a full record of all the probabilities computed up to that point, since it associates probabilities with partial analyses of the input so far rather than with nodes in the graph-structured stack. We are currently 6 Although we define our probabilistic model relative to the LR parsing technique, it is likely that there is an equivalent encoding in purely grammatical terms. In general our approach corresponds to making the probability of rule application conditional on other rules having applied during the parse derivation (e.g. Magerman and Marcus 1991) and the lexical category of the next word; for example, it would be possible to create a grammatical representation of the probabilistic model that emerges from a LR(0) table by assigning a set of probabilities associated with rule numbers to each right-hand side category in each rule of a CFG that would encode the probability of a rule being used to expand that category in that context.</Paragraph>
    <Paragraph position="19"> 7 Magerrnan and Marcus (1991) argue that it is reasonable to use the geometric mean when computing the probability of two or more sub-analyses because the independence assumptions that motivate using products do not hold for such an approximate model. In Carroll and Briscoe (1992) we present a more motivated technique for normalizing the probability of competing sub-analyses in the parse forest.  Computational Linguistics Volume 19, Number 1 experimenting with techniques for probabilistically unpacking the packed parse forest to recover the first few most probable derivations without the need for exhaustive search or full expansion.</Paragraph>
    <Paragraph position="20"> 8. Parsing LDOCE Noun Definitions In order to test the techniques and ideas described in previous sections, we undertook a preliminary experiment using a subset of LDOCE noun definitions as our test corpus.</Paragraph>
    <Paragraph position="21"> (The reasons for choosing this corpus are discussed in the introduction.) A corpus of approximately 32,000 noun definitions was created from LDOCE by extracting the definition fields and normalizing the definitions to remove punctuation, font control information, and so forth, s A lexicon was created for this corpus by extracting the appropriate lemmas and matching these against entries in the ANLT lexicon. The 10,600 resultant entries were loaded into the ANLT morphological system (Ritchie et al. 1987) and this sublexicon and the full ANLT grammar formed the starting point for the training process.</Paragraph>
    <Paragraph position="22"> A total of 246 definitions, selected without regard for their syntactic form, were parsed semi-automatically using the parser described in Section 5. During this process, further rules and lexical entries were created for some definitions that failed to parse. Of the total number, 150 were successfully parsed and 63 lexical entries and 14 rules were added. Some of the rules required reflected general inadequacies in the ANLT grammar; for example, we added rules to deal with new partitives and prepositional phrase and verb complementation. However, 7 of these rules cover relatively idiosyncratic properties of the definition sublanguage; for example, the postmodification of pronouns by relative clause and prepositional phrase in definitions beginning something that .... that of..., parenthetical phrases headed by adverbs, such as the period... esp the period, and coordinations without explicit conjunctions ending with etc., and so forth. Further special rules will be required to deal with brackets in definitions to cover conventions such as a man (monk) or woman (nun) who lives in a monastery, which we ignored for this test. Nevertheless, the number of new rules required is not great and the need for most was identified very early in the training process. Lexical entries are more problematic, since there is little sign that the number of new entries required will tail off. However, many of the entries required reflect systematic inadequacies in the ANLT lexicon rather than idiosyncrasies of the corpus. It took approximately one person-month to produce this training corpus. As a rough guide, it takes an average of 15 seconds to resolve a single interaction with the parser. However, the time a parse takes can often be lengthened by incorrect choices (and the consequent need to back up manually) and by the process of adding lexical entries and occasional rules.</Paragraph>
    <Paragraph position="23"> The resultant parse histories were used to construct the probabilistic parser (as described in the previous section). This parser was then used to reparse the training corpus, and the most highly ranked analyses were automatically compared with the original parse histories. We have been able to reparse in a breadth-first fashion all but 3 of the 150 definitions that were parsed manually. 9 (These three are each over 8 The corpus contains about 17,000 unique headwords and 13,500 distinct word forms in the definitions. Its perplexity (PP) measures based on bigram and trigram word models and an estimate of an infinite model were PP(2) = 104, PP(3) = 41, and PP(inf) = 8 (Sharman 1991).</Paragraph>
    <Paragraph position="24"> 9 The results we report here are from using the latest versions of the ANLT grammar and LR parsing system. Briscoe and Carroll (1991) report an earlier version of this experiment using different versions of the grammar and parser in which results differed in minor ways. Carroll and Briscoe (1992) report a third version of the experiment in which results were improved slightly through the use of a better normalization and parse forest unpacking technique.</Paragraph>
    <Paragraph position="25">  25 words in length.) There are 22 definitions one word in length: all of these trivially receive correct analyses. There are 89 definitions between two and ten words in length inclusive (mean length 6.2). Of these, in 68 cases the correct analysis (as defined by the training corpus) is also the most highly ranked. In 13 of the 21 remaining cases the correct analysis is the second or third most highly ranked analysis. Looking at these 21 cases in more detail, in 8 there is an inappropriate structural preference for 'low' or 'local' attachment (see Kimball 1973), in 4, an inappropriate preference for compounds, and in 6 of the remaining 9 cases, the highest ranked result contains a misanalysis of a single constituent two or three words in length. If these results are interpreted in terms of a goodness of fit measure such as that of Sampson, Haigh, and Atwell (1989), the measure would be better ttian 96%. If we take correct parse/sentence as our measure then the result is 76%. For definitions longer than 10 words this latter figure tails off, mainly due to misapplication of such statistically induced, but nevertheless structural, attachment preferences. Figure 11 summarizes these results.</Paragraph>
    <Paragraph position="26"> We also parsed a further 55 LDOCE noun definitions not drawn from the training corpus, each containing up to 10 words (mean length 5.7). Of these, in 41 cases the correct parse is the most highly ranked, in 6 cases it is the second or third most highly ranked, and in the remaining 8 cases it is not in the first three analyses. This yields a correct parse/sentence measure of 75%. Examination of the failures again reveals that a preference for local attachment of postmodifiers accounts for 5 cases, a preference for compounds for 1, and the misanalysis of a single constituent for 2. The others are mostly caused by the lack of lexical entries with appropriate SUBCAT features. In Figure 12 we show the analysis for the unseen definition of affectation, which has 20 parses of which the most highly ranked is correct.</Paragraph>
    <Paragraph position="27">  Parse tree for a person or thing that supports or helps.</Paragraph>
    <Paragraph position="28"> Figure 13 shows the highest-ranked analysis assigned to one definition of aid. This is an example of a false positive which, in this case, is caused by the lack of a lexical entry for support as an intransitive verb. Consequently, the parser finds, and ranks highest, an analysis in which supports and helps are treated as transitive verbs forming verb phrases with object NP gaps, and that supports or helps as a zero relative clause with that analyzed as a prenominal subject--compare a person or thing that that supports or helps. It is difficult to fault this analysis and the same is true for the other false positives we have looked at. Such false positives present the biggest challenge to the type of system we are attempting to develop. One hopeful sign is that the analyses assigned such examples appear to have low probabilities relative to most probable correct analyses of other examples. However, considerably more data will be required before we can decide whether this trend is robust enough to provide the basis for automatic identification of false positives.</Paragraph>
    <Paragraph position="29">  Ted Briscoe and John Carroll Generalized Probabilistic LR Parsing Using a manually disambiguated training corpus and manually tuned grammar appears feasible with the definitions sublanguage. Results comparable to those obtained by Fujisaki et al. (1989) and Sharman, Jelinek, and Mercer (1990) are possible on the basis of a quite modest amount of manual effort and a very much smaller training corpus, because the parse histories contain little 'noise' and usefully reflect the semantically and pragmatically appropriate analysis in the training corpus, and because the number of failures of coverage were reduced to some extent by adding the rules specifically motivated by the training corpus. Unlike Fujisaki et al. or Sharman, Jelinek, and Mercer, we did not integrate information about lexemes into the rule probabilities or make use of lexical syntactic probability. It seems likely that the structural preference for local attachment might be overruled in appropriate contexts if lexeme (or better, word sense) information were taken into account. The slightly worse results (relative to mean definition length) obtained for the unseen data appear to be caused more by the nonexistence of a correct analysis in a number of cases, rather than by a marked decline in the usefulness of the rule probabilities. This again highlights the need to deal effectively with examples outside the coverage of the grammar.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML