File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1030_metho.xml
Size: 20,728 bytes
Last Modified: 2025-10-06 14:14:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1030"> <Title>FAST PARSING USING PRUNING AND GRAMMAR SPECIALIZATION</Title> <Section position="3" start_page="0" end_page="223" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Suppose that we have a general grammar for English, or some other natural language; by this, we mean a grammar which encodes most of the important constructions in the language, and which is intended to be applicable to a large range of different domains and applications. The basic question attacked in this paper is the following one: can such a grammar be concretely useful if we want to process input from a specific domain? In particular, how can a parser that uses a general grammar achieve a level of efficiency that is practically acceptable? The central problem is simple to state. By the very nature of its construction, a general grammar allows a great many theoretically valid analyses of almost any non-trivial sentence. However, in the context of a specific domain, most of these will be extremely implausible, and can in practice be ignored.</Paragraph> <Paragraph position="1"> If we want efficient parsing, we want to be able to focus our search on only a small portion of the space of theoretically valid grammatical analyses.</Paragraph> <Paragraph position="2"> One possible solution is of course to dispense with the idea of using a general grammar, and simply code a new grammar for each domain. Many people do this, but one cannot help feeling that something is being missed; intuitively, there are many domain-independent grammatical constraints, which one would prefer only to need to code once.</Paragraph> <Paragraph position="3"> In the last ten years, there have been a number of attempts to find ways to automatically adapt a general grammar and/or parser to the sub-language defined by a suitable training corpus. For example, (Briscoe and Carroll, 1993) train an LR parser based on a general grammar to be able to distinguish between likely and unlikely sequences of parsing actions; (Andry et al., 1994) automatically infer sortal constraints, that can be used to rule out otherwise grammatical constituents; and (Grishman et al., 1984) describes methods that reduce the size of a general grammar to include only rules actually useful for parsing the training corpus.</Paragraph> <Paragraph position="4"> The work reported here is a logical continuation of two specific strands of research aimed in this general direction. The first is the popular idea of statistical tagging e.g. (DeRose, 1988; Cutting et al., 1992; Church, 1988). Here, the basic idea is that a given small segment S of the input string may have several possible analyses; in particular, if S is a single word, it may potentially be any one of several parts of speech. However, if a substantial training corpus is available to provide reasonable estimates of the relevant parameters, the immediate context surrounding S will usually make most of the locally possible analyses of S extremely implausible.</Paragraph> <Paragraph position="5"> In the specific case of part-of-speech tagging, it is well-known (DeMarcken, 1990) that a large proportion of the incorrect tags can be eliminated &quot;safely&quot;~ i.e. with very low risk of eliminating correct tags.</Paragraph> <Paragraph position="6"> In the present paper, the statistical tagging idea is generalized to a method called &quot;constituent pruning&quot;; this acts on local analyses of phrases normally larger than single-word units.</Paragraph> <Paragraph position="7"> Constituent pruning is a bottom-up approach, and is complemented by a second, top-down, method based on Explanation-Based Learning (EBL; (Mitchell et al., 1986; van Harmelen and Bundy, 1988)). This part of the paper is essentially an extension and generalization of the line of work described in (Rayner, 1988; Rayner and Samuelsson, 1990; Samuelsson and Rayner, 1991; Rayner and Samuelsson, 1994; Samuelsson, 1994b). Here, the basic idea is that grammar rules tend in any specific domain to combine much more frequently in some ways than in others. Given a sufficiently large corpus parsed by the original, general, grammar, it is possible to identify the common combinations of grammar rules and &quot;chunk&quot; them into &quot;macro-rules&quot;. The result is a &quot;specialized&quot; grammar; this has a larger number of rules, but a simpler structure, allowing it in practice to be parsed very much more quickly using an LR-based method (Samuelsson, 1994a). The coverage of the specialized grammar is a strict subset of that of the original grammar; thus any analysis produced by the specialized grammar is guaranteed to be valid in the original one as well. The practical utility of the specialized grammar is largely determined by the loss of coverage incurred by the specialization process. null The two methods, constituent pruning and grammar specialization, are combined as follows. The rules in the original, general, grammar are divided into two sets, called phrasal and non-phrasal respectively. Phrasal rules, the majority of which define non-recursive noun phrase constructions, are used as they are; non-phrasal rules are combined using EBL into chunks, forming a specialized grammar which is then compiled further into a set of LRtables. Parsing proceeds by interleaving constituent creation and deletion. First, the lexicon and morphology rules are used to hypothesize word analyses.</Paragraph> <Paragraph position="8"> Constituent pruning then removes all sufficiently unlikely edges. Next, the phrasal rules are applied bottom-up, to find all possible phrasal edges, after which unlikely edges are again pruned. Finally, the specialized grammar is used to search for full parses.</Paragraph> <Paragraph position="9"> The scheme is fully implemented within a version of the Spoken Language Translator system (Rayner et al., 1993; Agniis et al., 1994), and is normally applied to input in the form of small lattices of hypotheses produced by a speech recognizer.</Paragraph> <Paragraph position="10"> The rest of the paper is structured as follows. Section 2 describes the constituent pruning method. Section 3 describes the grammar specialization method, focusing on how the current work extends and improves on previous results. Section 4 describes experiments where the constituent pruning/grammar specialization method was used on sets of previously unseen speech data. Section 5 concludes and sketches further directions for research, which we are presently in the process of investigating. null</Paragraph> </Section> <Section position="4" start_page="223" end_page="225" type="metho"> <SectionTitle> 2 Constituent Pruning </SectionTitle> <Paragraph position="0"> Before both the phrasal and full parsing stages, the constituent table (henceforth, the chart) is pruned to remove edges that are relatively unlikely to contribute to correct analyses.</Paragraph> <Paragraph position="1"> For example, after the string &quot;Show flight D L three one two&quot; is lexically analysed, edges for &quot;D&quot; and &quot;L&quot; as individual characters are pruned because another edge, derived from a lexical entry for &quot;D L&quot; as an airline code, is deemed far more plausible.</Paragraph> <Paragraph position="2"> Similarly, edges for &quot;one&quot; as a determiner and as a noun are pruned because, when flanked by two other numbers, &quot;one&quot; is far more likely to function as a number.</Paragraph> <Paragraph position="3"> Phrasal parsing then creates a number of new edges, including one for &quot;flight D L three one two&quot; as a noun phrase. This edge is deemed far more likely to serve as the basis for a correct full parse than any of the edges spanning substrings of this phrase; those edges, too, are therefore pruned. As a result, full parsing is very quick, and only one analysis (the correct one) is produced for the sentence. In the absence of pruning, processing takes over eight times as long and produces 37 analyses in total.</Paragraph> <Section position="1" start_page="223" end_page="224" type="sub_section"> <SectionTitle> 2.1 The pruning algorithm </SectionTitle> <Paragraph position="0"> Our algorithm estimates the probability of correctness of each edge: that is, the probability that the edge will contribute to the correct full analysis of the sentence (assuming there is one), given certain lexical and/or syntactic information about it. Values on each criterion (selection of pieces of information) are derived from training corpora by maximum likelihood estimation followed by smoothing. That is, our estimate for the probability that an edge with property P is correct is (modulo smoothing) simply the number of times edges with property P occur in correct analyses in training divided by the number of times such edges are created during the analysis process in training.</Paragraph> <Paragraph position="1"> The current criteria are: * The left bigram score: the probability of correctness of an edge considering only the following data about it: - its tag (corresponding to its major category symbol plus, for a few categories, some ad- null ditional distinctions derived from feature values); - for a lexical edge, its word or semantic word class (words with similar distributions, such as city names, are grouped into classes to overcome data sparseness); or for a phrasal edge, the name of the final (topmost) grammar rule that was used to create it; - the tag of a neighbouring edge immediately to its left. If there are several left neighbours, the one giving the highest probability is used.</Paragraph> <Paragraph position="2"> * The right bigram score: as above, but considering right neighbours.</Paragraph> <Paragraph position="3"> * The unigram score: the probability of correctness of an edge considering only the tree of grammar rules, with words or word classes at the leaves, that gave rise to it. For a lexical edge, this reduces to its word or word class, and its tag.</Paragraph> <Paragraph position="4"> Other criteria, such as trigrams and finer-grained tags, are obviously worth investigating, and could be applied straightforwardly within the framework described here.</Paragraph> <Paragraph position="5"> The minimum score derived from any of the criteria applied is deemed initially to be the score of the constituent. That is, an assumption of full statistical dependence (Yarowsky, 1994), rather than the more common full independence, is made3 When llf events El, E2,..., E,~ are fully independent, then the joint probability P(E1 A ... A En) is the product of P(EI)...P(En), but if they are maximally dependent, it is the minimum of these values. Of course, neither assumption is any more than an approximation to the truth; but assuming dependence has the advantage that the estimate of the joint probability depends much less strongly on n, and so estimates for alternative joint events can be directly compared, without any possibly tricky normalization, even if they are composed of different numbers of atomic events. This property is desirable: different (sub-)paths through a chart may span different numbers of edges, and one can imagine evaluation criteria which are only defined for some kinds of edge, or which often duplicate information supplied by other criteria. Taking minima means that the pruning of an edge results from it scoring poorly on one criterion, regardless of other, possibly good scores assigned to it by other criteria. This fits in with the fact that on the basis of local information alone it is not usually possibly to predict with confidence that a particular edge is highly likely to contribute to the correct analysis (since global factors will also be important) but it often is possible to spot highly unlikely edges. In other words, our training procedure yields far more probability estimates close to zero than close to one.</Paragraph> <Paragraph position="6"> recognizer output is being processed, however, the estimate from each criterion is in fact multiplied by a further estimate derived from the acoustic score of the edge: that is, the score assigned by the speech recognizer to the best-scoring sentence hypothesis containing the word or word string for the edge in question. Multiplication is used here because acoustic and lexicosyntactic likelihoods for a word or constituent would appear to be more nearly fully independent than fully dependent, being based on very different kinds of information.</Paragraph> <Paragraph position="7"> Next, account is taken of the connectivity of the chart. Each vertex of the chart is labelled with the score of the best path through the chart that visits that vertex. In accordance with the dependence assumption, the score of a path is defined as the minimum of the scores of its component edges. Then the score of each edge is recalculated to be the minimum of its existing score and the scores of its start and end vertices, on the grounds that a constituent, however intrinsically plausible, is not worth preserving if it does not occur on any plausible paths.</Paragraph> <Paragraph position="8"> Finally, a pruning threshold is calculated as the score of the best path through the chart multiplied by a certain fraction. For the first pruning phase we use 1/20, and for the second, 1/150, although performance is not very sensitive to this. Any constituents scoring less than the threshold are pruned out.</Paragraph> </Section> <Section position="2" start_page="224" end_page="225" type="sub_section"> <SectionTitle> 2.2 Relation to other pruning methods </SectionTitle> <Paragraph position="0"> As the example above suggests, judicious pruning of the chart at appropriate points can greatly restrict the search space and speed up processing. Our method has points of similarity with some very recent work in Constraint Grammar 2 and is an alternative to several other, related schemes.</Paragraph> <Paragraph position="1"> Firstly, a remarked earlier, it generalizes tagging: it not only adjudicates between possible labels for the same word, but can also use the existence of a constituent over one span of the chart as justification for pruning another constituent over another span, normally a subsumed one, as in the &quot;D L&quot; example. This is especially true in the second stage of pruning, when many constituents of different lengths have been created. Furthermore, it applies equally well to lattices, rather than strings, of words, and can take account of acoustic plausibility as well as syntactic considerations.</Paragraph> <Paragraph position="2"> Secondly, our method is related to beam search (Woods, 1985). In beam search, incomplete parses of an utterance are pruned or discarded when, on some criterion, they are significantly less plausible than other, competing parses. This pruning is fully interleaved with the parsing process. In contrast, our pruning takes place only at certain points: currently before parsing begins, and between the phrasM and full parsing stages. Potentially, as with any generate-and-test algorithm, this can mean efficiency is reduced: some paths will be explored that could in principle be pruned earlier. However, as the results in section 4 below will show, this is not in practice a serious problem, because the second pruning phase greatly reduces the search space in preparation for the potentially inefficient full parsing phase. Our method has the advantage, compared to beam search, that there is no need for any particular search order to be followed; when pruning takes place, all constituents that could have been found at the stage in question are guaranteed already to exist.</Paragraph> <Paragraph position="3"> Thirdly, our method is a generalization of the strategy employed by (McCord, 1993). McCord interleaved parsing with pruning in the same way as us, but only compared constituents over the same span and with the same major category. Our comparisons are more global and therefore can result in more effective pruning.</Paragraph> </Section> </Section> <Section position="5" start_page="225" end_page="226" type="metho"> <SectionTitle> 3 Grammar specialization </SectionTitle> <Paragraph position="0"> As described in Section 1 above, the non-phrasal grammar rules are subjected to two phases of processing. In the first, &quot;EBL learning&quot; phase, a parsed training corpus is used to identify &quot;chunks&quot; of rules, which are combined by the EBL algorithm into single macro-rules. In the second phase, the resulting set of &quot;chunked&quot; rules is converted into LR table form, using the method of (Samuelsson, 1994a).</Paragraph> <Paragraph position="1"> There are two main parameters that can be adjusted in the EBL learning phase. Most simply, there is the size of the training corpus; a larger training corpus means a smaller loss of coverage due to grammar specialization. (Recall that grammar specialization in general trades coverage for speed). Secondly, there is the question of how to select the rulechunks that will be turned into macro-rules. At one limit, the whole parse-tree for each training example is turned into a single rule, resulting in a specialized grammar all of whose derivations are completely &quot;flat&quot;. These grammars can be parsed extremely quickly, but the coverage loss is in practice unacceptably high, even with very large training corpora.</Paragraph> <Paragraph position="2"> At the opposite extreme, each rule-chunk consists of a single rule-application; this yields a specialized grammar identical to the original one. The challenge is to find an intermediate solution, which specializes the grammar non-triviMly without losing too much coverage.</Paragraph> <Paragraph position="3"> Several attempts to find good &quot;chunking criteria&quot; are described in the papers by Rayner and Samuelsson quoted above. In (Rayner and Samuelsson, 1994), a simple scheme is given, which creates rules corresponding to four possible units: full utterances, recursive NPs, PPs, and non-recursive NPs.</Paragraph> <Paragraph position="4"> A more elaborate scheme is given in (Samuelsson, 1994b), where the &quot;chunking criteria&quot; are learned automatically by an entropy-minimization method; the results, however, do not appear to improve on the earlier ones. In both cases, the coverage loss due to grammar specialization was about 10 to 12% using training corpora with about 5,000 examples.</Paragraph> <Paragraph position="5"> In practice, this is still unacceptably high for most applications.</Paragraph> <Paragraph position="6"> Our current scheme is an extension of the one from (Rayner and Samuelsson, 1994), where the rulechunks are trees of non-phrasal rules whose roots and leaves are categories of the following possible types: full utterances, utterance units, imperative VPs, NPs, relative clauses, VP modifiers and PPs.</Paragraph> <Paragraph position="7"> The resulting specialized grammars are forced to be non-recursive, with derivations being a maximum of six levels deep. This is enforced by imposing the following dominance hierarchy between the possible categories:</Paragraph> <Paragraph position="9"> The precise definition of the rule-chunking criteria is quite simple, and is reproduced in the appendix.</Paragraph> <Paragraph position="10"> Note that only the non-phrasal rules are used as input to the chunks from which the specialized grammar rules are constructed. This has two important advantages. Firstly, since all the phrasal rules are excluded from the speciMization process, the coverage loss associated with missing combinations of phrasal rules is eliminated. As the experiments in the next section show, the resulting improvement is quite substantial. Secondly, and possibly even more importantly, the number of specialized rules produced by a given training corpus is approximately halved. The most immediate consequence is that much larger training corpora can be used before the specialized grammars produced become too large to be handled by the LR table compiler. If both phrasal and non-phrasal rules are used, we have been unable to compile tables for rules derived from training sets of over 6,000 examples (the process was killed after running for about six hours on a Sun Sparc 20/HS21, SpecINT92=131.2). Using only non-phrasal rules, compilation of the tables for a 15,000 example train- null ing set required less than two CPU-hours on the same machine.</Paragraph> </Section> class="xml-element"></Paper>