File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/j93-1002_concl.xml

Size: 13,291 bytes

Last Modified: 2025-10-06 13:57:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-1002">
  <Title>Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars</Title>
  <Section position="8" start_page="52" end_page="55" type="concl">
    <SectionTitle>
9. Conclusions and Further Work
</SectionTitle>
    <Paragraph position="0"> The system that we have developed offers partial and practical solutions to two of the three problems of corpus analysis we identified in the introduction. The problem of tuning an existing grammar to a particular corpus or sublanguage is addressed partly by manual extensions to the grammar and lexicon during the semi-automatic training phase and partly by use of statistical information regarding frequency of rule use gathered during this phase. The results of the experiment reported in the last section suggest that syntactic peculiarities of a sublanguage or corpus surface quite rapidly, so that manual additions to the grammar during the training phase are practical. However, lexical idiosyncrasies are far less likely to be exhausted during the training phase, suggesting that it will be necessary to develop an automatic method of dealing with them. In addition, the current system does not take account of differing frequencies of occurrence of lexical entries; for example, in the LOB corpus the verb believe occurs with a finite sentential complement in 90% of citations, although it is grammatical with at least five further patterns of complementation. This type of lexical information, which will very likely vary between sublanguages, should be integrated into the probabilistic model. This will be straightforward in terms of the model, since it merely involves associating probabilities with each distinct lexical entry for a lexeme and carrying these forward in the computation of the likelihood of each parse.</Paragraph>
    <Paragraph position="1"> However, the acquisition of the statistical information from which these probabilities can be derived is more problematic. Existing lexical taggers are unable to assign tags that reliably encode subcategorization information. It seems likely that automatic acquisition of such information must await successful techniques for robust parsing of, at least, phrases in corpora (though Brent \[1991\] claims to be able to recognize some subcategorization patterns using large quantities of untagged text).</Paragraph>
    <Paragraph position="2"> The task of selecting the correct analysis from the set licensed by the grammar is also partially solved by the system. It is clear from the results of the preliminary experiment reported in the previous section that it is possible to make the semantically and pragmatically correct analysis highly ranked, and even most highly ranked in many cases, just by exploiting the frequency of occurrence of the syntactic rules in the training data. However, it is also clear that this approach will not succeed in all cases; for example, in the experiment the system appears to have developed a preference for local attachment of prepositional phrases (PPs), which is inappropriate in a  Computational Linguistics Volume 19, Number 1 significant number of cases. It is not surprising that probabilities based solely on the frequency of syntactic rules are not capable of resolving this type of ambiguity; in an example such as John saw the man on Monday again it is the temporal interpretation of Monday that favors the adverbial interpretation (and thus nonlocal attachment). Such examples are syntactically identical to ones such as John saw the man on the bus again, in which the possibility of a locative interpretation creates a mild preference for the adjectival reading and local attachment. To select the correct analysis in such cases it will be necessary to integrate information concerning word sense collocations into the probabilistic analysis. In this case, we are interested in collocations between the head of a PP complement, a preposition and the head of the phrase being postmodified. In general, these words will not be adjacent in the text, so it will not be possible to use existing approaches unmodified (e.g. Church and Hanks 1989), because these apply to adjacent words in unanalyzed text. Hindle and Rooth (1991) report good results using a mutual information measure of collocation applied within such a structurally defined context, and their approach should carry over to our framework straightforwardly. null One way of integrating 'structural' collocational information into the system presented above would be to make use of the semantic component of the (ANLT) grammar. This component pairs logical forms with each distinct syntactic analysis that represent, among other things, the predicate-argument structure of the input. In the resolution of PP attachment and similar ambiguities, it is 'collocation' at this level of representation that appears to be most relevant. Integrating a probabilistic ranking of the resultant logical forms with the probabilistic ranking of the distinct syntactic analyses presents no problems, in principle. However, once again, the acquisition of the relevant statistical information will be difficult, because it will require considerable quantities of analyzed text as training material. One way to ameliorate the problem might be to reduce the size of the 'vocabulary' for which statistics need to be gathered by replacing lexical items with their superordinate terms (or a disjunction of such terms in the case of ambiguity). Copestake (1990, 1992) describes a program capable of extracting the genus term of a definition from an LDOCE definition, resolving the sense of such terms, and constructing hierarchical taxonomies of the resulting word senses. Taxonomies of this form might be used to replace PP complement heads and postmodified heads in corpus data with a smaller number of superordinate concepts.</Paragraph>
    <Paragraph position="3"> This would make the statistical data concerning trigrams of head-preposition-head less sparse (cf. Gale and Church 1990) and easier to gather from a corpus. Nevertheless, it will only be possible to gather such data from determinately syntactically analyzed material.</Paragraph>
    <Paragraph position="4"> The third problem of dealing usefully with examples outside the coverage of the grammar even after training is not addressed by the system we have developed. Nevertheless, the results of the preliminary experiment for unseen examples indicate that it is a significant problem, at least with respect to lexical entries. A large part of the problem with such examples is identifying them automatically. Some such examples will not receive any parse and will, therefore, be easy to spot. Many, though, will receive incorrect parses (one of which will be automatically ranked as the most probable) and can, therefore, only be identified manually (or perhaps on the basis of relative improbability). Jensen et al. (1983) describe an approach to parsing such examples based on parse 'fitting' or rule 'relaxation' to deal with 'ill-formed' input. An approach of this type might work with input that receives no parse, but cannot help with the identification of those that only receive an incorrect one. In addition, it involves annotating each grammar rule about what should be relaxed and requires that semantic interpretation can be extended to 'fitted' or partial parses (e.g. Pollack and Pereira 1988).</Paragraph>
    <Paragraph position="5">  Ted Briscoe and John Carroll Generalized Probabilistic LR Parsing Sampson, Haigh, and Atwell (1989) propose a more thorough-going probabilistic approach in which the parser uses a statistically defined measure of 'closest fit' to the set of analyses contained in a 'tree bank' of training data to assign an analysis.</Paragraph>
    <Paragraph position="6"> This approach attempts to ensure that analyses of new data will conform as closely as possible to existing ones, but does not require that analyses assigned are well formed with respect to any given generative grammar implicit in the tree bank analyses.</Paragraph>
    <Paragraph position="7"> Sampson, Haigh, and Atwell report some preliminary results for a parser of this type that uses the technique of simulated annealing to assign the closest fitting analysis on the basis of initial training on the LOB treebank and automatic updating of its statistical data on the basis of further parsed examples. Sampson, Haigh, and Atwell give their results in terms of a similarity measure with respect to correct analyses assigned by hand. For a 13-sentence sample the mean similarity measure was 80%, and only one example received a fully correct analysis. These results suggest that the technique is not reliable enough for practical corpus analysis, to date. In addition, the analyses assigned, on the basis of the LOB treebank scheme, are not syntactically determinate (for example, syntactic relations in unbounded dependency constructions are not represented).</Paragraph>
    <Paragraph position="8"> A more promising approach with similar potential robustness would be to infer a probabilistic grammar using Baum-Welch re-estimation from a given training corpus and predefined category set, following Lari and Young (1990) and Pereira and Schabes (1992). This approach has the advantage that the resulting grammar defines a well-defined set of analyses for which rules of compositional interpretation might be developed. However, the technique is limited in several ways; firstly, such grammars are restricted to small (maximum about 15 nonterminal) CNF CFGs because of the computational cost of iterative re-estimation with an algorithm polynomial in sentence length and nonterminal category size; and secondly, because some form of supervised training will be essential if the analyses assigned by the grammar are to be linguistically motivated. Immediate prospects for applying such techniques to realistic NL grammars do not seem promising--the ANLT backbone grammar discussed in Section 4 contains almost 500 categories. However, Briscoe and Waegner (1992) describe an experiment in which, firstly, Baum-Welch re-estimation was used in conjunction with other more linguistically motivated constraints on the class of grammars that could be inferred, such as 'headedness'; and secondly, initial probabilities were heavily biased in favor of manually coded, linguistically highly plausible rules. This approach resulted in a simple tag sequence grammar often able to assign coherent and semantically/pragmatically plausible analyses to tag sequences drawn from the Spoken English Corpus. By combining such techniques and relaxing the CNF constraint, for example, by adopting the trellis algorithm version of Baum-Welch re-estimation (Kupiec 1991), it might be possible to create a computationally tractable system operating with a realistic NL grammar that would only infer a new rule from a finite space of linguistically motivated possibilities in the face of parse failure or improbability. In the shorter term, such techniques combined with simple tag sequence grammars might yield robust phrase-level 'skeleton' parsers that could be used as corpus analysis tools.</Paragraph>
    <Paragraph position="9"> The utility of the system reported here would be considerably improved by a more tractable approach to probabilistically unpacking the packed parse forest than exhaustive search. Finding the n-best analyses would allow us to recover analyses for longer sentences where a parse forest is constructed and would make the approach generally more efficient. Carroll and Briscoe (1992) present a heuristic algorithm for parse forest unpacking that interleaves normalization of competing sub-analyses with best-first extraction of the n most probable analyses. Normalization of competing sub-analyses with respect to the longest derivation both allows us to prune the search probabilisti- null Computational Linguistics Volume 19, Number 1 cally and to treat the probability of analyses as the product of the probability of their sub-analyses, without biasing the system in favor of shorter derivations. This modified version of the system presented here is able to return analyses for sentences over 31 words in length, yields slightly better results on a replication of the experiment reported in Section 8, and the resultant parser is approximately three times faster at returning the three highest-ranked parsers than that presented here.</Paragraph>
    <Paragraph position="10"> In conclusion, the main positive points of the paper are that 1) LR parse tables can be used to define a more context-dependent and adequate probabilistic model of NL, 2) predictive LR parse tables can be constructed automatically from unification-based grammars in standard notation, 3) effective parse table construction and representation techniques can be defined for realistically sized ambiguous NL grammars, 4) semi-automatic LR based parse techniques can be used to efficiently construct training corpora, and 5) the LR parser and ANLT grammar jointly define a useful probabilistic model into which probabilities concerning lexical subcategorization and structurally defined word sense collocations could be integrated.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML