File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-1010_intro.xml

Size: 4,992 bytes

Last Modified: 2025-10-06 14:06:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1010">
  <Title>Learning Stochastic Categorial Grammars</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Stochastic context free grammars (SCFGs), which are standard context free grammars extended with a probabilistic interpretation of the generation of strings, have been shown to model some sources with hidden branching processes more efficiently than stochastic regular grammars (Lari and Young, 1990). Furthermore, SCFGs can be automatically estimated using the Inside-Outside algorithm, which is guaranteed to produce a SCFG that is (locally) optimal (Baker, 1990). Hence, SCFGs appear to be suitable formalisms for the estimation of widecovering grammars, capable of being used as part of a system that assigns logical forms to sentences.</Paragraph>
    <Paragraph position="1"> Unfortunately, from a Natural Language Processing perspective, SCFGs are not appropriate grammars to learn. Firstly, as Collins demonstrates (Collins, 1996), accurate parse selection, which is important for ambiguity resolution, requires lexical statistics. SCFGs, as standardly used in the Inside-Outside algorithm, are in Chomsky Normal Form (CNF), which restricts rules to being at most binary branching. Such rules are not lexicalised, and hence, to lexicalise (CNF) CFGs requires adding a complex statistical model that simulates the projection of head items up the parse tree. Given the embryonic status of grammatical statistical models and the difficulties of accurately estimating the parameters of such a model, it seems more prudent to prefer whenever possible simpler statistical models with fewer parameters, and treat lexicalisation as part of the grammatical formalism, and not as part of the statistical framework (for example (Schabes, 1992)). Secondly, (stochastic) CFGs are well-known as being linguistically inadequate formalisms for problems such as non-constituent coordination.</Paragraph>
    <Paragraph position="2"> Hence, a learner using a SCFG will not have an appropriate formalism with which to construct an adequate grammar.</Paragraph>
    <Paragraph position="3"> Stochastic categorial grammars (SCGs), which are classical categorial grammars extended with a probabilistic component, by contrast, have a grammatical component that is naturally lexicalised. Furthermore, Combinatory Categorial Grammars have been shown to account elegantly for problematic areas of syntax such as non-constituent co-ordination (Steedman, 1989), and so it seems likely that SCGs, when suitably extended, will be able to inherit this linguistic adequacy. We therefore believe that SCGs are more useful formalisms for statistical language learning than SCFGs. Future work will reinforce the differences between SCFGs and SCGS, but in this paper, we instead concentrate upon the estimation of SCGs.</Paragraph>
    <Paragraph position="4"> Stochastic grammars (of all varieties) are usually estimated using the Maximum Likelihood Principle, which assumes an indifferent prior probability distribution. When there is sufficient training material, Maximum Likelihood Estimation (MLE) produces good results. More usually however, with many thousands of parameters to estimate, there will be insufficient training material for MLE to produce an optimal solution. If, instead, an informative prior is used in place of the indifferent prior, better results can be achieved. In this paper we show how using an informative prior probability distribution Osborne 8~ Briscoe 80 Stochastic Categorial Grammars Miles Osborne and Ted Briscoe (1997) Learning Stochastic Categorial Grammars. In T.M. Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL pp 80-87. (~) 1997 Association for Computational Linguistics leads to the estimation of a SCG that is more accurate than a SCG estimated using an indifferent prior. We use the Minimum Description Length Principle (MDL) as the basis of our informative prior. To our knowledge, we know of no other papers comparing MDL to MLE using naturally occurring data and learning probabilistic grammars. For example, Stolcke's MDL-based learner was trained using artificial data (Stolcke, 1984); Chen's similar learner mixes smoothing techniques with MDL, thereby obfuscating the difference between MDL and MLE (Chen, 1996).</Paragraph>
    <Paragraph position="5"> The structure of the rest of this paper is as follows.</Paragraph>
    <Paragraph position="6"> In section 2 we introduce SCGs. We then in section 3 present a problem facing most statistical learners known as over\]itting. Section 4 gives an overview of the MDL principle, which we use to deal with overfitting1; in section 5 we present our learner. Following this, in section 6 we give some experiments comparing use of MDL, with a MLE-style learner.</Paragraph>
    <Paragraph position="7"> The paper ends with some brief comments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML