File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1010_metho.xml

Size: 18,857 bytes

Last Modified: 2025-10-06 14:14:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1010">
  <Title>Learning Stochastic Categorial Grammars</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Grammar formalism and
</SectionTitle>
    <Paragraph position="0"> statistical models An SCG is a classical categorial grammar (one using just functional application, see, for example, Wood (Wood, 1993)) such that each category is augmented with a probability, which is used to model the choices made when constructing a parse. Categories are conditioned on the lexical item they are assigned to. More formally, a categorial lexicon G is the tuple (A, C, V, L), where:  learning can be found in (Rissanen and Ristad, 1994, de Marcken, 1996).</Paragraph>
    <Paragraph position="1"> A categorial grammar consists of a categorial lexicon augmented with the rule of left functional application: a b\a ~ b and the rule of right functional application: b/a a ~-+ b.</Paragraph>
    <Paragraph position="2"> A probabilistic categorial grammar is a categorial grammar such that the sum of the probabilities of all derivations is one. Since in our variant of a categorial grammar, where there are no variables in categories, directional information is encoded into each category, and we only use functional application, the actual derivation of any sentence mechanically follows from the assignment of categories to lexical items, and so it follows that the choices available when parsing with a categorial grammar arise from the particular assignment of categories to any given lexical item. Within a stochastic process, probabilities model these choices, so in a stochastic categorial grammar, we need to ensure that the probabilities of all categories assigned to a particular lexical item sum to one. That is, for all categories c in lexicon C assigned to lexical item w:</Paragraph>
    <Paragraph position="4"> for some distinct Category c, occurring with frequency f(c), that can be assigned to lexical item w, and for all categories x, with frequency f(x), that can also be assigned to w.</Paragraph>
    <Paragraph position="5"> For the derivation space to actually sum to one, all possible assignments of categories to lexical items must be legal. Clearly, only assignments of lexical items that combine to form a valid parse constitute legal category assignments, and so there will be a probability loss. That is, the sum of all derivations will be less than, or equal to one. We can either scale the probabilities so that the derivations do sum to one, or alternatively, we can assume that (illegal) assignments of categories are never seen, with the relative probabilities between the legal category assignments being unaffected, and so give a zero probability to the illegal category assignments 2 Because categories are normalised with respect to the lexical item they are associated with, the resulting statistical model is lexicalised. However, in this paper, we learn lexica of part-of-speech tag sequences, and not lexica for actual words. That is, the set of simple categories is taken as being a part-of-speech tag set and the set of words is Mso a set of 2Thanks to Eirik Hektoen for pointing this point out.</Paragraph>
    <Paragraph position="6"> Osborne ~ Briscoe 81 Stochastic Categorial Grammars part-of-speech tags. This greatly reduces the number of parameters to be acquired, than would be the case if the lexicon contained a set of words, but incurs the obvious cost of a loss of accuracy. In future experiments, we plan to learn fully-lexicalised SCGs.</Paragraph>
    <Paragraph position="7"> Having now introduced SCGs, we now turn to the problem of overfitting.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Overfitting
</SectionTitle>
    <Paragraph position="0"> Bayesian inference forms the basis of many popular language learning systems, examples of which include the Baum-Welch algorithm for estimating hidden Markov models (Baum, 1972) and the Inside-Outside algorithm for estimating CFGs (Baker, 1990). As is well known, Bayes' theorem takes the following form:</Paragraph>
    <Paragraph position="2"> Here, the term P(H) is the prior probability, P(D I H) is the likelihood probability, and P(H I D) is the posterior probability. The prior probability of H can be interpreted as quantifying one's belief in H. If the prior is accurate, hypotheses that are closer to the target hypothesis will have a higher prior probability assigned to them than hypotheses that are further away from the target hypothesis. The likelihood probability describes how well the training material can be encoded in the hypothesis. For example, one would hope that the training corpus would receive a high likelihood probability, but a set of ungrammatical sentences would receive a low likelihood probability. Finally, the posterior probability can be considered to be the combination of these two probability distributions: we prefer hypotheses that accord with our prior belief in them (have a high prior probability) and model the training material well (have a high likelihood probability). When learning in a Bayesian framework, we try to find some hypothesis that maximises the posterior probability. For example, we might try to find some maximally probable grammar H given some corpus D.</Paragraph>
    <Paragraph position="3"> The usual setting is for the learner to assume an uninformative (indifferent) prior, yielding MLE.</Paragraph>
    <Paragraph position="4"> Usually, with sufficient data, MLE give good results.</Paragraph>
    <Paragraph position="5"> However, with insufficient data, which is the standard case when there are many thousands of parameters to estimate, MLE, unless checked, will lead to the estimation of a large theory whose probability mass is concentrated upon the training set, with a consequential poor prediction of future, unseen events. This problem is known as over-fitting. Over-fitting affects all Bayesian learners that assume an uninformative prior and are given insufficient training data. An over-fitted theory poorly predicts future events not seen in the training set. Clearly, good prediction of unseen events is the central task of language learners, and so steps need to be taken to avoid over-fitting.</Paragraph>
    <Paragraph position="6"> Over-fitting is generally tackled in two ways: Restrict the learner such that it cannot express the maximally likely hypothesis, given some hypothesis language.</Paragraph>
    <Paragraph position="7"> Smooth the resulting parameters in the hope that they back-off from the training data and apportion more of the probability mass to account for unseen material.</Paragraph>
    <Paragraph position="8"> Examples of the first approach can be seen most clearly with the usage of CNF grammars by the Inside-Outside algorithm (Pereira and Schabes, 1992, Lari and Young, 1990). A grammar in CNF does not contain rules of an arbitrary arity, and so when learning CNF grammars, the Inside-Outside algorithm cannot find the maximal likelihood estimation of some training set. The problem with this language restriction is that there is no a priori reason why one should settle with any particular limit on rule arity; some grammars mainly contain binary rules, but others (for example those implicitly within tree-banks) sometimes contain rules with many right-hand side categories. Any language restriction, in lieu of some theory of rule arity, must remain ad hoc. Note that SCGs, whilst assigning binary branching trees to sentences, contain categories that may naturally be of an arbitrary length, without violating linguistic intuitions about what constitutes a plausible analysis of some sentence.</Paragraph>
    <Paragraph position="9"> Examples of the second approach can be found in language modelling (for example (Church and Gale, 1991, Katz, 1987)). Smoothing a probability distribution tends to make it 'closer' (reduces the Kullback-Liebler distance) to some other probability distribution (for example, the uniform distribution).</Paragraph>
    <Paragraph position="10"> Unfortunately, there is no guarantee that this other distribution is closer to the target probability distribution than was the original, un-smoothed distribution, and so smoothing cannot be relied upon always to improve upon the un-smoothed theory. Smoothing is also a post-hoc operation, unmotivated by details of what is actually being learnt, or with properties (problems) of the estimation process. Instead of selecting some language restriction or resorting to smoothing, a better solution to the over-fitting problem would be to use an informative prior. One such prior is in terms of theory minimisation, the pursuit Osborne ~4 Briscoe 82 Stochastic Categorial Grammars of which leads to the Minimum Description Length Principle (MDL) (Rissanen, 1989).</Paragraph>
    <Paragraph position="11"> In this paper we demonstrate that using MDL gives better results than when using an uninformative prior. Elsewhere, we demonstrated that (Good-Turing) smoothing does improve upon the accuracy of a SCG estimated using MLE, but still, the best results were obtained when using MDL (Osborne, 1997).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The MDL Principle
</SectionTitle>
    <Paragraph position="0"> Learning can be viewed as compression of the training data in terms of a compact hypothesis. It can be shown that, under very general assumptions, the hypothesis with the minimal, or nearly minimal complexity, which is consistent with the training data, will with high probability predict future observations well (Blumer et al., 1987). One way of finding a good hypothesis is to use a prior that favours hypotheses that are consistent with the training data, but have minimal complexity. That is, the prior should be construed in terms of how well the hypothesis can be compressed (since significant compression is equivalent to a low stochastic complexity).</Paragraph>
    <Paragraph position="1"> We can compress the hypothesis by replacing it with code words, such that when measured in bits of information, the total length of the encoding is less than, or equal to, the length of the hypothesis, also when measured in bits. To achieve this aim, objects in the hypothesis that occur frequently should be assigned shorter length code words than objects that occur infrequently. Let l(H) be the total length of the code words for some set of objects H, as assigned by some optimal coding scheme. It turns out that:</Paragraph>
    <Paragraph position="3"> can be used as a prior probability for H. The smaller l(H), the greater the compression, and so the higher the prior probability.</Paragraph>
    <Paragraph position="4"> There is an equivalence between description lengths, as measured in bits, and probabilities: the Shannon Complexity of some object x, with probability P(x), is - log(P(x)) (all logarithms are to the base 2). This gives the minimal number of bits required to encode some object. Hence, we can give a description length to both the prior and likelihood probabilities. Using these description lengths, we have the MDL Principle: we should select some hypothesis H that:  pact; the second part says prefer hypotheses that fit the data well. Both aspects of a theory are taken into consideration to arrive at a proper balance between overly favouring a compact hypothesis (which will model the training data badly) and overly favouring the likelihood probability (which leads to overfitting). null To use the MDL principle when learning grammar, we need to compute the prior and likelihood probabilities. One way to compute the prior is as follows.</Paragraph>
    <Paragraph position="5"> We give each category r in lexicon H an encoding probability P(r). If r was used f(r) times in the parse trees of the training set,</Paragraph>
    <Paragraph position="7"> That is, categories used frequently in the training set have a high probability, and categories used infrequently have a low probability.</Paragraph>
    <Paragraph position="8"> The intuition behind this particular coding scheme is to imagine that we are transmitting, in the shortest possible way, a set of parse trees across some channel. We conceptually use a two-part, dictionary-based coding scheme: one part for word-category pairs with their associated code words, and another part for an encoding of the trees in terms of the code words. Since the total length of the encoding of the trees will be much larger than the total length of the word-category pairs and associated code words, we can assume the dictionary length is just a constant, smaller than the total length of the encoded parse trees, and just consider, without an undue loss in accuracy, the cost of transmitting the trees. Hence, when we evaluate various lexica, we determine how much it costs to transmit the training material in terms of the particular dictionary-based encoding of the lexicon in question. Equation 5 is used to give the length, in bits, of the code word we would assign to each category in a parse tree.</Paragraph>
    <Paragraph position="9"> Our encoding scheme treats each category as being independent and clearly, we could have used more of the context within the parse trees to construct a more efficient encoding scheme (see, for example (Ristad and Thomas, 1995)). For the purposes of this paper, our simple encoding scheme is sufficient.</Paragraph>
    <Paragraph position="10"> The length of a lexicon is the sum of the lengths of all the categories used in the grammar:</Paragraph>
    <Paragraph position="12"> The likelihood probability, P(D \[ H), is defined as simply the product of the probabilities of the categories used to parse the corpus.</Paragraph>
    <Paragraph position="13"> We approximate the probability of the data, P(D), using a linear interpolated trigram model (Jelinek, 1990). Our trigram model is used to assign probabilities to substrings: substrings denoting phrases will be assigned higher probabilities than substrings that do not form natural phrases. It should be pointed out that most work in statistical language learning ignores P(D). However, the implementation reported in this paper is greedy, and tries to build parse trees for sentences incrementally. Hence, we need to determine if the substring dominated by a local tree forms a phrase (has a high P(D)), and is not some non-phrasal word grouping (has a low P(D)). Clearly, using trigrams as an approximation of P(D) may undermine the estimation process. In our more recent, non-greedy work, we can, and do, ignore P(D), and so do not resort to using the trigram model.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Implementation
</SectionTitle>
    <Paragraph position="0"> Having shown how MDL can be applied to the estimation of SCG, we now turn to a description of an implemented system. We learn categorial grammars in a greedy, bottom-up, incremental manner.</Paragraph>
    <Paragraph position="1"> In summary: * For each part-of-speech tag sequence in some corpus, we create a labelled binary tree spanning that sequence.</Paragraph>
    <Paragraph position="2"> * We then read-off from the tree those categories that would have generated that tree in the first place, placing them in the lexicon for subsequent usage.</Paragraph>
    <Paragraph position="3"> In more detail, to create a labelled binary tree, we firstly assign unary trees to each tag in the tag sequence. As far as the current implementation is concerned, the only element in a unary local tree is the tag. For example, assuming the following tagged sentence: We_prp love_vbp categorial_jj grammars_nns we would generate the forest of local trees: (prp) (vbp) (j j) (nns) We ignore words and only work with the part-of-speech tags.</Paragraph>
    <Paragraph position="4"> Next, we consider all pairwise ways of joining adjacent local trees together. For example, given the previous forest of local trees, we would consider joining the following local trees together:</Paragraph>
    <Paragraph position="6"> Each putative local tree is evaluated using Bayes' theorem: the prior is taken as being the probability assigned to an encoding of just the categories contained within the local tree (with respect to all the categories in the lexicon)3; the likelihood is taken as being the geometric mean of the probabilities of the categories contained within the local tree 4; the probability of the data is taken as being the probability assigned by the ngram model to the tag sequence dominated by that local tree. The mother of a local tree is defined using a small table of what constitutes a mother given possible heads. Mothers are always either the left or right daughter, representing either left or right functional application.</Paragraph>
    <Paragraph position="7"> After evaluating each putative local tree, the tree with the highest posterior probability is chosen.</Paragraph>
    <Paragraph position="8"> This tree replaces the two local trees from which it was created.</Paragraph>
    <Paragraph position="9"> Continuing our example, if we assume the putative local tree: (nns (jj) (nns)) has a higher posterior probability than the putative local tree: (vbp (vbp) (jj)) we would replace the local trees: (jj) (nns) with the local tree: (nns (jj) (nns)) The whole process of tree evaluation, selection and replacement is then repeated until a single tree remains. null To read categories off a labelled local tree, the following recursive process is applied: * The category of the root of a tree is the category dominating that tree.</Paragraph>
    <Paragraph position="10"> * Given a local tree of the form (A (A S)), the category assigned to the daughter node labelled A is a/B, where a is the category assigned to the root of the tree. The category assigned to node B is B.</Paragraph>
    <Paragraph position="11">  category assigned to the daughter node labelled A is o~\B, where ~ is the category assigned to the root of the tree. The category assigned to BisB.</Paragraph>
    <Paragraph position="12"> Note other methods of reading categories off a tree might exist. We make no claim that this is necessarily the best method.</Paragraph>
    <Paragraph position="13"> So, if we assume the following tree: (vbp (prp) (vbp (vbp) (nns (jj) (nns)))) we would extract the following categories:  Our categories are With each category, we also keep a frequency count of the number of times that category was added to the lexicon. This frequency information is used to estimate the probabilities of the lexicon.</Paragraph>
    <Paragraph position="14"> Finally, when learning, we ignore sentences shorter than three words (these are likely to be ungrammatical fragments), or, for computational reasons, sentences longer than 50 words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML