XML Viewer - w99-0708

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0708_metho.xml
Size: 16,310 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0708">
  <Title>MDL-based DCG Induction for NP Identification</Title>
  <Section position="4" start_page="62" end_page="64" type="metho">
    <SectionTitle>
3 Estimation Details
</SectionTitle>
    <Paragraph position="0"> Estimation of a model, given a training set of'sentences and an associated set of manually constructed parses, consists of four steps: probabilistic modelling of DCGs, model construction, search (constrained by parsed corpora) , and model estimation. We now explain these steps in turn.</Paragraph>
    <Paragraph position="1"> Probabilistic Modelling of DCGs DCGs in our approach are modelled in terms of a compression-based prior probability and a SCFG-based likelihood probability. The prior assigns high probability to compact models, and low probabilities to verbose, idiomatic models. As such, it favours simple grammars over more complex possibilities. The likelihood probability describes how well we can encode the training set in terms of the model. We now turn to the specifcation of likelihood and prior probabilities in our system.</Paragraph>
    <Paragraph position="2"> Likelihood Probability To specify a likelihood probability for DCGs, we have opted to use a SCFG, which consists of a set of context free grammar rules along with an associated set of parameters \[6\]. Each parameter models the way we might expand non-terminals in a top-down derivation process, and within a SCFG, we associate one such parameter with each distinct context free rule. However, DCG rules are feature-based, and so not directly equivalent to simple context free rules. In order to define a SCFG over DCG rules, we need to interpret them in a context-free manner. One way to achieve this is as follows. For each category in the grammar that is distinct in terms of features, invent an atomic non-terminal symbol. With these atomic symbols, create a SCFG by mapping each category in a DCG rule to an atomic symbol, yielding a context free (backbone) grammar, and with this grammar, specify a SCFG, /Vii. Naturally, this is not the most accurate probabilistic model for feature-based grammars, but for the interim, is sufficient (see Abney for a good discussion of how one might define a more accurate probabilistic model for feature-based grammars \[1\]).</Paragraph>
    <Paragraph position="3"> SCFGs are standardly defined as follows. Let P(A -~ (~ \] A) be the probability of expanding (backbone) non-terminal symbol A with the (backbone) rule A --+ o when deriving some sentence si. The probability of the jth derivation of si is defined as the product of the probabilities of all backbone rules used in that derivation. That is, if derivation j followed from an application of the rules A~ -~ c~ .... , A~ ~ a~,</Paragraph>
    <Paragraph position="5"> The probability of a seiatence is then defined as the sum of the probabilities of all n ways we can derive it:</Paragraph>
    <Paragraph position="7"> Having modelled DCGs as SCFGs, we can immediately specify the likelilmod probability of Mi generating a sample of sentences so... sn, as:</Paragraph>
    <Paragraph position="9"> This treats each sentence as being independently generated from each other sentence.</Paragraph>
    <Paragraph position="10"> Prior Probability Specifying a prior for DCGs amounts to encoding the rules and the associated parameters. We encode DCG rules in terms of an integer giving the length, in categories, of the rule (requiring log* (n) bits, where log* is Rissannen's encoding scheme for integers), and a list of that many encoded categories. Each category consists of a list of features, drawn from a finite set of features, and to each feature there is a value. In general, each feature will have a separate set of possible values. Within manually written DCGs, the way a feature is assigned a value is sensitive to the position, in a rule, of the category containing the feature in question. Hence, if we number the categories of a rule. we can work out the probability that a particular feature, in a given category, will take a certain value.</Paragraph>
    <Paragraph position="11"> Let P(v I fO be the probability that feature f takes the value v, in category i of all rules in the grammar.</Paragraph>
    <Paragraph position="12"> Each value can now be encoded with a prefix code of -log(P(v Ifi) bits in length. Encoding a category simply amounts to a (fixed length) sequence of such encoded features, assuming some canonical ordering upon features. Note we do not learn lexical entries and so not not need to encode them.</Paragraph>
    <Paragraph position="13"> To encode the model parameters, we simply use Rissannen's prefix coding scheme for integers to encode a rule's frequency. We do not directly encode probabilities. since these will be inaccurate when the frequency, used to estimate that probability, is low. Rissanuen's scheme has the property that small integers are assigned shorter codes than longer integers. In our context, this will favour low frequencies over highcr ones, which is undesirable, given the fact that we want, for estimation accuracy, to favour higher frequencies. Hence, instead of encoding an integer i in log*(/) bits (as, for example, Keller and Lutz roughly do \[18\]), we encode it in log*(Z - i) bits, where Z is a mlmber larger than any frequency. This will mean that higher frequencies are assigned shorter code words, as intended.</Paragraph>
    <Paragraph position="14"> The prior probability of a model )IL, containing a DCG G and an associated parameter set is:</Paragraph>
    <Paragraph position="16"> is the description length of the paranaeters. C is a constant ensuring that the prior sums to one; F is the set of features used to describe categories: \] r \] is the length of a DCG rule r seen f(r) times.</Paragraph>
    <Paragraph position="17"> Apart from being a prior over DCG rules, our scheme has the pleasing property that it assigns longer code words to rules containing categories in unlikely positions than to rules containing categories in expected positions. For example, our scheme would assign a longer list of code words to the categories expressing a rule such as Det -~ Det NP than to the list of categories expressing a rule such as NP --+ Det NP. Also, our coding scheme favours shorter rules than longer rules, which is desirable, given the fact that, generally speaking. rules in natural language grammars tend to be short.</Paragraph>
    <Paragraph position="18">  ! Posterior Probability In summary, the probability of a model, given a set of training examples is: P(M, I So... s,) = \[2- U#{M' )+'P{ M' D +C\].H~.=O PS(s~IM.)</Paragraph>
    <Paragraph position="20"/>
    <Section position="1" start_page="63" end_page="64" type="sub_section">
      <SectionTitle>
Model Construction
</SectionTitle>
      <Paragraph position="0"> For lack of space, we only sketch our model construction strategy. In brief, our approach is (largely) monotonic, in that we extend an old model with new rules constructed from pieces of manually written rules, such that the new and old rules result in a parse for the current sentence (and the old rules alone fail to parse the current sentence). In more detail, whenever we fail to parse a sentence with the manually written DCG, we use an optimised chart parser \[10\] to construct all local trees licensed by the manually written grammar. We next consider ways adjacent local trees licensed by manually written DCG rules may be combined into larger local trees (ie invent rules whose right hand side consists of categories that spell out the mother categories of these adjacent local trees; the left hand side will be one of these right-hand side categories, with the possibility of having its bar level raised). The parser packs all local trees in space polynomial with respect to sentence length. If, within self-imposed bounds upon computation (for example, limiting the number of local trees joined together), we succeed in constructing at least one tree that spans the entire sentence, we can build a new model by extracting all new rules seen in that tree and adding them to the old model.</Paragraph>
      <Paragraph position="1"> Note that periodically, it is useful to prune (and renormalise) the model of rules seen only once in the previously encountered sequence of sentences. Such pruned rules are likely to have arisen either due to marked constructions, noise in the training material, or rules that appeared to be promising, but did not receive any subsequent support and as such have little predictive utilit.v. Pruning the model is a non-monotonic operation and hard to formally justify, but nevertheless useful.</Paragraph>
      <Paragraph position="2"> Whitten, Cleary and Bell also comment upon the usefulness of resetting the model \[3\].</Paragraph>
      <Paragraph position="3"> Our model construction approach has the following properties: rules constructed all encode a version of X-Syntax, which weakly constrains the space of possible rules \[8, 21\]; analyses produced using manually written rules are favoured over those produced using learnt rules (by virtue of computation being resourcebounded): this mirrors the empirical fact that when extending manually written grammars, only a few. rules are necessary, and those required generally 'join' together local trees generated by manually written rules.</Paragraph>
      <Paragraph position="4">  Search Model construction may produce an exponential number of parses for a sentence and for computational reasons, we are unable to evaluate all the models encoded within these parses. We therefore use a probabilistic unpacking strategy that efficiently selects the n most likely parses, where n is much less thml the total number of parses possible for some sentence \[11\]. There is insufficient space here to describe how we rank parses, but the underlying parse selection model is based upon the SCFG used to evaluate models. Currently. it is not lexicalised, so parse selection performance is subject to the well-known limitations of non-lexicalised SCFGs \[5\].</Paragraph>
      <Paragraph position="5"> Whilst estimating models, we simultaneously estimate the parse selection model in terms of tile parse used to produce the model picked. Probabilistic unpacking is crucial to our approach, and it is this that makes our learner computationally feasible.</Paragraph>
      <Paragraph position="6"> After extracting n parses, we can then go on to construct k I models, evaluate their posterior probabilities, and then select the model that maximises this term.</Paragraph>
      <Paragraph position="7"> However, as was shown by P-t-S, when training material consists of just (limited quantities) of raw text. classical, single model parameter estimation often results in a model that produces worse parse selection results than when the estimation process is constrained to only consider derivations compatible with the parsecl corpora.</Paragraph>
      <Paragraph position="8"> In our context, we can use parsed corpora to constrain both parameter estimation and also model selection.</Paragraph>
      <Paragraph position="9"> We simply re-rank the n parses produced during model construction using a tree similarity metric that compares how 'close' an automatically constructed parse is to a manually written parse, and take the q parses that all minimise the metric and are all scored equally well \[17\]. From these q parses we can then build models as usual. When q = 1, there is no need to rely upon MDL-based model selection. Otherwise. when q is greater than one, we have a set of parses, all equally consistent with the manually created tree, and so fall-back upon the usual model selection strate~.v. Our use of parsed corpora differs from P+S's in that we use it as a soft constraint: we may still keep parses even if they violate constraints in the manually constructed parse tree. The reason for this decision is that we do not construct all possible parses for a sentence., and so at times may not produce a parse consistent with a manually created parse. Also, it is not clear whether parsed corpora is sufficiently reliable for it to be trusted absolutely. Clearly there will be a link between the amount of information present in the parsed corpora mid the quality of the estimated model. In the experimental ~k may be less than or equal to n. depending upon which independence assumptions are made by&amp;quot; the model section of this paper, we consider this issue.</Paragraph>
      <Paragraph position="10"> Estimation When computing a model's posterior probability, we estimate the description length of features, P(v \] f,), the model parameters, P(A -+ a I A) and the likelihood probability, P(so... sn \]Mi)-The feature description length is estimated by counting the number of times a given feature takes some value in some category position, and applying a maximal likelihood estimator to arrive at a probability. The model parameters are estimated by counting the number of times a given backbone rule was seen in the previous n parses just produced, and then again using a maximal likelihood estimator to produce a probability. We estimate, as we cannot afford to globally recompute, the likelihood probability using the following approximations. Only a fixed number of n previously seen sentences that cannot be parsed using the manually written rules are considered in the likelihood computation. We assume that the parses of these sentences remains constant across alternative models, but the derivation probabilities might vary. We also assume that the string probability of each sentence is reasonably well approximated by a single parse.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="64" end_page="64" type="metho">
    <SectionTitle>
4 The Grammar
</SectionTitle>
    <Paragraph position="0"> The grammar we extend with learning, (called the Tag Sequence Grammar \[7\], or TSG for short) was developed with regard to coverage, and when compiled consists of 455 object rules. It does not parse sequences of words directly, but instead assigns derivations to sequences of part-of-speech tags (using the CLAWS2 tagset \[4\]).</Paragraph>
    <Paragraph position="1"> The grammar is relatively shallow, (for example, it does not fully analyse unbounded dependencies) but it does make an attempt to deal with common constructions, such as dates or names, commonly found in corpora, but of little theoretical interest. Furthermore, it integrates into the syntax a text grammar, grouping utterances into units that reduce the overall ambiguity.</Paragraph>
    <Paragraph position="2"> For the experiments reported here, we manually extended TSG with four extra rules. These extra rules dealt with obvious oversights when parsing the WSJ.</Paragraph>
  </Section>
  <Section position="6" start_page="64" end_page="65" type="metho">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> Our approach is closely related to Stolcke's model merging work \[29\]. Apart from differences in prior and likelihood computation, the main divergence is that our work is motivated by the need to deal with undergeneration in broad-coverage, manually written natural language grammars (for example \[15\]). Although we do not go into the issues here, estimation of rules missing from such g-rammars is different from estimating grammars  ab initio. This is because rules missing from any realistic grammar are all likely to have a low frequency in any given corpus, and so will be harder to differentiate from competing, incorrect rules purely on the basis of statistical properties alone. We know of no other work reporting automated extension of broad-coverage grammars using MDL and parsed corpora.</Paragraph>
    <Paragraph position="1"> One of the anonymous reviewers wanted to know how our work related to Explanation-Based Learning (EBL) \[20\]. EBL is not concerned with induction of rules: it deals with finding more efficient ways to use existing rules. For example, in NLP, EBL has been used to reduce the time taken to parse sentences with large grammars (\[28\]). EBL does not extend the coverage of any given grammar, unlike our approach. In our opinion, it would be better to view our learner as all Inductive Logic Programming system specialised for DCG induction. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML