XML Viewer - w00-0740

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0740_metho.xml
Size: 26,114 bytes
Last Modified: 2025-10-06 14:07:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0740">
  <Title>Incorporating Linguistics Constraints into Inductive Logic Programming</Title>
  <Section position="5" start_page="184" end_page="186" type="metho">
    <SectionTitle>
3 Using linguistic constraints
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="184" end_page="185" type="sub_section">
      <SectionTitle>
3.1 Simple filter constraints
</SectionTitle>
      <Paragraph position="0"> The user never sees naive rules; most are filtered out as linguistically implausible and those that survive have generally become specialised.</Paragraph>
      <Paragraph position="1"> Our basic motto is: constrain early, constrain tightly. The aim is that no linguistically implausible rule is ever added to the set of candidate rules. This allows an incremental approach to implementing the constraints. On observing a linguistically implausible rule in the candidate  set, we have to specify what makes it implausible and then express this as a constraint in Prolog. In this way, we build up a set of filters which get rid of linguistically implausible naive rules as soon as they are produced.</Paragraph>
      <Paragraph position="2"> Table 1 lists the constraints currently used.</Paragraph>
      <Paragraph position="3"> The Head features and Gap threading constraints are discussed later. RHS length simply limits the number of constituents on the RHS of a rule to some small user-defined integer (in the experiments described here it was equal to 4). LHS # RHS filters out rules with a single daughter which is the same category as the mother. Head OK filters out rules', where the LHS has a head category which is not found on the RHS. The last three constraints in Table 1 act on the LHS of potential rules (i.e. needs), filtering out, respectively, sigma categories, categories which do not appear as the LHS of existing rules (and so are probably lexical) and s (sentence) categories.</Paragraph>
    </Section>
    <Section position="2" start_page="185" end_page="186" type="sub_section">
      <SectionTitle>
3.2 Gap threading and head feature
constraints
</SectionTitle>
      <Paragraph position="0"> Gap-threading is a technique originating with Pereira's 'extraposition grammars' (Pereira, 1981). It is an implementation technique commonly used for dealing with movement phenomena in syntax, as illustrated by a Wh-question like What does Smith own _?, where the Wh-word is logically associated with the gap marked '2.</Paragraph>
      <Paragraph position="1"> There are three components to this type of analysis. Firstly, one rule must introduce the 'moved' constituent. This rule also sets up an expectation for a gap of the same type as the moved constituent elsewhere in the sentence.</Paragraph>
      <Paragraph position="2"> This expectation is coded as a set of features, or in our case, a single tuple-valued feature with 'GapIn' and 'GapOut' values. By setting the value of the 'GapIn' feature to be that of (a copy of) the moved constituent, and GapOut to be some null marker (here, ng= nogap) we can enforce that expectation. Secondly, rules which do not involve gaps directly pass the value of the GapIn and GapOut values along their daughters (this is the 'threading' part) making sure that the gap value is threaded everywhere that a gap is permitted to occur linguistically. Thirdly, there are rules which rewrite the type of constituent which can be moved as the empty string, discharging the 'gap' expectation. Example rules of all three types are as follows:</Paragraph>
      <Paragraph position="4"> an S which must contain an associated NP gap agreeing in number etc. Rule (ii) passes the gap feature from the mother VP along the daughters that can in principle contain a gap. Rule (iii) rewrites an NP whose gap value indicates that a moved element precedes it as the empty string. Rules of these three types conspire to ensure that a moved constituent is associated with exactly one gap.</Paragraph>
      <Paragraph position="5"> Constituents which cannot contain a gap associated with a moved element outside the constituent identify the In and Out values of the gap feature, and so a usual NP rule might be of the form: np: \[gap(G,G)\] -&gt; det: \[...\] n: \[...\] In a sentence containing no gaps the value of In and Out will be ng everywhere.</Paragraph>
      <Paragraph position="6"> Naive rules will not necessarily fall into one of the three categories above, because the categories that make up their components will have been instantiated in various possibly incomplete ways. Thus in Fig 3 the gaps values in the mother are (ng,ng), and those in the daughters are separately threaded (A,A) and (B,B).  We apply various checks and filters to candidate rules to ensure that the logic of the gap feature instantiations is consistent with the linguistic principles embodied in the gap threading analysis.</Paragraph>
      <Paragraph position="7"> The gap threading logic is tested as follows.</Paragraph>
      <Paragraph position="8"> Firstly, rules are checked to see whether they match the general pattern of the three types above, gap-introduction, gap-threading, or gapdischarge rules. Secondly, in each of the three cases, the values of the gap features are checked to ensure they match the relevant schematic examples above.</Paragraph>
      <Paragraph position="9"> The most frequently postulated type of rule is a gap threading rule. The rule in Fig 3 has the general shape of such a rule but the feature values do not thread in the appropriate way and so it will be in effect unified with a template that makes this happen. The effect here will actually be to instantiate all In and Out values to ng, thus specialising the rule. Hypothesised rules where the values are all variables will get the In and Out values unified analogously to the example threading rule (ii) above. Hypothesised rules where the gap values are not variables are checked to see that they are subsumed by the appropriate schema: thus all the different threading patterns in Fig 4 would be substitution instances of the pattern imposed by the example threading rule (ii). At the later generalisation stage the correct variable threading regime should be the only one consistent with all the observed instantiation patterns.</Paragraph>
      <Paragraph position="10"> Y. \[ng/ng,ng/ng, ng/ng\] . 1 \[all, big, companies, wrote, a, report, quickly\]. %\[np/ng,np/ng,ng/ng\]. 2 \[what,dont,all,big,companies,read, with,a,machine\].</Paragraph>
      <Paragraph position="11"> ~\[np/ng,np/np,np/ng\]. 3 \[what,dont,all,big,companies,read, a,report,with\].</Paragraph>
      <Paragraph position="12"> ~\[np/np,np/np,np/np\]. 4 \[what,dont,all,big,companies,read, a,report,quickly,from\].</Paragraph>
      <Paragraph position="13">  patterns of gap threading Our constraints on head feature agreement are similar to the gap threading constraints.</Paragraph>
      <Paragraph position="14"> The specialised version of the naive rule in Fig 3 is displayed in Fig 5. Note that although the rule in Fig 5 is not incorrect, it is overly specific, applying only to mor=pl, aux=n where there is no gap to thread. We now consider how to generalise rules.</Paragraph>
      <Paragraph position="15"> vp : \[gaps= \[ng: \[\] , ng: \[\] \] ,mor=pl, aux=n\] ==&gt; \[vp: \[gaps= \[ng: \[\] ,ng: \[\] \] ,mor=pl, aux=n\], mod: \[gaps= \[ng: \[\] ,ng : \[\] \] , of=vp, type=n\] \] Figure 5: VP ~ VP MOD rule specialised to meet head and gap constraints</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="186" end_page="187" type="metho">
    <SectionTitle>
4 Generalisation operators
</SectionTitle>
    <Paragraph position="0"> In this section, we show how to generate grammar rules by generalising overly specific rules using the VP -+ VP MOD running example.</Paragraph>
    <Paragraph position="1"> Our target is to generate the missing grammar rule displayed in Fig 1. We will use the artificial dataset given in Fig 4 which displays 4 different patterns of gap threading. From the first three sentences we generate the expected overly specific grammar rules which correspond to the three patterns of gap threading. These axe given, in abbreviated form, in Fig 6. We use least general generalisation (lgg)</Paragraph>
    <Paragraph position="3"/>
    <Paragraph position="5"> as our basic generalisation operator. This is implemented (for terms) in the Sicstus terms library built-in term_subsumer/3. Lgg operates on the compiled form of the rules (such as the cmp_synrule/3 unit clause displayed in Fig 5),  not the human-readable form as in Fig 6. The lgg of the first two rules produces the following rule (translated back into human-readable form): vp: \[gaps= \[_282, ng: \[\] \], mor=or (inf, p1), aux=n\]</Paragraph>
    <Paragraph position="7"> rood: \[gaps= \[ng: \[\] ,ng: \[\] \], of=or (nora, vp), type=n\] \] The lgg of this rule with the third is:</Paragraph>
    <Paragraph position="9"> This rule covers the first three sentences but is not general enough to cope with the situation where the gap is not discharged on the mother VP--a pattern present in the fourth sentence.</Paragraph>
    <Paragraph position="10"> Unfortunately, the fourth sentence needs to use the missing rule twice to get a parse, and it is a fundamental limitation of our approach that a missing rule can only be recovered from a failed parse if it is required only once. Note that to induce a rule we only need one sentence where the rule is needed oncc our assumption is that in real (large) training datasets there will be enough sentences for this to be true for any missing grammar rule.</Paragraph>
    <Paragraph position="11"> Although this assumption seems reasonable, we have decided to experiment with a generalisation operator, which is helpful when the assumption does not hold true. A rule with a context-free skeleton of VP -+ VP MOD MOD is generated from the fourth sentence. This corresponds to the two applications of the target VP --+ VP MOD rule. The rule we have, can be derived by having the target rule resolve on itself. It follows that we can inductively generate the target rule from VP ---+ VP MOD MOD by implementing a special inverse resolution operator which produces the most specific clause C2 from a clause C1, when C1 can be produced by C2 resolving with itself. Applying this operator to the VP ~ VP MOD MOD rule renders: vp : \[gaps= \[np, _342\] ,mor=inf, aux=n\] ==&gt; \[vp: \[gaps= \[np ,np\] ,mor=inf, aux=n\] , mod: \[gaps= \[np, _342\], of=or (nom, vp), type=n\] \] 'Lggifying' this latest rule with the lgg of the 3 other rules finally generates a grammar rule with the correct gap threading, which we display in Fig 7 as it appears to the user (with a few added line breaks). However, this rule is not general enough simply because our training data is not general enough. Adding in the sentences All big companies will write a report quickly, All big companies have written a report quickly and All big companies wrote a report incredibly generates a more general version covering these various cases. However, there is still a problem because our induced rule allows the modifier to be modifying either a nom or a vp (represented by the term f (0,_280,_280,1) in the compiled form), where the correct rule allows the modifier to modify an s or a vp (represented by the term f (0,0,._280,1) in the compiled form). This is because our constraints still need to be improved.</Paragraph>
    <Paragraph position="13"> rood: \[gaps= \[_366, _368\], of=or (nora, vp), type=n\] \]</Paragraph>
  </Section>
  <Section position="7" start_page="187" end_page="188" type="metho">
    <SectionTitle>
5 Two experiments
</SectionTitle>
    <Paragraph position="0"> Our experiments consist of (i) randomly generating 50 sentences from a grammar, (ii) deleting some grammar rules and (iii) seeing whether we can recover the missing grammar rules using the 50 sentences. Our approach is interactive with the user making the final decision on which hypothesised rules to add to the grammar. Hypothesised rules are currently ordered by coverage and presented to the user in that order. In  our artificial experiments the earlier the missing rule is presented to the user the more successful the experiment.</Paragraph>
    <Paragraph position="1"> In the first experiment we deleted the VP --~</Paragraph>
    <Paragraph position="3"> After generalisation of naive rules, the rule with the largest cover was np: \[gaps= \[ng: \[\] ,ng: \[\]\] ,mor=or(pl,s3), type=_414, case=_415\] ==&gt; \[det : \[type=or (n, q) ,mor=_405\], nom : \[mor=or (pl, s3) \] \] which is over-general since the morphology feature of the determiner is not constrained to equal that of the mother. However, the third most general rule covered 24 sentences and was: np : \[gaps= \[ng : \[\] ,ng : \[\] \] ,mot=or (pl, s3), type=n, case=_442\] ==&gt; \[det : \[type=n, mot=or (pl, s3) \] , nora: \[mor=or (pl, s3) \] \] which does have agreement on morphology.</Paragraph>
    <Paragraph position="4"> Committing to this latter rule by asserting it as a grammar rule, removing newly parsable sentences and re-generating rules produced a vp ==&gt; \[vp,mod\] rules which was more general in terms of morphology than the one in Fig 7, but less general in terms of gap threading. This just reflects the sensitivity of our learning strategy on the particular types of sentences in the training data.</Paragraph>
    <Paragraph position="5"> In a second experiment, we deleted the rules: nom_nom_mod syn nom: \[mor=A\]==&gt; \[nom: \[mot=A\] , rood: \[gaps= \[ng: \[1 ,ng : \[\] \] , of=nora,</Paragraph>
    <Paragraph position="7"> vp:\[gaps=A,mor=B,aux=_\]\].</Paragraph>
    <Paragraph position="8"> Our algorithm failed to recover the s_aux np_vp rule but did find close approximations to the other two rules:</Paragraph>
    <Paragraph position="10"> type=n, case=_407\] \] nora: \[mor=or (pl, s3)\] ==&gt; \[nom: \[mor=or (pl, s3) \] , rood: \[gaps = \[_339, _339\], of=or (nom,vp),type=or (n,q) \] \]</Paragraph>
  </Section>
  <Section position="8" start_page="188" end_page="190" type="metho">
    <SectionTitle>
6 Related work
</SectionTitle>
    <Paragraph position="0"> The strong connections between proving and parsing axe well known (Shieber et al., 1995), so it is no surprise that we find related methods in both ILP and computational linguistics. In ILP the notion of inducing clauses to fix a failed proof, which is the topic of Section 2, is very old dating from the seminal work of Shapiro (1983).</Paragraph>
    <Paragraph position="1"> In NLP, Mellish (1989) presents a method for repairing failed parses in a relatively efficient way based on the fact that, after a failed parse, the information in the chart is sufficient for us to be able to determine what constituents would have allowed the parse to go through if they had been found.</Paragraph>
    <Section position="1" start_page="188" end_page="189" type="sub_section">
      <SectionTitle>
6.1 Related work in ILP
</SectionTitle>
      <Paragraph position="0"> The use of abduction to repair proofs/paxses has been extensively researched in ILP as has the importance of abduction for multiple predicate learning. De Raedt (1992), for example, notes that &amp;quot;Roughly speaking, combining abduction with single predicate-leaxning leads to multiple concept-leaxning&amp;quot;. This paper, where abduction is used to learn, say, verb phrases and noun phrases from examples of sentences is an example of this. Recent work in this vein includes (Muggleton and Bryant, 2000) and the papers in (Flach and Kakas, 2000).</Paragraph>
      <Paragraph position="1"> Amongst this work a particularly relevant paper for us is (Wirth, 1988). Wirth's Learning  by Failure to Prove (LFP) approach finds missing clauses by constructing partial proof trees (PPTs) and hence diagnosing the source of incompleteness. A clause representing the PPT is constructed (called the resolvent of the PPT) as is an approximation to the resolvent of the complete proof tree. Inverse resolution is then applied to these two clauses to derive the missing clause. Wirth explains his method by way of a small context-free DCG completion problem.</Paragraph>
      <Paragraph position="2"> Our approach is similar to Wirth's in the dependence on abduction to locate the source of proof (i.e. parse) failure. Also both methods use a meta interpreter to construct partial proofs. In our case the meta-interpreter is the chart parser augmented with the generation of needs and the partial proof is represented by the chart augmented with the needs. In Wirth's work the resolvent of the PPT represents the partial proof and a more general purpose meta-interpreter is used. (We conjecture that our tabular representation has a better chance of scaling up for real applications.) Thirdly, both methods are interactive. Translating his approach to the language of this paper, Wirth asks the user to verify that proposed needed atoms (our needed edges) are truly needed. The user also has to evaluate the final hypothesised rules. We prefer to have the user only perform the latter task, but the advantage of Wirth's approach is that the user can constrain the search at an earlier stage. Wirth defends an interactive approach on the grounds that &amp;quot;A system that learn\[s\] concepts or rules from looking at the world is useless as long as the results are not verified because a user who feels responsible for his knowledge base rarely use these concepts or rules&amp;quot;.</Paragraph>
      <Paragraph position="3"> In contrast to (Cussens and Puhnan, 2000) we now search bottom-up for our rules. This is because the rules we are searching for are near the bottom of the search space, and also because bottom-up searching effects a more constrained, example-driven search. Bottom-up search has been used extensively in ILP. For example, the GOLEM algorithm (Muggleton and Feng, 1990) used relative least general generalisation (rlgg). However, bottom-up search is rare in modern ILP implementations. This is primarily because the clauses produced can be unmanageably large, particularly when generalisation is performed relative to background knowledge, as with rlgg. Having grammar rules encoded as unit clauses alleviates this problem as does our decision to use lgg rather than rlgg.</Paragraph>
      <Paragraph position="4"> Zelle and Mooney (1996) provides a bridge between ILP and NLP inductive methods.</Paragraph>
      <Paragraph position="5"> Their CHILL algorithm is a specialised ILP system that learns control rules for a shift-reduce parser. The connection with the approach presented here (and that of Wirth) is that intermediate stages of a proof/parse are represented and then examined to find appropriate rules. In CHILL these intermediate stages are states of a shift-reduce parser.</Paragraph>
    </Section>
    <Section position="2" start_page="189" end_page="190" type="sub_section">
      <SectionTitle>
6.2 Related work in NLP
</SectionTitle>
      <Paragraph position="0"> Most work on grammar induction has taken place using formalisms in which categories are atomic: context-free grammars, categorial grammars, etc. Few attempts have been made at rule induction using a rich unification formalism. Two lines of work that are exceptions to this, and thus comparable to our own, are that of Osborne and colleagues; and the work of the SICS group using SRI's Core Language Engine and similar systems.</Paragraph>
      <Paragraph position="1"> Osborne (1999) argues (correctly) that the hypothesis space of grammars is sufficiently large that some form of bias is required. The current paper is concerned with methods for effecting what is known as declarative bias in the machine learning literature, i.e. hard constraints that reduce the size of the hypothesis space. Osborne, on the other hand, uses the Minimum Description Length (MDL) principle to effect a preferential (soft) bias towards smaller grammars. His approach is incremental and the induction of new rules is triggered by an unparsable sentence as follows: 1. Candidate rules are generated where the daughters are edges in the chart after the failed parse, and the mother is one of these daughters, possibly with its bar level raised.</Paragraph>
      <Paragraph position="2"> 2. The sentence is parsed and for each successful parse, the set of candidate rules used in that parse constitutes a model.</Paragraph>
      <Paragraph position="3"> 3. The 'best' model is found using a Minimum Description Length approach and is added to the existing grammar.</Paragraph>
      <Paragraph position="4">  So Osborne, like us, uses the edges in the chart after a failed parse to form the daughters of hypothesised rules. The mothers, though, are not found by abduction as in our case, also there is no subsequent generalisation step.</Paragraph>
      <Paragraph position="5"> Unlike us Osborne induces a probabilistic grammar. When candidate rules are added, probabilities are renormalised and the n most likely parses are found. If annotated data is being used, models that produce parses inconsistent with this data are rejected. In (Osborne, 1999), the DCG is mapped to a SCFG to compute probabilities, in very recent work a stochastic attribute-value grammar is used (Osborne, 2000). Giving the increasing sophistication of probabilistic linguistic models (for example, Collins (1997) has a statistical approach to learning gap-threading rules) a probabilistic extension of our work is attractive--it will be interesting to see how far an integration of 'logical' and statistical can go.</Paragraph>
      <Paragraph position="6"> Thalmann and Samuelsson (1995) describe a scheme which combines robust parsing and rule induction for unification grammars. They use an LR parser, whose states and actions are augmented so as to try to recover from situations that in a standard LR parser would result in an error. The usual actions of shift, reduce, and accept are augmented by hypothesised shift: shift a new item on to the stack even if no such action is specified in that state hypothesised unary reduce: reduce a symbol Y as if there was a rule X -~ Y, where the value of X is not yet determined.</Paragraph>
      <Paragraph position="7"> hypothesised binary reduce: reduce a symbols Y Z as if there was a rule X ~ Y Z, where the value of X is not yet determined.</Paragraph>
      <Paragraph position="8"> The value of the X symbol is determined by the next possibilities for reduction.</Paragraph>
      <Paragraph position="9"> To illustrate, consider the grammar</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="190" end_page="190" type="metho">
    <SectionTitle>
1 S -+ NP VP
2 NP -+ Name
3 VP --+ Vi
</SectionTitle>
    <Paragraph position="0"> and a sentence 'John snores loudly'.</Paragraph>
    <Paragraph position="1"> Assume that all the words are known (though this is not necessary for their method). The sequence of events will be:</Paragraph>
  </Section>
  <Section position="10" start_page="190" end_page="191" type="metho">
    <SectionTitle>
S\[NP \[VP VP Adv\]\]
</SectionTitle>
    <Paragraph position="0"> After stage 4 we could reduce with 1 but this would not lead to an accepting state. Instead we perform a hypothesised shift at stage 5 followed by a hypothesised binary reduce with X VP Adv in stage 6. Next we reduce with rule 1 which instantiates X to VP and we have a complete parse provided we hypothesise the rule VP ~ VP Adv.</Paragraph>
    <Paragraph position="1"> Two more hypothesised actions are used to account for gap threading: hypothesised move: put the current symbol on a separate movement stack (i.e. hypothesise that this constituent has been fronted) hypothesised fill: move the top of the movement stack to to top of the main stack These actions have costs associated with them and a control regime so that the 'cheapest' analysis will always be preferred. An analysis which uses none of the new actions will be cost-free. Unary reduction is more expensive than binary reduction because the consequent unary rules may lead to cycles, and such rules are often redundant.</Paragraph>
    <Paragraph position="2"> These actions hypothesise only the context free backbone of the rules. Feature principles analogous to those we described above are used, along with hand editing, to get the final form of the hypothesised rule. Presumably the information hypothesised by the move and fill operations as to be translated somehow into the gap threading notation which is also used by their formalism. No details are given of the results of this system, nor any empirical evaluation.</Paragraph>
    <Paragraph position="3"> This work shares many of the goals of the approach we describe, in particular the use of explicit encoding of background knowledge of feature principles. The main difference is that the technique they describe only hypothesises the context free backbone of the necessary rules, whereas in our approach the feature structures are also hypothesised simultaneously.</Paragraph>
    <Paragraph position="4"> Asker et al. (1992) also describe a method for inducing new lexical entries when extending  coverage of a unification grammar to a new domain, a task which is also related to our work in that they are using a full unification formalism and using partial analyses to constrain hypotheses. Firstly, they use 'explanation based generalisation' to learn a set of sentence templates for those sentences in the new corpus that can be successfully analysed. This process essentially takes commonly occurring trees and 'flattens' them, abstracting over the content words in them. Secondly they use these templates to analyse those sentences from the new corpus which contain unknown words, treatimg the entries implied by the templates for these words as provisionally correct. Finally these inferred entries are checked against a set of hand-coded 'paradigm' entries, and when all the entries corresponding to a paradigm have been found, a new canonical lexical entry for this word is created from the paradigm.</Paragraph>
    <Paragraph position="5"> Again, no results are evaluation are given, but it is clear that this method is likely to yield similar results to our own for inference of lexical entries.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML