File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/99/w99-0705_abstr.xml

Size: 36,086 bytes

Last Modified: 2025-10-06 13:49:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0705">
  <Title>The #-TBL System: Logic Programming Tools for Transformation-Based Learning</Title>
  <Section position="2" start_page="0" end_page="40" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> The #u-TBL system represents an attempt to use the search and database capabilities of the Prolog programming language to implement a generalized form of transformation-based learning. In the true spirit of logic-programming, the implementation is 'derived' from a declarative, logical interpretation of transformation rules. The #-TBL system recognizes four kinds of rules, that can be used to implement various kinds of disambiguators, including Constraint Grammar disambiguators as well as more traditional 'Brill-taggers'. Results from a number of experiments and benchmarks are presented which show that the system is both flex&amp;quot; ible and efficient.</Paragraph>
    <Paragraph position="1"> Introduction Since Eric Brill first introduced the method of Transformation-Based Learning (TBL) it has been used to learn rules for many natural language processing tasks, such as part-of-speech tagging \[Brill, 1995\], PP-attachment disambiguation \[Brill and Resnik, 1994\], text chunking \[Ramshaw and Marcus, 1995\], spelling correction \[Mangu and Brill, 1997\], dialogue act tagging \[Samuel et al., 1998\] and ellipsis resolution \[Hardt, 1998\]. Thus, TBL has proved very useful, in many different ways, and is likely to continue to do so in the future.</Paragraph>
    <Paragraph position="2"> Moreover, since Brill generously made his own TBL implementation publicly available, l many researchers in need of all off-the-shelf retrainable part-of-speech tagger have found what they were looking for. However, although very useful, Brill's original implementation is somewhat opaque, templates are not compositional, IThroughout this paper, when referring to Brill's TBL implementation, it is always his contextual-rule-learner implemented in C - that I have in mind. &amp;quot;It is available from http://www, cs. jhu. edu/~brill/, along with ~veral other learners and utility programs.</Paragraph>
    <Paragraph position="3"> and they are hard-wired into the program. Therefore, the program is difficult to modify and extend. What is more, it is fairly slow.</Paragraph>
    <Paragraph position="4"> This paper is dedicated to the design and implementation of an alternative transformation-based learner system, called &amp;quot;the #-TBL system&amp;quot; (pronounced &amp;quot;mutable&amp;quot;). The p-TBL system is designed to be theoretically transparent, flexible and efficient. Transparency is achieved by performing a 'logical reconstruction' of TBL, and by deriving the system from there. Flexibility is achieved through the use of a compositional rule and template formalism, and 'pluggable' &amp;quot;algorithms. As for the implementation, it turns out that transformation-based learning can be implemented very straightforwardly in a logic programming language such as Prolog. Efficient indexing of data, unification and backtracking search, as well as established Prolog programming techniques for building rule compilers and meta-interpreters, contribute to the making of a logically transparent, easily extendible, and fairly efficient system. 2 The content of the paper is presented in a bottom-up fashion, starting from the semantics of transformation rules. First, I show that, contrary to what is often assumed, transformation rules can be given a declarative, logical interpretation. I then introduce the IL-TBL system, which in a manner of speaking is derived from this interpretation of rules. The template compiler, a part of the system which translates templates into efficient Prolog programs, is described, and by w~&amp;quot; of examples it is shown how a particular combination of training data and templates may be 'queried' from the Prolog prompt. Next, a number of variants of all-solutions predicates are specified, that deal with notions such as scores, rankings and thresholds. Since they appear to be independently useful - even useful outside TBL &amp;quot;The ~-TBL system is available from http://~w, ling. gu. se/-~lager/mutbl, html.</Paragraph>
    <Paragraph position="6"> they belong in a separate library. By combining predicates from these code libraries, a number of TBL-like algorithms are assembled, and benchmarks are run that show the/~-TBL system to be quite efficient. Finally, a small experiment using transformation-based learning to induce Constraint Grammars from corpora is performed. null The Semantics of Transformation Rules The object of TBL is to learn an ordered sequence of transformation rules. The p-TBL system supports four kinds of transformation rules.</Paragraph>
    <Paragraph position="7"> Replacement rules dictate when - based on the context - one feature value for a word should be replaced with another feature value. An example would be &amp;quot;replace tag vb with nn if the Word immediately to the left has a tag dt&amp;quot;. Here is how this rule is represented in the #-TBL system's compositional rule/template formalism: tag:vb&gt;nn &lt;- tag:dr@\[-1\].</Paragraph>
    <Paragraph position="8"> This is of course the exact counterpart of the transformation rule in Brill's original framework.</Paragraph>
    <Paragraph position="9"> Addition rules specify when a feature value should be added to a word. An example would be &amp;quot;add tag nn to a word if the word immediately to the left has a tag tit&amp;quot;: tag:0&gt;nn &lt;- tag:dr@\[-1\].</Paragraph>
    <Paragraph position="10"> Note that a feature value is actually added to a word only if it not already there.</Paragraph>
    <Paragraph position="11"> Deletion rules dictate when a feature value should be removed from a word. An example would be &amp;quot;remove tag vb from a word if the word immediately to the left has a tag dt&amp;quot;: tag:vb&gt;0 &lt;- tag:dt@\[-1\].</Paragraph>
    <Paragraph position="12"> Reduction rules reduce the set of feature values for a word with a certain value. An example would be &amp;quot;reduce a word's tag values with tag vb if the word immediately to the left has a tag dr&amp;quot;: tag:vb&gt;l &lt;- tag:dr@I-I\].</Paragraph>
    <Paragraph position="13"> An important difference between deletion rules and reduction rules is that the latter will only remove a feature value from a word if it is not the last value for that feature. If vb is the last value the above rule is not applicable and the reduction will not take place. This should remind us of the kind of constraints that are central to the so called reductionistic approach to disambiguation. as represented by for example Constraint Gram- null mar \[Karlsson et al., 1995\]). Constraint grammars may indeed be possible to learn in the IL-TBL system, as I will show towards the end of this paper.</Paragraph>
    <Paragraph position="14"> In the p-TBL system's rule formalism, conditions may refer to different symbol features, and complex conditions may be composed from simpler ones. For example, here is a rule saying &amp;quot;replace the tag for adverb with the tag for adjective, if the current word is &amp;quot;only&amp;quot;, and if the previous tag, or the tag before that, is a determiner tag.&amp;quot;: tag:ab&gt;jj &lt;- wd:only@\[O\] k tag:dt@\[-1,-2\].</Paragraph>
    <Paragraph position="15"> Ill this paper, I will break with the tradition to think about transformation rules in exclusively procedural terms, and instead try to think about them in declarative and logical terms. Transformation rules (partially) describe an ordered sequence of pairs of symbols, which I will refer to as a relation. Such a relation form training data for a TBL system. Here is a simple (and unrealistically small) example: dt vb nn dt vb kn dt vb ab dt vb dt nn vb dt nn kn dt jj kn dt nn The sequence formed by the upper elements of the pairs will be referred to as Sl, and the sequence formed by the lower elements as Sn. Such sequences can be. modelled by means of two sets of clauses, which relate positions in the sequences to symbol feature values: $1 (1, dr) Sl (2, vb) $1 (3, nn) ... S1 (11, vb) S. (I, dr) S~ (2, nn) S.(3, vb) ... S. (11, nn) A central point in this paper is the suggestion that the declarative semantics of transformation rules can be captured by rule formulas in the form of universally quantified implications, and that, for example, the meanings of the four very simple rules shown previously are captured by the following formulas:</Paragraph>
    <Paragraph position="17"> Rule formulas as such will not be put to any direct computational use, but the notion of a rule formula provides a starting point, from which computational tools can be derived.</Paragraph>
    <Paragraph position="18"> A rule instance is a rule formula in which every variable has been replaced with a constant. Now, we may define the notions of positive and negative instances of rule formulas (and thus indirectly of transformation rules). A positive rule instance is a rule instance where the mltecedent and the consequent are both true.</Paragraph>
    <Paragraph position="19"> Thus, the following formula is a positive instance of the formula corresponding to the simple replacement rule above: Sl(2, vb) A (1 = 2 - 1) A S,(I,dt) --+ S,(2, nn) A negative instance of a rule is a rule instance where the antecedent is true but where the consequent is false, for example:</Paragraph>
    <Paragraph position="21"> Note that Brilrs notion of a neutral instance of a rule, i.e. an instance of a rule that replaces an incorrect tag with another incorrect tag, is a negative instance in my terminology. (In practice, this does not seem to matter much, as I will show later.) We now define two important rule evaluation measures. The score of a rule is the number of its positive instances minus the number of its negative instances: sco~e(R) =1 pos(R) 1 - I neg(R) I The accuracy of a rule is its number of positive instances divided by the total number of instances of the rule:</Paragraph>
    <Paragraph position="23"> The notion of rule accuracy is well-known in rule induction and inductive logic programming, and towards the end of this paper we will see that it may have a role to play in the context of transformation-based learning too.</Paragraph>
    <Paragraph position="24"> An Overview of the #-TBL System Through the use of unification and a particular search strategy (backtracking), a logic programming environment such as Prolog implements a constructive kind of inference which allows us to define predicates that are able to recognize, generate and search for positive and negative instances of transformation rules. Furthermore, a layer of recta-logical predicates provides a way to collect and count such instances, and thus a way to calculate the score and accuracy for any rule. Therefore, in a logic programming framework, transformation-based learning can be implemented in a very clear and simple way.</Paragraph>
    <Paragraph position="25"> However, for such an implementation to become useful. we have to think about efficiency. Among other things, we need to think about how we index our training data. Assuming the part-of-speech tagging task. corpus data can be represented by&amp;quot; means of three &amp;quot;kinds of clauses:  wd (P,N) is true iff the word W is at position P in the corpus tag(P,A) is true iff the word at position P in the corpus is tagged A tag(A,B,P) is true iff the word at, P is tagged A and the correct tag for the word at P is B Although this representation may seem a bit redundant, it provides exactly the kind of indexing into the data that is needed. 3 A decent Prolog system can deal with millions of such clauses.</Paragraph>
    <Paragraph position="26"> Rules that can be learned in TBL are instances of templates, such as &amp;quot;replace tag A with B if tho symbol (e.g. the word) immediately to the left has tag C, where A, B and C are variables. Here is how we write this template in the p-TBL system:</Paragraph>
    <Paragraph position="28"> The term to the left of # is a unique identifier for the template. A template instance is a template in which every variable in the identifier has been replaced by a constant. If we strip the identifier we end up with a transformation rule again. The instantiated identifier uniquely identifies that rule.</Paragraph>
    <Paragraph position="29"> Positive instances of rules that are instancbs of the above template can be efficiently recognized, generated and searched for, by means of the following clause: positive (t3(A,B,C)) :tag(A,B,PO), Pl is PO-I, tag(Pl,C).</Paragraph>
    <Paragraph position="30"> Negative instances are handled as follows: negative (t3 (A,B, C) ) :tag(A,X,PO), dif(X,B), P1 is PO-I,taE(PI,C).</Paragraph>
    <Paragraph position="31"> It should be clear how these clauses use the representation described above, and that they respect the semantics exemplified in the previous section. Clauses corresponding to other templates and other types of rules can be defined accordingly.</Paragraph>
    <Paragraph position="32"> Tied to each template is also an update proce.dure that will apply rules that are instances of this template, and thus update sequences, by replacing feature values with other feature values, adding to the feature values, or removing from them. For example:</Paragraph>
    <Paragraph position="34"> To write clauses such as these by hand for large sets of templates would be tedious and prone to errors and omissions. Fortunately, since the formalism is compositional, it is easy to write a template compiler that generates them automatically. The #u-TBL system uses well-known Prolog compiler writing techniques to expand templates written in the compositional high-level notation into clauses that can be run as programs. Thus, the convenience and flexibility of a high-level notation for templates and rules does not compromise performance.</Paragraph>
    <Paragraph position="35"> A template grammar defines the exact relation between a template and a set of clauses. As an illustration, the following grammar rules are used to expand a template into a Prolog clause defining positive/l, negative/1 and apply/l, for that template:</Paragraph>
    <Paragraph position="37"> \[GI\], cond(Cs,P), \[retract(Gl), retract(G2), assert(G3), assert(G4)\].</Paragraph>
    <Paragraph position="38"> cond((C&amp;Cs),P) --&gt; cond(C,P), cond(Cs,P).</Paragraph>
    <Paragraph position="39"> cond(FA@Pos,PO) --&gt; pos(Pos,PO,P), feat(FA,P). pos(Pos,PO,P) --&gt; \[member(Offset,Pos), P is PO+Offset\]. feat(F:A,P)--&gt; {G =.. \[F,P,A\]}, \[G\]. A modern Prolog system will compile the resulting clauses all the way down to machine code. Thus. a TBL-system implemented in Prolog can be quite efficient. null  The p-TBL Template Compiler When a file containing transformation rules is consulted or compiled, each transformation rule is expanded into several Prolog clauses) As a result of this, a large number of predicates becomes available, some of which are documented in Figure 1.</Paragraph>
    <Paragraph position="40"> Using the predicates generated by the template compiler, the training data in combination with the tem4Also. if the user does not provide them, template ideatlfiers m'e constructed automatically.</Paragraph>
    <Paragraph position="41"> pair(?a,?B) pair(?h,?B,?P) A ~ aligned with B at a position P in the current data. positive (?RuleID) positive (?RuleID, ?a, ?B) positive (?RuleID, ?A, ?B, ?P) RuleID names a rule which has a positive instance in the current data at a position P, whero A is aligned with B. The rule is an instance of a template, which is identified by the functor of RuleID. A call to this predicate usually has many solutions, az~,l tile order in which solutions are returned on backtracking is determined by the order in which templates are presented to the system, and the order of symbols in the training data.</Paragraph>
    <Paragraph position="42">  plates may be queried. By backtracking through the solutions to a call to positive/1 we may for example verify that there are ten ways to instantiate our example template in our example data (for space reasons, I show only the first three solutions):</Paragraph>
    <Paragraph position="44"> Alternatively, we might be interested only in instances where the aligned feature values (h and B) are different, and there are six of those: a Sdif/2 is a built-in predicate in SICStus Prolog. A call to dif (X,Y) constrains X and Y to represent different terms. Calls to dif/2 either succeed~ fail. or are blocked depending oil whether X and Y are sufficiently instantiated.</Paragraph>
    <Paragraph position="45">  l 7- dif(A,B), positive(ID,A,B), ID # Rule.</Paragraph>
    <Paragraph position="46"> Or, we might be interested only in template instances where the aligned symbols have feature values nn and vb, respectively. There is only one such rule:</Paragraph>
    <Paragraph position="48"> Sometimes, a random sample of a positive rule might be more useful:</Paragraph>
    <Paragraph position="50"> As for negative instances, we may want to know if the rule tag:vb&gt;nn &lt;- tag:dr@\[-1\] has any negative instances in the training data, and indeed there is one at position 8, where vb is aligned with jj rather than nn:</Paragraph>
    <Paragraph position="52"> Library ranking is a package for scoring and ranking rules. It was written for the specific purpose of scoring transformation rules in the context of TBL, but is likely to be more generally useful, hence deserving its status as a libra~'y. The basic notions are defined as follows: A score is an integer &gt; 0 A ranking entry is a pair S-R such that S is a score and R is a rule A ranking is an ordered sequence of ranking entries where each rule occurs only once.</Paragraph>
    <Paragraph position="53"> The score of a rule is determined by counting the solutions returned by goals containing the rule (or rather its ID). Thus. many predicates in library ranking are meta-predicates that work much the same way as the so called all-solutions predicates that are built into Prolog. Figure 2 lists some of the predicates available in library ranking.</Paragraph>
    <Paragraph position="54"> The library encapsulates some of Brill's own w~'s of optimizing transformation-based learning - optimizations which are possible to perform for the ranking of rules in general.</Paragraph>
    <Paragraph position="55"> The predicates in library ranking interact in a straightforward way with the predicates generated by the template compiler, as the following examples will show. Here. for instance, is how we compute (and print) a ranking on the basis of the goal invoh'ing a call to positive/3:  count (?R, +Goal ,-N) Binds N to the number of solutions for Goal. un\]ess it fails for lack of solutions. If there are uninsta~itiated variables in Goal, then a call to countl3 may backtrack, generating alternative values for N corresponding to different instantiations of the free variables of Goal. Defined as:</Paragraph>
    <Paragraph position="57"> Computes the score for each instance of R and ranks the instances. However, instances with scores less than the score threshold (ST) are not ranked. Defined as: rank (R, Goal, ST, Rnkng) :setof (N-R, (count (., Goal, N), N&gt;ST), Rnkng0), reverse (RnkngO, Rnkng).</Paragraph>
    <Paragraph position="58"> penalize (?R, +Goal, +Rnkng, +ST, +AT, -NewRnkng) Re-ranks the rules ill Rnkng by subtracting from their scores, giving a new ranking NewRnkng. However. any rule with a score &lt; ST or all accuracy &lt; AT Is just dropped.</Paragraph>
    <Paragraph position="59"> at_position (+N, +Rnkng, -Rule, -Score) Retrieves the Nth rule in the ranking, and its score. highscore (?R, +PGoal, +NGoal, +ST, +AT, ?WR, ?WRS) Among the different instances of R, NR is the rule with the highest score (i.e. the 'winning rule'), and NRS is its score, defined as the number of solutions to the goal PGoal minus the number of solutions to NGoal. However. if NRS &lt; ST, or if no rule clears the accuracy threshold (AT), highscore/7 fails. Works a~ if defined by: highscore (R,PGoal,NGoal, ST, AT, WR, ~IRS) -rank (R, PGoal, ST, Rnkng), penalize (R, NGoal, Rnkng, ST, AT, NewRnkng), at _pos it ion ( 1, NewRnkng, WR, WRS).</Paragraph>
    <Paragraph position="60"> Ill fact, highscore/7 is implemented in a more efficient way. It keeps track of a leading rule and its score, and thus only has to generate and count solutions to NGoal for rules for which the number of positive instances Is greater than the score for the leading lule. Moreover, the score threshold (ST) for the counting of solutions of NGoal can be set to the nmnber of solutions to PGoal minus the score for the leading rule.</Paragraph>
    <Paragraph position="62"> This concludes the demonstration of how the template compiler and the ranking library allows a particular combination of templates and training data to be interactively explored from the Prolog prompt.</Paragraph>
    <Paragraph position="63"> Simple TBL Full transformation-based learning is just a small snippet of code away. Given corpus data, templates and values for the thresholds (ST and AT), the predicate tbl/3 implements learning of a sequence of rules:</Paragraph>
    <Paragraph position="65"> This predicate, defined entirely in terms of predicates generated by the template compiler and predicates from library ranking, combines all the important principles of TBL into a complete learning program, that repeatedly instantiates rule templates in training data, scores rules on the basis of counts of positive and negative instances of them, selects the highest scoring rule on the basis of this ranking, and applies it to the training data.</Paragraph>
    <Paragraph position="66"> Consider our small example once again. Here are the three rules learned (with the score threshold set to 1) tag:vb&gt;nn &lt;- tag:dt~\[-1\].</Paragraph>
    <Paragraph position="67"> tag:ab&gt;kn &lt;- tag:nn@\[-1\].</Paragraph>
    <Paragraph position="68"> tag:nn&gt;vb &lt;- tag:nn@\[-l\].</Paragraph>
    <Paragraph position="69"> and here are the transformations that the upper sequence of the training data goes through, when the rules are applied in the given order: dt vb nn dt vb kn dt vb ab dt vb dt nn nn dt nn kn dt nn ab dt nn dt nn nn dt nn kn dt nn kn dt nn d~ nn vb dt nn kn dt nn kn dt nn It is interesting to regard what is happening hei-e as a decomposition of a relation S1-S,, into a number of re- null lations SI-S.&gt;_ o... o S,-I-S,, corresponding to a number of rules R1 o ... o R,_,.</Paragraph>
    <Paragraph position="70"> In general, for such a decomposition S,-S, = Si-S,+l o S,+I-Sn, it holds that if a rule R, has P positive and N negative instances in S,-S,, then (i) R, will have P + N positive and no negative instances in Ss-Si+l, and (ii) Ri will have no positive nor negative instances in S,+I-S,. Clearly, (i) follows from the fact that the update procedure associated with R, changed the negative instances of Ri in S,-S, into positive ones, and (ii) from the fact that the antecedent of R, must be false in S,+l'Sn. As a corollary to (ii) it follows that Ri will not be selected next.</Paragraph>
    <Paragraph position="71"> For each step, as long as P &gt; ST + N, then S, will become more similar to Sn. Note. in our example, that there is one rule, tag:nn&gt;jj &lt;- tag:dr@\[-1\], that would remove the only remaining difference between $4 and S,. However, this rule also has three negative instances, and thus the rule gets a score below tim threshold.</Paragraph>
    <Section position="1" start_page="38" end_page="40" type="sub_section">
      <SectionTitle>
Scaling Up
</SectionTitle>
      <Paragraph position="0"> Program 1 is indeed small, simple and transparent. But what about efficiency? How well does it scale up to handle real world tasks, such as part-of-speech tagging? In one small test the learner was operating on annotated Swedish corpora 6 of three different sizes, with 23 different tags, and the 26 templates that BriU uses in his distribution:</Paragraph>
      <Paragraph position="2"> pus (SUC). Here is a key to the part-of-speech tags appearing in the present paper: ma = noun, vb = verb, pp = preposition, pra = proper name. dt= determiner, pn = pronoun, ie = infinitive marker, sn = subjunction, jj = adjective,</Paragraph>
      <Paragraph position="4"> Below, I show the first thirteen rules, as they are reported by the p-TBL system (during training on the 30kw corpus). Each rule is preceded by its score (first column), and by its accuracy (second column).</Paragraph>
      <Paragraph position="5">  Note that the actual accuracy of a learned rule can sometimes be well below 1.00. (The accuracy threshold was set to 0.5 in this experiment.) The sequence of rules works well anyway, since the damage done by an incorrect rule can be repaired by rules later in the sequence. (In fact, a small experiment confirmed that the setting of the accuracy threshold to 1.00 generates a tagger which performs less well.) For each corpus, the accuracy of the learned sequence of rules was measured on a test corpus consisting of 40,000 words, with an initial-state accuracy of 93.3%. The system was running on a Sun Ultra Enterprise 3000 with a 250Mhz processor. Table 1 summarizes the results of the tests:  The performance of the p-TBL system was compared with Brill's learner running on the same machine, with the same templates, score thresholds and training data. Table 2 gives the figures.</Paragraph>
      <Paragraph position="6"> These tests verify that the program works as expected, and also that it is quite efficient, despite its small size and simple design. In fact. the tests show that p-TBL learner is an order of magnitude faster than Brill's original learner for this particular task.</Paragraph>
      <Paragraph position="7">  TBL h la Eric Brill This algorithm is perhaps tile one that resembles Brill's&amp;quot; own algorithm the most. It differs fi'om the siml)le algorithm in that to learn one rule, it ranks the error types that occur in the trairting data (using rank/4 from library ranking to do so), and then it searches top-to-bottom in this ranking, entry by entry, for a rule which fixes the type of error recorded by the entry, always keeping track of a leading rule and its score. When the score for a ranking entry drops below the leading rule's score, the search is abandoned, and the leader is declared winner. This effectively prunes the search space without losing completeness, and it also saves a lot of memory, since only rules for one kind of error at a time have to be held in memory.</Paragraph>
      <Paragraph position="8">  The benchmark results, using the same setup as with the simple algorithm, are shown in Table 3.</Paragraph>
      <Paragraph position="9"> As can be seen from Table 3, the optimized algorithm is significantly faster than the simple one, and it uses less memory. However, as pointed out in \[Ramshaw and Marcus, 1995\], the effect of this particular optimization method depends on the size of the  tag set. The larger the tag set, the more benefit we can expect. Thus, we can expect to see even greater improvements for many learning tasks.</Paragraph>
      <Paragraph position="10"> Note also that in contrast with the simple algorithm, this algorithm uses Brill's notion of negative rule instance. The call negative (g, A, A) ensures that neutral instances are not counted as negative. However, it appears that the way negative instances are counted does not matter much, at least not for this application. The rules look pretty much the same as the rules generated by Brill's learner, and in fact, the first ten rules are identical.</Paragraph>
      <Paragraph position="11"> Monte Carlo TBL The original TBL algorithm suffers from the fact that the number of candidate rules to consider grows very fast with the number of rule templates, and in practice only a small number of templates can be handled. \[Samuel et al., 1998\] presents a novel twist to the algorithm, in order to solve this problem. The idea is to randomly sample from the space of possible rules, rather than generating them all. The better the rule is, the greater the chance that it is included in the sample. Thus, the system is likely to find the best rules first. An implementation of this algorithm can be assembled by replacing the definition of learn_one/6 in Program 2 with the following definition:  That is, highscore/7 picks tile rules that it evaluates from the (here) 16 rules that are sampled. The amount of work that highscore/7 has to perform, and the memory requirements, no longer depends on how many templates there are.</Paragraph>
      <Paragraph position="12"> To test the algorithm, the system was run with 260 templates with the 60,000 word corpus, and a comparison was made with the optimized algorithm. The outcome of this experiment is reported in Table 4.</Paragraph>
      <Paragraph position="13">  the '~ la Brill algorithm.</Paragraph>
      <Paragraph position="14"> As can be seen from the table, although the other algorithm did not perform too bad with 260 templates, the 'lazy' algorithm was an order of magllitude faster. Accuracy was not compromised, although the number of rules grew.</Paragraph>
      <Paragraph position="15"> As a sidenote, let me describe a convenient /L-TBL system feature which makes it possible to train with very many templates without actually writing them 'all down. Instead of loading a set of templates into the system, the user may load a couple of template declarations, which, in terms of 'window' sizes and ranges of relative positions over which windows 'slide', constrain the relation between templates and clauses, defined by the template grammar. Constrained in this way, the grammar Call be used to generate templates. Without going into any further details, let me just show the declarations which causes the system to generate the 260 templates used above: * - head(tag:A&gt;B).</Paragraph>
      <Paragraph position="16"> :- window_size(tag,3).</Paragraph>
      <Paragraph position="17"> &amp;quot;- window_size(wd,2).</Paragraph>
      <Paragraph position="18"> :- range(tag,\[-3,-2,-l,l,2,3\]).</Paragraph>
      <Paragraph position="19"> :- range(wd,\[-2,-1,1,2\]).</Paragraph>
      <Paragraph position="20"> &amp;quot;- anchors(\[-t,O,1\]).</Paragraph>
    </Section>
    <Section position="2" start_page="40" end_page="40" type="sub_section">
      <SectionTitle>
Learning Constraint Grammars
</SectionTitle>
      <Paragraph position="0"> In another experiment, the #-TBL system was run with a number of templates for reduction rules, in order to see if something resembling a Constraint Grammar could be induced from training data. Each word token in a training corpus of 30,000 words was assigned the set of part-of-speech tags that it can have according to a lexicon. The training data also indicated which member of this set was the correct one.</Paragraph>
      <Paragraph position="1"> The system was run with the following four templates: null tag:A&gt;l &lt;- unique(tag:C@\[-l\]).</Paragraph>
      <Paragraph position="2"> tag:A&gt;l &lt;- unique (tag : C@ \[l\] ) .</Paragraph>
      <Paragraph position="3"> tag:A&gt;l &lt;- wd:C@\[O\] &amp; unique(tag:D@\[-l\]).</Paragraph>
      <Paragraph position="4"> tag:h&gt;l &lt;- wd:C@\[O\] k unique(tag:DO\[l\]).</Paragraph>
      <Paragraph position="5"> The use of the unique/1 wrapper in the conditions of the rules has the effect that a rule will trigger only if the assignments of tags to words in the relevant surroundings ar e non-mnbiguous. (As Karlsson et al. (1995) put it, the rules are run in &amp;quot;careful application mode&amp;quot;.) As mentioned earlier, replacement rules do not have to be very accurate: if a rule early in a sequence of replacement rules makes some errors, the errors can often be 'fixed' by rules later in the sequence. By contrast, in a sequence of reduction rules there are no rules that can add tags once they have been removed. Therefore, in order to maximize the accuracy of the whole sequence of rules, it must be induced under a validation bias which sees to it that each rule is as accurate as possible. In the #-TBL system, this is taken care of by&amp;quot; setting the accuracy threshold to a very high value. However, a sequence of rules induced in this way will typically leave many words with more than one tag. If we want instead to minimize the tags per words ratio, the accuracy threshold can be set to a lower value, but then a lower tagging accuracy will naturally result. In the experiment, the accuracy threshold was set to 0.99 (which still allows for a bit of noise in the data) and to 0.85. Program 2 was used.</Paragraph>
      <Paragraph position="6"> Below, I show the first ten rules that were learned by the system (with the accuracy threshold set to 0.99):  451 1.00 tag:rg&gt;l &lt;451 1.00 tag:pl&gt;l &lt;274 1.00 tag:pn&gt;l &lt;274 0.99 tag:ab&gt;l &lt;230 1.00 tag:pl&gt;l &lt;222 1.00 tag:ab&gt;l &lt;221 1.00 tag:rg&gt;l &lt;219 1.00 tag:p1&gt;1 &lt;200 1.00 tag:p1&gt;1 &lt;166 0.99 tag:rg&gt;l &lt;-</Paragraph>
      <Paragraph position="8"> wd: en@ \[0\] &amp; unique (nn@ \[i\] ).</Paragraph>
      <Paragraph position="9"> wd: en@ \[0\] &amp; unique (nn@ \[i\] ).</Paragraph>
      <Paragraph position="10"> unique(dl@ \[-I\] ).</Paragraph>
      <Paragraph position="12"> The induced sequences of rules were tested on a corpus of 11,000 words. Both the accuracy and the tags per word ratio in the test corpus were measured7 The initial tags per word ratio in the test corpus was 1.35.</Paragraph>
      <Paragraph position="13"> The results of the tests are given in Table 5.</Paragraph>
      <Paragraph position="14">  TA word is deemed to be accurately tagged if the correct tag is an element in the set of tags that the word has been assigned.</Paragraph>
      <Paragraph position="15"> These results are promising. But before it would be fair to compare with other methods for inducing Constraint Grammars from annotated corpora, e.g. the methods described ill \[Samuelsson et al., 1996\] or in \[Lindberg and Eineborg, 1998\], it remains to determine the optimal set of templates and the optimal settings of the accuracy threshold. Very likely, the learning process (applied to the learning of reduction rules) can also be optimized for speed. In short, a lot more has to be done, but at least this section has shown how easily an experiment like this can be set up in the #-TBL environment. null Summary and Conclusions The #-TBL system is not just a re-implementation of original TBL in another programming language.</Paragraph>
      <Paragraph position="16"> Rather it should be seen as all attempt to use tile reasoning and database capabilities of Prolog to do TBL ill a more high-level way. The #-TBL system is: General - The system supports four types of rules by means of which not only traditional 'Brill-taggers', but also Constraint Grammar disambiguators, are possible to train.</Paragraph>
      <Paragraph position="17"> Easily extendible - Through its support of a compositional rule/template formalism and 'pluggable' algorithms, the system can easily be tailored to different learning tasks.</Paragraph>
      <Paragraph position="18"> Transparent - Rules have a declarative, logical semantics which, among other things, has proved to be of great value during the implementation work.</Paragraph>
      <Paragraph position="19"> Efficient - A number of benchmarks have been run which show that the system is fairly efficient - an order of magnitude faster than Brill's contextual-rule learner.</Paragraph>
      <Paragraph position="20"> Interactive - Prolog is all interactive language and this is something that the #-TBL system inherits.</Paragraph>
      <Paragraph position="21"> Small - Thanks to the choice of implementation language, the system's code base can be kept quite small. Indeed, a 'light' version of the #-TBL system., consisting of just one page of Prolog code, has been implemented \[Lager, 1999\].</Paragraph>
      <Paragraph position="22"> In short, the #-TBL system is a powerful environment in which to experiment with transformation-based learning.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML