File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0107_metho.xml
Size: 23,278 bytes
Last Modified: 2025-10-06 14:14:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0107"> <Title>Text Chunking using Transformation-Based Learning</Title> <Section position="5" start_page="85" end_page="86" type="metho"> <SectionTitle> 3 The Transformation-based Learning Paradigm </SectionTitle> <Paragraph position="0"> As shown in Fig. 1, transformation-based learning starts with a supervised training corpus that specifies the correct values for some linguistic feature of interest, a baseline heuristic for predicting initial values for that feature, and a set of rule templates that determine a space of possible transformational rules. The patterns of the learned rules match to particular combinations of features in the neighborhood surrounding a word, and their action is to change the system's current guess as to the feature for that word.</Paragraph> <Paragraph position="1"> To learn a model, one first applies the baseline heuristic to produce initial hypotheses for each site in the training corpus. At each site where this baseline prediction is not correct, the templates are then used to form instantiated candidate rules with patterns that test selected features in the neighborhood of the word and actions that correct the currently incorrect tag assignment. This process eventually identifies all the rule candidates generated by that template set that would have a positive effect on the current tag assignments anywhere in the corpus.</Paragraph> <Paragraph position="2"> Those candidate rules are then tested against the rest of corpus, to identify at how many locations they would cause negative changes. One of those rules whose net score (positive changes minus negative changes) is maximal is then selected, applied to the corpus, and also written out as the first rule in the learned sequence. This entire learning process is then repeated on the transformed corpus: deriving candidate rules, scoring them, and selecting one with the maximal positive effect. This process is iterated, leading to an ordered sequence of rules, with rules discovered first ordered before those discovered later. The predictions of the model on new</Paragraph> <Paragraph position="4"> text are determined by beginning with the baseline heuristic prediction and then applying each rule in the learned rule sequence in turn.</Paragraph> </Section> <Section position="6" start_page="86" end_page="88" type="metho"> <SectionTitle> 4 Transformational Text Chunking </SectionTitle> <Paragraph position="0"> This section discusses how text chunking can be encoded as a tagging problem that can be conveniently addressed using transformational learning. We also note some related adaptations in the procedure for learning rules that improve its performance, taking advantage of ways in which this task differs from the learning of part-of-speech tags.</Paragraph> <Section position="1" start_page="86" end_page="86" type="sub_section"> <SectionTitle> 4.1 Encoding Choices </SectionTitle> <Paragraph position="0"> Applying transformational learning to text chunking requires that the system's current hypotheses about chunk structure be represented in a way that can be matched against the pattern parts of rules. One way to do this would be to have patterns match tree fragments and actions modify tree geometries, as in Brill's transformational parser (1993a). In this work, we have found it convenient to do so by encoding the chunking using an additional set of tags, so that each word carries both a part-of-speech tag and also a &quot;chunk tag&quot; from which the chunk structure can be derived.</Paragraph> <Paragraph position="1"> In the baseNP experiments aimed at non-recursive NP structures, we use the chunk tag set (I, G, B}, where words marked I are inside some baseNP, those marked O are outside, and the B tag is used to mark the left most item of a baseNP which immediately follows another baseNP.</Paragraph> <Paragraph position="2"> In these tests, punctuation marks were tagged in the same way as words.</Paragraph> <Paragraph position="3"> In the experiments that partitioned text into N and V chunks, we use the chunk tag set {BN, N, BV, V, P), where BN marks the first word and N the succeeding words in an N-type group while BY and Y play the same role for V-type groups. Punctuation marks, which are ignored in Abney's chunk grammar, but which the Treebank data treats as normal lexical items with their own part-of-speech tags, are unambiguously assigned the chunk tag P. Items tagged P are allowed to appear within N or V chunks; they are irrelevant as far as chunk boundaries are concerned, but they are still available to be matched against as elements of the left hand sides of rules.</Paragraph> <Paragraph position="4"> Encoding chunk structure with tags attached to words rather than non-recursive bracket markers inserted between words has the advantage that it limits the dependence between different elements of the encoded representation. While brackets must be correctly paired in order to derive a chunk structure, it is easy to define a mapping that can produce a valid chunk structure from any sequence of chunk tags; the few hard cases that arise can be handled completely locally. For example, in the baseNP tag set, whenever a B tag immediately follows an 0, it must be treated as an I, and, in the partitioning chunk tag set, wherever a V tag immediately follows an N tag without any intervening BV, it must be treated as a BV.</Paragraph> </Section> <Section position="2" start_page="86" end_page="87" type="sub_section"> <SectionTitle> 4.2 Baseline System </SectionTitle> <Paragraph position="0"> Transformational learning begins with some initial &quot;baseline&quot; prediction, which here means a basehne assignment of chunk tags to words. Reasonable suggestions for baseline heuristics after a text has been tagged for part-of-speech might include assigning to each word the chunk tag that it carried most frequently in the training set, or assigning each part-of-speech tag the chunk tag that was most frequently associated with that part-of-speech tag in the training. We tested both approaches, and the baseline heuristic using part-of-speech tags turned out to do better, so it was the one used in our experiments. The part-of-speech tags used by this baseline heuristic, and then later also matched against by transformational rule patterns, were derived by running the raw texts in a prepass through Brill's transformational part-of-speech tagger (Brill, 1993c).</Paragraph> </Section> <Section position="3" start_page="87" end_page="88" type="sub_section"> <SectionTitle> 4.3 Rule Templates </SectionTitle> <Paragraph position="0"> In transformational learning, the space of candidate rules to be searched is defined by a set of rule templates that each specify a small number of particular feature sets as the relevant factors that a rule's left-hand-side pattern should examine, for example, the part-of-speech tag of the word two to the left combined with the actual word one to the left. In the preliminary scan of the corpus for each learning pass, it is these templates that are applied to each location whose current tag is not correct, generating a candidate rule that would apply at least at that one location, matching those factors and correcting the chunk tag assignment.</Paragraph> <Paragraph position="1"> When this approach is applied to part-of-speech tagging, the possible sources of evidence for templates involve the identities of words within a neighborhood of some appropriate size and their current part-of-speech tag assignments. In the text chunking application, the tags being assigned are chunk structure tags, while the part-of-speech tags are a fixed part of the environment, like the lexical identities of the words themselves. This additional class of available information causes a significant increase in the number of reasonable templates if templates for a wide range of the possible combinations of evidence are desired. The distributed version of Brill's tagger (Brill, 1993c) makes use of 26 templates, involving various mixes of word and part-of-speech tests on neighboring words. Our tests were performed using 100 templates; these included almost all of Brill's combinations, and extended them to include references to chunk tags as well as to words and part-of-speech tags.</Paragraph> <Paragraph position="2"> The set of 100 rule templates used here was built from repetitions of 10 basic patterns, shown on the left side of Table 2 as they apply to words. The same 10 patterns can also be used to match against part-of-speech tags, encoded as P0, P-l, etc. (In other tests, we have explored mixed templates, that match against both word and part-of-speech values, but no mixed templates were used in these experiments.) These 20 word and part-of-speech patterns were then combined with each of the 5 different chunk tag patterns shown on the right side of the table. The cross product of the 20 word and part-of-speechpatterns with the 5 chunk tag patterns determined the full set of 100 templates used.</Paragraph> <Paragraph position="3"> word 1 to left word 1 to right current word and word to left current word and word to right word to left and word to right two words to left two words to right word 1 or 2 or 3 to left word 1 or 2 or 3 to right</Paragraph> </Section> </Section> <Section position="7" start_page="88" end_page="89" type="metho"> <SectionTitle> 5 Algorithm Design Issues </SectionTitle> <Paragraph position="0"> The large increase in the number of rule templates in the text chunking application when compared to part-of-speech tagging pushed the training process against the available limits in terms of both space and time, particularly when combined with the desire to work with the largest possible training sets. Various optimizations proved to be crucial to make the tests described feasible.</Paragraph> <Section position="1" start_page="88" end_page="88" type="sub_section"> <SectionTitle> 5.1 Organization of the Computation </SectionTitle> <Paragraph position="0"> One change in the algorithm is related to the smaller size of the tag set. In Brill's tagger (Brill, 1993c), an initial calculation in each pass computes the confusion matrix for the current tag assignments and sorts the entries of that \[old-tag x new-tag\] matrix, so that candidate rules can then be processed in decreasing order of the maximum possible benefit for any rule changing, say, old tag I to new tag J. The search for the best-scoring rule can then be halted when a cell of the confusion matrix is reached whose maximum possible benefit is less than the net benefit of some rule already encountered.</Paragraph> <Paragraph position="1"> The power of that approach is dependent on the fact that the confusion matrix for part-of-speech tagging partitions the space of candidate rules into a relatively large number of classes, so that one is likely to be able to exclude a reasonably large portion of the search space. In a chunk tagging application, with only 3 or 4 tags in the effective tagset, this approach based on the confusion matrix offers much less benefit.</Paragraph> <Paragraph position="2"> However, even though the confusion matrix does not usefully subdivide the space of possible rules when the tag set is this small, it is still possible to apply a similar optimization by sorting the entire list of candidate rules on the basis of their positive scores, and then processing the candidate rules (which means determining their negative scores and thus their net scores) in order of decreasing positive scores. By keeping track of the rule with maximum benefit seen so far, one can be certain of having found one of the globally best rules when one reaches candidate rules in the sorted list whose positive score is not greater than the net score of the best rule so far.</Paragraph> </Section> <Section position="2" start_page="88" end_page="89" type="sub_section"> <SectionTitle> 5.2 Indexing Static Rule Elements </SectionTitle> <Paragraph position="0"> In earlier work on transformational part-of-speech tagging (Ramshaw and Marcus, 1994), we noted that it is possible to greatly speed up the learning process by constructing a full, bidirectional index linking each candidate rule to those locations in the corpus at which it applies and each location in the corpus to those candidate rules that apply there. Such an index allows the process of applying rules to be performed without having to search through the corpus. Unfortunately, such complete indexing proved to be too costly in terms of physical memory to be feasible in this application.</Paragraph> <Paragraph position="1"> However, it is possible to construct a limited index that lists for each candidate rule those locations in the corpus at which the static portions of its left-hand-side pattern match. Because this index involves only the stable word identity and part-of-speech tag values, it does not require updating; thus it can be stored more compactly, and it is also not necessary to maintain back pointers from corpus locations to the applicable rules. This kind of partial static index proved to be a significant advantage in the portion of the program where candidate rules with relatively high positive scores are being tested to determine their negative scores, since it avoids the necessity of testing such rules against every location in the corpus.</Paragraph> </Section> <Section position="3" start_page="89" end_page="89" type="sub_section"> <SectionTitle> 5.3 Heuristic Disabling of Unlikely Rules </SectionTitle> <Paragraph position="0"> We also investigated a new heuristic to speed up the computation: After each pass, we disable all rules whose positive score is significantly lower than the net score of the best rule for the current pass. A disabled rule is then reenabled whenever enough other changes have been made to the corpus that it seems possible that the score of that rule might have changed enough to bring it back into contention for the top place. This is done by adding some fraction of the changes made in each pass to the positive scores of the disabled rules, and reenabling rules whose adjusted positive scores came within a threshold of the net score of the successful rule on some pass.</Paragraph> <Paragraph position="1"> Note that this heuristic technique introduces some risk of missing the actual best rule in a pass, due to its being incorrectly disabled at the time. However, empirical comparisons between runs with and without rule disabling suggest that conservative use of this technique can produce an order of magnitude speedup while imposing only a very slight cost in terms of suboptimality of the resulting learned rule sequence.</Paragraph> </Section> </Section> <Section position="8" start_page="89" end_page="91" type="metho"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> The automatic derivation of training and testing data from the Treebank analyses allowed for fully automatic scoring, though the scores are naturally subject to any remaining systematic errors in the data derivation process as well as to bona fide parsing errors in the Treebank source.</Paragraph> <Paragraph position="1"> Table 3 shows the results for the baseNP tests, and Table 4 shows the results for the partitioning chunks task. Since training set size has a significant effect on the results, values are shown for three different training set sizes. (The test set in all cases was 50K words. Training runs were halted after the first 500 rules; rules learned after that point affect relatively few locations in the training set and have only a very slight effect for good or ill on test set performance.) The first line in each table gives the performance of the baseline system, which assigned a baseNP or chunk tag to each word on the basis of the POS tag assigned in the.prepass. Performance is stated in terms of recall (percentage of correct chunks found) and precision (percentage of chunks found that are correct), where both ends of a chunk had to match exactly for it to be counted. The raw percentage of correct chunk tags is also given for each run, and for each performance measure, the relative error reduction compared to the baseline is listed. The par- null titioning chunks do appear to be somewhat harder to predict than baseNP chunks. The higher error reduction for the former is partly due to the fact that the part-of-speech basehne for that task is much lower.</Paragraph> <Section position="1" start_page="90" end_page="91" type="sub_section"> <SectionTitle> 6.1 Analysis of Initial Rules </SectionTitle> <Paragraph position="0"> To give a sense of the kinds of rules being learned, the first 10 rules from the 200K baseNP run are shown in Table 5. It is worth glossing the rules, since one of the advantages of transformation-based learning is exactly that the resulting model is easily interpretable. In the first of the baseNP rules, adjectives (with part-of-speech tag J J) that are currently tagged I but that are followed by words tagged 0 have their tags changed to 0. In Rule 2, determiners that are preceded by two words both tagged I have their own tag changed to B, marking the beginning of a baseNP that happens to directly follow another. (Since the tag B is only used when baseNPs abut, the basehne system tags determiners as I.) Rule 3 takes words which immediately follow determiners tagged I that in turn follow something tagged 0 and changes their tag to also be I. Rules 4-6 are similar to Rule 2, marking the initial words of baseNPs that directly follow another baseNP. Rule 7 marks conjunctions (with part-of-speech tag CC) as I if they follow an I and precede a noun, since such conjunctions are more likely to be embedded in a single baseNP than to separate two baseNPs, and Rules 8 and 9 do the same. (The word &quot;&&quot; in rule 8 comes mostly from company names in the Wall St. Journal source data.) Finally, Rule 10 picks up cases hke &quot;including about four million shares&quot; where &quot;about&quot; is used as a quantifier rather than preposition.</Paragraph> <Paragraph position="2"> A similar list of the first ten rules for the chunk task can be seen in Table 6. To gloss a few of these, in the first rule here, determiners (with part-of-speech tag DT), which usually begin N chunks and thus are assigned the baseline tag BN, have their chunk tags changed to hl if they follow a word whose tag is also BN. In Rule 2, sites currently tagged N but which fall at the beginning of a sentence have their tags switched to BN. (The dummy tag Z and word ZZZ indicate that the locations one to the left are beyond the sentence boundaries.) Rule 3 changes N to BN after a comma (which is tagged P), and in Rule 4, locations tagged BN are switched to BV if the following location is tagged V and has the part-of-speech tag VB.</Paragraph> </Section> </Section> <Section position="9" start_page="91" end_page="92" type="metho"> <SectionTitle> Pass 1. BN 2. N 3. N 4. BN 5. N 6. N 7. BV 8. V 9. BV 10. BN </SectionTitle> <Paragraph position="0"> Old Tag Context New Tag</Paragraph> <Paragraph position="2"/> <Section position="1" start_page="91" end_page="91" type="sub_section"> <SectionTitle> 6.2 Contribution of Lexical Templates </SectionTitle> <Paragraph position="0"> The fact that this system includes lexical rule templates that refer to actual words sets it apart from approaches that rely only on part-of-speech tags to predict chunk structure. To explore how much difference in performance those lexical rule templates make, we repeated the above test runs omitting templates that refer to specific words. The results for these runs, in Tables 7 and 8, suggest that the lexical rules improve performance on the baseNP chunk task by about 1% (roughly 5% of the overall error reduction) and on the partitioning chunk task by about 5% (roughly 10% of the error reduction). Thus lexical rules appear to be making a limited contribution in determining baseNP chunks, but a more significant one for the partitioning chunks.</Paragraph> </Section> <Section position="2" start_page="91" end_page="92" type="sub_section"> <SectionTitle> 6.3 Frequent Error Classes </SectionTitle> <Paragraph position="0"> A rough hand categorization of a sample of the errors from a baseNP run indicates that many fall into classes that are understandably difficult for any process using only local word and part-of-speech patterns to resolve. The most frequent single confusion involved words tagged VBG and VBN, whose baseline prediction given their part-of-speech tag was 0, but which also occur frequently inside baseNPs. The system did discover some rules that allowed it to fix certain classes of VBG and VBN mistaggings, for example, rules that retagged VBNs as I when they preceded an NN or NNS tagged I. However, many also remained unresolved, and many of those appear to be cases that would require more than local word and part-of-speech patterns to resolve.</Paragraph> <Paragraph position="1"> The second most common class of errors involved conjunctions, which, combined with the former class, make up half of all the errors in the sample. The Treebank tags the words &quot;and&quot; and frequently &quot;,&quot; with the part-of-speech tag CC, which the baseline system again predicted would fall most often outside of a baseNP 3. However, the Treebank parses do also frequently classify conjunctions of Ns or NPs as a single baseNP, and again there appear to be insufficient clues in the word and tag contexts for the current system to make the distinction. Frequently, in fact, the actual choice of structure assigned by the Treebank annotators seemed largely dependent on semantic indications unavailable to the transformational learner.</Paragraph> </Section> </Section> <Section position="10" start_page="92" end_page="92" type="metho"> <SectionTitle> 7 Future Directions </SectionTitle> <Paragraph position="0"> We are planning to explore several different paths that might increase the system's power to distinguish the linguistic contexts in which particular changes would be useful. One such direction is to expand the template set by adding templates that are sensitive to the chunk structure. For example, instead of referring to the word two to the left, a rule pattern could refer to the first word in the current chunk, or the last word of the previous chunk. Another direction would be to enrich the vocabulary of chunk tags, so that they could be used during the learning process to encode contextual features for use by later rules in the sequence.</Paragraph> <Paragraph position="1"> We would also like to explore applying these same kinds of techniques to building larger scale structures, in which larger units are assembled or predicate/argument structures derived by combining chunks. One interesting direction here would be to explore the use of chunk structure tags that encode a form of dependency grammar, where the tag &quot;N+2&quot; might mean that the current word is to be taken as partof the unit headed by the N two words to the right.</Paragraph> </Section> class="xml-element"></Paper>