File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0607_metho.xml
Size: 17,639 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0607"> <Title>Applying Extrasentential Context To Maximum Entropy Based Tagging With A Large Semantic And Syntactic Tagset</Title> <Section position="4" start_page="46" end_page="46" type="metho"> <SectionTitle> 1 NPiLOCNM 2 JJSYSTEM 3 VVDINCHOATIVE 4 IIDESPITE 5 DD 6 PN1PERSON </SectionTitle> <Paragraph position="0"> .,.</Paragraph> </Section> <Section position="5" start_page="46" end_page="46" type="metho"> <SectionTitle> 8 IIATSTANDIN 9 IIFROMSTANDIN 10 NNUNUM </SectionTitle> <Paragraph position="0"> with the indicated tagset.</Paragraph> <Paragraph position="1"> In what follows, Section 2 provides a basic overview of the tagging approach used (a maximum entropy tagging model employing constraints equivalent to those of the standard hidden Markov model). Section 3 discusses and offers examples of the sorts of extrasententially-based semantic constraints that were added to the basic tagging model. Section 4 describes the experiments we performed. Section 5 details our experimental results. Section 6 glances at projected future research, and concludes.</Paragraph> </Section> <Section position="6" start_page="46" end_page="46" type="metho"> <SectionTitle> 2 Tagging Model 2.1 ME Model </SectionTitle> <Paragraph position="0"> Our tagging model is a maximum entropy (ME) model of the following form:</Paragraph> <Paragraph position="2"> L is the number of tags in our tag set; - ak is the weight of trigger fk; fk are trigger functions and f~e{0, 1}; - P0 is the default tagging model (in our case, the uniform distribution, since all of the information in the model is specified using ME constraints).</Paragraph> <Paragraph position="3"> The model we use is similar to that of (Ratnaparkhi, 1996). Our baseline model shares the following features with this tagging model; we will call this set of features the basic n-gram tagger constraints: 1. w=X&t=T 2. t_l=X&t=T 3. t-2t-1 = XY ~: t = T where: - w is word whose tag we are predicting; - t is tag we are predicting; - t-1 is tag to the left of tag t; - t-2 is tag to the left of tag t-l; Our baseline model differs from Ratnaparkhi's in that it does not use any information about the occurrence of words in the history or their properties (other than in constraint 1). Our model exploits the same kind of tag-n-gram information that forms the core of many successful tagging models, for example, (Kupiec, 1992), (Merialdo, 1994), (Ratnaparkhi, 1996). We refer to this type of tagger as a tag-n-gram tagger.</Paragraph> <Section position="1" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 2.2 Trigger selection </SectionTitle> <Paragraph position="0"> We use mutual information (MI) to select the most useful trigger pairs (for more details, see (Rosenfeld, 1996)). That is, we use the following formula to gauge a feature's usefulness to - t is the tag we are predicting; - s can be any kind of triggering feature.</Paragraph> <Paragraph position="1"> For each of our trigger predictors, s is defined below: Bigram and trigram triggers : s is the presence of a particular tag as the first tag in the bigram pair, or the presence of two particular tags (in a particular order) as the first two tags of a trigram triple. In this case, t is the presence of a particular tag in the final position in the n-gram. Extrasentential tag triggers : 8 is the presence of a particular tag in the extrasentential history.</Paragraph> <Paragraph position="2"> Question triggers : s is the boolean answer to a question.</Paragraph> <Paragraph position="3"> This method has the advantage of finding good candidates quickly, and the disadvantage of ignoring any duplication of information in the features it selects. A more principled approach is to select features by actually adding them one-by-one into the ME model (Della Pietra et al., 1997); however, using this approach is very time-consuming and we decided on the MI approach for the sake of speed.</Paragraph> </Section> </Section> <Section position="7" start_page="46" end_page="48" type="metho"> <SectionTitle> 3 The Constraints </SectionTitle> <Paragraph position="0"> To understand what extrasentential semantic constraints were added to the base tagging model in the current experiments, one needs some familiarity with the ATR General English Tagset. For detailed presentations, see (Black et al., 1998; Black et al., 1996). An apercu can be gained, however, from Figure 1, which shows two sample sentences from the ATR Treebank (and originally from a Chinese take-out food flier), tagged with respect to the ATR GenerM English Tagset.</Paragraph> <Paragraph position="1"> Each verb, noun, adjective and adverb in the ATR tagset includes a semantic label, chosen from 42 noun/adjective/adverb categories and 29 verb/verbal categories, some overlap existing between these category sets. Proper nouns, plus certain adjectives and certain numerical expressions, are further categorized via an additional 35 &quot;proper-noun&quot; categories. These semantic categories are intended for any &quot;Standard-American-English&quot; text, in any domain. Sample categories include: &quot;physical.attribute&quot; (nouns/adjectives/adverbs), &quot;alter&quot; (verbs/verbals), &quot;interpersonal.act&quot; (nouns/adjectives/adverbs/verbs/verbals), &quot;orgname&quot; (proper nouns), and &quot;zipcode&quot; (numericals). They were developed by the ATR grammarian and then proven and refined via day-in-day-out tagging for six months at ATR by two human &quot;treebankers', then via four months of tagset-testing-only work at Lancaster University (UK) by five treebankers, with daily interactions among treebankers, and between the treebankers and the ATR grammarian. The semantic categorization is, of course, in addition to an extensive syntactic classification, involving some 165 basic syntactic tags.</Paragraph> <Paragraph position="2"> Starting with a basic tag-n-gram tagger trained to tag raw text with respect to the ATR General English Tagset, then, we added constraints defined in terms of &quot;tag families&quot;. A tag family is the set of all tags sharing a given semantic category. For instance, the tag family &quot;MONEY&quot; contains common nouns, proper nouns, adjectives, and adverbs, the semantic component of whose tags within the ATR General English Tagset, is &quot;money&quot;: 500-stock, Deposit, TOLL-FREE, inexpensively, etc.</Paragraph> <Paragraph position="3"> One class of constraints consisted of the presence, within the 6 sentences (from the same document) 1 preceding the current sentence, of one or more instances of a given tag family. This type of constraint came in two varieties: either including, or excluding, the words within the sentence of the word being tagged. Where these intrasentential words were included, they 1 (Black et al., 1998) determined a 6-sentence window to be optimal for this task.</Paragraph> <Paragraph position="5"> consisted of the set of words preceding the word being tagged, within its sentence.</Paragraph> <Paragraph position="6"> A second class of constraints added to the requirements of the first class the representation, within the past 6 sentences, of related tag families. Boolean combinations of such events defined this group of constraints. An example is as follows: (a) an instance either of the tag family &quot;person&quot; or of the tag family &quot;personal attribute&quot;(or both) occurs within the 6 sentences preceding the current one; or else (b) an instance of the tag family &quot;person&quot; occurs in the current sentence, to the left of the word being tagged; or, finally, both (a) and (b) occur.</Paragraph> <Paragraph position="7"> A third class of constraints had to do with the specific word being tagged. In particular, the word being classified is required to belong to a set of words which have been tagged at least once, in the training treebank, with some tag from a particular tag family; and which, further, always shared the same basic syntax in the training data. For instance, consider the words &quot;currency&quot; and &quot;options&quot;. Not only have they both been tagged at least once in the training set with some member of the tag family &quot;MONEY&quot; (as well, it happens, as with tags from other tag families); but in addition they both occur in the training set only as nouns.</Paragraph> <Paragraph position="8"> Therefore these two words would occur on a list named &quot;MONEY nouns&quot;, and when an instance of either of these words is being tagged, the constraint &quot;MONEY nouns&quot; is satisfied.</Paragraph> <Paragraph position="9"> A fourth and final class of constraints combines the first or the second class, above, with the third class. E.g. it is both the case that some avatar of the tag family &quot;MONEY&quot; has occurred within the last 6 sentences to the left; and that the word being tagged satisfies the constraint &quot;MONEY nouns&quot;. The advantage of this sort Of composite constraint is that it is focused, and likely to be'helpful when it does occur. The d\[isadvantage is that it is unlikely to occur extremely often. On the other hand, constraints of the first, second, and third classes, above, are more likely to occur, but less focused and therefore less obviously helpful.</Paragraph> </Section> <Section position="8" start_page="48" end_page="82425" type="metho"> <SectionTitle> 4 The Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="48" end_page="49" type="sub_section"> <SectionTitle> 4.1 The Four Models </SectionTitle> <Paragraph position="0"> To evaluate the utility of long-range semantic context we performed four separate experiments. All of the models in the experiments include the basic ME tag-n-gram tagger constraints listed in section 2. The models used in our experiments are as follows: (1) The first model is a model consisting ONLY of these basic ME tag-n-gram tagger constraints. This model represents the base-line model.</Paragraph> <Paragraph position="1"> (2) The second model consists of the baseline model together with constraints representing extrasentential tag triggers. This experiment measures the effect of employing the triggers specified in (Black et al., 1998) --i.e. the presence (or absence) in the previous 6 sentences of each tag in the tagset, in turn-- to assist a real tagger, as opposed to simply measuring their mutual information. In other words, we are measuring the contribution of this long-range information over and above a model which uses local tag-n-grams as context, rather than measuring the gain over a naive model which does not take context into account, as was the case with the mutual information experiments in (Black et al., 1998).</Paragraph> <Paragraph position="2"> (3) The third model consists of the baseline model together with the four classes of more sophisticated question-based triggers defined in the previous section.</Paragraph> <Paragraph position="3"> (4) The fourth model consists of the baseline model together with both the long-range tag trigger constraints and the question-based trigger constraints.</Paragraph> <Paragraph position="4"> \\:(~ chose the model underlying a standard la< n-gram tagger as the baseline because it represents a respectable tagging model which most readers will be familiar with. The ME framework was used to build the models since il provides a principled manner in which to integrate the diverse sources of information needed for these experiments.</Paragraph> </Section> <Section position="2" start_page="49" end_page="82425" type="sub_section"> <SectionTitle> 4.2 Experimental Procedure </SectionTitle> <Paragraph position="0"> The performance of each the tagging models is measured on a 53,000-word test treebank hand-labelled to an accuracy of over 97% (Black et al., 1996; Black et al., 1998). We measure the model performance in terms of the perplexity of the tag being predicted. This measurement gives an indication of how useful the features we supply could be to an n-gram tagger when it consults its model to obtain a probablity distribution over the tagset for a particular word.</Paragraph> <Paragraph position="1"> Since our intention is to gauge the usefulness of long-range context, we measure the performance improvement with respect to correctly (very accurately) labelled context. We chose to do this to isolate the effect of the correct markup of the history on tagging performance (i.e. to measure the performance gain in the absence of noise from the tagging process itself).</Paragraph> <Paragraph position="2"> Earlier experiments using predicted tags in the history showed that at current levels of tagging accuracy for this tagset, these predicted tags yielded very little benefit to a tagging model.</Paragraph> <Paragraph position="3"> However, removing the noise from these tags showed clearly that improvement was possible from this information. As a consequence, we chose to investigate in the absence of noise, so that we could see the utility of exploiting the history when labelled with syntactic/semantic tags.</Paragraph> <Paragraph position="4"> The resulting measure is an idealization of a component of a real tagging process, and is a measure of the usefulness of knowing the tags in the history. In order to make the comparisons between models fair, we use correctly-labelled history in the n-gram components of our models as well as for the long-range triggers. As a consequence of this, no search is nescessary.</Paragraph> <Paragraph position="5"> The number of possible triggers is obviously very large and needs to be limited for reasons of practicability. The number of triggers used for these experiments is shown in Table 2. \[;sing these limits we were able to build each model in around one week on a 600MHz DEC-alpha.</Paragraph> <Paragraph position="6"> The constraints were selected by mutual information. Thus, as an example, the 82425 question trigger constraints shown in Table 2 represent the 82425 question trigger constraints with the highest mutual information.</Paragraph> <Paragraph position="7"> The improved iterative scaling technique (Della Pietra et al., 1997) was used to train the parameters in the ME model.</Paragraph> </Section> </Section> <Section position="9" start_page="82425" end_page="82425" type="metho"> <SectionTitle> 5 The Results </SectionTitle> <Paragraph position="0"> Table 4 shows the perplexity of each of the four models on the testset.</Paragraph> <Paragraph position="1"> The maximum entropy framework adopted for these experiments virtually guarantees that models which utilize more information will perform as well as or better than models which do not include this extra information. Therefore, it comes as no surprise that all models improve upon the baseline model, since every model effectively includes the baseline model as a component. null However, despite promising results when measuring mutual information gain (Black et M., 1998), the baseline model combined only with extrasentential tag triggers reduced perplexity by just a modest 7.6% . The explanation for this is that the information these triggers provide is already present to some degree in the n-grams of the tagger and is therefore redundant.</Paragraph> <Paragraph position="2"> In spite of this, when long-range information is captured using more sophisticated, linguistically meaningful questions generated by an expert grammarian (as in experiment 3), the perplexity reduction is a more substantial 19.4%. The explanation for this lies in the fact that these question-based triggers are much more specific. The simple tag-based triggers will be active much more frequently and often inappropriately. The more sophisticated question-based triggers are less of a blunt instrument. As an example, constraints from the fourth class (described in the constraints section of this paper) are likely to only be active for words able to take the particular tag the constraint was designed to apply to. In effect, tuning the ME constraints has recovered much ground lost to the n-grams in the model.</Paragraph> <Paragraph position="3"> The final experiment shows that using all.the triggers reduces perplexity by 21.4%. This is a modest improvement over the results obtained in experiment 3. This suggests that even though this long-range trigger information is less useful, it is still providing some additional information to the more sophisticated question-based triggers. null Table 3 shows the five constraints with the highest mutual information for the tag NN1PERSON (singular common noun of person, e.g. lawyer, friend, niece). All five of these constraints happen to fall within the twenty-five constraints of any type with the highest mutual information with their predicted tags. Within Table 3, &quot;full history&quot; refers to the previous 6 sentences as well as the previous words in the current sentence, while &quot;remote history&quot; indicates only the previous 6 sentences. A &quot;person word&quot; is any word in the tag family &quot;person&quot;, hence adjectives, adverbs, and both common and proper nouns of person. Similarly, a &quot;personal attribute word&quot; is any word in the tag family &quot;personal attribute&quot;, e.g. left-wing, liberty, courageously.</Paragraph> </Section> class="xml-element"></Paper>