File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2140_metho.xml
Size: 24,277 bytes
Last Modified: 2025-10-06 14:14:59
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2140"> <Title>Feature Lattices for Maximum Entropy Modelling</Title> <Section position="2" start_page="0" end_page="848" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Maximum entropy modelling has been recently introduced to the NLP community and proved to be an expressive and powerful framework.</Paragraph> <Paragraph position="1"> The maximum entropy model is a model which fits a set of pre-defined constraints and assumes maximum ignorance about everything which is not subject to its constraints thus assigning such cases with the most uniform distribution. The most uniform distribution will have the entropy on its maximum Because of its ability to handle overlapping features the maximum entropy framework provides a principle way to incorporate information from multiple knowledge sources. It is superior totraditionally used for this purpose linear interpolation and Katz back-off method.</Paragraph> <Paragraph position="2"> (Rosenfeld, 1996) evaluates in detail a maximum entropy language model which combines unigrams, bigrams, trigrams and long-distance trigger words, and provides a thorough analysis of all the merits of the approach.</Paragraph> <Paragraph position="3"> * Now at Harlequin Ltd.</Paragraph> <Paragraph position="4"> The iterative scaling algorithm (Darroch&Ratcliff, 1972) applied for the parameter estimation of maximum entropy models computes a set of feature weights (As) which ensure that the model fits the reference distribution and does not make spurious assumptions (as required by the maximum entropy principle) about events beyond the reference distribution.</Paragraph> <Paragraph position="5"> It, however, does not guarantee that the features employed by the model are good features and the model is useful. Thus the most important part of the model building is the feature selection procedure. The key idea of the feature selection is that if we notice an interaction between certain features we should build a more complex feature which will account for this interaction. The newly added feature should improve the model: its Kullback-Leibler divergence from the reference distribution should decrease and the conditional maximum entropy model will also have the greatest log-likelihood (L) value: The basic feature induction algorithm presented in (Della Pietra et al., 1995) starts with an empty feature space and iterativety tries all possible feature candidates. These candidates are either atomic features or complex features produced as a combination of an atomic feature with the features already selected to the model's feature space. For every feature from the candidate feature set the algorithm prescribes to compute the maximum entropy model using the iterative scaling algorithm described above, and select the feature which in the largest way minimizes the Kullback-Leibler divergence or maximizes the log-likelihood of the model. This approach, however, is not computationally feasible since the iterative scaling is computationally expensive and to compute models for many candidate features many times is unreal. To make feature ranking computationally tractable in (Della Pietra et al., 1995) and (Berger et al., 1996) a simplified process proposed: at the feature ranking stage when adding a new feature to the model, all previously computed parameters are kept fixed and, thus, we have to fit only one new constraint imposed by the candidate feature. Then after the best ranked feature has been established, it is added to the feature space and the weights for all the features are recomputed. This approach estimates good features relatively fast but it does not guarantee that at every single point we add the best feature because when we add a new feature to the model all its parameters can change.</Paragraph> <Paragraph position="6"> In this paper we present a novel approach to feature selection for the maximum entropy models. Our approach uses a feature collocation lattice and selects candidate features without resorting to the iterative scaling.</Paragraph> </Section> <Section position="3" start_page="848" end_page="849" type="metho"> <SectionTitle> 2 Feature Collocation Lattice </SectionTitle> <Paragraph position="0"> We start the modelling process by building a sample space w to train our model on. The sample space consists of observed events of interest mapped to a set of atomic features T which we should define beforehand. Thus every observation from the sample space is a binary vector of atomic features: if an observation includes a certain feature, its corresponding bit in the vector is turned on (set to 1) otherwise it is 0.</Paragraph> <Paragraph position="1"> When we have a set of atomic features T and a training sample of configurations w, we can build the feature collocation lattice. Such collocation lattice will represent, in fact, the factorial constraint space (X) for the maximum entropy model and at the same time will contain all seen and logically implied configurations (w+). Formally, the feature collocation lattice is a 3-ple: (0, C_, ~w) where 0 is a set of nodes of the lattice which corresponds to the union of the feature space of the maximum entropy model and the configuration space: 0 = XU~(w). In fact, the nodes in the lattice (0) can have dual interpretation - on one hand they can act as mapped configurations from the extended configuration space (w +) and on the other hand they can act as features from the constraint space (X);</Paragraph> <Paragraph position="3"> need the indicator function to flag whether the relation C holds from node i to node k:</Paragraph> <Paragraph position="5"> ~w is a set of configuration frequency counts of the nodes (0) of the lattice. This represents how many times we saw a particular configuration in our training samples.</Paragraph> <Paragraph position="6"> Because of the dual interpretation of the nodes, a node can also be associated with its feature frequency count i.e. the number of times we see this feature combination anywhere in the lattice. The feature frequency of a node will then be ~X(0k) = ~0~e0 f0k(0i) * ~0~ which is the sum of all the configuration frequency counts (~w) of the descendant nodes.</Paragraph> <Paragraph position="7"> Suppose we have a lattice of nodes A,B,\[AB\] with obvious relations: A C_ \[AB\]; B C_ \[AB\]: A &quot;~ \[AB\] ~, B The configuration frequency ~,~ will be the number of times we saw A but not \[AB\] and then the feature frequency of A will be: ~ = ~ + ~,~B i.e. the number of times we saw A in all the nodes.</Paragraph> <Paragraph position="8"> When we construct the feature collocation lattice from a set of samples, each sample represents a feature configuration which we must add to the lattice as its node (Ok). To support generalizations over the domain we also want to add to the lattice the nodes which are shared parts with other nodes in the lattice. Thus we add to the lattice all sub-configurations of a newly added configuration which are the intersections with the other nodes. We increment the configuration frequency (~) of a node each time we see in the training samples this particular configuration in full. For example, if a configuration \[ABCD\] comes from a training sample and it is still not in the lattice, we create a node \[ABCD\] and set its configuration frequency ~\[~ABCD\] to 1. If by that time there is a node \[ABDE\] in the lattice, we then also create the node \[ABD\], relate it to the nodes \[ABCD\] and \[ABDE\] and set its configuration frequency to 0. If \[ABCD\] had already existed in the lattice, we would simply incremented its configuration frequency: ~\[WABCD \] ~- ~\[WABcD \] + 1. Thus in the feature lattice we have nodes with non-zero configuration frequencies, which we call reference nodes and nodes with zero configuration frequencies which we call latent or hidden nodes. Reference nodes actually represent the observed configuration space (w). Hidden nodes are never observed on their own but only as parts of the reference nodes and represent possible generalizations about domain: lowcomplexity constraints (X) and logically possible configurations (w+).</Paragraph> <Paragraph position="9"> This method of building the feature collocation lattice ensures that along with true observations it contains hidden nodes which can provide generalizations about the domain. At the same time there is no over-generation of the hidden nodes: no logically impossible feature combinations and no hidden nodes without generalization power are included.</Paragraph> </Section> <Section position="4" start_page="849" end_page="851" type="metho"> <SectionTitle> 3 Feature Selection </SectionTitle> <Paragraph position="0"> After we constructed from a set of samples the feature collocation lattice (0, C_,(~), which we will call the empirical lattice, we try to estimate which features contribute and which do not to the frequency distribution on the reference nodes. Thus only the predictive features will be retained in the lattice. The optimized feature space can be seen as a feature lattice defined over the empirical feature lattice: 0' C_ 0 and initially it is empty: 0' =j0. We build the optimized lattice by incrementally adding a feature (atomic or complex) from the empirical lattice, together with the nodes which are the minimal collocations of this feature with the nodes already included into the optimized lattice. The necessity to add the collocations comes from the fact that the features (or nodes) can overlap with each other and we want to have a unique node for such overlap. So if in the optimized feature lattice there is just one feature A, then when we add the feature B we also have to add the collocation \[AB\] if it exists in the empirical lattice. The configuration frequency of a node in the optimized lattice ((,w) then can be cornputed as:</Paragraph> <Paragraph position="2"> Thus a node in the optimized lattice takes all configuration frequencies ((w) of itself and the above related nodes if these nodes do not belong to the optimized lattice themselves and there is no higher node in the optimized lattice related to them.</Paragraph> <Paragraph position="3"> Figure i shows how the configuration frequencies in the optimized lattice are redistributed when adding a new feature. First the lattice is empty. When we add the feature A to the optimized lattice (Figure 1.a), because no other features are present in the optimized lattice, it takes all the configuration frequencies of the nodes where we see the feature A:</Paragraph> <Paragraph position="5"> ure 1 represents the situation when we add the feature B to the optimized lattice which already includes the feature A. Apart from the node B we also add the collocation of the nodes A and B to the optimized lattice. Now we have to redistribute the configuration frequencies in the optimized lattice. The configuration frequency of the node A now will become the number of times of seeing the feature A but not the feature combination AB: ('A w = ~1 + ~C&quot; The configuration frequency of the node B will be the number of times of seeing the node B but not the nodeAB: ~ = ~ + w (BC&quot; The configuration frequency of the node AB will be: ~B = %ABCW -b %ABC'C/W When we add the feature C to the optimized lattice (Figure 1.c) we produce a fully saturated lattice identical to the empirical lattice, since the node C will collocate with the node A producing AC and will collocate with the node B producing BC. These nodes in their turn will collocate with each other and with the node AB producing the node ABC.</Paragraph> <Paragraph position="6"> During the optimized lattice construction all the features (atomic and complex) from the empirical lattice compete, and we include the one which results in a optimized lattice with the smallest divergence D(p \[I P') and equation ??) and therefore with the greatest log-likelihood</Paragraph> <Paragraph position="8"> adding new nodes. Case a) stands for adding the feature A to the empty lattice, case b) stands for adding the feature B to the lattice with the feature A and case c) stands for adding the feature C to the lattice with the atomic features A and B and their collocations. The unfilled nodes stand for the nodes in the empirical lattice which don't have reference in the optimized lattice. The nodes in bold stand for the nodes decided by the optimized lattice (i.e. they can be assigned with non-default probabilities).</Paragraph> <Paragraph position="9"> the empirical lattice: p(Oi)=--~deg~ where N= ~_, ~ (2) N o~eo * p'(Si) is the probability assigned to the i-th node using only the nodes included into the optimized lattice.</Paragraph> <Paragraph position="10"> but there is just one undecided node (C) which is not shown in bold. So the probabilities for the nodes will be:</Paragraph> <Paragraph position="12"> N is the total count on the empirical lattice and { ~_~_. . -, is calculated as shown in equation 2:</Paragraph> <Paragraph position="14"> The optimized lattice assigns the probability to a node in the empirical lattice equal to that of its most specific sub-node from the optimized lattice. For reference nodes which do not have sub-nodes in the optimized lattice at all (undecided nodes) according to the maximum entropy principle we assign the uniform probability of making an arbitrary prediction.</Paragraph> <Paragraph position="15"> For instance, for the example on Figure 1.b the optimized lattice includes only three nodes tures from the initial set of candidate features without resorting to iterative scaling. When this way we add the features to the optimized lattice some candidate features might not sufficiently contribute to the probability distribution on the lattice. For instance, in the example presented on Figure 1, after we added the feature \[B\] (case b) the only remaining undecided node was IV\]. If the node \[C\] is truly hidden (i.e. it does not have its own observation frequency) and all other nodes are optimally decided, there is no point to add the node \[C\] into the lattice and instead of having 9 nodes we will have only 3. Another consideration which we apply during the lattice building is to penalize the development of low frequency (but not zero frequency)</Paragraph> <Paragraph position="17"> So for high frequency nodes this smoothing is very minor but for nodes with frequencies less than two thresholds the penalty will be considerable. This will favor nodes which do not create sparce collocations with other nodes.</Paragraph> <Paragraph position="18"> The described method is similar in spirit to the method of word trigger incorporation to a trigram model suggested in (Rosenfeld, 1996): if a trigram predicts well enough there is no need for an additional trigger. The main difference is that we do not recompute the maximum entropy model every time but use our own frequency redistribution method over the collocation lattice. This is the crucial difference which makes a tremendous saving in time. We also do not require a newly added feature to be either atomic or a collocation of an atomic feature with a feature already included into the model as it was proposed in (Della Pietra et al., 1995) (Berger et al., 1996). All the features are created equal and the model should decide on the level of granularity by itself.</Paragraph> </Section> <Section position="5" start_page="851" end_page="852" type="metho"> <SectionTitle> 4 Model Generalization </SectionTitle> <Paragraph position="0"> After we have chosen a subset of features for our model, we restrict our feature lattice to the optimized lattice. Now we can compute the maximum entropy model taking the reference probabilities (which are configuration probabilities) as in equation 3.</Paragraph> <Paragraph position="1"> The nodes from the optimized lattice serve both as possible domain configurations and as potential constraint features to our model. We, however, want to constrain only the nodes with the reliable statistics on them in order not to overfit the model. This in its turn will take off certain computational load, since we expect a considerable number of fragmented (simply infrequent) nodes in the optimized lattice. This comes from the requirement to build all the collocations when we add a new node. Although many top-level nodes will not be constrained, the information from such infrequent nodes will not be lost completely - it will contribute to more general nodes since for every constrained node we marginalize Over all its unconstrained descendants (more specific nodes). Thus as possible constraints for the model we will consider only those nodes from the optimized lattice, whose marginalized over responses feature frequency counts I are greater than a certain threshold, e.g.: ~0x__ ~ y > 5. This considera- =(, ) tion is slightly different from the one suggested in (Ristad, 1996) where it was proposed to unconstrain nodes with infrequent joint feature frequency counts. Thus if we saw a certain feature configuration say 5,000 times and it always gave a single response we suggest to constrain as well the observation that we never saw this configuration with the other responses. If we applied the suggestion of (Ristad, 1996) and cut out on the basis of the joint frequency we would lose the negative evidence, which is quite reliable judging by the total frequency of the observation.</Paragraph> <Paragraph position="2"> Initially we constrain all the nodes which satisfy the above requirement. In order to generalize and simplify our maximum entropy model, we uncgnstrain the most specific features, compute a new simplified maximum entropy model, and if it still predicts well, we repeat the process. So our aim is to remove from the constraints as many top level nodes as possible without losing the model fitness to the reference distribution (15) of the optimized feature lattice. The necessary condition for a node to be taken as a candidate to unconstrain, is that this node shouldn't have any constrained nodes above it. There is also a natural ranking for the candidate nodes: the closer to 1 the weight (),) of a such a node is, the less it is important for the model. We can set a certain threshold on the weights, so all the candidate nodes whose As differ from 1 less than this threshold will be unconstrained in one go. Therefore we don't have to use the iterative scaling for feature ranking and apply it only for model recomputation, possibly un-constraining several feature configurations (nodes) at once. This method, in fact, resembles the Backward Sequential Search</Paragraph> <Paragraph position="4"> (BSS) proposed in (Pedersen&Bruce, 1997) for decomposable models. There is also a significant reduction in computational load since the generalized smaller model deviates from the previous larger model only in a small number of constraints. So we use the parameters of that larger model 2 as the initial values for the iterative scaling algorithm. This proved to decrease the number of required iterations by about tenfold, which makes a tremendous saving in time.</Paragraph> <Paragraph position="5"> There can be many possible criteria when to stop the generalization algorithm. The simplest one is just to set a predefined threshold on the deviation D(fi II P) of the generalized model from the reference distribution. (Pedersen&Bruce, 1997) suggest to use Akaike's Information Criteria (AIC) to judge the acceptability of a new model. AIC rewards good model fit and penalizes models with high complexity measured in the number of features. We adopted the stop condition suggested in (Berger et al., 1996) - the maximization of the likelihood on a cross-validation set of samples which is unseen at the parameter estimation.</Paragraph> </Section> <Section position="6" start_page="852" end_page="853" type="metho"> <SectionTitle> 5 Application: Fullstop Problem </SectionTitle> <Paragraph position="0"> Sentence boundary disambiguation has recently gained certain attention of the language engineering community. It is required for most text processing tasks such as, tagging, parsing, parallel corpora alignment etc., and, as it turned out to be, this is a non-trivial task itself. A period can act as the end of a sentence or be a part of an abbreviation, but when an abbreviation is the last word in a sentence, the period denotes the end of a sentence as well. The simplest &quot;period-space-capital_letter&quot; approach works well for simple texts but is rather unreliable for texts with many proper names and abbreviations at the end of sentence as, for instance, the Wall Street Journal (WSJ) corpus ( (Marcus et al., 1993) ).</Paragraph> <Paragraph position="1"> One well-known trainable systems - SATZ - is described in (Palmer&Hearst, 1997). It uses a neural network with two layers of hidden units. It was trained on the most probable parts-of-speech of three words before and three words after the period using 573 samples from the WSJ corpus. It was then tested on 2instead of the uniform distribution as prescribed in the step 1 of the Improved Iterative Scaling algorithm. unseen 27,294 sentences from the same corpus and achieved 1.5% error rate. Another automatically trainable system described in (Reynar&Ratnaparkhi, 1997). This system is similar to ours in the model choice - it uses the maximum entropy framework. It was trained on two different feature sets and scored 1.2% error rate on the corpus tuned feature set and 2% error rate on a more portable feature set.</Paragraph> <Paragraph position="2"> The features themselves were words and their classes in the immediate context of the period mark. (Reynar&Ratnaparkhi, 1997) don't report on the number of features utilized by their model and don't describe their approach to feature selection but judging by the time their system was trained (18 minutes 3) it did not aim to produce the best performing feature-set but estimated a given one.</Paragraph> <Paragraph position="3"> To tackle this problem we applied our method to a maximum entropy model which used a lexicon of words associated with one or more categories from the set: abbreviation, proper noun, content word, closed-class word. This model employed atomic features such as the lexicon information for the words before and after the period, their capitalization and spellings.</Paragraph> <Paragraph position="4"> For training we collected from the WSJ corpus 51,000 samples of the form (Y, F..F) and (N, F..F), where Y stands for the end of sentence, N stands for otherwise and Fs stand for the atomic features of the model. We started to built the model with 238 most frequent atomic features which gave us the collocation lattice of 8,245 nodes in 8 minutes of processor time on five SUN Ultra-1 workstations working in parallel by means of multi-threading and Remote Process Communication. When we applied the feature selection algorithm (section 3), we in 53 minutes boiled the lattice down to 769 nodes.</Paragraph> <Paragraph position="5"> Then constraining all the nodes, we compiled a maximum entropy model in about 15 minutes and then using the constraint removal process in two hours we boiled the constraint space down to 283. In this set only 31 atomic features remained. This model was detected to achieve the best performance on a specified cross-validation set. For the evaluation we used the same 27,294 sentences as in (Palmer&Hearst, 1997) 4 which aPersonal communication 4We would like to thank David Palmer for making his test data available to us.</Paragraph> <Paragraph position="6"> were also used by (Reynar&Ratnaparkhi, 1997) in the evaluation of their system. These sentences, of course, were not seen at the training phase of our model. Our model achieved 99,2477% accuracy which is the highest quoted score on this test-set known to the authors.</Paragraph> <Paragraph position="7"> We attribute this to the fact that although we started with roughly the same atomic features as (Reynar&Ratnaparkhi, 1997) our system created complex features with higher prediction power.</Paragraph> </Section> class="xml-element"></Paper>