File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0121_intro.xml
Size: 7,470 bytes
Last Modified: 2025-10-06 14:06:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0121"> <Title>Collocation Lattices and Maximum Entropy Models</Title> <Section position="2" start_page="0" end_page="217" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Maximum entropy modelling has been recently introduced to the NLP community and proved to be an expressive and powerful framework. The maximum entropy model is a model which fits to a set of pre-defined constraints and assumes m~ximum ignorance about everything which is not subject to its constraints thus assigning such cases with the most uniform distribution. The most uniform distribution will have the entropy on its maxim~n and the model is chosen accorrllng to:</Paragraph> <Paragraph position="2"> For instance, if we want to disambiguate part-of-speech of a word and we observed that in 50% of the times a noun is preceeded by a determiner and in 30% of the thnes it is preceeded by an adjective, we can state these observations as constraints to the model. Thus our model will have to choose a probability distribution for parts-of-speech which will agree with our observations and which will assign all the cases when a word which can be a noun is preceeded with neither a determiner nor an adjective with equal probabilities.</Paragraph> <Paragraph position="3"> One of the most popular maximum entropy distributions is known as Gibbs distribution. It defines a model as a set of Lagrange multipliers (Z,)~0..PSn) and has an exponential form:</Paragraph> <Paragraph position="5"> instantiated atomic features from T. We will call wi a configuration from con.figuration space * W. The configuration space W includes not only observed configurations (w) but rather all possible in the domain configurations many of which might have not ever been observed; * X1 is a constraint from the constraint space X imposed to the model. It is essentially a combination ofinstantiated atomic features too. We can look at the constraints as at employed by the model features 1. In the rest of the paper we will use the terms constraint feature, feature and constraint interchangeably; * fx~ (wi) is the indicator function which indicates whether or not the j-th constraint (Xj) is active for the configuration wi. This function takes two values: 1 - if the constraint is active and 0 otherwise;</Paragraph> <Paragraph position="7"> Apart from being a distribution of maximum entropy this distribution also po6sesses a very important property of model decomposition. For instance, if our atomic feature set T includes word length (1, 2, 3, 4, 5..), its capitalization (Cap) and whether it ends with the full-stop (fstop), and we want to obtain the probability of spelling the word &quot;Mr.&quot; , we will have the equation as follows:</Paragraph> <Paragraph position="9"> There is nothing interesting about the model above since it constrains only atomic non-overlapping i.e. independent features. The main strength of the Gibbs distribution is that it can handle complex overlapping features and therefore account for feature interaction. For instance, we might notice that the capitalization, when it is seen particlll~rly with words of length 3, has a different distribution from its general one. To model this observation we can introduce a compleC/ feature (i.e. we set a complex constraint) which is a logical conjunction or collocation of the two atomic features 3 and Cap. Now our model will predict the probability for the word &quot;Mr.&quot; as: 1Note here the difference between atomic features and constraint features: constraint features consist from atomic features but we can have a set of constraints which does not include some or even all atomic features per se but only their combinations.</Paragraph> <Paragraph position="11"> Important thing to note here is that we still constrain the atomic features A3 and Acap together with their collocation feature A(3,can)- So A(a,cap) has only the excess weight which differentiates p(3, Cap) from the product of p(3) and p(Cap) - the case ff there were no feature interaction. Such decomposition of complex features into simpler ones provides an elegant way of representing cases with interactions of many overlapping features of high complexity.</Paragraph> <Paragraph position="12"> Because of its ability to handle overlapping features the maximum entropy framework provides a better way to incorporate multiple knowledge sources than traditionally used for this purpose linear interpolation and Katz back-off method (Katz 1987). Rosenfeld 1996 evaluates in detail a maximum entropy model which combines unigrams, bigrams, trigrams and long-distance trigger words for the prediction of the next word. The linear interpolation combines such knowledge sources simply by weighting them as Pcornbined = ~=Z AiPi for'k knowledge sources. It does not, however, model the interaction between different knowledge sources 2 and only provides the best weights for the them under the assumption of independence. The back-off method, in fact, does not combine different knowledge sources but rather ran/; them. This allows for using the most informative method first and back-off to a less informative method if there is not enough information for a more informative one. For instance, we can try to use the trigram model first, and only when there is no suitable trigram known to the model, we back-off to the bigram model. As the linear interpolation the back-off method does not account for possible interactions between different knowledge sources which can lead to overestimation of some events.</Paragraph> <Paragraph position="13"> The maximum entropy framework naturally combines the good sides of the two methods and at the same time it accounts for the interactions between features. Every knowledge source produces a set of constraints which are used together with constraints from other knowledge sources - so no interpolation needed. Simple and complex features together with their overlaps are naturally incorporated into the model and all the interactions are naturally accounted for. Because of the decompositional nature of the maximum entropy model, it can act as a back-off model too overlapping simpler features naturally coexist with more complex ones and the weights of the complex features are just the excess on which they different from their constituent simpler features.</Paragraph> <Paragraph position="14"> Applying the exponential distribution discussed above the maximum entropy approach developed in Della Pietra et ai. 1995 defines a framework for the selection of the best performing constraint set and for the estimation of the weights (As) for these constraints. The model induction procedure has two parts: feature selection and parameter estimation, both of which agree with the principle of maximum entropy. In this paper we present a novel approach to feature selection for the maximum entropy models. This approach requires less computational load than the one developed in Della Pietra et al. 1995 at a price of being not yet suitable for building models with a very large (hundreds of thousands) set of parameters. We also propose a slight modification to the process of parameter estimation for the conditional maximum entropy models. Our method uses assumptions similar to Berger et al. 1996 but is naturally suitable for distributed parallel computations.</Paragraph> </Section> class="xml-element"></Paper>