File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2601_metho.xml
Size: 11,929 bytes
Last Modified: 2025-10-06 14:10:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2601"> <Title>Maximum Entropy Tagging with Binary and Real-Valued Features</Title> <Section position="3" start_page="0" end_page="3" type="metho"> <SectionTitle> 3 Standard Binary Features </SectionTitle> <Paragraph position="0"> Binary features are indicator functions ofspecified events of the sample space X x Y. Hence, they take value 1 if the event occurs or0 otherwise. For the sake of notation, the feature name denotes the type of event, while the index specifies its parameters. For example: Orthperson,Cap,[?]1(x,y) corresponds to an Orthographic feature which is active if and only if the class at time t is person and the word at time t[?]1 in the context starts with capitalized letter.</Paragraph> <Section position="1" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 3.1 Atomic Features </SectionTitle> <Paragraph position="0"> Lexical features These features model co-occurrences ofclasses andsingle wordsofthecontext. Lexical features are defined on a window of +-2 positions around the current word. Lexical features are denoted by the nameLexand indexed with the triple c,w,d which fixes the current class, i.e. ct = c, the identity and offset of the word in the context, i.e. wt+d = w. Formally, the feature is computed by: Lex c,w,d(x,y) ^= d(ct = c) *d(wt+d = w).</Paragraph> <Paragraph position="1"> For example, the lexical feature for word Verona, at position t with tag loc (location) is:</Paragraph> <Paragraph position="3"> Lexical features might introduce data sparseness in the model, given that in real texts an important fraction of words occur only once. In other words, many words in the test set will have no corresponding features-parameter pairs estimated on the training data. To cope with this problem, all words observed only once in the training data were mapped into the special symbol oov.</Paragraph> <Paragraph position="4"> Syntactic features They model co-occurrences of the current class with part-of-speech or chunk tagsofaspecificposition inthecontext. Syntactic features are denoted by the nameSynand indexed with a 4-tuple (c,Pos,p,d) or (c,Chnk,p,d),</Paragraph> <Paragraph position="6"> which fixes the class ct, the considered syntactic information, and the tag and offset within the context. Formally, these features are computed by:</Paragraph> <Paragraph position="8"> Orthographic features These features model co-occurrences of the current class with surface characteristics of words of the context, e.g. check if a specific word in the context starts with capitalized letter (IsCap) or is fully capitalized (IsCAP). In this framework, only capitalization information is considered. Analogously to syntactic features, orthographic features are defined as follows:</Paragraph> <Paragraph position="10"> Dictionary features These features check if specific positions in the context contain words occurring in some prepared list. This type of feature results relevant for tasks such as NER, in which gazetteers of proper names can be used to improve coverage of the training data. Atomic dictionary features are defined as follows:</Paragraph> <Paragraph position="12"> where L is a specific pre-compiled list, and InList is a function which returns 1 if the specified word matches one of the multi-word entries of list L, and 0 otherwise.</Paragraph> <Paragraph position="13"> Transition features Transition features model Markov dependencies between the current tag and a previous tag. They are defined as follows:</Paragraph> <Paragraph position="15"/> </Section> <Section position="2" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 3.2 Complex Features </SectionTitle> <Paragraph position="0"> More complex events are defined by combining two or more atomic features in one of two ways.</Paragraph> <Paragraph position="1"> Product features take the intersection of the corresponding atomic events. V ector features consider all possible outcomes of the component features. null For instance, the product of 3 atomic Lexical features, withclass c, offsets [?]2,[?]1,0, and words</Paragraph> <Paragraph position="3"> Lexc,vd,d(x,y).</Paragraph> <Paragraph position="4"> Vector features obtained from three Dictionary features with the same class c, list L, and offsets, respectively, -1,0,+1, are indexed over all possible binary outcomes b[?]1,b0,b1 of the single atomic features, i.e.:</Paragraph> <Paragraph position="6"> Complex features used in the experiments are described in Table 1.</Paragraph> <Paragraph position="7"> The use of complex features significantly increases the model complexity. Assuming that there are 10,000 words occurring more than once inthe training corpus, the above lexical feature potentially adds O(|C|1012) parameters! As complex binary features might result prohibitive from a computational point of view, real-valued features should be considered as an alternative. null</Paragraph> </Section> </Section> <Section position="4" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Real-valued Features </SectionTitle> <Paragraph position="0"> A binary feature can be seen as a probability measure with support set made of a single event. According to this point of view, we might easily extend binary features to probability measures defined over larger event spaces. In fact, it results convenient to introduce features which are logarithms of conditional probabilities. It can be shown that in this way linear constraints of the MEmodelcanbeinterpreted intermsofKullback-Leibler distances between the target model and the conditional distributions (Klakow, 1998).</Paragraph> <Paragraph position="1"> Let p1(y|x),p2(y|x),...,pn(y|x) be n different conditional probability distributions estimated on the training corpus. In our framework, each conditional probability pi is associated to a feature fi which is defined over a subspace [X]i x Y of the sample space X x Y. Hence, pi(y|x) should be read as a shorthand of p(y |[x]i).</Paragraph> <Paragraph position="2"> The corresponding real-valued feature is:</Paragraph> <Paragraph position="4"> In this way, the ME in Eq. (3) can be rewritten as:</Paragraph> <Paragraph position="6"> According to the formalism adopted in Eq. (4), real-valued features assume the following form:</Paragraph> <Paragraph position="8"> For each so far presented type of binary feature, a corresponding real-valued type can be easily defined. The complete list is shown in Table 2. In general, the context subspace was defined on the basis of the offset parameters of each binary feature. For instance, all lexical features selecting two words at distances -1 and 0 from the current position t are modeled by the conditional distribution p(ct |wt[?]1,wt). While distributions of lexical, syntactic and transition features are conditioned on words or tags, dictionary and orthographic features are conditioned on binary variables. null An additional real-valued feature that was employed is the so called prior feature, i.e. the probability of a tag to occur:</Paragraph> <Paragraph position="10"> A major effect of using real-valued features is the drastic reduction of model parameters. For example, each complex lexical features discussed before introduce just one parameter. Hence, the small number of parameters eliminates the need of smoothing the ME estimates.</Paragraph> <Paragraph position="11"> Real-valued features present some drawbacks.</Paragraph> <Paragraph position="12"> Their level of granularity, or discrimination, might result much lower than their binary variants. For many features, it might result difficult to compute reliable probability values due to data sparseness.</Paragraph> <Paragraph position="13"> For the last issue, smoothing techniques developedforstatistical language modelscanbeapplied (Manning and Schutze, 1999).</Paragraph> </Section> <Section position="5" start_page="3" end_page="4" type="metho"> <SectionTitle> 5 Mixed Feature Models </SectionTitle> <Paragraph position="0"> This work, beyond investigating the use of real-valued features, addresses the behavior of models combining binary and real-valued features. The reason is twofold: on one hand, real-valued features allow to capture complex information with fewer parameters; on the other hand, binary features permit to keep a good level of granularity over salient characteristics. Hence, finding a compromise between binary and real-valued features might help to develop ME models which better trade-off complexity vs. granularity of information. null</Paragraph> </Section> <Section position="6" start_page="4" end_page="4" type="metho"> <SectionTitle> 6 Parameter Estimation </SectionTitle> <Paragraph position="0"> From the duality of ME and maximum likelihood (Berger et al., 1996), optimal parameters l[?] for model (3) can be found by maximizing the log-likelihood function over a training sample</Paragraph> <Paragraph position="2"> Now, whereas binary features take only twovalues and do not need any estimation phase, conditional probability features have to be estimated on some data sample. The question arises about how to efficiently use the available training data in order to estimate the parameters and the feature distributions of the model, by avoiding over-fitting.</Paragraph> <Paragraph position="3"> Two alternative techniques, borrowed from statistical language modeling, have been considered: the Held-out and the Leave-one-out methods (Manning and Schutze, 1999).</Paragraph> <Paragraph position="4"> Held-outmethod. Thetraining sampleS issplit into two parts used, respectively, to estimate the feature distributions and the ME parameters.</Paragraph> <Paragraph position="5"> Leave-one-out. ME parameters and feature distributions are estimated over the same sample S.</Paragraph> <Paragraph position="6"> The idea is that for each addend in eq. (7), the corresponding sample point (xt,yt) is removed from the training data used to estimate the feature distributions of the model. In this way, it can be shown that occurrences of novel observations are simulated during the estimation of the ME parameters (Federico and Bertoldi, 2004).</Paragraph> <Paragraph position="7"> In ourexperiments, language modeling smoothing techniques (Manning and Schutze, 1999) were applied to estimate feature distributions pi(y|x).</Paragraph> <Paragraph position="8"> In particular, smoothing was based on the discounting method in Ney et al. (1994) combined to interpolation with distributions using less context.</Paragraph> <Paragraph position="9"> Given the small number of smoothing parameters involved, leave-one-out probabilities wereapproximated by just modifying count statistics on the fly (Federico and Bertoldi, 2004). The rationale is that smoothing parameters do not change significantly after removing just one sample point.</Paragraph> <Paragraph position="10"> For parameter estimation, the GIS algorithm by Darroch and Ratcliff (1972) was applied. It is known that the GIS algorithm requires feature functions fi(x,y) to be non-negative. Hence, features were re-scaled as follows:</Paragraph> <Paragraph position="12"> where epsilon1 is a small positive constant and the denominator is a constant term defined by:</Paragraph> <Paragraph position="14"> The factor (1 + epsilon1) was introduced to ensure that real-valued features are always positive. This condition is important to let features reflect the same behavior of the conditional distributions, which assign a positive probability to each event.</Paragraph> <Paragraph position="15"> It is easy to verify that this scaling operation does not affect the original model but only impacts on the GIS calculations. Finally, a slack feature wasintroduced by the algorithm to satisfy the constraint that all features sum up to a constant value (Darroch and Ratcliff, 1972).</Paragraph> </Section> class="xml-element"></Paper>