XML Viewer - w06-2601

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2601_intro.xml
Size: 4,864 bytes
Last Modified: 2025-10-06 14:04:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2601">
  <Title>Maximum Entropy Tagging with Binary and Real-Valued Features</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The Maximum Entropy (ME) statistical framework (Darroch and Ratcliff, 1972; Berger et al., 1996) has been successfully deployed in several NLP tasks. In recent evaluation campaigns, e.g.</Paragraph>
    <Paragraph position="1"> DARPA IE and CoNLL 2000-2003, ME models reached state-of-the-art performance on a range of text-tagging tasks.</Paragraph>
    <Paragraph position="2"> With few exceptions, best ME taggers rely on carefully designed sets of features. Features correspond to binary functions, which model events, observed in the (annotated) training data and supposed to be meaningful or discriminative for the task at hand. Hence, ME models result in a loglinearcombination ofalarge setoffeatures, whose weights can be estimated by the well known Generalized Iterative Scaling (GIS) algorithm by Darroch and Ratcliff (1972).</Paragraph>
    <Paragraph position="3"> Despite ME theory and its related training algorithm (Darroch and Ratcliff, 1972) do not set restrictions on the range of feature functions1, popular NLP text books (Manning and Schutze, 1999) and research papers (Berger et al., 1996) seem to limit them to binary features. In fact, only recently, log-probability features have been deployed in ME models for statistical machine translation (Och and Ney, 2002).</Paragraph>
    <Paragraph position="4"> This paper focuses on ME models for two text-tagging tasks: Named Entity Recognition (NER) and Text Chuncking (TC). By taking inspiration from the literature (Bender et al., 2003; Borthwick, 1999; Koeling, 2000), a set of standard binary features is introduced. Hence, for each feature type, a corresponding real-valued feature is developed in terms of smoothed probability distributions estimated on the training data. A direct comparison of ME models based on binary, realvalued, and mixed features is presented. Besides, performance on the tagging tasks, complexity and training time by each model are reported. ME estimation with real-valued features is accomplished by combining GIS with the leave-one-out method (Manning and Schutze, 1999).</Paragraph>
    <Paragraph position="5"> Experiments were conducted on two publicly available benchmarks for which performance levelsofmanysystems arepublished ontheWeb. Results show that better ME models for NER and TC can be developed by integrating binary and real- null Given a sequence of words wT1 = w1,...,wT and a set of tags C, the goal of text-tagging is to find a sequence of tags cT1 = c1,...,cT which maximizes the posterior probability, i.e.:</Paragraph>
    <Paragraph position="7"> By assuming a discriminative model, Eq. (1) can be rewritten as follows:</Paragraph>
    <Paragraph position="9"> where p(ct|ct[?]11 ,wT1 ) is the target conditional probability of tag ct given the context (ct[?]11 ,wT1 ), i.e. the entire sequence of words and the full sequence of previous tags. Typically, independence assumptions are introduced in order to reduce the context size. While this introduces some approximations in the probability distribution, it considerably reduces data sparseness in the sampling space. For this reason, the context is limited here to the two previous tags (ct[?]1t[?]2) and to four words around the current word (wt+2t[?]2). Moreover, limiting the context to the two previous tags permits to apply dynamic programming (Bender et al., 2003) to efficiently solve the maximization (2).</Paragraph>
    <Paragraph position="10"> Let y = ct denote the class to be guessed (y [?] Y) at time t and x = ct[?]1t[?]2,wt+2t[?]2 its context (x [?] X). The generic ME model results:</Paragraph>
    <Paragraph position="12"> Thenfeature functions fi(x,y) represent anykind of information about the event (x,y) which can be useful for the classification task. Typically, binary features are employed which model the verification of simple events within the target class and the context.</Paragraph>
    <Paragraph position="13"> InMikheev (1998), binary features fortexttagging are classified into two broad classes: atomic and complex. Atomic features tell information about the current tag and one single item (word ortag) of the context. Complex features result as acombination of two or more atomic features. In this way, if the grouped events are not independent, complex features should capture higher correlations or dependencies, possibly useful to discriminate.</Paragraph>
    <Paragraph position="14"> In the following, a standard set of binary features is presented, which is generally employed fortext-tagging tasks. Thereader familiar withthe topic can directly check this set in Table 1.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML