XML Viewer - w04-3237

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3237_metho.xml
Size: 12,606 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3237">
  <Title>Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lot</Title>
  <Section position="4" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3 MEMM for Sequence Labeling
</SectionTitle>
    <Paragraph position="0"> A simple approach to sequence labeling is the maximum entropy Markov model. The model assigns a probability P(T|W) to any possible tag sequence</Paragraph>
    <Paragraph position="2"> for a given word sequence</Paragraph>
    <Paragraph position="4"> . The probability assignment is done according to:</Paragraph>
    <Paragraph position="6"> ) is the conditioning information at position i in the word sequence on which the probability model is built.</Paragraph>
    <Paragraph position="7"> The approach we took is the one in (Ratnaparkhi, 1996), which uses x</Paragraph>
    <Paragraph position="9"> }. We note that the probability model is causal in the sequencing of tags (the probability assignment for t i only depends on previous tags t</Paragraph>
    <Paragraph position="11"> ) which allows for efficient algorithms that search for the most likely tag sequence</Paragraph>
    <Paragraph position="13"> P(T|W) as well as ensures a properly normalized conditional probability model</Paragraph>
    <Paragraph position="15"> using a maximum entropy model. The next section briefly describes the training procedure; for details the reader is referred to (Berger et al., 1996).</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Maximum Entropy State Transition Model
</SectionTitle>
      <Paragraph position="0"> The sufficient statistics that are extracted from the training data are tuples</Paragraph>
      <Paragraph position="2"> the tag assigned in context x</Paragraph>
      <Paragraph position="4"> count with which this event has been observed in the training data. By way of example, the event associated with the first word in the example in Section 2 is (*bdw* denotes a special boundary type):</Paragraph>
      <Paragraph position="6"> suffix1=e suffix2=me suffix3=ime The maximum entropy probability model P(y|x) uses features which are indicator functions of the type:</Paragraph>
      <Paragraph position="8"> Assuming a set of features F whose cardinality is F, the probability assignment is made according to:</Paragraph>
      <Paragraph position="10"> is the set of real-valued model parameters.  We used a simple count cut-off feature selection algorithm which counts the number of occurrences of all features in a predefined set after which it discards the features whose count is less than a pre-specified threshold. The parameter of the feature selection algorithm is the threshold value; a value of 0 will keep all features encountered in the training data.  The model parameters L are estimated such that the model assigns maximum log-likelihood to the training data subject to a Gaussian prior centered</Paragraph>
      <Paragraph position="12"> whose value is determined by line search on development data such that it yields the best tagging accuracy. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="5" type="metho">
    <SectionTitle>
4 MAP Adaptation of Maximum Entropy
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
Models
</SectionTitle>
      <Paragraph position="0"> In the adaptation scenario we already have a Max-Ent model trained on the background data and we wish to make best use of the adaptation data by balancing the two. A simple way to accomplish this is to use MAP adaptation using a prior distribution on the model parameters.</Paragraph>
      <Paragraph position="1"> A Gaussian prior for the model parameters L has been previously used in (Chen and Rosenfeld, 2000) for smoothing MaxEnt models. The prior has 0 mean and diagonal covariance: L [?]</Paragraph>
      <Paragraph position="3"> )). In the adaptation scenario, the prior distribution used is centered at the parameter</Paragraph>
      <Paragraph position="5"> The adaptation is performed in stages: * apply feature selection algorithm on adaptation data and determine set of features F  introduced in the model receive 0 weight. The resulting model is thus equivalent with the background model. * train the model such that the regularized log-likelihood of the adaptation training data is maximized. The prior mean is set at L</Paragraph>
      <Paragraph position="7"> tween the parameter vector for the background model and a 0-valued vector of length |F</Paragraph>
      <Paragraph position="9"> |corresponding to the weights for the new features.</Paragraph>
      <Paragraph position="10"> As shown in Appendix A, the update equations are very similar to the 0-mean case:</Paragraph>
      <Paragraph position="12"> The effect of the prior is to keep the model parameters l i close to the background ones. The cost of moving away from the mean for each feature f</Paragraph>
      <Paragraph position="14"> specified by the magnitude of the variance s</Paragraph>
      <Paragraph position="16"> will keep the weight l i close to its mean; a large variance s i will make the regularized log-likelihood (see Eq. 3) insensitive to the prior on l i , allowing the use of the best value l i for modeling the adaptation data. Another observation is that not only the features observed in the adaptation data get updated: even if E</Paragraph>
      <Paragraph position="18"> still get updated if the feature f i triggers for a context x encountered in the adaptation data and some predicted value y -- not necessarily present in the adaptation data in context x.</Paragraph>
      <Paragraph position="19"> In our experiments the variances were tied to</Paragraph>
      <Paragraph position="21"> = s whose value was determined by line search on development data drawn from the adaptation data. The common variance s will thus balance optimally the log-likelihood of the adaptation data with the L  mean values obtained from the background data.</Paragraph>
      <Paragraph position="22"> Other tying schemes are possible: separate values could be used for the F  feature sets, respectively. We did not experiment with various tying schemes although this is a promising research direction.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
4.1 Relationship with Minimum Divergence
Training
</SectionTitle>
      <Paragraph position="0"> Another possibility to adapt the background model is to do minimum KL divergence (MinDiv) train- null We use A\B to denote set difference.</Paragraph>
      <Paragraph position="1"> ing (Pietra et al., 1995) between the background exponential model B -- assumed fixed -- and an exponential model A built using the F  feature set. It can be shown that, if we smooth the A model with a Gaussian prior on the feature weights that is centered at 0 -- following the approach in (Chen and Rosenfeld, 2000) for smoothing maximum entropy models -- then the MinDiv update equations for estimating A on the adaptation data are identical to the MAP adaptation procedure we proposed  .</Paragraph>
      <Paragraph position="2"> However, we wish to point out that the equivalence holds only if the feature set for the new model  feature set for A -- will not result in an equivalent procedure to ours. In fact, the difference in performance between this latter approach and ours could be quite large since the cardinality of F background is typically several orders of magnitude larger than that of F adapt and our approach also updates the weights corresponding to features in F</Paragraph>
      <Paragraph position="4"> needed to compare the performance of the two approaches. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="5" end_page="5" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> The baseline 1-gram and the background MEMM capitalizer were trained on various amounts of WSJ (Paul and Baker, 1992) data from 1987 -- files WS87_{001-126}. The in-domain test data used was file WS94_000 (8.7kwds).</Paragraph>
    <Paragraph position="1"> As for the adaptation experiments, two different sets of BN data were used, whose sizes are summarized in Table 1:</Paragraph>
  </Section>
  <Section position="7" start_page="5" end_page="5" type="metho">
    <SectionTitle>
1. BN CNN/NPR data. The train-
</SectionTitle>
    <Paragraph position="0"> ing/development/test partition consisted of a 3-way random split of file BN624BTS. The resulting sets are denoted CNN-trn/dev/tst, respectively</Paragraph>
  </Section>
  <Section position="8" start_page="5" end_page="6" type="metho">
    <SectionTitle>
2. BN ABC Primetime data. The training set con-
</SectionTitle>
    <Paragraph position="0"> sisted of file BN623ATS whereas the development/test set consisted of a 2-way random split of file BN624ATS</Paragraph>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
5.1 In-Domain Experiments
</SectionTitle>
      <Paragraph position="0"> We have proceeded building both 1-gram and MEMM capitalizers using various amounts of background training data. The model sizes for the 1-gram and MEMM capitalizer are presented in Table 2. Count cut-off feature selection has been used  Thanks to one of the anonymous reviewers for pointing out this possible connection.</Paragraph>
      <Paragraph position="1">  for the MEMM capitalizer with the threshold set at 5, so the MEMM model size is a function of the training data. The 1-gram capitalizer used a vocabulary of the most likely 100k wds derived from the training data.</Paragraph>
      <Paragraph position="2"> Model No. Param. (10  rameters) for various amounts of training data We first evaluated the in-domain and out-of-domain relative performance of the 1-gram and the MEMM capitalizers as a function of the amount of training data. The results are presented in Table 3. The MEMM capitalizer performs about 45% better  data for various amounts of training data than the 1-gram one when trained and evaluated on Wall Street Journal text. The relative performance improvement of the MEMM capitalizer over the 1-gram baseline drops to 35-40% when using out-of-domain Broadcast News data. Both models benefit from using more training data.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
5.2 Adaptation Experiments
</SectionTitle>
      <Paragraph position="0"> We have then adapted the best MEMM model built on 20Mwds on the two BN data sets (CNN/ABC) and compared performance against the 1-gram and the unadapted MEMM models.</Paragraph>
      <Paragraph position="1"> There are a number of parameters to be tuned on development data. Table 4 presents the variation in model size with different count cut-off values for the feature selection procedure on the adaptation data. As can be seen, very few features are added to the background model. Table 5 presents the variation in log-likelihood and capitalization accuracy on the CNN adaptation training and development data, respectively. The adaptation procedure was found  cut-off threshold used for feature selection on CNN-trn adaptation data; the entry corresponding to the cut-off threshold of 10  represents the number of features in the background model to be insensitive to the number of reestimation iterations, and, more surprisingly, to the number of features added to the background model from the adaptation data, as shown in 5. The most sensitive parameter is the prior variance s  , as shown in Figure 1; its value is chosen to maximize classification accuracy on development data. As expected, low  variance values; log-likelihood and accuracy on adaptation data CNN-trn as well as accuracy on held-out data CNN-dev; the background model results (no new features added) are the entries corresponding to the cut-off threshold of  Finally, Table 6 presents the results on test data for 1-gram, background and adapted MEMM. As can be seen, the background MEMM outperforms the 1-gram model on both BN test sets by about 35-40% relative. Adaptation improves performance even further by another 20-25% relative. Overall, the adapted models achieve 60% relative reduction in capitalization error over the 1-gram baseline on both BN test sets. An intuitively satisfying result is the fact that the cross-test set performance (CNN  and training/development data (- -/- line) capitalization accuracy as a function of the prior variance s  used: ABC and CNN adapted model evaluated on ABC data and the other way around) is worse than the adapted one.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML