File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-2018_intro.xml

Size: 19,515 bytes

Last Modified: 2025-10-06 14:01:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2018">
  <Title>A comparison of algorithms for maximum entropy parameter estimation</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Maximum likelihood estimation
</SectionTitle>
    <Paragraph position="0"> Suppose we are given a probability distribution p over a set of events X which are characterized by a d dimensional feature vector function f : X !Rd.</Paragraph>
    <Paragraph position="1"> In addition, we have also a set of contexts W and a function Y which partitions the members of X. In the case of a stochastic context-free grammar, for example, X might be the set of possible trees, the feature vectors might represent the number of times each rule applied in the derivation of each tree, W might be the set of possible strings of words, and Y(w) the set of trees whose yield is w2W. A conditional maximum entropy model qth(xjw) for p has the parametric form (Berger et al., 1996; Chi, 1998; Johnson et al., 1999):</Paragraph>
    <Paragraph position="3"> where th is a d-dimensional parameter vector and thT f (x) is the inner product of the parameter vector and a feature vector.</Paragraph>
    <Paragraph position="4"> Given the parametric form of an ME model in (1), fitting an ME model to a collection of training data entails finding values for the parameter vector th which minimize the Kullback-Leibler divergence between the model qth and the empirical distribu-</Paragraph>
    <Paragraph position="6"> or, equivalently, which maximize the log likelihood:</Paragraph>
    <Paragraph position="8"> The gradient of the log likelihood function, or the vector of its first derivatives with respect to the pa-</Paragraph>
    <Paragraph position="10"> Since the likelihood function (2) is concave over the parameter space, it has a global maximum where the gradient is zero. Unfortunately, simply setting G(th) = 0 and solving for th does not yield a closed form solution, so we proceed iteratively. At each step, we adjust an estimate of the parameters th(k) to a new estimate th(k+1) based on the divergence between the estimated probability distribution q(k) and the empirical distribution p. We continue until successive improvements fail to yield a sufficiently large decrease in the divergence.</Paragraph>
    <Paragraph position="11"> While all parameter estimation algorithms we will consider take the same general form, the method for computing the updates d(k) at each search step differs substantially. As we shall see, this difference can have a dramatic impact on the number of updates required to reach convergence.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Iterative Scaling
</SectionTitle>
      <Paragraph position="0"> One popular method for iteratively refining the model parameters is Generalized Iterative Scaling (GIS), due to Darroch and Ratcliff (1972). An extension of Iterative Proportional Fitting (Deming and Stephan, 1940), GIS scales the probability distribution q(k) by a factor proportional to the ratio of Ep[ f ] to Eq(k)[ f ], with the restriction that [?] j f j(x) = C for each event x in the training data (a condition which can be easily satisfied by the addition of a correction feature). We can adapt GIS to estimate the model parameters th rather than the model probabilities q, yielding the update rule:</Paragraph>
      <Paragraph position="2"> The step size, and thus the rate of convergence, depends on the constant C: the larger the value of C, the smaller the step size. In case not all rows of the training data sum to a constant, the addition of a correction feature effectively slows convergence to match the most difficult case. To avoid this slowed convergence and the need for a correction feature, Della Pietra et al. (1997) propose an Improved Iterative Scaling (IIS) algorithm, whose update rule is the solution to the equation:</Paragraph>
      <Paragraph position="4"> where M(x) is the sum of the feature values for an event x in the training data. This is a polynomial in exp d(k) , and the solution can be found straight-forwardly using, for example, the Newton-Raphson method.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 First order methods
</SectionTitle>
      <Paragraph position="0"> Iterative scaling algorithms have a long tradition in statistics and are still widely used for analysis of contingency tables. Their primary strength is that on each iteration they only require computation of the expected values Eq(k). They do not depend on evaluation of the gradient of the log-likelihood function, which, depending on the distribution, could be prohibitively expensive. In the case of ME models, however, the vector of expected values required by iterative scaling essentially is the gradient G. Thus, it makes sense to consider methods which use the gradient directly.</Paragraph>
      <Paragraph position="1"> The most obvious way of making explicit use of the gradient is by Cauchy's method, or the method of steepest ascent. The gradient of a function is a vector which points in the direction in which the function's value increases most rapidly. Since our goal is to maximize the log-likelihood function, a natural strategy is to shift our current estimate of the parameters in the direction of the gradient via the update rule:</Paragraph>
      <Paragraph position="3"> where the step size a(k) is chosen to maximize L(th(k) + d(k)). Finding the optimal step size is itself an optimization problem, though only in one dimension and, in practice, only an approximate solution is required to guarantee global convergence.</Paragraph>
      <Paragraph position="4"> Since the log-likelihood function is concave, the method of steepest ascent is guaranteed to find the global maximum. However, while the steps taken on each iteration are in a very narrow sense locally optimal, the global convergence rate of steepest ascent is very poor. Each new search direction is orthogonal (or, if an approximate line search is used, nearly so) to the previous direction. This leads to a characteristic &amp;quot;zig-zag&amp;quot; ascent, with convergence slowing as the maximum is approached.</Paragraph>
      <Paragraph position="5"> One way of looking at the problem with steepest ascent is that it considers the same search directions many times. We would prefer an algorithm which considered each possible search direction only once, in each iteration taking a step of exactly the right length in a direction orthogonal to all previous search directions. This intuition underlies conjugate gradient methods, which choose a search direction which is a linear combination of the steepest ascent direction and the previous search direction. The step size is selected by an approximate line search, as in the steepest ascent method. Several non-linear conjugate gradient methods, such as the Fletcher-Reeves (cg-fr) and the Polak-Ribi`ere-Positive (cf-prp) algorithms, have been proposed. While theoretically equivalent, they use slighly different update rules and thus show different numeric properties.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Second order methods
</SectionTitle>
      <Paragraph position="0"> Another way of looking at the problem with steepest ascent is that while it takes into account the gradient of the log-likelihood function, it fails to take into account its curvature, or the gradient of the gradient. The usefulness of the curvature is made clear if we consider a second-order Taylor series approx-</Paragraph>
      <Paragraph position="2"> where H is Hessian matrix of the log-likelihood function, the d d matrix of its second partial derivatives with respect to th. If we set the derivative of (4) to zero and solve for d, we get the update rule for Newton's method:</Paragraph>
      <Paragraph position="4"> Newton's method converges very quickly (for quadratic objective functions, in one step), but it requires the computation of the inverse of the Hessian matrix on each iteration.</Paragraph>
      <Paragraph position="5"> While the log-likelihood function for ME models in (2) is twice differentiable, for large scale problems the evaluation of the Hessian matrix is computationally impractical, and Newton's method is not competitive with iterative scaling or first order methods. Variable metric or quasi-Newton methods avoid explicit evaluation of the Hessian by building up an approximation of it using successive evaluations of the gradient. That is, we replace H 1(th(k)) in (5) with a local approximation of the inverse Hes-</Paragraph>
      <Paragraph position="7"> with B(k) a symmatric, positive definite matrix which satisfies the equation:</Paragraph>
      <Paragraph position="9"> Variable metric methods also show excellent convergence properties and can be much more efficient than using true Newton updates, but for large scale problems with hundreds of thousands of parameters, even storing the approximate Hessian is prohibitively expensive. For such cases, we can apply limited memory variable metric methods, which implicitly approximate the Hessian matrix in the vicinity of the current estimate of th(k) using the previous  m values of y(k) and d(k). Since in practical applications values of m between 3 and 10 suffice, this can offer a substantial savings in storage requirements over variable metric methods, while still giving favorable convergence properties.1 3 Comparing estimation techniques The performance of optimization algorithms is highly dependent on the specific properties of the problem to be solved. Worst-case analysis typically 1Space constraints preclude a more detailed discussion of these methods here. For algorithmic details and theoretical analysis of first and second order methods, see, e.g., Nocedal (1997) or Nocedal and Wright (1999).</Paragraph>
      <Paragraph position="10"> does not reflect the actual behavior on actual prob- null lems. Therefore, in order to evaluate the performance of the optimization techniques sketched in previous section when applied to the problem of parameter estimation, we need to compare the performance of actual implementations on realistic data sets (Dolan and Mor'e, 2002).</Paragraph>
      <Paragraph position="11"> Minka (2001) offers a comparison of iterative scaling with other algorithms for parameter estimation in logistic regression, a problem similar to the one considered here, but it is difficult to transfer Minka's results to ME models. For one, he evaluates the algorithms with randomly generated training data. However, the performance and accuracy of optimization algorithms can be sensitive to the specific numerical properties of the function being optimized; results based on random data may or may not carry over to more realistic problems.</Paragraph>
      <Paragraph position="12"> And, the test problems Minka considers are relatively small (100-500 dimensions). As we have seen, though, algorithms which perform well for small and medium scale problems may not always be applicable to problems with many thousands of dimensions.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Implementation
</SectionTitle>
      <Paragraph position="0"> As a basis for the implementation, we have used PETSc (the &amp;quot;Portable, Extensible Toolkit for Scientific Computation&amp;quot;), a software library designed to ease development of programs which solve large systems of partial differential equations (Balay et al., 2001; Balay et al., 1997; Balay et al., 2002).</Paragraph>
      <Paragraph position="1"> PETSc offers data structures and routines for parallel and sequential storage, manipulation, and visualization of very large sparse matrices.</Paragraph>
      <Paragraph position="2"> For any of the estimation techniques, the most expensive operation is computing the probability distribution q and the expectations Eq[ f ] for each iteration. In order to make use of the facilities provided by PETSc, we can store the training data as a (sparse) matrix F, with rows corresponding to events and columns to features. Then given a parameter vector th, the unnormalized probabilities .qth are the matrix-vector product: .qth = expFth and the feature expectations are the transposed matrix-vector product: Eqth[ f ] = FT qth By expressing these computations as matrix-vector operations, we can take advantage of the high performance sparse matrix primitives of PETSc.</Paragraph>
      <Paragraph position="3"> For the comparison, we implemented both Generalized and Improved Iterative Scaling in C++ using the primitives provided by PETSc. For the other optimization techniques, we used TAO (the &amp;quot;Toolkit for Advanced Optimization&amp;quot;), a library layered on top of the foundation of PETSc for solving non-linear optimization problems (Benson et al., 2002). TAO offers the building blocks for writing optimization programs (such as line searches and convergence tests) as well as high-quality implementations of standard optimization algorithms (including conjugate gradient and variable metric methods).</Paragraph>
      <Paragraph position="4"> Before turning to the results of the comparison, two additional points need to be made. First, in order to assure a consistent comparison, we need to use the same stopping rule for each algorithm.</Paragraph>
      <Paragraph position="5"> For these experiments, we judged that convergence was reached when the relative change in the log-likelihood between iterations fell below a predetermined threshold. That is, each run was stopped when:</Paragraph>
      <Paragraph position="7"> where the relative tolerance e = 10 7. For any particular application, this may or may not be an appropriate stopping rule, but is only used here for purposes of comparison.</Paragraph>
      <Paragraph position="8"> Finally, it should be noted that in the current implementation, we have not applied any of the possible optimizations that appear in the literature (Lafferty and Suhm, 1996; Wu and Khudanpur, 2000; Lafferty et al., 2001) to speed up normalization of the probability distribution q. These improvements take advantage of a model's structure to simplify the evaluation of the denominator in (1). The particular data sets examined here are unstructured, and such optimizations are unlikely to give any improvement.</Paragraph>
      <Paragraph position="9"> However, when these optimizations are appropriate, they will give a proportional speed-up to all of the algorithms. Thus, the use of such optimizations is independent of the choice of parameter estimation method.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Experiments
</SectionTitle>
      <Paragraph position="0"> To compare the algorithms described inx2, we applied the implementation outlined in the previous section to four training data sets (described in Table 1) drawn from the domain of natural language processing. The 'rules' and 'lex' datasets are examples dataset classes contexts features non-zeros  of stochastic attribute value grammars, one with a small set of SCFG-like features, and with a very large set of fine-grained lexical features (Bouma et al., 2001). The 'summary' dataset is part of a sentence extraction task (Osborne, to appear), and the 'shallow' dataset is drawn from a text chunking application (Osborne, 2002). These datasets vary widely in their size and composition, and are representative of the kinds of datasets typically encountered in applying ME models to NLP classification tasks.</Paragraph>
      <Paragraph position="1"> The results of applying each of the parameter estimation algorithms to each of the datasets is summarized in Table 2. For each run, we report the KL divergence between the fitted model and the training data at convergence, the prediction accuracy of fitted model on a held-out test set (the fraction of contexts for which the event with the highest probability under the model also had the highest probability under the reference distribution), the number of iterations required, the number of log-likelihood and gradient evaluations required (algorithms which use a line search may require several function evaluations per iteration), and the total elapsed time (in seconds).2 There are a few things to observe about these results. First, while IIS converges in fewer steps the GIS, it takes substantially more time. At least for this implementation, the additional bookkeeping overhead required by IIS more than cancels any improvements in speed offered by accelerated convergence. This may be a misleading conclusion, however, since a more finely tuned implementation of IIS may well take much less time per iteration than the one used for these experiments. However, even if each iteration of IIS could be made as fast as an 2The reported time does not include the time required to input the training data, which is difficult to reproduce and which is the same for all the algorithms being tested. All tests were run using one CPU of a dual processor 1700MHz Pentium 4 with 2 gigabytes of main memory at the Center for High Performance Computing and Visualisation, University of Groningen. null iteration of GIS (which seems unlikely), the benefits of IIS over GIS would in these cases be quite modest.</Paragraph>
      <Paragraph position="2"> Second, note that for three of the four datasets, the KL divergence at convergence is roughly the same for all of the algorithms. For the 'summary' dataset, however, they differ by up to two orders of magnitude. This is an indication that the convergence test in (6) is sensitive to the rate of convergence and thus to the choice of algorithm. Any degree of precision desired could be reached by any of the algorithms, with the appropriate value of e.</Paragraph>
      <Paragraph position="3"> However, GIS, say, would require many more iterations than reported in Table 2 to reach the precision achieved by the limited memory variable metric algorithm. null Third, the prediction accuracy is, in most cases, more or less the same for all of the algorithms.</Paragraph>
      <Paragraph position="4"> Some variability is to be expected--all of the data sets being considered here are badly ill-conditioned, and many different models will yield the same likelihood. In a few cases, however, the prediction accuracy differs more substantially. For the two SAVG data sets ('rules' and 'lex'), GIS has a small advantage over the other methods. More dramatically, both iterative scaling methods perform very poorly on the 'shallow' dataset. In this case, the training data is very sparse. Many features are nearly 'pseudo-minimal' in the sense of Johnson et al. (1999), and so receive weights approaching [?].</Paragraph>
      <Paragraph position="5"> Smoothing the reference probabilities would likely improve the results for all of the methods and reduce the observed differences. However, this does suggest that gradient-based methods are robust to certain problems with the training data.</Paragraph>
      <Paragraph position="6"> Finally, the most significant lesson to be drawn from these results is that, with the exception of steepest ascent, gradient-based methods outperform iterative scaling by a wide margin for almost all the datasets, as measured by both number of function evaluations and by the total elapsed time. And, in each case, the limited memory variable metric algo-</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML