XML Viewer - h05-1087

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1087_metho.xml
Size: 15,601 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1087">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 692-699, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Maximum Expected F-Measure Training of Logistic Regression Models</Title>
  <Section position="3" start_page="692" end_page="692" type="metho">
    <SectionTitle>
2 Review of Logistic Regression
</SectionTitle>
    <Paragraph position="0"> Bernoulli regression models are conditional probability models of a binary response variable Y given a vector vectorX of k explanatory variables (X1,...,Xk).</Paragraph>
    <Paragraph position="1"> We will use the convention2 that Y takes on a value y[?]{[?]1,+1}.</Paragraph>
    <Paragraph position="2"> Logistic regression models (Cox, 1958) are perhaps best viewed as instances of generalized linear models (Nelder and Wedderburn, 1972; McCullagh and Nelder, 1989) where the the response variable follows a Bernoulli distribution and the link function is the logit function. Let us summarize this first, before expanding the relevant definitions:</Paragraph>
    <Paragraph position="4"> What this means is that there is an unobserved quantity p, the success probability of the Bernoulli distribution, which we interpret as the probability that Y will take on the value +1:</Paragraph>
    <Paragraph position="6"> parenrightbigg The logit function is used to transform a probability, constrained to fall within the interval (0,1), into a real number ranging over ([?][?],+[?]). The inverse function of the logit is the cumulative distribution 2The natural choice may seem to be for Y to range over the set{0,1}, but the convention adopted here is more common for classification problems and has certain advantages which will become clear soon.</Paragraph>
    <Paragraph position="7"> function of the standard logistic distribution (also known as the sigmoid or logistic function), which we call g:</Paragraph>
    <Paragraph position="9"> This allows us to write p = g(th0 +x1 th1 +x2 th2 +***+xk thk). We also adopt the usual convention that vectorx = (1,x1,x2,...,xk), which is a k +1-dimensional vector whose first component is always 1 and whose remaining k components are the values of the k explanatory variables. So the Bernoulli probability can be expressed as</Paragraph>
    <Paragraph position="11"> The conditional probability model then takes the following abbreviated form, which will be used throughout the rest of this paper:</Paragraph>
    <Paragraph position="13"> A classifier can be constructed from this probability model using the MAP decision rule. This means predicting the label +1 if Pr(+1|vectorx,vectorth) exceeds 1/2, which amounts to the following:</Paragraph>
    <Paragraph position="15"> parenrightBig This illustrates the well-known result that a MAP classifier derived from a logistic regression model is equivalent to a (single-layer) perceptron (Rosenblatt, 1958) or linear threshold unit.</Paragraph>
  </Section>
  <Section position="4" start_page="692" end_page="693" type="metho">
    <SectionTitle>
3 F-Measure
</SectionTitle>
    <Paragraph position="0"> Suppose the parameter vector th of a logistic regression model is known. The performance of the resulting classifier can then be evaluated in terms of the recall (or sensitivity) and precision of the classifier on an evaluation dataset. Recall (R) and precision (P) are defined in terms of the number of true positives (A), misses (B), and false alarms (C) of the classifier (cf. Table 1):</Paragraph>
    <Paragraph position="2"> The Fa measure - familiar from Information Retrieval - combines recall and precision into a single utility criterion by taking their a-weighted harmonic mean:</Paragraph>
    <Paragraph position="4"> The Fa measure can be expressed in terms of the triple (A,B,C) as follows:</Paragraph>
    <Paragraph position="6"> In order to define A, B, and C formally, we use the notation llbracketpirrbracket to denote a variant of the Kronecker delta defined like this, where pi is a Boolean expression: null</Paragraph>
    <Paragraph position="8"> Given an evaluation dataset (vectorx1,y1),...,(vectorxn,yn) the counts of hits (true positives), misses, and false alarms are, respectively:</Paragraph>
    <Paragraph position="10"> Note that F-measure is seemingly a global measure of utility that applies to an evaluation dataset as a whole: while the F-measure of a classifier evaluated on a single supervised instance is well defined, the overall F-measure on a larger dataset is not a function of the F-measure evaluated on each instance in the dataset. This is in contrast to ordinary loss/ utility, whose grand total (or average) on a dataset can be computed by direct summation.</Paragraph>
  </Section>
  <Section position="5" start_page="693" end_page="694" type="metho">
    <SectionTitle>
4 Relation to Expected Utility
</SectionTitle>
    <Paragraph position="0"> We reformulate F-measure as a scalar-valued rational function composed with a vector-valued utility function. This allows us to define notions of expected and average utility, setting up the discussion of parameter estimation in terms of empirical risk minimization (or rather, utility maximization).</Paragraph>
    <Paragraph position="1"> Define the following vector-valued utility function u, where u(~y|y) is the utility of choosing the label ~y if the true label is y:</Paragraph>
    <Paragraph position="3"> This function indicates whether a classification decision is a hit, miss, or false alarm. Correct rejections are not counted.</Paragraph>
    <Paragraph position="4"> Expected values are, of course, well-defined for vector-valued functions. For example, the expected  In empirical risk minimization we approximate the expected utility of a classifier by its average utility US on a given dataset S = (vectorx1,y1),...,(vectorxn,yn):</Paragraph>
    <Paragraph position="6"> fined before. This means that we can interpret the  F-measure of a classifier as a simple rational function of its empirical average utility (the scaling factor 1/n in (3) can in fact be omitted). This allows us to approach the parameter estimation task as an empirical risk minimization or utility maximization problem.</Paragraph>
  </Section>
  <Section position="6" start_page="694" end_page="697" type="metho">
    <SectionTitle>
5 Discriminative Parameter Estimation
</SectionTitle>
    <Paragraph position="0"> In the preceding two sections we assumed that the parameter vector vectorth was known. Now we turn to the problem of estimating vectorth by maximizing the F-measure formulated in terms of expected utility. We make the dependence on vectorth explicit in the formulation of the optimization task:</Paragraph>
    <Paragraph position="2"> where (A(vectorth),B(vectorth),C(vectorth))=US(vectorth) as defined in (3).</Paragraph>
    <Paragraph position="3"> We encounter the usual problem: the basic quantities involved are integers (counts of hits, misses, and false alarms), and the optimization objective is a piecewise-constant functions of the parameter vector vectorth, due to the fact that vectorth occurs exclusively inside Kronecker deltas. For example:</Paragraph>
    <Paragraph position="5"> In general, we can set</Paragraph>
    <Paragraph position="7"> and in the case of logistic regression this arises as a special case of approximating the limit</Paragraph>
    <Paragraph position="9"> with a fixed value of g = 1. The choice of g does not matter much. The important point is that we are now dealing with approximate quantities which depend continuously on vectorth. In particular A(vectorth)[?] ~A(vectorth), where</Paragraph>
    <Paragraph position="11"> Since the marginal total of positive instances npos (cf. Table 1) does not depend onvectorth, we use the identities ~B(vectorth) = npos[?] ~A(vectorth) and ~mpos(vectorth) = ~A(vectorth)+ ~C(vectorth) to rewrite the optimization objective as ~Fa:</Paragraph>
    <Paragraph position="13"> g(gvectorxi*vectorth).</Paragraph>
    <Paragraph position="14"> Maximization of ~F as defined in (6) can be carried out numerically using multidimensional optimization techniques like conjugate gradient search (Fletcher and Reeves, 1964) or quasi-Newton methods such as the BFGS algorithm (Broyden, 1967; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970). This requires the evaluation of partial derivatives. The jth partial derivative of ~F is as follows:</Paragraph>
    <Paragraph position="16"> One can compute the value of ~F(vectorth) and its gradient [?] ~F(vectorth) simultaneously at a given point vectorth in O(nk) time and O(k) space. Pseudo-code for such an algorithm appears in Figure 1. In practice, the inner loops on lines 8-9 and 14-18 can be made more efficient by using a sparse representation of the row vectors x[i]. A concrete implementation of this algorithm can then be used as a callback to a multi-dimensional optimization routine. We use the BFGS minimizer provided by the GNU Scientific Library (Galassi et al., 2003). Important caveat: the function ~F is generally not concave. We deal with this problem by taking the maximum across several runs of the optimization algorithm starting from random initial values. The next section illustrates this point further.</Paragraph>
    <Paragraph position="17">  A comparison with the method of maximum likelihood illustrates two important properties of discriminative parameter estimation. Consider the toy dataset in Table 2 consisting of four supervised instances with a single explanatory variable. Thus the logistic regression model has two parameters and takes the following form:</Paragraph>
    <Paragraph position="19"> +logPr toy(+1|3,th0,th1).</Paragraph>
    <Paragraph position="20"> A surface plot of L is shown in Figure 2. Observe that L is concave; its global maximum occurs near (th0,th1)[?](0.35,0.57), and its value is always strictly negative because the toy dataset is not linearly separable. The classifier resulting from maximum likelihood training predicts the label +1 for all training instances and thus achieves a recall of 3/3 and precision 3/4 on its training data. The Fa=0.5 measure is 6/7.</Paragraph>
    <Paragraph position="21"> Contrast the shape of the log-likelihood function L with the function ~Fa. Surface plots of ~Fa=0.5 and ~Fa=0.25 appear in Figure 3. The figures clearly illustrate the first important (but undesirable) property of ~F, namely the lack of concavity. They also illustrate a desirable property, namely the ability to take into account certain properties of the loss function during training. The ~Fa=0.5 surface in the left panel of Figure 3 achieves its maximum in the right corner for (th0,th1)-(+[?],+[?]). If we choose (th0,th1) = (20,15) the classifier labels every instance of the training data with +1.</Paragraph>
    <Paragraph position="22">  corner for (th0,th1)-([?][?],+[?]). If we set (th0,th1)= ([?]20,15) the resulting classifier labels the first two  instances (x = 0 and x = 1) as [?]1 and the last two instances (x = 2 and x = 3) as +1.</Paragraph>
    <Paragraph position="23"> The classifier trained according to the ~Fa=0.5 criterion achieves an Fa=0.5 measure of 6/7[?]0.86, compared with 4/5 = 0.80 for the classifier trained according to the ~Fa=0.25 criterion. Conversely, that classifier achieves an Fa=0.25 measure of 8/9[?]0.89 compared with 4/5 = 0.80 for the classifier trained according to the ~Fa=0.5 criterion. This demonstrates that the training procedure can effectively take information from the utility function into account, producing a classifier that performs well under a given evaluation criterion. This is the result of optimizing a task-specific utility function during training, not simply a matter of adjusting the decision threshold of a trained classifier.</Paragraph>
    <Paragraph position="24"> 7 Evaluation on an Extraction Problem We evaluated our discriminative training procedure on a real extraction problem arising in broadcast news summarization. The overall task is to summarize the stories in an audio news broadcast (or in the audio portion of an A/V broadcast). We assume that story boundaries have been identified and that each story has been broken up into sentence-like units. A simple way of summarizing a story is then to classify each sentence as either belonging into a summary or not, so that a relevant subset of sentences can be extracted to form the basis of a summary. What makes the classification task hard, and therefore interesting, is the fact that reliable features are hard to come by. Existing approaches such as Maskey and Hirschberg 2005 do well only when combining diverse features such as lexical cues, acoustic properties, structural/ positional features, etc.</Paragraph>
    <Paragraph position="25"> The task has another property which renders it problematic, and which prompted us to develop the discriminative training procedure described in this paper. Summarization, by definition, aims for brevity. This means that in any dataset the number of positive instances will be much smaller than the number of negative instances. Given enough data, balance could be restored by discarding negative instances. This, however, was not an option in our case: a moderate amount of manually labeled data had been produced and about one third would have had to be discarded to achieve a balance in the distribution of class labels. This would have eliminated precious supervised training data, which we were not prepared to do.</Paragraph>
    <Paragraph position="26"> The training and test data were prepared by Maskey and Hirschberg (2005), who performed the feature engineering, imputation of missing values, and the training-test split. We used the data unchanged in order to allow for a comparison between approaches. The dataset is made up of 30 features, divided into one binary response variable, and one binary explanatory variable plus 28 integer- and real-valued explanatory variables. The training portion consists of 3 535 instances, the test portion of 408 instances.</Paragraph>
    <Paragraph position="27"> We fitted logistic regression models in three different ways: by maximum likelihood ML, by ~Fa=0.5 maximization, and by ~Fa=0.75 maximization. Each  classifier was evaluated on the test dataset and its recall (R), precision (P), Fa=0.5 measure, and Fa=0.75 measure recorded. The results appear in Table 3.</Paragraph>
    <Paragraph position="28"> The row labeled ML+is special: the classifier used here is the logistic regression model fitted by maximum likelihood; what is different is that the threshold for positive predictions was adjusted post hoc to match the number of true positives of the first discriminatively trained classifier. This has the same effect as manually adjusting the threshold parameter th0 based on partial knowledge of the test data (via the performance of another classifier) and is thus not permissible. It is interesting to note, however, that the ML trained classifier performs worse than the ~Fa=0.5 trained classifier even when one parameter is adjusted by an oracle with knowledge of the test data and the performance of the other classifier. Fitting a model based on ~Fa=0.75, which gives increased weight to recall compared with ~Fa=0.5, led to higher recall as expected. However, we also expected that the Fa=0.75 score of the ~Fa=0.75 trained classifier would be higher than the Fa=0.75 score of the ~Fa=0.5 trained classifier. This is not the case, and could be due to the optimization getting stuck in a local maximum, or it may have been an unreasonable expectation to begin with.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML