XML Viewer - p06-1028

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1028_metho.xml
Size: 14,947 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1028">
  <Title>Training Conditional Random Fields with Multivariate Evaluation Measures</Title>
  <Section position="5" start_page="217" end_page="219" type="metho">
    <SectionTitle>
3 MCE Criterion Training for CRFs
</SectionTitle>
    <Paragraph position="0"> The Minimum Classification Error (MCE) framework first arose out of a broader family of approaches to pattern classifier design known as Generalized Probabilistic Descent (GPD) (Katagiri et al., 1991). The MCE criterion minimizes an empirical loss corresponding to a smooth approximation of the classification error. This MCE loss is itself defined in terms of a misclassification measure derived from the discriminant functions of a given task. Via the smoothing parameters, the MCE loss function can be made arbitrarily close to the binary classification error. An important property of this framework is that it makes it  possible in principle to achieve the optimal Bayes error even under incorrect modeling assumptions.</Paragraph>
    <Paragraph position="1"> It is easy to extend the MCE framework to use evaluation measures other than the classification error, namely the linear combination of error rates.</Paragraph>
    <Paragraph position="2"> Thus, it is possible to optimize directly a variety of (smoothed) evaluation measures. This is the approach proposed in this article.</Paragraph>
    <Paragraph position="3"> We first introduce a framework for MCE criterion training, focusing only on error rate optimization. Sec. 4 then describes an example of minimizing a different multivariate evaluation measure using MCE criterion training.</Paragraph>
    <Section position="1" start_page="218" end_page="218" type="sub_section">
      <SectionTitle>
3.1 Brief Overview of MCE
</SectionTitle>
      <Paragraph position="0"> Let x 2 X be an input, and y 2 Y be an output.</Paragraph>
      <Paragraph position="1"> The Bayes decision rule decides the most probable output ^y for x, by using the maximum a posteriori probability, ^y = arg maxy[?]Y p(yjx; l). In general, p(yjx; l) can be replaced by a more general discriminant function, that is,</Paragraph>
      <Paragraph position="3"> Using the discriminant functions for the possible output of the task, the misclassification measure d() is defined as follows: d(y[?],x, )=[?]g(y[?],x, ) + maxy [?]Y\y[?]g(y,x, ). (5) where y[?] is the correct output for x. Here it can be noted that, for a given x, d() 0 indicates misclassification. By using d(), the minimization of the error rate can be rewritten as the minimization of the sum of 0-1 (step) losses of the given training data. That is, arg minl Ll where</Paragraph>
      <Paragraph position="5"> d(d(y[?]k,xk, )). (6) d(r) is a step function returning 0 ifr&lt;0 and 1 otherwise. That is, d is 0 if the value of the discriminant function of the correct outputg(y[?]k,xk,l) is greater than that of the maximum incorrect output g(yk,xk,l), and d is 1 otherwise.</Paragraph>
      <Paragraph position="6"> Eq. 5 is not an appropriate function for optimization since it is a discontinuous function w.r.t. the parameters l. One choice of continuous misclassification measure consists of substituting 'max' with 'soft-max', maxkrk log summationtextk exp(rk). As a result d(y[?],x, )=[?]g[?]+log</Paragraph>
      <Paragraph position="8"> where g[?] =g(y[?],x,l), g=g(y,x,l), and A=  |Y|[?]1. ps is a positive constant that represents Lps-norm. When ps approaches 1, Eq. 7 converges to Eq. 5. Note that we can design any misclassification measure, including non-linear measures for d(). Some examples are shown in the Appendices.</Paragraph>
      <Paragraph position="9"> Of even greater concern is the fact that the step function d is discontinuous; minimization of Eq.</Paragraph>
      <Paragraph position="10"> 6 is therefore NP-complete. In the MCE formalism, d() is replaced with an approximated 0-1 loss function, l(), which we refer to as a smoothing function. A typical choice for l() is the sigmoid function, lsig(), which is differentiable and provides a good approximation of the 0-1 loss when the hyper-parameter a is large (see Eq. 8). Another choice is the (regularized) logistic function, llog(), that gives the upper bound of the 0-1 loss.</Paragraph>
      <Paragraph position="11"> Logistic loss is used as a conventional CRF loss function and provides convexity while the sigmoid function does not. These two smoothing functions can be written as follows:</Paragraph>
      <Paragraph position="13"> where a and b are the hyper-parameters of the training.</Paragraph>
      <Paragraph position="14"> We can introduce a regularization term to reduce over-fitting, which is derived using the same sense as in MAP, Eq. 2. Finally, the objective function of the MCE criterion with the regularization term can be rewritten in the following form:</Paragraph>
      <Paragraph position="16"> Then, the objective function of the MCE criterion that minimizes the error rate is Eq. 9 and</Paragraph>
      <Paragraph position="18"> is substituted for Fl,d,g,l. Since N is constant, we can eliminate the term 1/N in actual use.</Paragraph>
    </Section>
    <Section position="2" start_page="218" end_page="219" type="sub_section">
      <SectionTitle>
3.2 Formalization
</SectionTitle>
      <Paragraph position="0"> We simply substitute the discriminant function of the CRFs into that of the MCE criterion: g(y,x, ) = logp(y|x; ) [?] *F(y,x) (11) Basically, CRF training with the MCE criterion optimizes Eq. 9 with Eq. 11 after the selection of an appropriate misclassification measure, d(), and  smoothing function, l(). Although there is no restriction on the choice of d() and l(), in this work we select sigmoid or logistic functions for l() and Eq. 7 for d().</Paragraph>
      <Paragraph position="1"> The gradient of the loss function Eq. 9 can be decomposed by the following chain rule:</Paragraph>
      <Paragraph position="3"> The derivatives of l() w.r.t. d() given in Eq.</Paragraph>
      <Paragraph position="4"> 8 are written as: [?]lsig/[?]d = a lsig (1 lsig) and [?]llog/[?]d=lsig.</Paragraph>
      <Paragraph position="5"> The derivative of d() of Eq. 7 w.r.t. parameters l is written in this form:</Paragraph>
      <Paragraph position="7"> where g = l F(y,x), g[?] = l F(y[?],x), and Zl(x,ps)=summationtexty[?]Y exp(psg).</Paragraph>
      <Paragraph position="8"> Note that we can obtain exactly the same loss function as ML/MAP with appropriate choices of F(), l() and d(). The details are provided in the Appendices. Therefore, ML/MAP can be seen as one special case of the framework proposed here.</Paragraph>
      <Paragraph position="9"> In other words, our method provides a generalized framework of CRF training.</Paragraph>
    </Section>
    <Section position="3" start_page="219" end_page="219" type="sub_section">
      <SectionTitle>
3.3 Optimization Procedure
</SectionTitle>
      <Paragraph position="0"> With linear chain CRFs, we can calculate the objective function, Eq. 9 combined with Eq. 10, and the gradient, Eq. 12, by using the variant of the forward-backward and Viterbi algorithm described in (Sha and Pereira, 2003). Moreover, for the parameter optimization process, we can simply exploit gradient descent or quasi-Newton methods such as L-BFGS (Liu and Nocedal, 1989) as well as ML/MAP optimization.</Paragraph>
      <Paragraph position="1"> If we select ps = 1 for Eq. 7, we only need to evaluate the correct and the maximum incorrect output. As we know, the maximum output can be efficiently calculated with the Viterbi algorithm, which is the same as calculating Eq. 1.</Paragraph>
      <Paragraph position="2"> Therefore, we can find the maximum incorrect output by using the A* algorithm (Hart et al., 1968), if the maximum output is the correct output, and by using the Viterbi algorithm otherwise. It may be feared that since the objective function is not differentiable everywhere for ps=1, problems for optimization would occur. However, it has been shown (Le Roux and McDermott, 2005) that even simple gradient-based (firstorder) optimization methods such as GPD and (approximated) second-order methods such as QuickProp (Fahlman, 1988) and BFGS-based methods have yielded good experimental optimization results. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="219" end_page="221" type="metho">
    <SectionTitle>
4 Multivariate Evaluation Measures
</SectionTitle>
    <Paragraph position="0"> Thus far, we have discussed the error rate version of MCE. Unlike ML/MAP, the framework of MCE criterion training allows the embedding of not only a linear combination of error rates, but also any evaluation measure, including non-linear measures.</Paragraph>
    <Paragraph position="1"> Several non-linear objective functions, such as F-score for text classification (Gao et al., 2003), and BLEU-score and some other evaluation measures for statistical machine translation (Och, 2003), have been introduced with reference to the framework of MCE criterion training.</Paragraph>
    <Section position="1" start_page="219" end_page="219" type="sub_section">
      <SectionTitle>
4.1 Sequential Segmentation Tasks (SSTs)
</SectionTitle>
      <Paragraph position="0"> Hereafter, we focus solely on CRFs in sequences, namely the linear chain CRF. We assume that x and y have the same length: x=(x1,...,xn) and y=(y1,...,yn). In a linear chain CRF, yi depends only on yi[?]1.</Paragraph>
      <Paragraph position="1"> Sequential segmentation tasks (SSTs), such as text chunking (Chunking) and named entity recognition (NER), which constitute the shared tasks of the Conference of Natural Language Learning (CoNLL) 2000, 2002 and 2003, are typical CRF applications. These tasks require the extraction of pre-defined segments, referred to as target segments, from given texts. Fig. 1 shows typical examples of SSTs. These tasks are generally treated as sequential labeling problems incorporating the IOB tagging scheme (Ramshaw and Marcus, 1995). The IOB tagging scheme, where we only consider the IOB2 scheme, is also shown in Fig. 1. B-X, I-X and O indicate that the word in question is the beginning of the tag 'X', inside the tag 'X', and outside any target segment, respectively. Therefore, a segment is defined as a sequence of a few outputs.</Paragraph>
    </Section>
    <Section position="2" start_page="219" end_page="221" type="sub_section">
      <SectionTitle>
4.2 Segmentation F-score Loss for SSTs
</SectionTitle>
      <Paragraph position="0"> where TP, FP and FN represent true positive, false positive and false negative counts, respectively. null The individual evaluation units used to calculate TP, FN and PN, are not individual outputs yi or output sequences y, but rather segments. We need to define a segment-wise loss, in contrast to the standard CRF loss, which is sometimes referred to as an (entire) sequential loss (Kakade et al., 2002; Altun et al., 2003). First, we consider the point-wise decision w.r.t. Eq. 1, that is, ^yi = arg maxyi[?]Y1 g(y,x,i,l). The point-wise discriminant function can be written as follows: g(y,x,i, ) = maxyprime</Paragraph>
      <Paragraph position="2"> where Yj represents a set of all y whose length is j, and Y[yi] represents a set of all y that contain yi in the i'th position. Note that the same output ^y can be obtained with Eqs. 1 and 14, that is, ^y = (^y1,...,^yn). This point-wise discriminant function is different from that described in (Kakade et al., 2002; Altun et al., 2003), which is calculated based on marginals.</Paragraph>
      <Paragraph position="3"> Let ysj be an output sequence corresponding to the j-th segment of y, where sj represents a sequence of indices of y, that is, sj = (sj,1,...,sj,|sj|). An example of the Chunking data shown in Fig. 1, ys4 is (B-VP, I-VP) where s4 = (7,8). Let Y[ysj] be a set of all outputs whose positions from sj,1 to sj,|sj |are ysj = (ysj,1,...,ysj,|sj|). Then, we can define a segment-wise discriminant function w.r.t. Eq. 1.</Paragraph>
      <Paragraph position="5"> Note again that the same output ^y can be obtained using Eqs. 1 and 15, as with the piece-wise discriminant function described above. This property is needed for evaluating segments since we do not know the correct segments of the test data; we can maintain consistency even if we use Eq. 1 for testing and Eq. 15 for training. Moreover, Eq. 15 obviously reduces to Eq. 14 if the length of all segments is 1. Then, the segment-wise misclassification measure d(y[?],x,sj,l) can be obtained simply by replacing the discriminant function of the entire sequence g(y,x,l) with that of segment-wise g(y,x,sj,l) in Eq. 7.</Paragraph>
      <Paragraph position="6"> Let s[?]k be a segment sequence corresponding to the correct output y[?]k for a given xk, and S(xk) be all possible segments for a given xk. Then, approximated evaluation functions of TP, FP and FN can be defined as follows:</Paragraph>
      <Paragraph position="8"> where d(sj) returns 1 if segment sj is a target segment, and returns 0 otherwise. For the NER data shown in Fig. 1, 'ORG', 'PER' and 'LOC' are the target segments, while segments that are labeled 'O' in y are not. Since TPl should not have a value of less than zero, we select sigmoid loss as the smoothing function l().</Paragraph>
      <Paragraph position="9"> The second summation of TPl and FNl performs a summation over correct segments s[?]. In contrast, the second summation in FPl takes all possible segments into account, but excludes the correct segments s[?]. Although an efficient way to evaluate all possible segments has been proposed in the context of semi-Markov CRFs (Sarawagi and Cohen, 2004), we introduce a simple alternative method. If we select ps = 1 for d() in Eq. 7, we only need to evaluate the segments corresponding to the maximum incorrect output ~y to calculate FPl. That is, sprimej 2 S(xk)ns[?]k can be reduced to sprimej 2 ~sk, where ~sk represents segments corresponding to the maximum incorrect output ~y.</Paragraph>
      <Paragraph position="10"> In practice, this reduces the calculation cost and so we used this method for our experiments described in the next section.</Paragraph>
      <Paragraph position="11"> Maximizing the segmentation Fg-score, Eq. 13,  is equivalent to minimizing g2*FN+FP(g2+1)*TP , since Eq.  an objective function closely reflecting the segmentation Fg-score based on the MCE criterion can be written as Eq. 9 while replacing Fl,d,g,l with:</Paragraph>
      <Paragraph position="13"> The derivative of Eq. 16 w.r.t. l() is given by the following equation:</Paragraph>
      <Paragraph position="15"> whereZN andZD represent the numerator and denominator of Eq. 16, respectively.</Paragraph>
      <Paragraph position="16"> In the optimization process of the segmentation F-score objective function, we can efficiently calculate Eq. 15 by using the forward and backward Viterbi algorithm, which is almost the same as calculating Eq. 3 with a variant of the forward-backward algorithm (Sha and Pereira, 2003). The same numerical optimization methods described in Sec. 3.3 can be employed for this optimization.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML