File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1028_intro.xml

Size: 3,117 bytes

Last Modified: 2025-10-06 14:03:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1028">
  <Title>Training Conditional Random Fields with Multivariate Evaluation Measures</Title>
  <Section position="4" start_page="217" end_page="217" type="intro">
    <SectionTitle>
2 CRFs and Training Criteria
</SectionTitle>
    <Paragraph position="0"> Given an input (observation) x2X and parameter vector l = fl1,...,lMg, CRFs define the conditional probability p(yjx) of a particular output y 2 Y as being proportional to a product of potential functions on the cliques of a graph, which represents the interdependency of y and x. That is:</Paragraph>
    <Paragraph position="2"> factor over all output values, Y.</Paragraph>
    <Paragraph position="3"> Following the definitions of (Sha and Pereira, 2003), a log-linear combination of weighted features, Phc(y,x; l) = exp(l fc(y,x)), is used as individual potential functions, where fc represents a feature vector obtained from the corresponding clique c. That is, producttextc[?]C(y,x) Phc(y,x) = exp(l F(y,x)), where F(y,x)=summationtextc fc(y,x) is the CRF's global feature vector for x and y.</Paragraph>
    <Paragraph position="4"> The most probable output ^y is given by ^y = arg maxy[?]Y p(yjx; l). However Zl(x) never affects the decision of ^y since Zl(x) does not depend on y. Thus, we can obtain the following discriminant function for CRFs:</Paragraph>
    <Paragraph position="6"> The maximum (log-)likelihood (ML) of the conditional probability p(yjx; l) of training data f(xk,y[?]k)gNk=1 w.r.t. parameters l is the most basic CRF training criterion, that is, arg maxl summationtextk logp(y[?]kjxk; l), where y[?]k is the correct output for the given xk. Maximizing the conditional log-likelihood given by CRFs is equivalent to minimizing the log-loss function, summationtext k logp(y[?]kjxk; l). We minimize the following loss function for the ML criterion training of CRFs:</Paragraph>
    <Paragraph position="8"> To reduce over-fitting, the Maximum a Posteriori (MAP) criterion of parameters l, that is, arg maxl summationtextk logp(ljy[?]k,xk) /summationtext k logp(y[?]kjxk; l)p(l), is now the most widely used CRF training criterion. Therefore, we minimize the following loss function for the MAP criterion training of CRFs: LMAPl = LMLl [?] logp( ). (2) There are several possible choices when selecting a prior distribution p(l). This paper only considers Lph-norm prior, p(l)/exp( jjljjph/phC), which becomes a Gaussian prior when ph=2. The essential difference between ML and MAP is simply that MAP has this prior term in the objective function. This paper sometimes refers to the ML and MAP criterion training of CRFs as ML/MAP.</Paragraph>
    <Paragraph position="9"> In order to estimate the parameters l, we seek a zero of the gradient over the parameters l:</Paragraph>
    <Paragraph position="11"> The gradient of ML is Eq. 3 without the gradient term of the prior, rlogp(l).</Paragraph>
    <Paragraph position="12"> The details of actual optimization procedures for linear chain CRFs, which are typical CRF applications, have already been reported (Sha and Pereira, 2003).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML