XML Viewer - p04-1086

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1086_intro.xml
Size: 4,188 bytes
Last Modified: 2025-10-06 14:02:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1086">
  <Title>Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Conditional Random Fields
</SectionTitle>
    <Paragraph position="0"> CRFs can be considered as a generalization of logistic regression to label sequences. They define a conditional probability distribution of a label sequence y given an observation sequence x. In this paper, x = (x1;x2;:::;xn) denotes a sentence of length n and y = (y1;y2;:::;yn) denotes the label sequence corresponding to x. In pitch accent prediction, xt is a word and yt is a binary label denoting whether xt is accented or not.</Paragraph>
    <Paragraph position="1"> CRFs specify a linear discriminative function F parameterized by over a feature representation of the observation and label sequence (x;y). The model is assumed to be stationary, thus the feature representation can be partitioned with respect to positions tin the sequence and linearly combined with respect to the importance of each feature k, denoted by k. Then the discriminative function can be stated as in Equation 1:</Paragraph>
    <Paragraph position="3"> Then, the conditional probability is given by</Paragraph>
    <Paragraph position="5"> tion constant which is computed by summing over all possible label sequences y of the observation sequence x.</Paragraph>
    <Paragraph position="6"> We extract two types of features from a sequence pair:  1. Current label and information about the observation sequence, such as part-of-speech tag of a word that is within a window centered at the word currently labeled, e.g. Is the current word pitch accented and the part-of-speech tag of the previous word=Noun? 2. Current label and the neighbors of that label,  i.e. features that capture the inter-label dependencies, e.g. Is the current word pitch accented and the previous word not accented? Since CRFs condition on the observation sequence, they can efficiently employ feature representations that incorporate overlapping features, i.e. multiple interacting features or long-range dependencies of the observations, as opposed to HMMs which generate observation sequences.</Paragraph>
    <Paragraph position="7"> In this paper, we limit ourselves to 1-order Markov model features to encode inter-label dependencies. The information used to encode the observation-label dependencies is explained in detail in Section 4.</Paragraph>
    <Paragraph position="8"> In CRFs, the objective function is the log-loss of the model with parameters with respect to a training set D. This function is defined as the negative sum of the conditional probabilities of each training label sequence yi, given the observation sequence xi, where D f(xi;yi) : i = 1;:::;mg. CRFs are known to overfit, especially with noisy data if not regularized. To overcome this problem, we penalize the objective function by adding a Gaussian prior (a term proportional to the squared norm jj jj2) as suggested in (Johnson et al., 1999). Then the loss function is given as:</Paragraph>
    <Paragraph position="10"> where c is a constant.</Paragraph>
    <Paragraph position="11"> Lafferty et al. (2001), proposed a modification of improved iterative scaling for parameter estimation in CRFs. However, gradient-based methods have often found to be more efficient for minimizing Equation 3 (Minka, 2001; Sha and Pereira, 2003).</Paragraph>
    <Paragraph position="12"> In this paper, we use the conjugate gradient descent method to optimize the above objective function.</Paragraph>
    <Paragraph position="13"> The gradients are computed as in Equation 4:</Paragraph>
    <Paragraph position="15"> where the expectation is with respect to all possible label sequences of the observation sequence xi and can be computed using the forward backward algorithm.</Paragraph>
    <Paragraph position="16"> Given an observation sequence x, the best label sequence is given by:</Paragraph>
    <Paragraph position="18"> where ^ is the parameter vector that minimizes L( ;D). The best label sequence can be identified by performing the Viterbi algorithm.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML