XML Viewer - n03-1028

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1028_metho.xml
Size: 16,292 bytes
Last Modified: 2025-10-06 14:08:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1028">
  <Title>Shallow Parsing with Conditional Random Fields</Title>
  <Section position="3" start_page="0" end_page="2" type="metho">
    <SectionTitle>
2 Conditional Random Fields
</SectionTitle>
    <Paragraph position="0"> We focus here on conditional random elds on sequences, although the notion can be used more generally (Lafferty et al., 2001; Taskar et al., 2002). Such CRFs de ne conditional probability distributions p(Y jX) of label sequences given input sequences. We assume that the random variable sequences X and Y have the same length, and use x = x1 xn and y = y1 yn for the generic input sequence and label sequence, respectively.</Paragraph>
    <Paragraph position="1"> A CRF on (X; Y ) is speci ed by a vector f of local features and a corresponding weight vector . Each local feature is either a state feature s(y; x; i) or a transition feature t(y; y0; x; i), where y; y0 are labels, x an input sequence, and i an input position. To make the notation more uniform, we also write</Paragraph>
    <Paragraph position="3"> for any state feature s and transition feature t. Typically, features depend on the inputs around the given position, although they may also depend on global properties of the input, or be non-zero only at some positions, for instance features that pick out the rst or last labels.</Paragraph>
    <Paragraph position="4"> The CRF's global feature vector for input sequence x and label sequence y is given by</Paragraph>
    <Paragraph position="6"> where i ranges over input positions. The conditional probability distribution de ned by the CRF is then</Paragraph>
    <Paragraph position="8"> Any positive conditional distribution p(Y jX) that obeys the Markov property p(YijfYjgj6=i; X) = p(YijYi 1; Yi+1; X) can be written in the form (1) for appropriate choice of feature functions and weight vector (Hammersley and Clifford, 1971).</Paragraph>
    <Paragraph position="9"> The most probable label sequence for input sequence</Paragraph>
    <Paragraph position="11"> because Z (x) does not depend on y. F(y; x) decomposes into a sum of terms for consecutive pairs of labels, so the most likely y can be found with the Viterbi algorithm. null We train a CRF by maximizing the log-likelihood of a given training set T = f(xk; yk)gNk=1, which we assume xed for the rest of this section:</Paragraph>
    <Paragraph position="13"> (2) In words, the maximum of the training data likelihood is reached when the empirical average of the global feature vector equals its model expectation. The expectation Ep (Y jx)F(Y ; x) can be computed ef ciently using a variant of the forward-backward algorithm. For a given x, de ne the transition matrix for position i as Mi[y; y0] = exp f(y; y0; x; i) Let f be any local feature, fi[y; y0] = f(y; y0; x; i), F(y; x) = Pi f(yi 1; yi; x; i), and let denote component-wise matrix product. Then</Paragraph>
    <Paragraph position="15"> where i and i the forward and backward state-cost vectors de ned by</Paragraph>
    <Paragraph position="17"> Therefore, we can use a forward pass to compute the i and a backward bass to compute the i and accumulate the feature expectations.</Paragraph>
    <Paragraph position="18"> To avoid over tting, we penalize the likelihood with a spherical Gaussian weight prior (Chen and Rosenfeld, 1999):</Paragraph>
    <Paragraph position="20"/>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Training Methods
</SectionTitle>
    <Paragraph position="0"> Lafferty et al. (2001) used iterative scaling algorithms for CRF training, following earlier work on maximum-entropy models for natural language (Berger et al., 1996; Della Pietra et al., 1997). Those methods are very simple and guaranteed to converge, but as Minka (2001) and Malouf (2002) showed for classi cation, their convergence is much slower than that of general-purpose convex optimization algorithms when many correlated features are involved. Concurrently with the present work, Wallach (2002) tested conjugate gradient and second-order methods for CRF training, showing signi cant training speed advantages over iterative scaling on a small shallow parsing problem. Our work shows that preconditioned conjugate-gradient (CG) (Shewchuk, 1994) or limited-memory quasi-Newton (L-BFGS) (Nocedal and Wright, 1999) perform comparably on very large problems (around 3.8 million features). We compare those algorithms to generalized iterative scaling (GIS) (Darroch and Ratcliff, 1972), non-preconditioned CG, and voted perceptron training (Collins, 2002). All algorithms except voted perceptron maximize the penalized loglikelihood: = arg max L0 . However, for ease of exposition, this discussion of training methods uses the unpenalized log-likelihood L .</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Preconditioned Conjugate Gradient
</SectionTitle>
      <Paragraph position="0"> Conjugate-gradient (CG) methods have been shown to be very effective in linear and non-linear optimization (Shewchuk, 1994). Instead of searching along the gradient, conjugate gradient searches along a carefully chosen linear combination of the gradient and the previous search direction.</Paragraph>
      <Paragraph position="1"> CG methods can be accelerated by linearly transforming the variables with preconditioner (Nocedal and Wright, 1999; Shewchuk, 1994). The purpose of the preconditioner is to improve the condition number of the quadratic form that locally approximates the objective function, so the inverse of Hessian is reasonable preconditioner. However, this is not applicable to CRFs for two reasons. First, the size of the Hessian is dim( )2, leading to unacceptable space and time requirements for the inversion. In such situations, it is common to use instead the (inverse of) the diagonal of the Hessian. However in our case the Hessian has the form</Paragraph>
      <Paragraph position="3"> where the expectations are taken with respect to p (Y jxk). Therefore, every Hessian element, including the diagonal ones, involve the expectation of a product of global feature values. Unfortunately, computing those expectations is quadratic on sequence length, as the forward-backward algorithm can only compute expectations of quantities that are additive along label sequences. We solve both problems by discarding the off-diagonal terms and approximating expectation of the square of a global feature by the expectation of the sum of squares of the corresponding local features at each position. The approximated diagonal term Hf for feature f has the form</Paragraph>
      <Paragraph position="5"> If this approximation is semide nite, which is trivial to check, its inverse is an excellent preconditioner for early iterations of CG training. However, when the model is close to the maximum, the approximation becomes unstable, which is not surprising since it is based on feature independence assumptions that become invalid as the weights of interaction features move away from zero.</Paragraph>
      <Paragraph position="6"> Therefore, we disable the preconditioner after a certain number of iterations, determined from held-out data. We call this strategy mixed CG training.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Limited-Memory Quasi-Newton
</SectionTitle>
      <Paragraph position="0"> Newton methods for nonlinear optimization use second-order (curvature) information to nd search directions.</Paragraph>
      <Paragraph position="1"> As discussed in the previous section, it is not practical to obtain exact curvature information for CRF training. Limited-memory BFGS (L-BFGS) is a second-order method that estimates the curvature numerically from previous gradients and updates, avoiding the need for an exact Hessian inverse computation. Compared with preconditioned CG, L-BFGS can also handle large-scale problems but does not require a specialized Hessian approximations. An earlier study indicates that L-BFGS performs well in maximum-entropy classi er training (Malouf, 2002).</Paragraph>
      <Paragraph position="2"> There is no theoretical guidance on how much information from previous steps we should keep to obtain suf ciently accurate curvature estimates. In our experiments, storing 3 to 10 pairs of previous gradients and updates worked well, so the extra memory required over preconditioned CG was modest. A more detailed description of this method can be found elsewhere (Nocedal and Wright, 1999).</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Voted Perceptron
</SectionTitle>
      <Paragraph position="0"> Unlike other methods discussed so far, voted perceptron training (Collins, 2002) attempts to minimize the difference between the global feature vector for a training instance and the same feature vector for the best-scoring labeling of that instance according to the current model.</Paragraph>
      <Paragraph position="1"> More precisely, for each training instance the method computes a weight update</Paragraph>
      <Paragraph position="3"> in which ^yk is the Viterbi path</Paragraph>
      <Paragraph position="5"> Like the familiar perceptron algorithm, this algorithm repeatedly sweeps over the training instances, updating the weight vector as it considers each instance. Instead of taking just the nal weight vector, the voted perceptron algorithm takes the average of the t. Collins (2002) reported and we con rmed that this averaging reduces overtting considerably.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Shallow Parsing
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the base NPs in an example sentence. Following Ramshaw and Marcus (1995), the input to the NP chunker consists of the words in a sentence annotated automatically with part-of-speech (POS) tags. The chunker's task is to label each word with a label indicating whether the word is outside a chunk (O), starts a chunk (B), or continues a chunk (I). For example, the tokens in rst line of Figure 1 would be labeled BIIBIIOBOBIIO.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Data Preparation
</SectionTitle>
      <Paragraph position="0"> NP chunking results have been reported on two slightly different data sets: the original RM data set of Ramshaw and Marcus (1995), and the modi ed CoNLL-2000 version of Tjong Kim Sang and Buchholz (2000). Although the chunk tags in the RM and CoNLL-2000 are somewhat different, we found no signi cant accuracy differences between models trained on these two data sets. Therefore, all our results are reported on the CoNLL-2000 data set. We also used a development test set, provided by Michael Collins, derived from WSJ section 21 tagged with the Brill (1995) POS tagger.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 CRFs for Shallow Parsing
</SectionTitle>
      <Paragraph position="0"> Our chunking CRFs have a second-order Markov dependency between chunk tags. This is easily encoded by making the CRF labels pairs of consecutive chunk tags.</Paragraph>
      <Paragraph position="1"> That is, the label at position i is yi = ci 1ci, where ci is the chunk tag of word i, one ofO,B, orI. SinceBmust be used to start a chunk, the label OI is impossible. In addition, successive labels are constrained: yi 1 = ci 2ci 1,</Paragraph>
      <Paragraph position="3"> topology are enforced by giving appropriate features a weight of 1, forcing all the forbidden labelings to have zero probability.</Paragraph>
      <Paragraph position="4"> Our choice of features was mainly governed by computing power, since we do not use feature selection and all features are used in training and testing. We use the following factored representation for features</Paragraph>
      <Paragraph position="6"> where p(x; i) is a predicate on the input sequence x and current position i and q(yi 1; yi) is a predicate on pairs of labels. For instance, p(x; i) might be word at position i is the or the POS tags at positions i 1; i are Rockwell International Corp. 's Tulsa unit said it signed a tentative agreement extending its contract with Boeing Co. to provide structural parts for Boeing 's 747 jetliners .</Paragraph>
      <Paragraph position="8"> DT, NN. Because the label set is nite, such a factoring of f(yi 1; yi; x; i) is always possible, and it allows each input predicate to be evaluated just once for many features that use it, making it possible to work with millions of features on large training sets.</Paragraph>
      <Paragraph position="9"> Table 1 summarizes the feature set. For a given position i, wi is the word, ti its POS tag, and yi its label. For any label y = c0c, c(y) = c is the corresponding chunk tag. For example, c(OB) = B. The use of chunk tags as well as labels provides a form of backoff from the very small feature counts that may arise in a second-order model, while allowing signi cant associations between tag pairs and input predicates to be modeled. To save time in some of our experiments, we used only the 820,000 features that are supported in the CoNLL training set, that is, the features that are on at least once. For our highest F score, we used the complete feature set, around 3.8 million in the CoNLL training set, which contains all the features whose predicate is on at least once in the training set. The complete feature set may in principle perform better because it can place negative weights on transitions that should be discouraged if a given predicate is on.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.3 Parameter Tuning
</SectionTitle>
      <Paragraph position="0"> As discussed previously, we need a Gaussian weight prior to reduce over tting. We also need to choose the number of training iterations since we found that the best F score is attained while the log-likelihood is still improving. The reasons for this are not clear, but the Gaussian prior may not be enough to keep the optimization from making weight adjustments that slighly improve training log-likelihood but cause large F score uctuations. We used the development test set mentioned in Section 4.1 to set the prior and the number of iterations.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.4 Evaluation Metric
</SectionTitle>
      <Paragraph position="0"> The standard evaluation metrics for a chunker are precision P (fraction of output chunks that exactly match the reference chunks), recall R (fraction of reference chunks returned by the chunker), and their harmonic mean, the F1 score F1 = 2 P R=(P + R) (which we call just F score in what follows). The relationships between F score and labeling error or log-likelihood are not direct, so we report both F score and the other metrics for the models we tested. For comparisons with other reported results we use F score.</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.5 Signi cance Tests
</SectionTitle>
      <Paragraph position="0"> Ideally, comparisons among chunkers would control for feature sets, data preparation, training and test procedures, and parameter tuning, and estimate the statistical signi cance of performance differences. Unfortunately, reported results sometimes leave out details needed for accurate comparisons. We report F scores for comparison with previous work, but we also give statistical signi cance estimates using McNemar's test for those methods that we evaluated directly.</Paragraph>
      <Paragraph position="1"> Testing the signi cance of F scores is tricky because the wrong chunks generated by two chunkers are not directly comparable. Yeh (2000) examined randomized tests for estimating the signi cance of F scores, and in particular the bootstrap over the test set (Efron and Tibshirani, 1993; Sang, 2002). However, bootstrap variances in preliminary experiments were too high to allow any conclusions, so we used instead a McNemar paired test on labeling disagreements (Gillick and Cox, 1989).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML