File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1028_evalu.xml

Size: 4,965 bytes

Last Modified: 2025-10-06 13:58:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1028">
  <Title>Shallow Parsing with Conditional Random Fields</Title>
  <Section position="6" start_page="2" end_page="2" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> All the experiments were performed with our Java implementation of CRFs,designed to handle millions of features, on 1.7 GHz Pentium IV processors with Linux and IBM Java 1.3.0. Minor variants support voted perceptron (Collins, 2002) and MEMMs (McCallum et al., 2000) with the same ef cient feature encoding. GIS, CG, and L-BFGS were used to train CRFs and MEMMs.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.1 F Scores
</SectionTitle>
      <Paragraph position="0"> Table 2 gives representative NP chunking F scores for previous work and for our best model, with the complete set of 3.8 million features. The last row of the table gives the score for an MEMM trained with the mixed CG method using an approximate preconditioner. The published F score for voted perceptron is 93.53% with a different feature set (Collins, 2002). The improved result given here is for the supported feature set; the complete feature set gives a slightly lower score of 94.07%. Zhang et al. (2002) reported a higher F score (94.38%) with generalized winnow using additional linguistic features that were not available to us.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.2 Convergence Speed
</SectionTitle>
      <Paragraph position="0"> All the results in the rest of this section are for the smaller supported set of 820,000 features. Figures 2a and 2b show how preconditioning helps training convergence.</Paragraph>
      <Paragraph position="1"> Since each CG iteration involves a line search that may require several forward-backward procedures (typically between 4 and 5 in our experiments), we plot the progress of penalized log-likelihood L0 with respect to the number of forward-backward evaluations. The objective function increases rapidly, achieving close proximity to the maximum in a few iterations (typically 10). In contrast, GIS training increases L0 rather slowly, never reaching the value achieved by CG. The relative slowness of iterative scaling is also documented in a recent evaluation of training methods for maximum-entropy classi cation (Malouf, 2002). In theory, GIS would eventually converge to the L0 optimum, but in practice convergence may be so slow that L0 improvements may fall below numerical accuracy, falsely indicating convergence.</Paragraph>
      <Paragraph position="2"> training method time F score L0  preconditioner converges much more slowly than both preconditioned CG and mixed CG training. However, it is still much faster than GIS. We believe that the superior convergence rate of preconditioned CG is due to the use of approximate second-order information. This is conrmed by the performance of L-BFGS, which also uses approximate second-order information.2 Although there is no direct relationship between F scores and log-likelihood, in these experiments F score tends to follow log-likelihood. Indeed, Figure 3 shows that preconditioned CG training improves test F scores much more rapidly than GIS training.</Paragraph>
      <Paragraph position="3"> Table 3 compares run times (in minutes) for reaching a target penalized log-likelihood for various training methods with prior = 1:0. GIS is the only method that failed to reach the target, after 3,700 iterations. We cannot place the voted perceptron in this table, as it does not optimize log-likelihood and does not use a prior. However, it reaches a fairly good F-score above 93% in just two training sweeps, but after that it improves more slowly, to a somewhat lower score, than preconditioned CG training. null</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.3 Labeling Accuracy
</SectionTitle>
      <Paragraph position="0"> The accuracy rate for individual labeling decisions is over-optimistic as an accuracy measure for shallow parsing. For instance, if the chunk BIIIIIII is labled as OIIIIIII, the labeling accuracy is 87.5%, but recall is  likelihood, its log-likelihood on the data is actually lower than that of preconditioned CG and mixed CG training.</Paragraph>
      <Paragraph position="1">  such test is McNemar test on paired observations (Gillick and Cox, 1989).</Paragraph>
      <Paragraph position="2"> With McNemar's test, we compare the correctness of the labeling decisions of two models. The null hypothesis is that the disagreements (correct vs. incorrect) are due to chance. Table 4 summarizes the results of tests between the models for which we had labeling decisions. These tests suggest that MEMMs are signi cantly less accurate, but that there are no signi cant differences in accuracy among the other models.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML