File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1028_evalu.xml

Size: 6,699 bytes

Last Modified: 2025-10-06 13:59:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1028">
  <Title>Training Conditional Random Fields with Multivariate Evaluation Measures</Title>
  <Section position="7" start_page="221" end_page="222" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We used the same Chunking and 'English' NER task data used for the shared tasks of CoNLL-2000 (Sang and Buchholz, 2000) and CoNLL-2003 (Sang and De Meulder, 2003), respectively. Chunking data was obtained from the Wall Street Journal (WSJ) corpus: sections 15-18 as training data (8,936 sentences and 211,727 tokens), and section 20 as test data (2,012 sentences and 47,377 tokens), with 11 different chunk-tags, such as NP and VP plus the 'O' tag, which represents the outside of any target chunk (segment). The English NER data was taken from the Reuters Corpus21. The data consists of 203,621, 51,362 and 46,435 tokens from 14,987, 3,466 and 3,684 sentences in training, development and test data, respectively, with four named entity tags, PERSON, LOCATION, ORGANIZATION and MISC, plus the 'O' tag.</Paragraph>
    <Section position="1" start_page="221" end_page="221" type="sub_section">
      <SectionTitle>
5.1 Comparison Methods and Parameters
</SectionTitle>
      <Paragraph position="0"> For ML and MAP, we performed exactly the same training procedure described in (Sha and Pereira, 2003) with L-BFGS optimization. For MCE, we  only considered d() with ps = 1 as described in Sec. 4.2, and used QuickProp optimization2. For MAP, MCE and MCE-F, we used the L2norm regularization. We selected a value of C from 1.0 10n where n takes a value from -5 to 5 in intervals 1 by development data3. The tuning of smoothing function hyper-parameters is not considered in this paper; that is, a=1 and b=0 were used for all the experiments.</Paragraph>
      <Paragraph position="1"> We evaluated the performance by Eq. 13 with g = 1, which is the evaluation measure used in CoNLL-2000 and 2003. Moreover, we evaluated the performance by using the average sentence accuracy, since the conventional ML/MAP objective function reflects this sequential accuracy.</Paragraph>
    </Section>
    <Section position="2" start_page="221" end_page="221" type="sub_section">
      <SectionTitle>
5.2 Features
</SectionTitle>
      <Paragraph position="0"> As regards the basic feature set for Chunking, we followed (Kudo and Matsumoto, 2001), which is the same feature set that provided the best result in CoNLL-2000. We expanded the basic features by using bigram combinations of the same types of features, such as words and part-of-speech tags, within window size 5.</Paragraph>
      <Paragraph position="1"> In contrast to the above, we used the original feature set for NER. We used features derived only from the data provided by CoNLL-2003 with the addition of character-level regular expressions of uppercases [A-Z], lowercases [a-z], digits [0-9] or others, and prefixes and suffixes of one to four letters. We also expanded the above basic features by using bigram combinations within window size 5.</Paragraph>
      <Paragraph position="2"> Note that we never used features derived from external information such as the Web, or a dictionary, which have been used in many previous studies but which are difficult to employ for validating the experiments. null</Paragraph>
    </Section>
    <Section position="3" start_page="221" end_page="222" type="sub_section">
      <SectionTitle>
5.3 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> Our experiments were designed to investigate the impact of eliminating the inconsistency between objective functions and evaluation measures, that is, to compare ML/MAP and MCE-F.</Paragraph>
      <Paragraph position="1"> Table 1 shows the results of Chunking and NER.</Paragraph>
      <Paragraph position="2"> The Fg=1 and 'Sent' columns show the perfor- null train the systems with all but the last 2000 sentences in the training data as a development set to obtain C, and then re-train them with all the training data.</Paragraph>
      <Paragraph position="3">  sentence accuracy, respectively. MCE-F refers to the results obtained from optimizing Eq. 9 based on Eq. 16. In addition, we evaluated the error rate version of MCE. MCE(log) and MCE(sig) indicate that logistic and sigmoid functions are selected for l(), respectively, when optimizing Eq. 9 based on Eq. 10. Moreover, MCE(log) and MCE(sig) used d() based on ps=1, and were optimized using QuickProp; these are the same conditions as used for MCE-F. We found that MCE-F exhibited the best results for both Chunking and NER. There is a significant difference (p&lt;0.01) between MCE-F and ML/MAP with the McNemar test, in terms of the correctness of both individual outputs, yki , and sentences, yk.</Paragraph>
      <Paragraph position="4"> NER data has 83.3% (170524/204567) and 82.6% (38554/46666) of 'O' tags in the training and test data, respectively while the corresponding values of the Chunking data are only 13.1% (27902/211727) and 13.0% (6180/47377). In general, such an imbalanced data set is unsuitable for accuracy-based evaluation. This may be one reason why MCE-F improved the NER results much more than the Chunking results.</Paragraph>
      <Paragraph position="5"> The only difference between MCE(sig) and MCE-F is the objective function. The corresponding results reveal the effectiveness of using an objective function that is consistent as the evaluation measure for the target task. These results show that minimizing the error rate is not optimal for improving the segmentation F-score evaluation measure. Eliminating the inconsistency between the task evaluation measure and the objective function during the training can improve the overall performance.</Paragraph>
      <Paragraph position="6">  While ML/MAP and MCE(log) is convex w.r.t.</Paragraph>
      <Paragraph position="7"> the parameters, neither the objective function of MCE-F, nor that of MCE(sig), is convex. Therefore, initial parameters can affect the optimization  results, since QuickProp as well as L-BFGS can only find local optima.</Paragraph>
      <Paragraph position="8"> The previous experiments were only performed with all parameters initialized at zero. In this experiment, the parameters obtained by the MAPtrained model were used as the initial values of MCE-F and MCE(sig). This evaluation setting appears to be similar to reranking, although we used exactly the same model and feature set.</Paragraph>
      <Paragraph position="9"> Table 2 shows the results of Chunking and NER obtained with this parameter initialization setting. When we compare Tables 1 and 2, we find that the initialization with the MAP parameter values further improves performance.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML