File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1081_evalu.xml

Size: 8,465 bytes

Last Modified: 2025-10-06 13:59:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1081">
  <Title>Chinese Segmentation and New Word Detection using Conditional Random Fields</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Experiments and Analysis
</SectionTitle>
    <Paragraph position="0"> To make a comprehensive evaluation, we use all four of the datasets from a recent Chinese word segmentation bake-off competition (Sproat and Emerson, 2003). These datasets represent four different segmentation standards. A summary of the datasets is shown in Table 1. The standard bake-off scoring program is used to calculate precision, recall, F1, and OOV word recall.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Experimental design
</SectionTitle>
      <Paragraph position="0"> Since CTB and PK are provided in the GB encoding while AS and HK use the Big5 encoding, we convert AS and HK datasets to GB in order to make cross-training-and-testing possible. Note that this conversion could potentially worsen performance slightly due to a few conversion errors.</Paragraph>
      <Paragraph position="1"> We use cross-validation to choose Markov-order and perform feature selection. Thus, each training set is randomly split--80% used for training and the remaining 20% for validation--and based on validation set performance, choices are made for model structure, prior, and which word lexicons to include.</Paragraph>
      <Paragraph position="2"> The choices of prior and model structure shown in Table 2 are used for our final testing.</Paragraph>
      <Paragraph position="3"> We conduct closed and open tests on all four datasets. The closed tests use only material from the training data for the particular corpus being tested.</Paragraph>
      <Paragraph position="4"> Open tests allows using other material, such as lexicons from Internet. In open tests, we use lexicons obtained from various resources as described  in Section 3.1. In addition, we conduct cross-dataset tests, in which we train on one dataset and test on other datasets.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Overall results
</SectionTitle>
      <Paragraph position="0"> Final results of CRF based segmentation with new word detection are summarized in Table 3. The upper part of the table contains the closed test results, and the lower part contains the open test results.</Paragraph>
      <Paragraph position="1"> Each entry is the performance of the given metric (precision, recall, F1, and Roov) on the test set.</Paragraph>
      <Paragraph position="2">  To compare our results against other systems, we summarize the competition results reported in (Sproat and Emerson, 2003) in Table 4. XXc and XXo indicate the closed and open runs on dataset XX respectively. Entries contain the F1 performance of each participating site on different runs, with the best performance in bold. Our results are in the last row. Column SITE-AVG is the average F1 performance over the datasets on which a site reported results. Column OUR-AVG is the average F1 performance of our system over the same datasets.</Paragraph>
      <Paragraph position="3"> Comparing performance across systems is difficult since none of those systems reported results on all eight datasets (open and closed runs on 4 datasets). Nevertheless, several observations could be made from Table 4. First, no single system achieved best results in all tests. Only one site (S01) achieved two best runs (CTBc and PKc) with an average of 91.8% over 6 runs. S01 is one of the best segmentation systems in mainland China (Zhang et al., 2003). We also achieve two best runs (ASo and HKc), with a comparable average of 91.9% over the same 6 runs, and a 92.7% average over all the 8 runs.</Paragraph>
      <Paragraph position="4"> Second, performance varies significantly across different datasets, indicating that the four datasets have different characteristics and use very different segmentation guidelines. We also notice that the worst results were obtained on CTB dataset for all systems. This is due to significant inconsistent segmentation in training and testing (Sproat and Emerson, 2003). We verify this by another test. We randomly split the training data into 80% training and 20% testing, and run the experiments for 3 times, resulting in a testing F1 of 97:13%. Third, consider a comparison of our results with site S12, who use a sliding-window maximum entropy model (Xue, 2003). They participated in two datasets, with an average of 93.8%. Our average over the same two runs is 94.2%. This gives some empirical evidence of the advantages of linear-chain CRFs over sliding-window maximum entropy models, however, this comparison still requires further investigation since there are many factors that could affect the performance such as different features used in both systems. null To further study the robustness of our approach to segmentation, we perform cross-testing--that is, training on one dataset and testing on other datasets. Table 5 summarizes these results, in which the rows are the training datasets and the columns are the testing datasets. Not surprisingly, cross testing results are worse than the results using the same ASc ASo CTBc CTBo HKc HKo PKc PKo SITE-AVG OUR-AVG  competition; the second to the ninth columns contain their results on the 8 runs, where a bold entry is the winner of that run; column SITE-AVG contains the average performance of the site over the runs in which it participated, where a bold entry indicates that this site performs better than our system; column OUR-AVG is the average of our system over the same runs, where a bolded entry indicates our system performs better than the other site; the last row is the performance of our system over all the runs and the overall average. source as training due to different segmentation policies, with an exception on CTB where models trained on other datasets perform better than the model trained on CTB itself. This is due to the data problem mentioned above. Overall, CRFs perform robustly well across all datasets.</Paragraph>
      <Paragraph position="5"> From both Table 3 and 5, we see, as expected, improvement from closed tests to open tests, indicating the significant contribution of domain knowledge lexicons.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Effects of new word detection
</SectionTitle>
      <Paragraph position="0"> Table 6 shows the effect of new word detection on the closed tests. An interesting observation is  the results without new word detection and NWD is the results with new word detection.</Paragraph>
      <Paragraph position="1"> that the improvement is monotonically related to the OOV rate (OOV rates are listed in Table 1). This is desirable because new word detection is most needed in situations that have high OOV rate. At low OOV rate, noisy new word detection can result in worse performance, as seen in the AS dataset.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Error analysis and discussion
</SectionTitle>
      <Paragraph position="0"> Several typical errors are observed in error analysis. One typical error is caused by inconsistent segmentation labeling in the test set. This is most notorious in CTB dataset. The second most typical error is in new, out-of-vocabulary words, especially proper names. Although our new word detection fixes many of these problems, it is not effective enough to recognize proper names well. One solution to this problem could use a named entity extractor to recognize proper names; this was found to be very helpful in Wu (2003).</Paragraph>
      <Paragraph position="1"> One of the most attractive advantages of CRFs (and maximum entropy models in general) is its the flexibility to easily incorporate arbitrary features, here in the form domain-knowledge-providing lexicons. However, obtaining these lexicons is not a trivial matter. The quality of lexicons can affect the performance of CRFs significantly. In addition, compared to simple models like n-gram language models (Teahan et al., 2000), another shortcoming of CRF-based segmenters is that it requires significantly longer training time. However, training is a one-time process, and testing time is still linear in the length of the input.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML