File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1711_evalu.xml

Size: 4,188 bytes

Last Modified: 2025-10-06 13:59:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1711">
  <Title>A Chinese Efficient Analyser Integrating Word Segmentation, Part-Of-Speech Tagging, Partial Parsing and Full Parsing</Title>
  <Section position="8" start_page="111" end_page="111" type="evalu">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> The Chinese efficient analyser is implemented in C++, providing a rapid and easy code-compile-train-test development cycle. In fact, many NLP systems suffer from a lack of software and computer-science engineering effort: running efficiency is key to performing numerous experiments, which, in turn, is key to improving performance. A system may have excellent performance on a given task, but if it takes long to compile and/or run on test data, the rate of improvement of that system will be contrained compared to that which can run very efficiently.</Paragraph>
    <Paragraph position="1"> Moreover, speed plays a critical role in many applications such as text mining.</Paragraph>
    <Paragraph position="2"> All the experiments are implemented on a Pentium II/450MHZ PC. All the performances are measured in precisions, recalls and F-measures.</Paragraph>
    <Paragraph position="3"> Here, the precision (P) measures the number of correct units in the answer file over the total number of units in the answer file and the recall (R) measures the number of correct units in the answer file over the total number of units in the key file while F-measure is the weighted harmonic mean of precision and recall:</Paragraph>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
6.1 Word Segmentation and POS Tagging
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the integrated word segmentation and POS tagging results on the Chinese tag bank PFR1.0 of 3.69M Chinese characters (1.12 Chinese Words) developed by Institute of Computational Linguistics at Beijing Univ. Here, 80% of the corpus is used as formal training data, another 10% as development data and remaining 10% as formal test data.</Paragraph>
      <Paragraph position="1">  and POS Tagging (wps: words per second) The word segmentation corresponds to bracketing of the chunking model while POS tagging corresponds to bracketing and labelling. Table 2 shows that recall (P) is higher than precision (P). The main reason may be the existence of unknown words. In the Chinese efficient analyser, unknown words are segmented into individual Chinese characters. This makes the number of segmented words/POS tagged words in the system output higher than that in the correct answer.</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
6.2 Partial Parsing and Full Parsing
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the results of 1 st -level partial parsing and full parsing, using the PARSEVAL evaluation methodology (Black et al 1991) on the UPENN Chinese Tree Bank of 100k words developed by Univ. of Penn. Here, 80% of the corpus is used as formal training data, another 10% as development data and remaining 10% as formal test data.</Paragraph>
    </Section>
    <Section position="3" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
-level Partial Parsing
</SectionTitle>
      <Paragraph position="0"> and Full Parsing (wps: words per second) Table 3 shows that the performances of partial parsing and full parsing are quite low, compared to those of state-of-art partial parsing and full parsing for the English language (Zhou et al 2000a; Collins 1997). The main reason behind is the small size of the training corpus used in our experiments.</Paragraph>
      <Paragraph position="1"> However, the Chinese PENN Tree Bank is the largest corpus we can find for partial parsing and full parsing. Therefore, developing a much larger Chinese tree bank (comparable to UPENN English Tree Bank) becomes an urgent task for the Chinese language processing community. Actually, the best individual system (Zhou et al 2000b) in CoNLL'2000 chunking shared task for the English language (Tjong et al 2000) used the same HMM-based tagging engine.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML