File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-1145_abstr.xml

Size: 2,718 bytes

Last Modified: 2025-10-06 13:42:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1145">
  <Title>Building a Large-Scale Annotated Chinese Corpus</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In this paper we address issues related to building a large-scale Chinese corpus. We try to answer four questions: (i) how to speed up annotation, (ii) how to maintain high annotation quality, (iii) for what purposes is the corpus applicable, and finally (iv) what future work we anticipate.</Paragraph>
    <Paragraph position="1"> Introduction The Penn Chinese Treebank (CTB) is an ongoing project, with its objective being to create a segmented Chinese corpus annotated with POS tags and syntactic brackets. The first installment of the project (CTB-I) consists of Xinhua newswire between the years 1994 and 1998, totaling 100,000 words, fully segmented, POS-tagged and syntactically bracketed and it has been released to the public via the Penn Linguistic Data Consortium (LDC). The preliminary results of this phase of the project have been reported in Xia et al (2000). Currently the second installment of the project, the 400,000-word CTB-II is being developed and is expected to be completed early in the year 2003. CTB-II will follow the standards set up in the segmentation (Xia 2000b), POS tagging (Xia 2000a) and bracketing guidelines (Xue and Xia 2000) and it will use articles from Peoples' Daily, Hong Kong newswire and material translated into Chinese from other languages in addition to the Xinhua newswire used in CTB-I in an effort to diversify the sources.</Paragraph>
    <Paragraph position="2"> The availability of CTB-I changed our approach to CTB-II considerably. Due to the existence of CTB-I, we were able to train new automatic Chinese language processing (CLP) tools, which crucially use annotated corpora as training material. These tools are then used for preprocessing in the development of the CTB-II. We also developed tools to control the quality of the corpus. In this paper, we will address three issues in the development of the Chinese Treebank: annotation speed, annotation accuracy and usability of the corpus. Specifically, we attempt to answer four questions: (i) how do we speed up the annotation process, (ii) how do we maintain high quality, i.e. annotation accuracy and inter-annotator consistency during the annotation process, and (iii) for what purposes is the corpus applicable, and (iv) what are our future plans? Although we will touch upon linguistic problems that are specific to Chinese, we believe these issues are general enough for the development of any single language corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML