File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1114_intro.xml

Size: 8,189 bytes

Last Modified: 2025-10-06 14:02:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1114">
  <Title>The Construction of A Chinese Shallow Treebank</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Overview and Design Principles
</SectionTitle>
    <Paragraph position="0"> The objective of this project is to manually construct a large shallow Treebank with high accuracy and consistency.</Paragraph>
    <Paragraph position="1"> The design principles of The PolyU Treebank are: high resource sharing ability, low structural complexity, sufficient syntactic information and large data scale. First of all, the design and construction of The PolyU Treebank aims to provide as much a general purpose Treebank as possible so that different applications can make use of it as a NLP resource. With this objective, we chose to follow the well-known Phrase-based Grammar as the framework for annotation as this grammar is widely accepted by Chinese language researchers, and thus our work can be easily understood and accepted.</Paragraph>
    <Paragraph position="2"> Due to the lack of word delimitation in Chinese, word segmentation must be performed before any further syntactical annotation. High accuracy of word segmentation is very important for this project. In this project, we chose to use the segmented and tagged corpus of People Daily annotated by the Peking University. The annotated corpus contains articles appeared in the People Daily Newspaper in 1998. The segmentation is based on the guidelines, given in the Chinese national standard GB13715, (Liu et al. 1993) and the POS tagging specification was developed according to the 'Grammatical Knowledge-base of contemporary Chinese'.</Paragraph>
    <Paragraph position="3"> According to the report from Peking University, the accuracy of this annotated corpus in terms of segmentation and POS tagging are 99.9% and 99.5%, respectively (Yu et al. 2001). The use of such mature and widely adopted resource can effectively reduce our cost, ensure syntactical annotation quality. With consistency in segmentation, POS, and syntactic annotation, the resulting Treebank can be readily shared by other researchers as a public resource.</Paragraph>
    <Paragraph position="4"> The second design principle is low structural complexity. That means, the annotation framework should be clear and simple, and the labeled syntactic and functional information should be commonly used and accepted. Considering the characteristics of shallow annotation, our project has focused on the annotation of phrases and headwords while the sentence level syntax are ignored.</Paragraph>
    <Paragraph position="5"> Following the framework of Phrase-based Grammar, a base-phrase is regarded as the smallest unit where a base-phrase is defined as a 'stable' and 'simple' phrase without nesting components. Study on Chinese syntactical analysis suggests that phrases should be the fundamental unit instead of words in a sentence. This is because, firstly, the usage of Chinese words is very flexible. A word may have different POS tags serving for different functions in sentences. On the contrary, the use of Chinese phrases is much more stable. That is, a phrase has very limited functional use in a sentence. Secondly, the construction rules of Chinese phrases are nearly the same as that of Chinese sentences. Therefore, the analysis of phrases can help identifying POS and grammatical functions of words. Naturally, it should be regarded as the basic syntactical unit. Usually, a base-phrase is driven by a lexical word as its headword. Examples of base-phrases include base NP, base VP and so on, such as the sample shown below.</Paragraph>
    <Paragraph position="6"> Using base-phrases as the start point, nested levels of phrases are then identified, until the maximum phrases (will be defined later) are identified. Since we do not intend to provide full parsing information, there has to be a limit on the level of nesting. For practical reasons, we choose to limit the nesting of brackets to 3 levels. That means, the depth of our shallow parsed Treebank will be limited to 3. This restriction can limit the structural complexity to a manageable level.</Paragraph>
    <Paragraph position="7"> Our nested bracketing is not strictly bottom up.</Paragraph>
    <Paragraph position="8"> That is we do not simply extend from base-phrase and move up until the 3 rd level. Instead, we first identify the maximal-phrase which is used to identify the backbone of the sentence. The maximal-phrase provides the framework under which the base-phrases of up to 2 levels can be identified. The principles for the identification of scope and depth of phrase bracketing are briefly explained below and the operating procedure is indicated by the given order in which these principles are presented. More details is given in Section 5.</Paragraph>
    <Paragraph position="9"> Step 1: Annotation of maximal-phrase which is the shortest word sequence of maximally spanning non-overlapping edges which plays a distinct semantic role of a predicate. A maximal-phrase contains two or more lexical words.</Paragraph>
    <Paragraph position="10"> Step 2: Annotation of base-phrases within a maximal-phrase. In case a base-phrase and a maximal-phrase are identical and the maximal-phrase is already bracketed in Step 1, no bracketing is done in this step. For each identified base-phrase, its headword will be marked.</Paragraph>
    <Paragraph position="11"> Step 3: Annotation of next level of bracketing, called mid-phrase which is expended from a base-phrase. A mid-phrase is annotated only if it is deemed necessary. The process starts from the identified base-phrase. One more level of syntactical structure is then bracketed if it exists within the maximal-phrase.</Paragraph>
    <Paragraph position="12"> The third design principle is to provide sufficient syntactical information for natural language application even though shallow annotation does not necessarily contain complete syntactic information at sentence level. Some past research in Chinese shallow parsing were on single level base-phrases only (Sun 2001). However, for certain applications, such as for collocation extraction, identification of base-phrases only are not very useful. In this project, we have decided to annotate phrases within three levels of nesting within a sentence. For each phrase, a label is be given to indicate its syntactical information, and an optional semantic or structural label is given if applicable. Furthermore, the headword of a base-phrase is annotated. We believe these information are sufficient for many natural language processing research work and it is also manageable for this project within its working schedule.</Paragraph>
    <Paragraph position="13"> Fourthly, aiming to support practical language processing, a reasonably large annotated Treebank is expected. Studies on English have shown that Treebank of word size 500K to 1M is reasonable for syntactical structure analysis (Leech and Garside 1996). In consideration of the resources available and the reference of studies on English, we have set out our Treebank size to be one million words. We hope such a reasonably large-scale data can effectively support some language research, such as collocation extraction.</Paragraph>
    <Paragraph position="14"> We chose to use the XML format to record the annotated data. Other information such as original article related information (author, date, etc.), annotator name, and other useful information are also given through the meta-tags provided by XML.</Paragraph>
    <Paragraph position="15"> All the meta-tags can be removed by a program to recover the original data.</Paragraph>
    <Paragraph position="16"> We have performed a small-scale experiment to compare the annotation cost of shallow annotation and full annotation (followed Penn Chinese Treebank specification) on 500 Chinese sentences by the same annotators. The time cost in shallow annotation is only 25% of that for full annotation. Meanwhile, due to the reduced structural complexity in shallow annotation, the accuracy of first pass shallow annotation is much higher than full annotation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML