XML Viewer - w04-1114

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1114_metho.xml
Size: 18,970 bytes
Last Modified: 2025-10-06 14:09:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1114">
  <Title>The Construction of A Chinese Shallow Treebank</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Corpus Materials Preparation
</SectionTitle>
    <Paragraph position="0"> The People Daily corpus, developed by PKU, consists of more than 13k articles totaling 5M words. As we need one million words for our Treebank, we have selected articles covering different areas in different time span to avoid duplications due to short-lived events and news topics. Our selection takes each day's news as one single unit, and then several distant dates are randomly selected among the whole 182 days in the entire collection. We have also decided to keep the original articles' structures and topics indicators as they may be useful for some applications.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Word Segmentation and Part-of-Speech
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Tagging
</SectionTitle>
      <Paragraph position="0"> The articles selected from PKU corpus are already segmented into words following the guidelines given in GB13715. The annotated corpus has a basic lexicon of over 60,000 words. We simply use this segmentation without any change and the accuracy is claimed to be 99.9%.</Paragraph>
      <Paragraph position="1"> Each word in the PKU corpus is given a POS tag.</Paragraph>
      <Paragraph position="2"> In this tagging scheme, a total of 43 POS tags are listed (Yu et al. 2001). Our project takes the PKU POS tags with only notational changes explained as follows: The morphemes tags including Ag (Adjectives morphemes), Bg, Dg, Ng, Mg, Rg, Tg, Qg, and Ug are re-labeled as lowercase letters, ag, bg, dg, ng, mg, rg, tg, qg and ug, respectively. This modification is to ensure consistent labeling in our system where the lower cases are used to indicate word-level tags and upper cases are used to indicate phrase-level labels.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="4" type="metho">
    <SectionTitle>
5 Phrase Bracketing and Annotation
</SectionTitle>
    <Paragraph position="0"> Phrase bracketing and annotation is the core part of this project. Not only all the original annotated files are converted to XML files, results of our annotations are also given in XML form. The meta tags provided by XML are very helpful for further processing and searching to the annotated text. .</Paragraph>
    <Paragraph position="1"> Note that in our project, the basic phrasal analysis looks at the context of a clause, not a sentence. Here, the term clause refers the text string ended by some punctuations including comma (,), semicolon (;), colon (:), or period (.). Certain punctuation marks such as ' ', '&lt;', and '&gt;' are not considered clause separators. For example, is considered having two clauses and thus will be bracketed separately. It should be pointed out that he set of Chinese punctuation marks are different from that of English and their usage can also be different.</Paragraph>
    <Paragraph position="2"> Therefore, an English sentence and their Chinese translation may use different punctuation marks.</Paragraph>
    <Paragraph position="3"> For example, the sentence is the translation of the English 'Tom, John, and Jack go back to school together' , which uses ' ' rather than comma(,) to indicate parallel structures, and is thus considered one clause.</Paragraph>
    <Paragraph position="4"> Each clause will then be processed according to the principles discussed in Section 2. The symbols '[' and ']' are used to indicate the left and right boundaries of a phrase. The right bracket is appended with syntactic labels as described in the general form of [Phrase]SS-FF, where SS is a mandatory syntactic label such as NP(noun phrase) and AP(adjective phrase), and FF is an optional label indicating internal structures and semantic functions such as BL(parallel), SB(a noun is the object of verb within a verb phrase). A total of 21 SS labels and 20 FF labels are given in our phrase annotation specification. For example, the functional label BL identifies parallel components in a phrase as indicated in the example .</Paragraph>
    <Paragraph position="5"> As in another example shown below, the phrase is a verb phrase, thus it is labeled as VP. Furthermore, the verb phrase can be further classified as a verb-complement type. Thus an additional SBU function label is marked. We should point out that since the FF labels are not syntactical information and are thus not expected to be used by any shallow parsers. The FF labels carry structural and/or semantic information which are of help in annotation. We consider it useful for other applications and thus decide to keep them in the Treebank. Appendix 1 lists all the FF labels used in the annotation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Identification of Maximal-phrase:
</SectionTitle>
      <Paragraph position="0"> The maximal-phrases are the main syntactical structures including subject, predicate, and objects in a clause. Again, maximal-phrase is defined as the phrase with the maximum spanning non-overlapping length, and it is a predicate playing a distinct semantic role and containing more than one lexical word. That means a maximal-phrase contains at least one base-phrase. As this is the first stage in the bracketing process, no nesting should occur. In the following annotated sentence,</Paragraph>
      <Paragraph position="2"> that is considered a base-phrase, but not a maximal-phrase because it contains only one lexical word. Unlike many annotations where the object of a sentence is included as a part of the verb phrase, we treat them as separate maximal-phrases both due to our requirement and also for reducing nesting.</Paragraph>
      <Paragraph position="3"> If a clause is completely embedded in a larger clause, it is considered a special clause and given a special name called an internal clause . We will bracket such an internal clause as a maximal phrase with the tag 'IC' as shown in the following example,</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Annotation of Base-phrases:
</SectionTitle>
      <Paragraph position="0"> A base-phrase is the phrase with stable, close and simple structure without nesting components.</Paragraph>
      <Paragraph position="1"> Normally a base-phrase contains a lexical word as headword. Taking the maximal-phrase in Eg.1 as an example, and , are base-phrases in this maximal-phrase. Thus, the sentence is annotated as In fact, and are also base-phrases. is not bracketed because it is a single lexical word as a base-phrase without any ambiguity and it is thus by default not being bracketed. is not further bracketed because it overlaps with a maximal-phrase. Our annotation principle here is that if a base-phrase overlaps with a maximal-phrase, it will not be bracketed twice.</Paragraph>
      <Paragraph position="2"> The identification of base-phrase is done only within an already identified maximal-phrase. In other words, if a base-phrase is identified, it must be nested inside a maximal-phrase or at most overlaps with it. It should be pointed out that the identification of a base-phrase is the most fundamental and most important goal of Treebank annotation. The identification of maximal-phrases can be considered as parsing a clause using a top-down approach. On the other hand, the identification of a base-phrase is a bottom up approach to find the most basic units within a maximal-phrase.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="4" type="sub_section">
      <SectionTitle>
5.3 Mid-Phrase Identification:
</SectionTitle>
      <Paragraph position="0"> Due to the fact that sometimes there may be more syntactic structures between the base-phrases and maximal-phrases, this step uses base-phrase as the starting point to further identify one more level of the syntactical structure in a maximal-phrase. Takes Eg.1 as an example, it is further annotated as where the underlined text shows the additional annotation.</Paragraph>
      <Paragraph position="1"> As we only limit our nesting to three levels, any further nested phrases will be ignored. The following sentence shows the result of our annotation with three levels of nesting: However, a full annotation should have 4 levels of nesting as shown below. The underlined text is the  th level annotation skipped by our system.</Paragraph>
    </Section>
    <Section position="4" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
5.4 Annotation of Headword
</SectionTitle>
      <Paragraph position="0"> In our system, a '#' tag will be appended after a word to indicate that it is a headword of the base-phrase. Here, a headword must be a lexical word rather than a function word.</Paragraph>
      <Paragraph position="1"> In most cases, a headword stays in a fixed position of a base-phrase. For example, the headword of a noun phrase is normally the last noun in this phrase. Thus, we call this position the default position. If a headword is in the default position, annotation is not needed. Otherwise, a '#' tag is used to indicate the headword.</Paragraph>
      <Paragraph position="2"> For example, in a clause, , is a verb phrase, and the headword of the phrase is , which is not in the default position of a verb phrase. Thus, this phrase is further annotated as: Note that is also a headword, but since it is in the default position, no explicit annotation is needed.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="4" end_page="4" type="metho">
    <SectionTitle>
6 Annotation and Quality Assurance
</SectionTitle>
    <Paragraph position="0"> Our research team is formed by four people at the Hong Kong Polytechnic University, two linguists from Beijing Language and Culture University and some research collaborators from Peking University.</Paragraph>
    <Paragraph position="1"> Furthermore, the annotation work has been conducted by four post-graduate students in language studies and computational linguistics from the Beijing Language and Culture University.</Paragraph>
    <Paragraph position="2"> The annotation work is conducted in 5 separate stages to ensure quality output of the annotation work. The preparation of annotation specification and corpus selection was done in the first stage.</Paragraph>
    <Paragraph position="3"> Researchers in Hong Kong invited two linguists from China to come to Hong Kong to prepare for the corpus collection and selection work. A thorough study on the reported work in this area was conducted. After the project scope was defined, the SS labels and the FF labels were then defined. A Treebank specification was then documented. The Treebank was given the name PolyU Treebank to indicate that it is produced at the Hong Kong Polytechnic University. In order to validate the specifications drafted, all the six members first manually annotated 10k-word material, separately.</Paragraph>
    <Paragraph position="4"> The outputs were then compared, and the problems and ambiguities occurred were discussed and consolidated and named Version 1.0. Stage 1 took about 5 months to complete. Details of the specification can be downloaded from the project website www.comp.polyu.edu.hk/~cclab.</Paragraph>
    <Paragraph position="5"> In Stage 2, the annotators in Beijing were then involved. They had to first study the specification and understand the requirement of the annotation.</Paragraph>
    <Paragraph position="6"> Then, the annotators under the supervision of a team member in Stage 1 annotated 20k-word materials together and discussed the problems occurred.</Paragraph>
    <Paragraph position="7"> During this two-month work, the annotators were trained to understand the specification. The emphasis at this stage was to train the annotators' good understanding of the specification as well as consistency by each annotator and consistency by different annotators. Further problems occurred in the actual annotation practice were then solved and the specification was also further refined or modified.</Paragraph>
    <Paragraph position="8"> In Stage 3, which took about 2 months, each annotator was assigned 40k-word material each in which 5k-words material were duplicate annotated to all the annotators. Meanwhile, the team members in Hong Kong also developed a post-annotation checking tool to verify the annotation format, phrase bracketing, annotation tags, and phrase marks to remove ambiguities and mistakes. Furthermore, an evaluation tool was built to check the consistency of annotation output. The detected annotation errors were then sent back to the annotators for discussion and correction. Any further problems occurred were submitted for group discussion and minor modification on the specification was also done.</Paragraph>
    <Paragraph position="9"> In stage 4, each annotator was dispatched with one set of 50k-word material each time. For each distribution, 15k-word data in each set were distributed to more than two annotators in duplicates so that for any three annotators, there would be 5K duplicated materials. When the annotators finished the first pass annotation, we used the post-annotation checking tool to do format checking in order to remove the obvious annotation errors such as wrong tag annotation and cross bracketing. However, it was quite difficult to check the difference in annotation due to different interpretation of a sentence. What we did was to make use of the annotations done on the duplicate materials to compare for consistency.</Paragraph>
    <Paragraph position="10"> When ambiguity or differences were identified, discussions were conducted and a result used by the majority would be chosen as the accepted result. The re-annotated results were regarded as the Golden Standard to evaluate the accuracy of annotation and consistency between different annotators. The annotators were required to study this Golden Standard and go back to remove similar mistakes.</Paragraph>
    <Paragraph position="11"> The annotated 50k data was accepted only after this.</Paragraph>
    <Paragraph position="12"> Then, a new 50k-word materials was distributed and repeated in the same way. During this stage, the ambiguous and out-of-tag-set phrase structures were marked as OT for further process. The annotation specification was not modified in order to avoid frequent revisit to already annotated data. About 4 months were spent on this stage.</Paragraph>
    <Paragraph position="13"> In Stage 5, all the members and annotators were grouped and discuss the OT cases. Some typical new phrase structure and function types were appended in the specification and thus the final formal annotation specification was established. Using this final specification, the annotators had to go back to check their output, modify the mistakes and substitute the OT tags by the agreed tags. Currently, the project was already in Stage 5 with 2 months of work finished. A further 2 months was expected to complete this work.</Paragraph>
    <Paragraph position="14"> Since it is impossible to do all the checking and analysis manually, a series of checking and evaluating tools are established. One of the tools is to check the consistency between text corpus files and annotated XML files including checking the XML format, the filled XML header, and whether the original txt material is being altered by accident. This program ensures that the XML header information is correctly filled and during annotation process, no additional mistakes are introduced due to typing errors.</Paragraph>
    <Paragraph position="15"> Furthermore, we have developed and trained a shallow parser using the Golden Standard data. This shallow parser is performed on the original text data, and its output and manually annotated result are compared for verification to further remove errors Now, we are in the process of developing an effective analyzer to evaluate the accuracy and consistency for the whole annotated corpus. For the exactly matched bracketed phrases, we check whether the same phrase labels are given. Abnormal cases will be manually checked and confirmed. Our final goal is to ensure the bracketing can reach 99% accuracy and consistency.</Paragraph>
  </Section>
  <Section position="8" start_page="4" end_page="4" type="metho">
    <SectionTitle>
7 Current Progress and Future Work
</SectionTitle>
    <Paragraph position="0"> As mentioned earlier, we are now in Stage 5 of the annotation. The resulting annotation contains 2,639 articles selected from PKU People Daily corpus.</Paragraph>
    <Paragraph position="1"> These articles contains 1, 035, 058 segmented Chinese words, with on average, around 394 words in each article. There are a total of 284, 665 bracketed phrases including nested phrases. A summary of the different SS labels used are given in  For each bracketed phrase, if its FF label does not fit into the corresponding default pattern, (like for the noun phrase(NP), the default grammatical structure is that the last noun in the phrase is the headword and other components are the modifiers, using PZ tags), its FF labels should then be explicitly labeled. The statistics of annotated FF tags are listed in Table 2.</Paragraph>
    <Paragraph position="2"> Table 2. Statistics of function and structure tags For the material annotated by multiple annotators as duplicates, the evaluation program has reported that the accuracy of phrase annotation is higher than 99.5% and the consistency between different annotators is higher than 99.8%. As for other annotated materials, the quality evaluation program preliminarily reports the accuracy of phrase annotation is higher than 98%. Further checking and evaluation work are ongoing to ensure the final overall accuracy achieves 99%.</Paragraph>
    <Paragraph position="3"> Up to now, the FF labels of 5,255 phrases are annotated as OT. That means about 1.8% (5,255 out of a total of 284,665) of them do not fit into any patterns listed in Table 2. Most of them are proper noun phrase, syntactically labeled as PP. We are investigating these cases and trying to identify whether some of them can be in new function and structure patterns and give a new label.</Paragraph>
    <Paragraph position="4"> It is also our intention to further develop our tools to improve the automatic annotation analysis and evaluation program to find out the potential annotation error and inconsistency. Other visualization tools are also being developed to support keyword searching, context indexing, and annotation case searching. Once we complete Stage 5, we intend to make the PolyU Treebank data available for public access. Furthermore, we are developing a shallow parser and using The PolyU Treebank as training and testing data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML