File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1145_metho.xml

Size: 16,342 bytes

Last Modified: 2025-10-06 14:07:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1145">
  <Title>Building a Large-Scale Annotated Chinese Corpus</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Annotation Speed
</SectionTitle>
    <Paragraph position="0"> There are three main factors that affect the annotation speed : annotators' background, guideline design and more importantly, the availability of preprocessing tools. We will discuss how each of these three factors affects annotation speed.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Annotator Background
</SectionTitle>
      <Paragraph position="0"> Even with the best sets of guidelines, it is important that annotators have received considerable training in linguistics, particularly in syntax. In both the segmentation/POS tagging phase and the syntactic bracketing phase, understanding the structure of the sentences is essential for correct annotation with reasonable speed. For example, a0 /de is assigned two part-of-speech tags, depending on where it occurs in the sentence. It is tagged as DEC when it marks the end of the preceding modifying clause and DEG when it follows a nominal phrase. This distinction is useful in that it marks different relations : between the nominal phrase and the noun head, and between the clause and the noun head respectively.</Paragraph>
      <Paragraph position="2"> 'recently held demonstration' During the bracketing phase, the modifying clause is further divided into relative clauses and complement (appositive) clauses. The structures of these two types of clauses are different, as illustrated in 2:  'the attitude that one is responsible to the nation' The annotator needs to make his/her own judgment as to whether the preceding constituent is a phrase or a clause. If it is a clause, he then needs to decide whether it is a complement clause or a relative clause. That is just one of the numerous places where he would have to draw upon his training in syntax in order to annotate the sentence correctly and efficiently. Although it is hard to quantify how the annotator's background can affect the annotation speed, it is safe to assume that basic training in syntax is very important for his performance.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 How Guideline Design can
Affect Annotation Speed
</SectionTitle>
      <Paragraph position="0"> In addition to the annotator's background, the way the guidelines are designed also affects the annotation speed and accuracy. It is important to factor in how a particular decision in guideline design can affect the speed of the annotation. In general, the more complex a construction is, the more difficult and error-prone its annotation is.</Paragraph>
      <Paragraph position="1"> In contemporary theoretical linguistics the structure of a sentence can be very elaborate.</Paragraph>
      <Paragraph position="2"> The example in 3 shows how complicated the structure of a simple sentence &amp;quot;they seem to understand&amp;quot; can be. The pronoun &amp;quot;they&amp;quot; cyclically moves up in the hierarchy in three steps.</Paragraph>
      <Paragraph position="4"> However, such a representation is infeasible for annotation guidelines. Wherever possible, we try to simplify structures without loss of information. For example, in a raising construction, instead of introducing a trace in the subject position of the complement clause of the verb, we allow the verb to take another verb phrase as its complement. Information is not lost because raising verbs are the only verbs that take a verb phrase as their complement. The structure can be automatically expanded to the &amp;quot;linguistically correct&amp;quot; structure if necessary: 4.a. before simplification</Paragraph>
      <Paragraph position="6"> 'Leaders should be responsible.' In some cases, we have to leave some structures flat in order not to slow down our annotation speed. One such example is the annotation of noun phrases. It is very useful to mark which noun modifies which, but sometimes it is hard to decide because there is too much ambiguity. We decided against annotating the internal structure of noun phrases where they consist of a string of nouns:</Paragraph>
      <Paragraph position="8"> We believe decisions like these make our guidelines simple and easy to follow, without compromising the requirement to annotate the most important grammatical relations.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.3 Speeding up Annotation with
Automatic Tools
</SectionTitle>
      <Paragraph position="0"> The availability of CTB-I makes it possible to train an increasingly more accurate set of CLP tools. When used as preprocessors, these tools substantially, and sometimes greatly, accelerated our annotation. We will briefly describe how we trained segmenters, taggers and parsers for use as preprocessors.</Paragraph>
      <Paragraph position="1">  to Chinese Word Segmentation Using the data from CTB-I, we trained an automatic word segmenter, using the maximum entropy approach. In general, machine learning approaches to Chinese word segmentation crucially hinge on the observation that word components (here we loosely define word components to be Chinese characters) can occur on the left, in the middle or on the right within a word. It would be a trivial exercise if a given character always occurs in one of these positions across all words, but in actuality, it can be ambiguous with regard to its position within a word. This ambiguity can be resolved by looking at the context, specifically the neighboring characters and the distribution of the previous characters (left, middle, or right). So the word segmentation problem can be modeled as an ambiguity resolution problem that readily lends itself to machine learning approaches. It should be pointed out that the ambiguity cannot be completely resolved just by looking at neighboring words. Sometimes syntactic context is also needed (Xue 2001). As a preliminary step, we just looked at the immediate contexts in our experiments.</Paragraph>
      <Paragraph position="2"> In training our maximum entropy segmenter, we reformulated the segmentation problem as a tagging problem. Specifically, we tagged the characters as LL (left), RR (right), MM (middle) and LR (single-character word), based on their distribution within words. A character can have multiple tags if it occurs in different positions  and the remaining 20,000 words as testing data, the maximum entropy segmenter achieved an accuracy of 91%, calculated by the F-measure, which combines precision and recall1. Compared with 'industrial strength' segmenters that have reported segmentation accuracy in the upper 90% range (Wu and Jiang 2000), this accuracy may seem to be relatively low. There are two reasons for this. First, the 'industrial strength' segmenters usually go through several steps (name identification, number identification, to name a few), which we did not do. Second,</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 F-measure = (precision * recall * 2) / (precision +
</SectionTitle>
    <Paragraph position="0"> recall).</Paragraph>
    <Paragraph position="1"> CTB-I is a relatively small corpus and we believe as we have more data available, we will be able to retrain our segmenters on more data and get increasingly more accurate segmenters. The more accurate segmenters in turn help speed up our annotation.</Paragraph>
    <Paragraph position="2">  Unlike segmenters, a POS tagger is a standard tool for the processing of Indo-European languages where words are trivially identified by white spaces in text form. Once the sentences are segmented into words, Chinese POS taggers can be trained in a similar fashion as POS taggers for English. The contexts that are used to predict the part-of-speech tag are roughly the same in both Chinese and English. These are the surrounding words, the previous tags and word components. One notable difference is that Chinese words lack the rich prefix and suffix morphology in Indo-European languages that are generally good predictors for the part-of-speech of a word. Another difference is that words in Chinese are not as long as English words in terms of the number of characters or letters they have. Still, some characters are useful predictors for the part-of-speech of the words they are components of.</Paragraph>
    <Paragraph position="3"> Our POS tagger is essentially the maximum entropy tagger by Ratnaparkhi (1996) retrained on the CTB-I data. We used the same 80,000 words chunk that was used to train the segmenter and used the remaining 20,000 words for testing. Our results show that the accuracy of this tagger is about 93% when tested on Chinese data. Considering the fact that our corpus is relatively small, this result is very promising. We expect that better accuracy will be achieved as more data become available.</Paragraph>
    <Paragraph position="4"> The training and development of Chinese segmenters and taggers speeds up our annotation, and at the same time as more data are annotated we are able to train more accurate preprocessing tools. This is a bootstrapping cycle that helps both the annotation and the tools. The value of preprocessing in segmentation and POS tagging is substantial and these automatic tools turn annotation into an error-correction activity rather than annotation from scratch. From our estimate, correcting the output of a segmenter and a POS-tagger is nearly twice as fast as annotating the same data from scratch in the segmentation and POS-tagging phase.</Paragraph>
    <Paragraph position="5"> The value of a parser as a preprocessing tool is less obvious, since when an error is made, the human annotator has to do considerable backtracking and undo some of the incorrect parses produced by the automatic parser. So we conducted an experiment and our results show that even with the apparent drawback of having to backtrack from the parses produced by the parser, the parser is still a useful preprocessing tool that helps annotation substantially. We will discuss this result in the next subsection.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.3.3 Training a Statistical Parser
</SectionTitle>
      <Paragraph position="0"> In order to determine the usefulness of the parser as a preprocessing tool, we used Chiang's parser (Chiang 2000), originally developed for English, which was retrained on data from CTB-I. We used 80,000 words of fully bracketed data for training and 10,000 words for testing. The parser obtains 73.9% labeled precision and 72.2% labeled recall. We then conducted an experiment to determine whether the use of a parser as a preprocessor improves annotation speed. We randomly selected a 13,469-word chunk of data form the corpus. This chunk was blindly divided into 2 portions of equal size (6,731 words for portion 1, 6,738 words for portion 2). The first portion was annotated from scratch. The second portion was first preprocessed by this parser and then an annotator corrected its output. The throughput rate was carefully recorded. In both cases, another annotator made a final pass over the first annotator's annotation, and discussed discrepancies with the first annotator. The adjudicated data was designated as the Gold Standard. This allows us to measure the &amp;quot;quality&amp;quot; of each portion in addition to the throughput rate. The experimental results are tabulated in 8:  The results clearly show that using the parser as a preprocessor greatly reduces the time needed for the annotation (specifically, 42%), compared with the time spent on annotation from scratch.</Paragraph>
      <Paragraph position="1"> This suggests that even in the bracketing phrase, despite the need to backtrack sometimes, preprocessing can greatly benefit treebank annotation. In addition, the results show that the annotation accuracy remains roughly constant.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Quality Control
</SectionTitle>
    <Paragraph position="0"> If the preprocessing tools give a substantial boost in our annotation speed, the use of evaluation tools, especially in the bracketing phase, helps us monitor the annotation accuracy and inter-annotator consistency, and thus the overall quality of the corpus. From our experience, we have learned that despite the best effort of human annotators, they are bound to make errors, especially mechanical errors due to oversight or fatigue. These mechanical errors often happen to be the errors automatic tools are good at detecting. In this section, we will describe how we monitor our annotation quality and the tools we used to detect errors.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Double Annotation and
Evaluation
</SectionTitle>
      <Paragraph position="0"> To monitor our annotation accuracy and inter-annotator consistency, we randomly selected 20% of the files to double annotate. That is, for these files, each annotator annotates them independently. The annotators meet weekly to compare those double annotated files. This is done in three steps: first, an evaluation tool2 is run on each double annotated file to determine the inter-annotator consistency. Second, the annotators examine the results of the comparison and the inconsistencies detected by the evaluation tool. These inconsistencies are generally in the form of crossed brackets, extra brackets, wrong labels, etc.. The annotators examine the errors and decide on the correct 2 The tool was written by Satoshi Sekine and Mike Collins. More information can be found at &lt;www.cs.nyu.edu/cs/projects/proteus/evalb&gt; annotation. Most of the errors are obvious and the annotators can agree on the correct annotation. In rare occasions, the errors can be due to misinterpretation of the guidelines, which is possible given the complexity of the syntactic constructions encountered in the corpus.</Paragraph>
      <Paragraph position="1"> Therefore the comparison is also an opportunity of continuing the training process for the annotators. After the inconsistencies are corrected or adjudicated, the corrected and adjudicated file are designated as the Gold Standard. The final step is to compare the Gold Standard against each annotator's annotation and determine each annotator's accuracy. Our results show that both measures are in the high 90% range, which is a very satisfactory result.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Post-annotation Checking
</SectionTitle>
      <Paragraph position="0"> with Automatic Tools As a final quality control step, we run LexTract (Xia 2001) and a corpus search tool developed by Beth Randall3. These tools are generally very good at picking up mechanical errors made by the human annotator. For example, the tools detect errors such as missing brackets, wrong phrasal labels and wrong POS tags. If a phrasal label is not found in the bracketing guidelines, the tools will be able to detect it. The annotators will then manually fix the error. Using these tools allows us to fix the mechanical errors and get the data ready for the final release.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Usability
</SectionTitle>
    <Paragraph position="0"> As we have discussed earlier, in order to finish this project in a reasonable time frame, some decisions have been made to simplify this phase of the project. In this section, we will briefly describe what has been achieved. We then try to anticipate future work on top of the current phase of the project</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Current Annotation
</SectionTitle>
      <Paragraph position="0"> As we have briefly mentioned in previous sections, the bracketing phase of this project focuses on the syntactic relationships between constituents. In our guidelines, we selected three</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML