File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1119_metho.xml

Size: 8,460 bytes

Last Modified: 2025-10-06 14:09:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1119">
  <Title>A Semi-Supervised Approach to Build Annotated Corpus for Chinese Named Entity Recognition</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 A semi-supervised approach to improve
</SectionTitle>
    <Paragraph position="0"> context model estimation In this study the context model is a trigram model which estimates the probability of a word class.</Paragraph>
    <Paragraph position="1"> Ideally, given an annotated corpus, where each sentence is segmented into words which are tagged by their word types, the trigram word class probabilities can be calculated using MLE, together with a backoff schema (Katz, 1987) to deal with the sparse data problem. Unfortunately, building such annotated training corpora is very expensive. Our basic solution is the bootstrapping approach described in Gao et al. (2002). It consists of three steps: (1) Initially, a greedy word segmenter (i.e. FMM) is used to annotate the corpus, and an initial context model is obtained based on the initial annotated corpus; (2) Re-annotate the corpus using the obtained models; (3) Re-train the context model using the re-annotated corpus. Steps 2 and 3 are iterated until the performance of the system converges.</Paragraph>
    <Paragraph position="2"> In the above approach, the quality of the context model depends to a large degree upon the quality of the initial annotated corpus, which is however not satisfied due to the fact that many named entities cannot be identifying using the greedy word segmenter which is based on the dictionary. As a consequence, the above approach achieves a low accuracy in detecting Chinese named entities.</Paragraph>
    <Paragraph position="3"> A straightforward solution to the above problem is to obtain large amount of high-quality annotated corpus for context model estimation. Unfortunately, manually creating such annotated corpus is very expensive. For example, Douglas (1999) pointed out that at least up to about 1.2 million words of training data are necessary to train an HMM name recognizer. To guarantee a high degree of accuracy (e.g. 90% F-measure), it requires about 800 hours, or 20 person*weeks of labor to annotate and check the amount of data. This is almost certainly more time than would be required by a skilled rule writer to write a rule-based name recognizer achieving the same level of performance, assuming all the necessary resources, such as lexicons and name lists, are already available.</Paragraph>
    <Paragraph position="4"> Our training data contains approximately 80 million Chinese characters from various domains of text. We are facing three questions in annotating the training data. (1) How to generate a high quality hand-annotated corpus? (2) How to best use the valuable hand-annotated corpus so as to achieve a satisfying performance? (3) What is the optimal size of the hand-annotated corpus, considering the tradeoff between the cost of human labor and the performance of the resulting segmenter? We leave the answers to the first and third questions to Section 4. In what follows, we describe our method of using small set of human-annotated corpus to boost the quality of the annotation of the entire corpus. It consists of 6 steps.</Paragraph>
    <Paragraph position="5"> Step 1: Manually annotate named entities on a small subset (call seed set) of the training data.</Paragraph>
    <Paragraph position="6"> Step 2: Obtain a context model on the seed set (called seed model).</Paragraph>
    <Paragraph position="7"> Step 3: Re-annotate the training corpus using the seed model and then train an improved context model using the re-annotated corpus.</Paragraph>
    <Paragraph position="8"> Step 4: Manually annotate another small subset of the training data. Repeat Steps (2) and (3) until the entire training data have been annotated.</Paragraph>
    <Paragraph position="9"> Step 5: Repeat steps 1 to 4 using different seed sets (we used three seed sets in our experiments, as we shall describe in Section 4).</Paragraph>
    <Paragraph position="10"> Step 6: Combine all context models obtained in step 5 via linear interpolation:</Paragraph>
    <Paragraph position="12"> (xyz) is the trigram probability of the i-th context model. l s is the interpolation weights which vary from 0 to 1.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> In this section, we first present our experiments on the generation and evaluation of hand-annotated corpus to answer the first two questions. Then, the answer to the third question is given in subsection 4.2.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 The generation and evaluation of hand-
</SectionTitle>
      <Paragraph position="0"> annotated corpus  Four students, whose major is Chinese language, annotate the corpus according to a pre-defined MSRA's guideline of Chinese named entities. We find that we have to revise the guideline when they were annotating the corpus. For example, Chinese character string &amp;quot;Shen Cheng Bo Lan Hui (Shanghai Exposition)&amp;quot;can be tagged as either &amp;quot;[L Shen ]Cheng Bo Lan Hui &amp;quot; or &amp;quot;[L Shen Cheng ]Bo Lan Hui &amp;quot;. Here &amp;quot;Shen &amp;quot; is the abbreviation of &amp;quot;Shang Hai (Shanghai)&amp;quot;. &amp;quot;Cheng &amp;quot; is the abbreviation of &amp;quot;Cheng Shi (city)&amp;quot;. L is the tag of location name. It is not clearly described in the guideline where the named entity's right boundary is.</Paragraph>
      <Paragraph position="1"> We obtain in total three manually annotated sub-sets (i.e. seed sets) by the following process:  1. Annotate the training data using a greedy word segmenter. Highlight the NEs and their tags.</Paragraph>
      <Paragraph position="2"> 2. Randomly select 10 million characters from  the annotated training data and then ask the students to manually refine these 10 million characters. The refinement includes correcting the wrong NE tags and adding missing NE tags.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3. Repeat the second step, and then combine the
</SectionTitle>
    <Paragraph position="0"> obtained new 10-million-character subset with the first one. Hence, a 20-million-character subset of the training data is obtained.</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4. Repeat the second step, and then combine the
</SectionTitle>
    <Paragraph position="0"> obtained new 10-million-character subset with the 20-million-character subset. Hence, a 30million-character subset of the training data is obtained.</Paragraph>
    <Paragraph position="1"> A manually annotated test set was developed as well. The text corpus contains approximately a half million Chinese characters that have been proofread and balanced in terms of domain, styles, and times.</Paragraph>
    <Paragraph position="2">  To evaluate the quality of our annotated corpus, we trained a context model using the method described in Section 3, with the first-obtained 10-million-character seed set. We then compare the performance of the resulting segmenter with those of other state-of-the-art segmenters and the FMM segmenter.</Paragraph>
    <Paragraph position="3">  We conduct evaluations in terms of precision (P) and recall (R).</Paragraph>
    <Paragraph position="4">  1. The MSWS system is one of the best available products. It is released by Microsoft null (r) (as a set of Windows APIs). MSWS first conducts the word-breaking using MM (augmented by heuristic rules for disambiguation), and then conducts factoid detection and NER using rules.</Paragraph>
    <Paragraph position="5"> 2. The LCWS system is one of the best re- null search systems in mainland China. It is released by Beijing Language University. The system works similarly to MSWS, but has a larger dictionary containing more PNs and LNs.</Paragraph>
  </Section>
  <Section position="8" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3. The PBWS system is a rule-based Chinese
</SectionTitle>
    <Paragraph position="0"> parser which can also output the word segmentation results. It explores high-level linguistic knowledge, such as syntactic structure for Chinese word segmentation and NER.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML