File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1119_evalu.xml
Size: 3,342 bytes
Last Modified: 2025-10-06 13:59:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1119"> <Title>A Semi-Supervised Approach to Build Annotated Corpus for Chinese Named Entity Recognition</Title> <Section position="9" start_page="1" end_page="1" type="evalu"> <SectionTitle> 4.1.2.3 Results </SectionTitle> <Paragraph position="0"> The performance of the resulting segmenter is compared with those of three state-of-the-art segmenters and FMM segmenter in Table 1. Here PN, LN and ON stand for person name, location name and organization name respectively. The first column lists the segmenters.</Paragraph> <Paragraph position="1"> As can be seen from Table 1, the resulting segmenter (SSSC.10m) achieves comparable results with those of the other three state-of-the-art word segmenters. From Table 1 we also find that our semi-supervised approach makes a 2.4%-49% im- null The results show a moderate amount of hand-annotated corpus leads our segmenter to a state-of-the-art performance.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 The optimal size of the hand-annotated corpus </SectionTitle> <Paragraph position="0"> Regarding the third question: what is the optimal size of the hand-annotated subset, considering the tradeoff between the cost of human labor and the performance of the resulting segmenter? We obtain a series of results using 10-30-million-character subsets as seed sets, and then plot three graphs showing the relationship between the performances and their corresponding human efforts of constructing the seed set.</Paragraph> <Paragraph position="1"> We use two baselines.</Paragraph> <Paragraph position="2"> One is the FMM method, which does not use annotated training data. It is used to evaluate the performances using 10-30-million-character sub-sets (see Figure 1).</Paragraph> <Paragraph position="3"> The other is to use all the human effort of annotating the whole training data, which takes about 1920 person*hours human effort. It is used to calculate how much labor we would save by using the semi-supervised approach described in section 3.</Paragraph> </Section> </Section> <Section position="10" start_page="1" end_page="1" type="evalu"> <SectionTitle> 4.2.2 Results </SectionTitle> <Paragraph position="0"> The relationship between the performances and their corresponding human efforts of constructing the seed sets is shown in Figure 1. The X-axes give the human efforts on building 10, 20, and 30million-character subsets. They are 360, 720, and 1080 person*hours. The Y-axes show the recall and precision results on person name, location name and organization name, separately.</Paragraph> <Paragraph position="1"> We observe that both the recall and precision results first go upwards, and level off after the use of 720 person*hours, which is the corresponding human effort of constructing 20 million characters. This means that 20 million characters is a saturation point, because more human effort does not lead to any improvement in performance, and less human effort leads to lower performance.</Paragraph> <Paragraph position="2"> From the fact that manually annotating the whole training data costs 1920 person*hours, we indicate that by using our semi-supervised approach we save 62.5% human labor in corpus annotation. null</Paragraph> </Section> class="xml-element"></Paper>