File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1026_intro.xml
Size: 2,601 bytes
Last Modified: 2025-10-06 14:02:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1026"> <Title>HowtogetaChineseName(Entity): Segmentation and Combination Issues</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Data </SectionTitle> <Paragraph position="0"> We used three annotated Chinese corpora in our experiments. null The IBM-FBIS Corpus The Foreign Broadcast Information Service (FBIS) offers an extensive collection of translations and transcriptions of open source information monitored worldwide on diverse topics such as military affairs, politics, economics, and science and technology. The IBM-FBIS corpus consists of approximately 3,000 Chinese articles obtained from FBIS (about 3.2 million Chinese characters in total). This corpus was tagged by a native Chinese speaker with 32 NE categories, such as person, location, organization, country, people, date, time, percentage, cardinal, ordinal, product, substance, and salutation. There are approximately 300,000 NEs in the entire corpus, 16% of which are labeled as person, 16% as organization, and 11% as location.</Paragraph> <Paragraph position="1"> The IBM-CT Corpus The Chinese Treebank (Xia et al., 2000), available from Linguistic Data Consortium, consists of a 100,000 word (approximately 160,000 characters) corpus annotated with word segmentation, part-of-speech tags, and syntactic bracketing. It includes 325 articles from Xinhua newswire between 1994 and 1998. The same Chinese annotator who worked on the above IBM-FBIS data also annotated the Chinese Treebank data with NE information, henceforth the IBM-CT corpus, using the same 32 categories as mentioned above.</Paragraph> <Paragraph position="2"> The IEER data The National Institute of Standard and Technology organized the Information Extraction - Entity Recognition (IEER) evaluation, which involves entity recognition from textual information sources in both English and Mandarin. The Mandarin training data consists of approximately 10 hours of broadcast news transcripts comprised of approximately 390 stories. The test data also contains transcripts of broadcast news1. The training data includes approximately 170,000 characters and the test data includes approximately 6,500 characters. Ten categories of NEs were annotated, such as person, location, organization, date, duration, and measure.</Paragraph> <Paragraph position="3"> 1Other types of test data were also used in IEER evaluation, including newswire text and real automatic speech recognition transcripts, but we did not use them in our experiments.</Paragraph> </Section> class="xml-element"></Paper>