File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1708_intro.xml

Size: 3,114 bytes

Last Modified: 2025-10-06 14:02:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1708">
  <Title>CHINERS: A Chinese Named Entity Recognition System for the Sports Domain</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The research for Chinese information extraction is one of the topics in the project COLLATE1 (Computational Linguistics and Language Technology for Real World Applications). The main motivation is to investigate the strategies for information extraction for such language, especially in some special linguistic phenomena, to build a reasonable information extraction model and to implement an application system. Chinese Named Entity Recognition System (CHINERS) is a component of Chinese information extraction system which is being developed. CHINERS is mainly based on machine learning and shallow parsing techniques. We adopt football competition news as our corpus, because there exist a variety of named entities (NEs) and relations in the news. Among the NEs we select six of them as the recognized objects, that is, personal name (PN), date or time (DT), location name (LN), team name (TN), competition title (CT) and personal identity (PI). e.g. a0a2a1a4a3 (Mo Chenyue), a5</Paragraph>
    <Paragraph position="2"> ward), a31a33a32 (foreign player), a34a36a35a33a37 (chief coach),a38a16a39a41a40 (judge),a42a16a43 (correspondent), etc.</Paragraph>
    <Paragraph position="3"> Figure 1 shows the system architecture. The system is principally composed of three components. The first one is Modern Chinese Word Segmentation and POS Tagging System from Shan Xi University, China (Liu, 2000), which is our base-line system. The second one is an error repairer which is used to repair the word segmentation and POS tagging errors from the above system. The third one is a shallow parser which consists of Finite State Cascades (FSC) with three recognition levels. The dotted line shows the flow process for the training texts; while the solid line is the one for the testing texts. When training, the texts are segmented and tagged, then the error repairing candidate rules are produced and some of them are selected as the regular rules under the appropriate conditions. Thereafter, the errors caused during word segmentation and POS tagging in testing texts can be automatically repaired through such regular rules. Among the six types of NEs, PN, DT and LN are tagged by the first component and repaired by the second component. They are immediately recognized after error repairing; while TN, CT and PI are recognized and then tagged by the third component.</Paragraph>
    <Paragraph position="4"> In Section 2, an effective repairing approach for Chinese word segmentation and POS tagging errors will be presented. Next, Section 3 will aim to illustrate the principle of an automatically constructed FSC and NE recognition procedure. On the basis of that, Section 4 will show the three experimental conditions and results. Finally, Section 5 will draw some conclusions and introduce future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML