File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-4010_metho.xml
Size: 9,118 bytes
Last Modified: 2025-10-06 14:10:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-4010"> <Title>Chinese Named Entity and Relation Identification System</Title> <Section position="4" start_page="37" end_page="38" type="metho"> <SectionTitle> 2 System Design </SectionTitle> <Paragraph position="0"> In the model, the IE processing is divided into three stages: (i) word segmentation and part-of-speech (POS) tagging; (ii) NE recognition; (iii) NER identification. Figure 1 demonstrates a Chinese IE computational model comprised of these three stages. Each component in the system cor- null tage Chinese IE computa-In general, the e first stage has</Paragraph> <Paragraph position="2"> defined a hierarchical taxon null mentation During the implementation, object-oriented design and programming methods are thoroughly tional model.</Paragraph> <Paragraph position="3"> accuracy of th nsiderable influence on the performance of the consequent two stages. It has been demonstrated by our experiments (Yao et al., 2002). In order to reduce unfavorable influence, we utilize a trainable approach (Brill, 1995) to automatically generate effective rules, by which the first component can repair different errors caused by word segmentation and POS tagging.</Paragraph> <Paragraph position="4"> At the second stage, there are two kinds of NE constructions to be processed (Yao et al., 2003). One is the NEs which involve trigger words; the other those without trigger words. For the former NEs, a shallow parsing mechanism, i.e., finite-state cascades (FSC) (Abney, 1996) which are automatically constructed by sets of NE recognition rules, is adopted for reliably identifying different categories of NEs. For the latter NEs, however, some special strategies, such as the valence constraints of domain verbs, the constituent analysis of NE candidates, the global context clues and the analysis for preposition objects etc., are designed for identifying them. After the recognition for NEs, NER identification is performed in the last stage. Because of th versity and complexity of NERs, at the same time, considering portability requirement in the identification, we suggest a novel supervised machine learning approach called positive and negative case-based learning (PNCBL) used in this stage (Yao and Uszkoreit, 2005).</Paragraph> <Paragraph position="5"> The learning in this approach is a variant of memory-based learning (Daelemans et e goal of that is to capture valuable information from NER and non-NER patterns, which is implicated in different features. Because not all features we predefine are necessary for each NER or non-NER, we should select them by a reasonable measure mode. According to the selection criterion we propose - self-similarity, which is a quantitative measure for the concentrative degree of the same kind of NERs or non-NERs in the corresponding pattern library, the effective feature sets - General-Character Feature (GCF) sets for NERs and Individual-Character Feature (ICF) sets for non-NERs are built. Moreover, the GCF and ICF feature weighting serve as a proportion determination of feature's degree of importance for identifying NERs against non-NERs. Subsequently, identification thresholds can also be determined.</Paragraph> <Paragraph position="6"> Therefore, this approach pursues the improvement of the identif ERs by simultaneously learning two opposite cases, automatically selecting effective multi-level linguistic features from a predefined feature set for each NER and non-NER, and optimally making an identification tradeoff. Further, two other strategies, resolving relationship conflicts and inferring missing relationships, are also integrated in this stage.</Paragraph> <Paragraph position="7"> Considering the actual requirements for domain knowledge, we omy and constructed conceptual relationships among Object, Movement and Property concept categories under the taxonomy in a lexical sports ontology (Yao, 2005). Thus, this ontology can be used for the recognition of NEs with special constructions - without trigger words, the determination of NE boundaries, and the provision of feature values as well as the computation of the semantic distance for two concepts during the iden- null used in the system development. In order to eriments for testing three components. Table 1 shows the experimenr f these compoavoid repeated development, we integrate other application system and resource, e.g., Modern Chinese Word Segmentation and POS Tagging System (Liu, 2000) and HowNet (Dong and Dong, 2000) into the system. Additionally, we utilize Protege-2000 (version 1.9) (Stanford Medical Informatics, 2003) as a development environment for the implementation of lexical sports ontology.</Paragraph> <Paragraph position="8"> The prototype system CHINERIS has been implemented in Java. The system can automatically identify 6 types of NEs in the sports domain. Furthermore, its run-time efficiency is acceptable and the system user interfaces are friendly.</Paragraph> </Section> <Section position="5" start_page="38" end_page="39" type="metho"> <SectionTitle> 4 Testing and Evaluation </SectionTitle> <Paragraph position="0"> We have finished three exp tal esults for the performance o nents.</Paragraph> <Paragraph position="1"> In the first experiment, the training set consists of 94 texts including 3473 sentences collected stem from the soccer matches of the Jie Fang Daily (http://www.jfdaily.com/) in 2001. During manual error-correction, we adopted a double-person annotation method. After training, we obtain error repair rules. They can repair at least one error in the training corpus. The rules in the rule library are ranked according to the errors they correct. The testing set is a separate set that contains usage of error repair rules with context constraints has priority over those without context constraints, and the usage of error repair rules for word segmentation has priority over those for POS tagging. Through experimental observation, this processing sequence can ensure that the rules repair many more errors. On the other hand, it can prevent new errors occurring during the repair of existing errors. The results indicate that after the correction, the average F-measure of word segmentation has increased from 87.75 % to 92.86%; while that of POS tagging has even increased from 77.47% to 90.01%. That is to say, the performance of both processes has been distinctly enhanced.</Paragraph> <Paragraph position="2"> In the second experiment, we utilize the same testing set for the error repair component to check the named udes regular and special entity constructions. The rule sets provided for TN, CT, and PI recognition have 35, 50, and 20 rules respectively. In lexical sports ontology, there are more than 350 domain verbs used for the identification of TN with special constructions. Among six NEs, the average F-measure of DT, PI, and CT exceeds 85%. Therefore, it specifies that the identification performance of named entities after adding the special recognition strategies in this component has reached a good level.</Paragraph> <Paragraph position="3"> In the third experiment, both pattern libraries are established in terms of the annotated texts and lexical sports ontology dur ve 142 (534 NERs) and 98 (572 non-NERs) sentence groups respectively. To test the performance of our approach, we randomly choose 32 sentence groups from the Jie Fang Daily in 2002 (these sentence groups are out of either NER or non-NER pattern library), which embody 117 different NER candidates. Table 1 shows the total average recall, precision, and F-measure for 14 different NERs by positive and negative case-based learning and identification. Among 14 types of NERs, the highest total average F-measure is 95.65 from the relation LOC_CPC and the lowest total average F-measure is 34.09 from TM_CPC. The total average F-measure is 70.46. In addition, we also compared the performance between the total average recall, precision, and F-measure for all NERs only by positive and by positive and negative case-based learning and identification separately. It shows the total average F-measure is enhanced from 63.61% to 70.46% as a whole, due to the adoption of both positive and negative cases.</Paragraph> <Paragraph position="4"> From the result, we also realize that the selection of relation features is critical. First, they should be selected from multiple linguistic levels, e.g ap effective for Chinese named y cation in sports domain.</Paragraph> <Paragraph position="5"> wo constraint sy sit Es, identific null successful for the sample appl null is a part of the COLLATE project un01B, which is supported ry for Education and Research. null g Workshop, pages 8-15. Prague, Czech Re- null ., morphology, syntax and semantics. Second, they should also embody the crucial information of Chinese language processing, such as word order, the context of words, and particles etc. Moreover, the proposed self-similarity is a reasonable measure for selecting GCF and ICF for NERs and non-NERs identification respectively.</Paragraph> </Section> class="xml-element"></Paper>