File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1207_intro.xml
Size: 2,288 bytes
Last Modified: 2025-10-06 14:02:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1207"> <Title>Event-based Information Extraction for the biomedical domain: the Caderige project</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Developments in biology and biomedicine are reported in large bibliographical databases either focused on a specific species (e.g.</Paragraph> <Paragraph position="1"> Flybase, specialized on Drosophilia Menogaster) or not (e.g. Medline). This type of information sources is crucial for biologists but there is a lack of tools to explore them and extract relevant information. While recent named entity recognition tools have gained a certain success on these domains, event-based Information Extraction (IE) is still a challenge. The Caderige project aims at designing and integrating Natural Language Processing (NLP) and Machine Learning (ML) techniques to explore, analyze and extract targeted information in biological textual databases. We promote a corpus-based approach focusing on text pre-analysis and normalization: it is intended to drain out the linguistic variation dimension, as most as possible. Actually, the MUC (1995) conferences have demonstrated that extraction is more efficient when performed on normalized texts. The extraction patterns are thus easier to acquire or learn, more abstract and easier to maintain Beyond extraction patterns, it is also possible to acquire from the corpus, via ML methods, a part of the knowledge necessary for text normalization as shown here.</Paragraph> <Paragraph position="2"> This paper gives an overview of current research activities and achievements of the Caderige project. The paper first presents our approach and compares it with the one developed in the framework of a similar project called Genia (Collier et al. 1999). We then propose an account of Caderige techniques on various filtering and normalization tasks, namely, sentence filtering, resolution of named entity synonymy, syntactic parsing, and ontology learning.</Paragraph> <Paragraph position="3"> Finally, we show how extraction patterns can be learned from normalized and annotated documents, all applied to biological texts.</Paragraph> </Section> class="xml-element"></Paper>