File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1207_metho.xml
Size: 20,886 bytes
Last Modified: 2025-10-06 14:09:17
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1207"> <Title>Event-based Information Extraction for the biomedical domain: the Caderige project</Title> <Section position="3" start_page="0" end_page="43" type="metho"> <SectionTitle> 2 Description of our approach </SectionTitle> <Paragraph position="0"> In this section, we give some details about the motivations and choices of implementation.</Paragraph> <Paragraph position="1"> We then briefly compare our approach with the one of the Genia project.</Paragraph> <Section position="1" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 2.1 Project organization </SectionTitle> <Paragraph position="0"> The Caderige project is a multi disciplinary French research project on the automatic mining of textual data from the biomedical domain and is mainly exploratory orientated. It involved biology teams (INRA), computer science teams (LIPN, INRA and Leibniz-IMAG) and NLP teams (LIPN) as major partners, plus LRI and INRIA from 2000 to 2003.</Paragraph> </Section> <Section position="2" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 2.2 Project motivations </SectionTitle> <Paragraph position="0"> Biologists can search bibliographic databases via the Internet, using keyword queries that retrieve a large superset of relevant papers.</Paragraph> <Paragraph position="1"> Alternatively, they can navigate through hyperlinks between genome databanks and referenced papers. To extract the requisite knowledge from the retrieved papers, they must identify the relevant abstracts or paragraphs. Such manual processing is time consuming and repetitive, because of the bibliography size, the relevant data sparseness, and the database continuous updating. From the Medline database, the focused query &quot;Bacillus subtilis and transcription&quot; which returned 2,209 abstracts in 2002, retrieves 2,693 of them today. We chose this example because Bacillus subtilis is a model bacterium and transcription is a central phenomenon in functional genomics involved in genic interaction, a popular IE problem.</Paragraph> <Paragraph position="2"> GerE stimulates cotD transcription and inhibits cotA transcription in vitro by sigma K RNA polymerase, as expected from in vivo studies, and, unexpectedly, profoundly inhibits in vitro transcription of the gene (sigK) that encode sigma K.</Paragraph> <Paragraph position="3"> interaction.</Paragraph> <Paragraph position="4"> Still, applying IE a la MUC to genomics and more generally to biology is not an easy task because IE systems require deep analysis methods to locate relevant fragments. As shown in the example in Figures 1 and 2, retrieving that GerE is the agent of the inhibition of the transcription of the gene sigK requires at least syntactic dependency analysis and coordination processing. In most of the genomics IE tasks (function, localization, homology) the methods should then combine the semantic-conceptual analysis of text understanding methods with IE through pattern matching.</Paragraph> </Section> <Section position="3" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 2.3 Comparison with the Genia project </SectionTitle> <Paragraph position="0"> Our approach is very close to the one of the Genia project (Collier et al., 1999). Both projects rely on precise high-level linguistic analysis to be able to perform IE. The kind of information being searched is similar, concerning mainly gene and protein interaction as most of the research in this domain. The Genia corpus (Ohtae et al. 2001) is not specialized on a specific species whereas ours is based on Bacillus Subtilis.</Paragraph> <Paragraph position="1"> Both projects develop annotation tools and Document Type Definition (DTD), which are, for the most part, compatible. The aim here is to build training corpus to which various techniques of NLP and ML are applied in order to acquire efficient event-based extraction patterns. The choice of ML and NLP methods differs but their aim is similar to our: normalizing text with predicate-arguments structures for learning better patterns. For example, Genia uses a combination of parsers to finally perform an HPSG-like analysis. The Caderige syntactic analysis is based on the specialization of the Link Parser (Sleator and Temperley, 1993 see section 4) to the biological domain.</Paragraph> <Paragraph position="2"> In the following two sections, we detail our text filtering and normalization methods.</Paragraph> <Paragraph position="3"> Filtering aims at pruning the irrelevant part of the corpus while normalization aims at building an abstract representation of the relevant text. Section 4 is devoted to the acquisition of extraction patterns from the filtered and normalized text.</Paragraph> </Section> </Section> <Section position="4" start_page="43" end_page="44" type="metho"> <SectionTitle> 3 Text filtering </SectionTitle> <Paragraph position="0"> IR and text filtering are a prerequisite step to IE, as IE methods (including normalization and learning) cannot be applied to large and irrelevant corpora (they are not robust enough and they are computationally expensive). IR here is done through Medline interface by keyword queries for filtering the appropriate document subset. Then, text filtering, reduces the variability of textual data with the following assumptions: [?] desired information is local to sentences ; [?] relevant sentences contain at least two gene names.</Paragraph> <Paragraph position="1"> These hypotheses may lead to miss some genic interactions, but we assume that information redundancy is such that at least one instance of each interaction is contained into a single sentence in the corpus. The documents retrieved are thus segmented into sentences and the sentences with at least two gene names are selected.</Paragraph> <Paragraph position="2"> To identify the only relevant sentences among thoses, classical supervised ML methods have been applied to a Bacillus Subtilis corpus in which relevant and irrelevant sentences had been annotated by a biological expert. Among SVMs, Naive Bayes (NB) methods, Neural Networks, decision trees (Marcotte et al., 2001; Nedellec et al., 2001), (Nedellec et al, 2001) demonstrates that simple NB methods coupled with feature selection seem to perform well by yielding around 85 % precision and recall. Moreover, our first experiments show that the linguistic-based representation changes such as the use of lemmatization, terminology and named entities, do not lead to significant improvements. The relevant sentences filtered at this step are then used as input of the next tasks, normalization and IE.</Paragraph> </Section> <Section position="5" start_page="44" end_page="47" type="metho"> <SectionTitle> 4 Normalization </SectionTitle> <Paragraph position="0"> This section briefly presents three text normalization tasks: normalization of entity names, normalization of relations between text elements through syntactic dependency parsing and semantic labeling. The normalization process, by providing an abstract representation of the sentences, allows the identification of regularities that simplify the acquisition or learning of pattern rules.</Paragraph> <Section position="1" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 4.1 Entity names normalization </SectionTitle> <Paragraph position="0"> Named Entity recognition is a critical point in biological text analysis, and a lot of work was previously done to detect gene names in text (Proux and al., 1998), (Fukuda and al., 1998).</Paragraph> <Paragraph position="1"> So, in Caderige, we do not develop any original NE extraction tool. We focus on a less studied problem that is synonyms recognition.</Paragraph> <Paragraph position="2"> Beyond typographical variations and abbreviations, biological entities often have several different names. Synonymy of gene names is a well-known problem, partly due to the huge amount of data manipulated (43.238 references registered in Flybase for Drosophilia Melanogaster for example). Genes are often given a temporary name by a biologist. This name is then changed according to information on the concerned gene: for example SYGP-ORF50 is a gene name temporarily attributed by a sequencing project to the PMD1 yeast gene. We have shown that, in addition to available data in genomic database (GenBank, SwissProt,...), it is possible to acquire many synonymy relations with good precision through text analysis. By focusing on synonymy trigger phrases such as &quot;also called&quot; or &quot;formerly&quot;, we can extract text fragments of that type : gene trigger gene.</Paragraph> <Paragraph position="3"> However, the triggers themselves are subject to variation and the arguments of the synonymy relation must be precisely identified. We have shown that it is possible to define patterns to recognize synonymy expressions. These patterns have been trained on a representative set of sentences from Medline and then tested on a new corpus made of 106 sentences containing the keyword formerly. Results on the test corpus are the following: 97.5% precision, 75% recall. We chose to have a high precision since the acquired information must be valid for further acquisition steps (Weissenbacher, 2004).</Paragraph> <Paragraph position="4"> The approach that has been developed is very modular since abstract patterns like gene trigger gene (the trigger being a linguistic marker or a simple punctuation) can be instantiated by various linguistic items. A score can be computed for each instantiation of the pattern, during a learning phase on a large representative corpus. The use of a reduced tagged corpus and of a large untagged corpus justify the use of semi-supervised learning techniques.</Paragraph> </Section> <Section position="2" start_page="44" end_page="45" type="sub_section"> <SectionTitle> 4.2 Sentence parsing </SectionTitle> <Paragraph position="0"> The extraction of structured information from texts requires precise sentence parsing tools that exhibit relevant relation between domain entities. Contrary to (Akane et al. 2001), we chose a partial parsing approach: the analysis is focused on relevant parts of texts and, from these chunks, on specific relations. Several reasons motivate this choice: among others, the fact that relevant information generally appears in predefined syntactic patterns and, moreover, the fact that we want to learn domain knowledge ontologies from specific syntactic relations (Faure and Nedellec, 2000 ; Bisson et al. 2000).</Paragraph> <Paragraph position="1"> First experiments have been done on several shallow parsers. It appeared that constituent based parsers are efficient to segment the text in syntactic phrases but fail to extract relevant functional relationships betweens phrases.</Paragraph> <Paragraph position="2"> Dependency grammars are more adequate since they try to establish links between heads of syntactic phrases. In addition, as described in Schneider (1998), dependency grammars are looser on word order, which is an advantage when working on a domain specific language.</Paragraph> <Paragraph position="3"> Two dependency-based syntactic parsers have been tested (Aubin 2003): a hybrid commercial parser (henceforth HCP) that combines constituent and dependency analysis, and a pure dependency analyzer: the Link Parser.</Paragraph> <Paragraph position="4"> Prasad and Sarkar (2000) promote a twofold evaluation for parsers: on the one hand the use of a representative corpus and, on the other hand, the use of specific manually elaborated sentences. The idea is to evaluate analyzers on real data (corpus evaluation) and then to check the performance on specific syntactic phenomena. In this experiment, we chose to have only one corpus, made of sentences selected from the Medline corpus depending on their syntactic particularity. This strategy ensures representative results on real data.</Paragraph> <Paragraph position="5"> A set of syntactic relations was then selected and manually evaluated. This led to the results presented for major relations only in table 1.</Paragraph> <Paragraph position="6"> For each analyzer and relation, we compute a recall and precision score (recall = # relevant found relations / # relations to be found; precision = # relevant found relations / # relations found by the system).</Paragraph> <Paragraph position="7"> The Link Parser generally obtains better results than HCP. One reason is that a major particularity of our corpus (Medline abstracts) is that sentences are often (very) long (27 words on average) and contain several clauses.</Paragraph> <Paragraph position="8"> The dependency analyzer is more accurate to identify relevant relationships between headwords whereas the constituent parser is lost in the sentence complexity. We finally opted for the Link Parser. Another advantage of the Link Parser is the possibility to modify its set of rules (see next subsection). The Link parser is currently used in INRA to extract syntactic relationships from texts in order to learn domain ontologies on the basis of a distributional analysis (Harris 1951, Faure and Nedellec, 1999).</Paragraph> </Section> <Section position="3" start_page="45" end_page="46" type="sub_section"> <SectionTitle> 4.3 Recycling a general parser for </SectionTitle> <Paragraph position="0"> biology During the evaluation tests, we noticed that some changes had to be applied either to the parser or to the text itself to improve the syntactic analysis of our biomedical corpus. The corpus needs to be preprocessed: sentence segmentation, named entities and terms recognition are thus performed using generic modules tuned for the biology domain1. Term recognition allows the removing of numerous structure ambiguities, which clearly benefits the parsing quality and execution time.</Paragraph> <Paragraph position="1"> 1 A term analyser is currently being built at LIPN using existing term resources like Gene Ontology</Paragraph> <Paragraph position="3"> Concerning the Link Parser, we have manually introduced new rules and lexicon to allow the parsing of syntactic structures specific to the domain. For instance, the Latin-derived Noun Adjective phrase &quot;Bacillus subtilis&quot; has a structure inverse to the canonical English noun phrase (Adjective Noun). Another major task was to loosen the rules constraints because Medline abstracts are written by biologists who express themselves in sometimes broken English. A typical error is the omission of the determinant before some nouns that require one. We finally added words unknown to the original parser.</Paragraph> </Section> <Section position="4" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 4.4 Semantic labelling </SectionTitle> <Paragraph position="0"> Asium software is used to semi-automatically acquire relevant semantic categories by distributional semantic analysis of parsed corpus. These categories contribute to text normalization at two levels, disambiguating syntactic parsing and typing entities and actions for IE. Asium is based on an original ascendant hierarchical clustering method that builds a hierarchy of semantic classes from the syntactic dependencies parsed in the training corpus. Manual validation is required in order to distinguish between different meanings expressed by identical syntactic structures.</Paragraph> <Paragraph position="1"> 5 Extraction pattern learning Extraction pattern learning requires a training corpus from which the relevant and discriminant regularities can be automatically identified. This relies on two processes: text normalization that is domain-oriented but not task-oriented (as described in previous sections), and task-oriented annotation by the expert of the task.</Paragraph> </Section> <Section position="5" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 5.1 Annotation procedure </SectionTitle> <Paragraph position="0"> The Caderige annotation language is based on XML and a specific DTD (Document Type Definition that can be used to annotate both prokaryote and eukaryote organisms by 50 tags with up to 8 attributes. Such a precision is required for learning feasibility and extraction efficiency. Practically, each annotation aims at highlighting the set of words in the sentence describing: [?] Agents (A): the entities activating or controlling the interaction [?] Targets (T): the entities that are produced or controlled [?] Interaction (I): the kind of control performed during the interaction [?] Confidence (C): the confidence level in this interaction.</Paragraph> <Paragraph position="1"> The annotation of &quot;A low level of GerE activated transcription of CotD by GerE RNA polymerase in vitro ...&quot; is given below. The attributes associated to the tag <GENIC-INTERACTION> express the fact that the interaction is a transcriptional activation and that it is certain. The other tags (<IF>, <AF1>, ...) mark the agent (AF1 and AF2), the target (TF1) and the interaction (IF).</Paragraph> </Section> <Section position="6" start_page="46" end_page="47" type="sub_section"> <SectionTitle> 5.2 The annotation editor2 </SectionTitle> <Paragraph position="0"> Annotations cannot be processed in text form by biologists. The annotation framework developed by Caderige provide a general XML editor with a graphic interface for creating, checking and revising annotated documents.</Paragraph> <Paragraph position="1"> For instance, it displays the text with graphic attributes as defined in the editor XML style sheet, it allows to add the tags without strong constraint on the insertion order and it automatically performs some checking.</Paragraph> <Paragraph position="2"> The editor interface is composed of four main parts (see Figure 3). The editable text zone for annotation, the list of XML tags that can be used at a given time, the attributes zone to edit the values of the selected tag, and the XML code currently generated. In the text zone, the above sentence is displayed as follows: A low level of GerE activated transcription of CotD by GerE RNA polymerase but in vitro This editor is currently used by some of the Caderige project partners and at SIB (Swiss Institute of BioInformatics) with another DTD, in the framework of the European BioMint project. Several corpora on various species have been annotated using this tool, mainly by biologists from INRA.</Paragraph> </Section> <Section position="7" start_page="47" end_page="47" type="sub_section"> <SectionTitle> 5.3 Learning </SectionTitle> <Paragraph position="0"> The vast majority of approaches relies on hand-written pattern rules that are based on shallow representations of the sentences (e.g.</Paragraph> <Paragraph position="1"> Ono et al., 2001). In Caderige, the deep analysis methods increase the complexity of the sentence representation, and thus of the IE patterns. ML techniques appear therefore very appealing to automate the process of rule acquisition (Freitag, 1998; Califf et al., 1998; Craven et al., 1999).</Paragraph> <Paragraph position="2"> Learning IE rules is seen as a discrimination task, where the concept to learn is a n-ary relation between arguments which correspond to the template fields. For example, the template in figure 2 can be filled by learning a ternary relation genic-interaction(X,Y,Z), where X,Y and Z are the type, the agent and the target of the interaction. The learning algorithm is provided with a set of positive and negative examples built from the sentences annotated and normalized. We use the relational learning algorithm, Propal (Alphonse et al., 2000). The appeal of using a relational method for this task is that it can naturally represent the relational structure of the syntactic dependencies in the normalized sentences and the background knowledge if needed, such as for instance semantic relations. For instance, the IE rules learned by Propal extract, from the following sentence :&quot;In this mutant, expression of the spoIIG gene, whose transcription depends on both sigA and the phosphorylated Spo0A protein, Spo0AP, a major transcription factor during early stages of sporulation, was greatly reduced at 43 degrees C.&quot;, successfully extract the two relations genic-interaction(positive, sigA, spoIIG) and genic-interaction(positive, Spo0AP, spoIIG). As preliminary experiments, we selected a subset of sentences as learning dataset, similar to this one. The performance of the learner evaluated by ten-fold cross-validation is 69+-6.5% of recall and 86+-3.2% of precision. This result is encouraging, showing that the normalization process provides a good representation for learning IE rules with both high recall and high precision.</Paragraph> </Section> </Section> class="xml-element"></Paper>