File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-2019_intro.xml
Size: 4,540 bytes
Last Modified: 2025-10-06 14:02:55
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-2019"> <Title>POSBIOTM/W: A Development Workbench For Machine Learning Oriented Biomedical Text Mining System [?]</Title> <Section position="3" start_page="0" end_page="36" type="intro"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> The POSBIOTM/W comprises a set of appropriate tools to provide users a convenient environment for gathering, managing and analyzing biomedical text and for named-entity annotation. The workbench consists of four components: Managing tool, NER tool, Event Extraction Tool and Annotation Tool.</Paragraph> <Paragraph position="1"> And we adopt an active learning idea into the workbench to improve the NER and the Event Extraction module's performance. The overall design is shown in Figure 1.</Paragraph> <Section position="1" start_page="0" end_page="36" type="sub_section"> <SectionTitle> 2.1 Managing tool </SectionTitle> <Paragraph position="0"> Main objective of the Managing tool is to help biologists search, collect and manage literatures relevant to their interest. Users can access to the PubMed database of bibliographic information using quick searching bar and incremental PubMed search engine. null</Paragraph> </Section> <Section position="2" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 2.2 NER tool </SectionTitle> <Paragraph position="0"> The NER tool is a client tool of POSBIOTM-NER module and able to automatically annotate biomedical-related texts. The NER tool provides access to three target-specific named entity models - GENIA-NER model, GENE-NER model and GPCR-NER model. Each of these model is trained based on GENIA-Corpus (Kim et. al., 2003), BioCreative data (Blaschke et. al., 2004) and POS-BIOTM/NE corpus2 respectively. In POSBIOTM-NER system, we adopt the Conditional Random Fields (CRF) model (Lafferty et. al., 2001) for the biomedical NER task.</Paragraph> </Section> <Section position="3" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 2.3 Event Extraction tool </SectionTitle> <Paragraph position="0"> The Event Extraction tool extracts several biological events from texts using automatically generated rules. We use a supervised machine learning method to overcome a knowledge-engineering bottleneck by learning event extraction rules automatically. We modify the WHISK (Soderland, 1999) algorithm to provide a two-level rule learning method as a divide-and-conquer strategy. In two-level rule learning, the system learns event extraction rules which are inside of the noun chunk at first level, and then it learns the rules for whole sentence.</Paragraph> <Paragraph position="1"> Since the system extracts biological events using automatically generated rules, we can not guarantee that every extracted event is always correct because many different rules can be applied to the same sentence. Therefore we try to verify the result with a Maximum Entropy (ME) classifier to remove incorrectly extracted events. For each extracted event, we verify each component of the event with the ME classifier model. If one component is contradicted to the class assigned by the classification model, we will remove the event. For detail event extraction process, please consult our previous paper (Kim et.</Paragraph> <Paragraph position="2"> al., 2004).</Paragraph> </Section> <Section position="4" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 2.4 Annotation tool </SectionTitle> <Paragraph position="0"> Our workbench provides a Graphical User Interface based Annotation tool which enables the users to annotate and correct the result of the named-entity recognition and the event extraction. And users can 2POSBIOTM/NE corpus, our own corpus, is used to identify four target named entities: protein, gene, small molecule and cellular process.</Paragraph> <Paragraph position="1"> upload the revised data to the POSBIOTM system, which would contribute to the incremental build-up of named-entity and relation annotation corpus.</Paragraph> </Section> <Section position="5" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 2.5 Active learning </SectionTitle> <Paragraph position="0"> To minimize the human labeling effort, we employ the active learning method to select the most informative samples. We proposed a new active learning paradigm which considers not only the uncertainty of the classifier but also the diversity of the corpus, which will soon be published.</Paragraph> </Section> </Section> class="xml-element"></Paper>