File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1308_intro.xml
Size: 5,949 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1308"> <Title>Bio-Medical Entity Extraction using Support Vector Machines</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> The names that we are trying to extract fall into a number of categories that are outside the definitions used for the traditional named-entity task used in MUC. For this reason we consider the task of term identification and classification to be an extended named entity task (NE+) in which the goal is to find types as well as individuals and where the term classes belong to an explicitly defined ontology. The use of an ontology allows us to associate human-readable terms in the domain with a set of computer-readable classes, relations, properties and axioms (Gruber, 1993).</Paragraph> <Paragraph position="1"> The particular difficulties with identifying and classifying terms in scientific and technical domains are the size of the vocabulary (Lindberg et al., 1993), an open growing vocabulary (Lovis et al., 1995), irregular naming conventions as well as extensive cross-over in vocabulary between named entity classes. The irregular naming arises in part because of the number of researchers and practitioners from different fields who are working on the same knowledge discovery area as well as the large number of entities that need to be named. Despite the best efforts of major journals to standardize the terminology, there is also a significant problem with synonymy so that often an entity has more than one name that is widely used. In molecular biology for example class cross-over of terms may arise because many DNA and RNA are named after the protein with which they transcribe. This semantic ambiguity which is dependent on often complex contextual conditions is one of the main reasons why we need learnable models and why it is difficult to re-use existing term lists and vocabularies such as MeSH(NLM, 1997), UMLS (Lindberg et al., 1993) or those found in databases such as SwissProt (Bairoch and Apweiler, 1997). An additional obstacle to re-use is that the classification scheme used within an existing thesaurus or database may not be the same as the one in the users' ontology which may change from time to time as the consensus view of the structure of knowledge is refined. Our work has focussed on identifying names belonging to the classes shown in Table 1 which are all taken from the domain of molecular biology . Example sentences from a marked up abstract are given in underlies this classification scheme describes a simple top-level model which is almost flat except for the source class which shows places where genetic activity occurs and has a number of sub-types. Further discussion of our use of deep semantic structures in the ontology is given elsewhere1 and we will now focus our attention on the machine learning model used to capture low level semantics.</Paragraph> <Paragraph position="2"> The training set we used in our experiments called Bio1 consists of 100 MEDLINE abstracts, marked up in XML by a doctoral-qualified domain expert Class # Description PROTEIN 2125 proteins, protein groups, families, complexes and substructures.</Paragraph> <Paragraph position="3"> DNA 358 DNAs, DNA groups, regions and genes RNA 30 RNAs, RNA groups, regions and genes SOURCE.cl 93 cell line SOURCE.ct 417 cell type SOURCE.mo 21 mono-organism SOURCE.mu 64 multiorganism SOURCE.vi 90 virus SOURCE.sl 77 sublocation SOURCE.ti 37 tissue Table 1: Markup classes used in Bio1 with the number of word tokens for each class.</Paragraph> <Paragraph position="4"> TI - Differential interactions of <NAME cl=&quot;PROTEIN&quot; >Rel </NAME >- <NAME cl=&quot;PROTEIN&quot; >NF-kappa B </NAME > complexes with <NAME cl=&quot;PROTEIN&quot; >I kappa B alpha </NAME > determine pools of constitutive and inducible <NAME cl=&quot;PROTEIN&quot; >NF-kappa B </NAME > activity.</Paragraph> <Paragraph position="5"> AB - The <NAME cl=&quot;PROTEIN&quot; >Rel </NAME ><NAME cl=&quot;PROTEIN&quot; >NF-kappa B </NAME > fam- null ily of transcription factors plays a crucial role in the regulation of genes involved in inflammatory and immune responses. We demonstrate that in vivo, in contrast to the other mem- null bers of the family, <NAME cl=&quot;PROTEIN&quot; >RelB </NAME >associates efficiently only with <NAME cl=&quot;PROTEIN&quot; >NF-kappa B1 </NAME > ( <NAME cl=&quot;PROTEIN&quot; >p105-p50 </NAME >) and <NAME cl=&quot;PROTEIN&quot; >NFkappa B2 </NAME > ( <NAME cl=&quot;PROTEIN&quot; >p100-p52 </NAME >), but not with <NAME cl=&quot;PROTEIN&quot; >cRel </NAME > or <NAME cl=&quot;PROTEIN&quot; >p65 </NAME >. The <NAME cl=&quot;PROTEIN&quot; >RelB </NAME >- <NAME cl=&quot;PROTEIN&quot; >p52 </NAME >heterodimers display a much lower affinity for <NAME cl=&quot;PROTEIN&quot; >I kappa B alpha </NAME > than <NAME cl=&quot;PROTEIN&quot; >RelB </NAME >- <NAME cl=&quot;PROTEIN&quot; >p50 </NAME > heterodimers or <NAME cl=&quot;PROTEIN&quot; >p65 </NAME > in XML for molecular biology named-entities.</Paragraph> <Paragraph position="6"> for the name classes given in Table 1. The number of named entities that were marked up by class are also given in Table 1 and the total number of words in the corpus is 29940. The abstracts were chosen from a sub-domain of molecular biology that we formulated by searching under the terms human, blood cell, transcription factor in the PubMed database. An example can be seen in Figure 1</Paragraph> </Section> class="xml-element"></Paper>