File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2702_metho.xml
Size: 26,118 bytes
Last Modified: 2025-10-06 14:10:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2702"> <Title>Annotation and Disambiguation of Semantic Types in Biomedical Text: a Cascaded Approach to Named Entity Recognition</Title> <Section position="3" start_page="11" end_page="11" type="metho"> <SectionTitle> 2 Biomedical Named Entity Recognition </SectionTitle> <Paragraph position="0"> Terms and named-entities (NEs) are the means of scientific communication as they are used to identify the main concepts in a domain. The identification of terminology in the biomedical literature is one of the most challenging research topics both in the NLP and biomedical communities (Hirschman et al., 2005; Kirsch et al., 2005).</Paragraph> <Paragraph position="1"> Identification of named entities (NEs) in a document can be viewed as a three-step procedure (Krauthammer and Nenadic, 2004). In the first step, single or multiple adjacent words that indicate the presence of domain concepts are recognised (term recognition). In the second step, called term categorisation, the recognised terms are classified into broader domain classes (e.g. as genes, proteins, species). The final step is mapping of terms into referential databases. The first two steps are commonly referred to as named entity recognition (NER).</Paragraph> <Paragraph position="2"> One of the main challenges in NER is a huge number of new terms and entities that appear in the biomedical domain. Further, terminological variation, recognition of boundaries of multiword terms, identification of nested terms and ambiguity of terms are the difficult issues when mapping terms from the literature to biomedical database entries (Hirschman et al., 2005; Krauthammer and Nenadic, 2004).</Paragraph> <Paragraph position="3"> On one hand, NER in the biomedical domain (in particular the recognition part) profits from large, freely available terminological resources, which are either provided as ontologies (e.g.</Paragraph> <Paragraph position="4"> Gene Ontology, ChEBI6, UMLS7) or result from biomedical databases containing named entities (e.g. UniProt/Swiss-Prot8). On the other hand, combining sets of terms from different termino- null 2003).</Paragraph> <Paragraph position="5"> logical resources leads to naming conflicts such as homonymous use of names and terminological ambiguities. The most obvious problem is when the same span of text is assigned to different semantic types (e.g. 'rat' denotes a species and a protein). In this case, there are three types of ambiguities: null (Amb1) A name is used for different entries in the same database, e.g. the same protein name serves for a given protein in different species (Chen et al., 2005).</Paragraph> <Paragraph position="6"> (Amb2) A name is used for entries in multiple databases and thus represents different types, e.g. 'rat' is a protein and a species.</Paragraph> <Paragraph position="7"> (Amb3) A name is not only used as a biomedical term but also as part of common English (in contrast to the biomedical terminology), e.g.</Paragraph> <Paragraph position="8"> 'who' and 'how', which are used as protein names.</Paragraph> <Paragraph position="9"> In some cases (i.e. Amb2), broader classification can help to disambiguate between different entries (e.g. differentiate between 'CAT' as a protein, animal or medical device). However, it is ineffective in situations where names can be mapped to several different entries in the same data source. In such situations, disambiguation on the resource level is needed (see, for example, (Liu et al., 2002) for disambiguation of terms associated with several entries in the UMLS Metathesaurus). null In many solutions, the three steps in biomedical NER (namely, recognition, categorisation and mapping to databases) are merged within one module. For example, using an existing terminological database for recognition of NEs, effectively leads to complete term identification (in cases where there are no ambiguities). Some researchers, however, have stressed the advantages of tackling each step as a separate task, pointing at different sources and methods needed to accomplish each of the subtasks (Torii et al., 2003; Lee et al., 2003). Also, in the case of modularisation, it is easier to integrate different solutions for each specific problem. However, it has been suggested that whether a clear separation into single steps would improve term identification is an open issue (Krauthammer and Nenadic, 2004). In this paper we discuss a cascaded, modular approach to biomedical NER.</Paragraph> </Section> <Section position="4" start_page="11" end_page="15" type="metho"> <SectionTitle> 3 Biomedical NER based on XML an- </SectionTitle> <Paragraph position="0"> notation: Modules in a pipeline In this Section we present a modular approach to identification, disambiguation and annotation of several biomedical semantic types in the text. Full identification of NEs and resolving ambiguities in particular, may require a full parse tree of a sentence in addition to the analysis of local context information. On the other hand, full parse trees may be only derivable after NEs are resolved. Methods to efficiently overcome these problems are not yet available today and in order to come up with an applicable solution, it was necessary to choose a more pragmatic approach.</Paragraph> <Paragraph position="1"> We first discuss the basic principles and design of the processing pipeline, which is based on a pragmatic cascade of modules, and then present each of the modules separately.</Paragraph> <Section position="1" start_page="12" end_page="12" type="sub_section"> <SectionTitle> 3.1 Modular design of a text processing </SectionTitle> <Paragraph position="0"> pipeline Our methodology is based on the idea of separating the process into clearly defined functions applied one after another to text, in a processing pipeline characterized by the following statements: null (P1) The complete text processing task consists of separate and independent modules. (P2) The task is performed by running all modules exactly once in a fixed sequence. (P3) Each module operates continuously on an input stream and performs its function on stretches or &quot;windows&quot; of text that are usually much smaller than the whole input. As soon as a window is processed, the module produces the resulting output.</Paragraph> <Paragraph position="1"> (P4) After the startup phase, all modules run in parallel. Incoming requests for annotation are accepted by a master process that ensures that all required modules are approached in the right order. null (P5) Communication of information between the modules is strictly downstream and all meta-information is contained in the data stream itself in the form of XML markup.</Paragraph> <Paragraph position="2"> An instance of a processing pipeline (which is actually embedded in EBIMed) is presented in Figure 1. The modules M-1 to M-8 are run in this order, and no communication between them is needed apart from streaming the text from the output of one module to the input of another. The text contains the meta-data as XML markup. The modules are described below.</Paragraph> <Paragraph position="3"> Although this is the standard pipeline for EBIMed, it is possible to re-arrange the modules to favour identification of specific semantic types. More precisely, in our modular approach, after identification of a term in the text, disambiguation only decides whether the term is of that type or not. If it is not, the specific annotation is removed and left to the downstream modules to tag the term differently. While this requires n identification steps, adding identification of new types is independent of modules already present. However, the prioritization of semantic types is enforced by the order of the associated term identification modules.</Paragraph> </Section> <Section position="2" start_page="12" end_page="14" type="sub_section"> <SectionTitle> 3.2 Input documents and pre-processing </SectionTitle> <Paragraph position="0"> Input documents are XML-formatted Medline abstracts as provided from the National Library of Medicine (NLM). The XML structure of Medline abstracts includes meta information attached to the original document, such as the journal, author list, affiliations, publication dates as well as annotations inserted by the NLM such as creation date of the Medline entry, list of chemicals associated with the document, as well as related MeSH headings.</Paragraph> <Paragraph position="1"> The text processing modules are only concerned with the document parts that consist of natural language text. In Medline abstracts, these stretches of text are marked up as Article-Title and AbstractText. Inside these elements we add another XML element, called text, to flag natural language text independent of the original input document format (module M-1 in Figure 1). Thereby the subsequent text processing modules become independent of the document structure: other document types, e.g. BioMed Central full text papers, can easily be fed into the pipeline providing a simple adaptation of the input pre-processor.</Paragraph> <Paragraph position="2"> As a final pre-processing step (M-2), sentences are identified and marked using the <SENT> tag.</Paragraph> <Paragraph position="3"> 3.3 Finding protein names in text For identification of protein names (M-3 in Figure 1), we use an existing protein repository. UniProt/Swiss-Prot contains roughly 190,000 protein/gene names (PGNs) in database entries that also annotate proteins with protein function, species and tissue type. PGNs from UniProt/Swiss-Prot are matched with regular expressions which account for morphological variability. These terms are tagged using the <z:uniprot> tag (see Figure 2). The list of identifiers (ids attribute) contains the accession numbers of the mentioned protein in the UniProt/Swiss-Prot database. All synonyms from a database entry are kept, and in the case of homonymy, where one name refers to several database entries, all accession numbers are stored. The pair consisting of the database name and the accession number(s) forms a unique identifier (UID) that represents the semantics of the term and can be trivially rewritten into a URL pointing to the database entry. Each entity also contains the attribute fb which provides the frequency of the term in the British National Corpus (BNC).</Paragraph> <Paragraph position="4"> 3.4 Resolving (some) protein name ambiguities null The approach to finding names that we presented can create three types of ambiguities mentioned above in Section 2.</Paragraph> <Paragraph position="5"> In the current implementation, Amb1 (ambiguity within a given resource) is not resolved. Rather, the links to all entries in the same data-base are maintained. Amb2 and Amb3 are partially resolved for protein/gene names as explained below (steps M-4 and M-5). Note that Amb2 is resolved on &quot;first-come first-serve&quot; basis, meaning that an annotation introduced by one module is not overwritten by a subsequent module.</Paragraph> <Paragraph position="6"> Many protein names are indeed or at least look like abbreviations. It has been proved that ambiguities of abbreviations and acronyms found in Medline abstracts can be automatically resolved with high accuracy (Yu et al., 2002; Schwartz and Hearst, 2003; Gaudan et al., 2005).</Paragraph> <Paragraph position="7"> <SENT sid=&quot;2&quot; pm=&quot;.&quot;> Aberrant Wnt signaling, which results from mutations of either <z:uniprot</Paragraph> <Paragraph position="9"> catenin</z:uniprot> resistant to degradation, and has been associated with multiple types of human In our approach (Gaudan et al., 2005) all acronyms from Medline have been gathered together null with their expanded forms, called senses. In addition all morphological and syntactical variants of a known expanded form have been extracted from Medline. Expanded forms were categorised into classes of semantically equivalent forms. Feature representations of Medline abstracts containing the acronym and the expanded form were used to train support vector machines (SVMs). Disambiguation of acronyms to their senses in Medline abstracts based on the SVMs was achieved at an accuracy of above 98%. This was independent from the presence of the expanded form in the Medline abstract. This disambiguation solution lead to the solution integrated into the processing pipeline.</Paragraph> <Paragraph position="10"> A potential protein has to be evaluated against three possible outcomes: either a name is an acronym and can be resolved as (a) a protein or (b) not a protein, or (c) a name cannot be resolved. To distinguish cases (a) and (b) the document content is processed to identify the expanded form of the acronym and to check whether the expanded form refers to a protein name. In case of (c), the frequency of the name in the British National Corpus (BNC) is compared with a threshold. If the frequency is higher than the threshold, the name is assumed not to be a protein name. The threshold was chosen not to exclude important protein names that have already entered common English (such as insulin).</Paragraph> <Paragraph position="11"> The disambiguation module (M-4) runs on the results of the previous module that performs protein-name matching and indiscriminately assumes each match to be a protein name. The module M-4 marks up all known acronym expansions in the text and combines the two pieces of information: a marked up protein name is looked up in the list of abbreviations. If the abbreviation has an expansion that is marked up in the vicinity and denotes a protein name, the abbreviation is verified as a protein name (case (a) above) by adding an attribute with a suitable value to the protein tag. The annotation also includes the normalised form of the acronym, which serves as an identifier for further database lookups. Similarly, if the expansion is clearly not a protein name, the same attribute is used with the according value.</Paragraph> <Paragraph position="12"> Finally, the module M-5 removes the protein name markup if the name is either (b) clearly not a protein, or in case (c) has a BNC frequency beyond the threshold.</Paragraph> <Paragraph position="13"> 3.5 Finding other names in text Further modules (M-6, M-7 and M-8 in Fig. 1) perform matching and markup for drugs from MedlinePlus9, species from Entrez Taxonomy10 and terms from the Gene Ontology (GO). As for proteins, the semantic type is signified by the element name and a unique ID referencing the source database is added as an attribute. Disambiguation for these names and terms is, however, not yet available.</Paragraph> <Paragraph position="14"> Finding GO ontology terms in text can be difficult, as these names are typically &quot;descriptions&quot; rather than real terms (e.g. GO:0016886, ligase activity, forming phosporic ester bonds), and therefore are not likely to appear in text frequently (McCray et al., 2002; Verspoor et al., 2003; Nenadic et al., 2004).</Paragraph> <Paragraph position="15"> Figure 3 shows an example of a sentence annotated for semantic types and POS information using the pipeline from the Figure 1. Note that POS tags are inside the type tags although type annotation has been performed prior to the POS tagging.</Paragraph> </Section> <Section position="3" start_page="14" end_page="15" type="sub_section"> <SectionTitle> 3.6 Other modules in the pipeline </SectionTitle> <Paragraph position="0"> The modular text processing pipeline of EBIMed is currently being extended to include other modules. The part-of-speech tagger (POS-tagger) is a separate module and combines tokenization and POS annotation. It leaves previously annotated entities as single tokens, even for multi-word terms, and assigns a noun POS tag to every Shallow parsing is introduced as another layer in the multidimensional annotation of biomedical documents. After the NER modules, the shallow parsing modules extract events of protein-protein interactions. Shallow parsing basically annotates noun phrases (NP) and verb groups. Noun phrases that contain a protein name receive a modified NP tag (Protein-NP) to simplify finding of protein-protein interaction phrases. Patterns of Protein-NPs in conjunction with selected verb groups are annotated as final result.</Paragraph> <Paragraph position="1"> taining different semantic types and POS tags.</Paragraph> </Section> </Section> <Section position="5" start_page="15" end_page="15" type="metho"> <SectionTitle> 4 EBIMed </SectionTitle> <Paragraph position="0"> This cascaded approach to NER has been incorporated into EBIMed, a system for mining biomedical literature.</Paragraph> <Paragraph position="1"> EBIMed is a service that combines document retrieval with co-occurrence-based summarization of Medline abstracts. Upon a keyword query, EBIMed retrieves abstracts from EMBL-EBI's installation of Medline and filters for biomedical terminology. The final result is organised in a view displaying pairs of concepts. Each pair co-occurs in at least one sentence in the retrieved abstracts. The findings (e.g.</Paragraph> <Paragraph position="2"> UniProt/Swiss-Prot proteins, GO annotations, drugs and species) are listed in conjunction with the UniProt/Swiss-Prot protein that appears in the same biological context. All terms, retrieved abstracts and extracted sentences are automatically linked to contextual information, e.g. entries in biomedical databases.</Paragraph> <Paragraph position="3"> The annotation modules are also available via HTTP request that allows for specification of which modules to run (cf. Whatizit11). Note that with suitable pre-processing to insert the <text> tags, even well formed HTML can be processed.</Paragraph> </Section> <Section position="6" start_page="15" end_page="16" type="metho"> <SectionTitle> 5 Lessons Learnt so far </SectionTitle> <Paragraph position="0"> Our text mining solution EBIMed successfully applies multi-dimensional markup in a pipeline of text processing modules to facilitate online retrieval and mining of the biomedical literature.</Paragraph> <Paragraph position="1"> The final goal is semantic annotation of biomedical terms with UID, and - in the next step shallow parsing based text processing for relationship identification. The following lessons have been learnt during design, implementation and use of our system.</Paragraph> <Paragraph position="2"> The end-users expect to see the original document at all times and therefore we have to rely on proper formatting of the original and the processed text. Consequently, when adding semantic information, all other meta-information must be preserved to allow for proper rendering as similar as possible to the original document.</Paragraph> <Paragraph position="3"> Therefore, our approach does not remove any pre-existing annotations supplied by the publisher, i.e. the original document could be recovered by removing all introduced markup.</Paragraph> <Paragraph position="4"> All modules only process sections of the document containing the natural language text, which improves modularisation. The document structure is irrelevant to single modules and facilitates reading and writing to the input and output stream, respectively, without taking notice of the beginning and/or the end of a single document.</Paragraph> <Paragraph position="5"> All information exchanged between modules is contained in the data stream. This facilitates running all the modules in a given pipeline in parallel, after an initial start-up. Even more, the modules can be distributed on separate machines with no implementation overheads for the communication over the network. Adding more modules with their own processors does not significantly impair overall runtime behaviour for large data-sets and leads to fast text processing throughput combined with a reasonable -- albeit not yet perfect -- quality, which allows for new and practically useful text mining solutions such as EBIMed.</Paragraph> <Paragraph position="6"> Modularisation of the text processing tasks leads to improved scalability and maintainability inherent to all modular software solutions. In the case of the presented solution, the modular approach allows for a selection of the setup and ordering of the modules, leading to a flexible software design, which can be adapted to different types of documents and which allows for an (incremental) replacement of methods to improve the quality of the output. This can also facilitate improved interoperability of XML-based NLP tools.</Paragraph> <Paragraph position="7"> Semantic annotation of named entities and terms blends effectively with logical markup, simply because there is no overlap between document structure and named entities and terms.</Paragraph> <Paragraph position="8"> On the other hand, some physical markup (such as <i> in the BMC corpus) is in some documents used to highlight names or terms of a semantic type, e.g. gene names. With consistent semantic markup, this kind of physical tags could be abandoned to be replaced by external style information. However, some semantic annotations still must be combined with physical markup as in the term B-sup that initially was annotated by a publisher as <b>B</b>-sup and that now (after NER) would be marked as <z:uniprot><b>B</b>-sup</z:uniprot>.</Paragraph> <Paragraph position="9"> Matching of names of a semantic type, e.g.</Paragraph> <Paragraph position="10"> protein/gene, is done on a &quot;longest of the leftmost&quot; basis and prioritization of semantic types is enforced by the order of the term identification modules. Both choices lead to the result that overlapping annotations are preempted and that annotations automatically endorse a link to a unique identifier, unless there are ambiguity on the level of biomedical resource.. This type of ambiguity is not resolved in our text processing solution. Instead, for a given biomedical term, links to all entries referring to this term in the same database are kept.</Paragraph> <Paragraph position="11"> One approach to the disambiguation of Amb2 (multiple resources) and Amb3 (common English words) ambiguities would be to integrate all terms into one massive dictionary, identify the strings in the text and then disambiguate between n semantic types. This would require the disambiguation module be trained to distinguish all semantic types. If a new type is added, the disambiguation module would need to be retrained, which limits the possibilities for expansion and tailoring of text mining solutions.</Paragraph> <Paragraph position="12"> Open Problems: We consider two categories of open problems: NLP-based and XML-based problems.</Paragraph> <Paragraph position="13"> Bio NLP-based problems include challenges in recognition and disambiguation of biomedical names in text. One of the main issues in our approach is annotation of compound and nested terms. The presented methodology can lead to the following annotations: 1. the head noun belongs to the same semantic type, but is not part of the protein name (as represented in the terminological resource): <z:uniprot>Wnt-2</z:uniprot> protein 2. the head noun belongs to a different semantic type not covered by any of the available terminological resources: <z:uniprot>WNT8B</z:uniprot> mRNA 3. a compound term consists of terms from dif- null ferent semantic types, but its semantic type is not known: <z:uniprot fb=&quot;0&quot; ids=&quot;...&quot;>betacatenin</z:uniprot> <z:go ids=&quot;...&quot; onto= &quot;...&quot;>binding </z:go> domain Therefore, an important open problem is the annotation of nested terms where an entity name is part of a larger term that may or may not be in one of the dictionaries. Once the inner term is marked up with inline annotation, simple string pattern matching (utilised in our approach) cannot be used easily to find the outer, because the XML structure is in the way. A more effective solution could be a combination of inline with stand-off annotation.</Paragraph> <Paragraph position="14"> Further, in a more complex case such as in htr-wnt-<uniprot>A protein</uniprot> neither wnt nor htr refer to a single protein but to a protein family, and whereas A protein is a known protein, this is not the case for wnt-A. The most obvious annotation <uniprot>htrwnt-A protein</uniprot> cannot be resolved by the terminology from the UniProt/Swiss-Prot database, as it simply does not exist in the database.</Paragraph> <Paragraph position="15"> More work is also needed on disambiguation of terms that correspond to common English words.</Paragraph> <Paragraph position="16"> Annotation (i.e. XML)-based problems mainly relate to an open question whether different tag names should be used for various semantic types, or semantic types should be represented via attributes of a generalised named entity or term tag. In EBIMed, specific tags are used to denote specific semantic types. A similar challenge is how to treat and make use of entities such as in-line references, citations and formulas (typically annotated in journals), which are commonly ignored by NLP modules.</Paragraph> <Paragraph position="17"> The most important issue, however, is how to represent still unresolved ambiguities, so that annotations might be modified at a later stage, e.g. when POS information or even the full parse tree is available. This also includes the issues on kind of information that should be made available for later processing. For example, as (compound) term identification is done before POS tagging, an open question is whether POS information should be assigned to individual components of a compound term (in addition to the term itself), since this information could be used to complete NER or adjust the results in a later stage.</Paragraph> </Section> class="xml-element"></Paper>