File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-2007_intro.xml
Size: 4,104 bytes
Last Modified: 2025-10-06 14:02:16
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-2007"> <Title>A Preliminary Look into the Use of Named Entity Information for Bioscience Text Tokenization</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Related work </SectionTitle> <Paragraph position="0"> As noted in Habert et al. (1998), standard methods for evaluating the quality of tokens produced by tokenization systems do not exist. Though a necessary first step to tasks such as document retrieval, sentence boundary finding, parsing, etc., there exists work involving these tasks that take tokenization for granted (e.g. Chang, Schutze and Altman (2002), Seki and Mostafa (2003)), mention tokenization without detailing the tokenization scheme (e.g. Fukuda et al. (1998)), or indicate use of a tokenization system without mentioning its performance (e.g. Bennet et al. (1999), Yamamoto et al. (2003)). To the author's knowledge, there exists no work analyzing the impact of tokenization performance on bioinformatics tasks.</Paragraph> <Paragraph position="1"> Tokenization methods for bioinformatics tasks range from simple to complex. Bennet et al. (1999) tokenized for noun phrase extraction, tokenizing based on whitespace, with additional modification to take &quot;specialized nomenclature&quot; into account. Yamamoto et al. (2003) developed a morphological analyzer for protein name tagging which tokenized, part-of-speech tagged, and stemmed documents. Seki and Mostafa (2003) essentially tokenized by dictionary lookup for protein name extraction, using hand-crafted rules and filtering to identify protein name candidates to check against their dictionary.</Paragraph> <Paragraph position="2"> Relevant work on normalization can be found in the proceedings of the 2003 Text REtrieval Conference (TREC) Genomics track competition. The competition involved two tasks. The first task was for gene or protein X, find all MEDLINE references that focus on the basic biology of the gene/protein from the designated organism. Basic biology includes isolation, structure, genetics and function of genes/proteins in normal and disease states. The second task was to extract GeneRIF statements from records from the MEDLINE biomedical and health abstract repository.</Paragraph> <Paragraph position="3"> Kayaalp et al. (2003) normalized by converting all letters to lower case, and expanded queries by identifying terms with both alphabetic and numerical characters and searching for hyphenated variants, i.e.</Paragraph> <Paragraph position="4"> JAK2 and JAK-2. de Bruijn and Martin (2003) used morphological query expansion along with a relevance feedback engine. Osborne et al. used a number of query expansion strategies, including appending parenthetical information, acronym expansions, words following hyphens, lower and uppercase versions of terms, etc.</Paragraph> <Paragraph position="5"> de Brujin and Martin (2003) and Osborne et al.</Paragraph> <Paragraph position="6"> (2003) both indicate that query expansion was beneficial to the performance of their systems. However, no authors gave performance measures for their query expansion methods independent of their final systems.</Paragraph> <Paragraph position="7"> To the author's knowledge, there exists no work analyzing the performance of normalization systems for bioscience literature.</Paragraph> <Paragraph position="8"> Named entities are &quot;proper names and quantities of interest&quot; (Chinchor (1998)) in a document. Named entity tagging involves discovering and marking these entities in a document, e.g. finding all proteins in a document and labeling them as such. Having biomedical documents tagged with NEs allows for better information extraction, archival, searching, etc. of those documents. The GENIA corpus (Kim et al. (2003)) is a corpus of 2000 MEDLINE abstracts tagged for parts of speech and hand-tagged for NEs. NE tags in the GENIA corpus are based on an ontology, consisting of amino acids, proteins, organisms and their tissues, cells, and other.</Paragraph> </Section> class="xml-element"></Paper>