File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/n04-2007_abstr.xml
Size: 1,437 bytes
Last Modified: 2025-10-06 13:43:30
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-2007"> <Title>A Preliminary Look into the Use of Named Entity Information for Bioscience Text Tokenization</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Tokenization in the bioscience domain is often difficult. New terms, technical terminology, and nonstandard orthography, all common in bioscience text, contribute to this difficulty.</Paragraph> <Paragraph position="1"> This paper will introduce the tasks of tokenization, normalization before introducing BAccHANT, a system built for bioscience text normalization. Casting tokenization / normalization as a problem of punctuation classification motivates using machine learning methods in the implementation of this system.</Paragraph> <Paragraph position="2"> The evaluation of BAccHANT's performance included error analysis of the system's performance inside and outside of named entities (NEs) from the GENIA corpus, which led to the creation of a normalization system trained solely on data from inside NEs, BAccHANT-N. Evaluation of this new system indicated that normalization systems trained on data inside NEs perform better than systems trained both inside and outside NEs, motivating a merging of tokenization and named entity tagging processes as opposed to the standard pipelining approach.</Paragraph> </Section> class="xml-element"></Paper>