File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/n04-2007_abstr.xml

Size: 1,437 bytes

Last Modified: 2025-10-06 13:43:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-2007">
  <Title>A Preliminary Look into the Use of Named Entity Information for Bioscience Text Tokenization</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Tokenization in the bioscience domain is often difficult. New terms, technical terminology, and nonstandard orthography, all common in bioscience text, contribute to this difficulty.</Paragraph>
    <Paragraph position="1"> This paper will introduce the tasks of tokenization, normalization before introducing BAccHANT, a system built for bioscience text normalization. Casting tokenization / normalization as a problem of punctuation classification motivates using machine learning methods in the implementation of this system.</Paragraph>
    <Paragraph position="2"> The evaluation of BAccHANT's performance included error analysis of the system's performance inside and outside of named entities (NEs) from the GENIA corpus, which led to the creation of a normalization system trained solely on data from inside NEs, BAccHANT-N. Evaluation of this new system indicated that normalization systems trained on data inside NEs perform better than systems trained both inside and outside NEs, motivating a merging of tokenization and named entity tagging processes as opposed to the standard pipelining approach.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML