File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1211_intro.xml

Size: 1,373 bytes

Last Modified: 2025-10-06 14:02:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1211">
  <Title>Creating a Test Corpus of Clinical Notes Manually Tagged for Part-of-Speech Information</Title>
  <Section position="3" start_page="62" end_page="62" type="intro">
    <SectionTitle>
2 Annotation
</SectionTitle>
    <Paragraph position="0"> Prior to this study, the three annotators who participated in it had a substantial experience in coding clinical diagnoses but virtually no experience in POS markup. The training process consisted of a general and rather superficial introduction to the issues in linguistics as well as some formal training using the POS tagging guidelines developed by Santoriny (1991) for tagging Penn Treebank data. The formal training was followed by informal discussions of the data and difficult cases pertinent to the clinical notes domain which often resulted in slight modifications to the Penn Treebank guidelines.</Paragraph>
    <Paragraph position="1"> The annotation process consisted of preprocessing and editing. The pre-processing includes sentence boundary detection, tokenization and priming with part-of-speech tags generated by a MaxEnt tagger (Maxent 1.2.4 package (Baldridge et al.)) trained on Penn Treebank data.</Paragraph>
    <Paragraph position="2"> Automatically annotated notes were then presented to the domain experts for editing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML