File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1306_intro.xml

Size: 5,985 bytes

Last Modified: 2025-10-06 14:03:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1306">
  <Title>Corpus design for biomedical natural language processing</Title>
  <Section position="4" start_page="0" end_page="39" type="intro">
    <SectionTitle>
2 Materials and methods
</SectionTitle>
    <Paragraph position="0"> Table 1 lists the publicly available biomedical corpora of which we are aware. We omit discussion here of the corpus currently in production by the University of Pennsylvania and the Children's Hospital of Philadelphia (Kulick et al. 2004), since it is not yet available in finished form. We also omit text collections from our discussion. By text collection we mean textual data sets that may include metadata about documents, but do not contain mark-up of the document contents. So, the OHSUMED text collec- null corpora are applicable. SS is sentence segmentation, T is tokenization, and POS is part-of-speech tagging.</Paragraph>
    <Paragraph position="1"> EI is entity identification, IE is information extraction, A is acronym/abbreviation definition, and C is coreference resolution.</Paragraph>
    <Paragraph position="3"> tion (Hersh et al. 1994) and the TREC Genomics track data sets (Hersh and Bhupatiraju 2003, Hersh et al. 2004) are excluded from this work, although their utility in information retrieval is clear.</Paragraph>
    <Paragraph position="4"> Table 1 lists the corpora, and for each corpus, gives its release date (or the year of the corresponding publication), the genre of the contents of the corpus, and the size of the corpus7.</Paragraph>
    <Paragraph position="5"> The left-hand side of Table 2 lists the data sets and, for each one, indicates the lower-level general language processing problems that it could be applied to, either as a source of training data or for evaluating systems that perform these tasks. We considered here sentence segmentation, word tokenization, and part-of-speech (POS) tagging.</Paragraph>
    <Paragraph position="6"> The right-hand side of Table 2 shows the higher7Sizes are given in words. Published descriptions of the corpora don't generally give size in words, so this data is based on our own counts. See the web site at http://compbio.uchsc.edu/corpora for details on how we did the count for each corpus.</Paragraph>
    <Paragraph position="7"> level tasks to which the various corpora can be applied. We considered here entity identification, information (relation) extraction, abbreviation/acronym definition, and coreference resolution. (Information retrieval is approached via text collections, versus corpora.) These tasks are directly related to the types of semantic annotation present in each corpus. The three EI-only corpora (GE-NIA, Yapex, GENETAG) are annotated with semantic classes of relevance to the molecular biology domain. In the case of the Yapex and GENETAG corpora, this annotation uses a single semantic class, roughly equivalent to the gene or gene product. In the case of the GENIA corpus, the annotation reflects a more sophisticated, if not widely used, ontology. The Medstract corpus uses multiple semantic classes, including gene, protein, cell type, and molecular process. In all of these cases, the semantic annotation was carefully curated, and in one (GENETAG) it includes alternative analyses. Two of the corpora (PDG, Wisconsin) are indicated in Table 2 as being applicable to both entity identification and information extraction tasks. From a biological perspective, the PDG corpus has exceptionally well-curated positive examples. From a linguistic perspective, it is almost unannotated. For each sentence, the entities are listed, but their locations in the text are not indicated, making them applicable to some definitions of the entity identification task but not others. The Wisconsin corpus contains both positive and negative examples. For each example, entities are listed in a normalized form, but without clear pointers to their locations in the text, making this corpus similarly difficult to apply to many definitions of the entity identification task.</Paragraph>
    <Paragraph position="8"> The Medstract corpus is unique among these in being annotated with coreferential equivalence sets, and also with acronym expansions.</Paragraph>
    <Paragraph position="9"> All six corpora draw on the same subject matter domain--molecular biology--but they vary widely with respect to their level of semantic restriction within that relatively broad category. One (GE-NIA) is restricted to the subdomain of human blood cell transcription factors. Another (Yapex) combines data from this domain with abstracts on protein binding in humans. The GENETAG corpus is considerably broader in topic, with all of PubMed/MEDLINE serving as a potential data  gives the count of the number of systems that actually used the dataset, as opposed to publications that cited the paper but did not use the data itself. Age is in years as of 2005.</Paragraph>
    <Paragraph position="10">  source. The Medstract corpus contains biomedical material not apparently related to molecular biology.</Paragraph>
    <Paragraph position="11"> The PDG corpus is drawn from a very narrow subdomain on protein-protein interactions. The Wisconsin corpus is composed of data from three separate sub-domains: protein-protein interactions, subcellular localization of proteins, and gene/disease associations. null Table 3 shows the number of systems built outside of the lab that created the corpus that used each of the data sets described in Tables 1 and 2. The counts in this table reflect work that actually used the datasets, versus work that cites the publication that describes the data set but doesn't actually use the data set. We assembled the data for these counts by consulting with the creators of the data sets and by doing our own literature searches8. If a system is described in multiple publications, we count it only once, so the number of systems is slightly smaller than the number of publications.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML