File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1306_metho.xml

Size: 19,890 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1306">
  <Title>Corpus design for biomedical natural language processing</Title>
  <Section position="5" start_page="39" end_page="40" type="metho">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"> Even without examining the external usage data, two points are immediately evident from Tables 1 and 2: a0 Only one of the currently publicly available corpora (GENIA) is suitable for evaluating performance on basic preprocessing tasks.</Paragraph>
    <Paragraph position="1"> 8In the cases of the two corpora for which we found only zero or one external usage, this search was repeated by an experienced medical librarian, and included reviewing 67 abstracts or full papers that cite Blaschke et al. (1999) and 37 that cite Craven and Kumlein (1999).</Paragraph>
    <Paragraph position="2"> a0 These corpora include only a very limited range of genres: only abstracts and roughly sentencesized inputs are represented.</Paragraph>
    <Paragraph position="3"> Examination of Table 3 makes another point immediately clear. The currently publicly available corpora fall into two groups: ones that have had a number of external applications (GENIA, GENE-TAG, and Yapex), and ones that have not (Medstract, Wisconsin, and PDG). We now consider a number of design features and other characteristics of these corpora that might explain these groupings9.</Paragraph>
    <Section position="1" start_page="39" end_page="39" type="sub_section">
      <SectionTitle>
3.1 Effect of age
</SectionTitle>
      <Paragraph position="0"> We considered the very obvious hypothesis that it might be length of time that a corpus has been available that determines the amount of use to which it has been put. (Note that we use the terms &amp;quot;hypothesis&amp;quot; and &amp;quot;effect&amp;quot; in a non-statistical sense, and there is no significance-testing in the work reported here.) Tables 1 and 3 show clearly that this is not the case.</Paragraph>
      <Paragraph position="1"> The age of the PDG, Wisconsin, and GENIA data is the same, but the usage rates are considerably different--the GENIA corpus has been much more widely used. The GENETAG corpus is the newest, but has a relatively high usage rate. Usage of a corpus is determined by factors other than the length of time that it has been available.</Paragraph>
    </Section>
    <Section position="2" start_page="39" end_page="40" type="sub_section">
      <SectionTitle>
3.2 Effect of size
</SectionTitle>
      <Paragraph position="0"> We considered the hypothesis that size might be the determinant of the amount of use to which a corpus is put--perhaps smaller corpora simply do not provide enough data to be helpful in the development and validation of learning-based systems. We can 9Three points should be kept in mind with respect to this data. First, although the sample includes all of the corpora that we are aware of, it is small. Second, there is a variety of potential confounds related to sociological factors which we are aware of, but do not know how to quantify. One of these is the effect of association between a corpus and a shared task. This would tend to increase the usage of the corpus, and could explain the usage rates of GENIA and GENETAG, although not that of Yapex. Another is the effect of association between a corpus and an influential scientist. This might tend to increase the usage of the corpus, and could explain the usage rate of GENIA, although not that of GENETAG. Finally, there may be interactions between any of these factors, or as a reviewer pointed out, there may be a separate explanation for the usage rate of each corpus in this study. Nevertheless, the analysis of the quantifiable factors presented above clearly provides useful information about the design of successful corpora.</Paragraph>
      <Paragraph position="1">  reject this hypothesis: the Yapex corpus is one of the smallest (a fraction of the size of the largest, and only roughly a tenth of the size of GENIA), but has achieved fairly wide usage. The Wisconsin corpus is the largest, but has a very low usage rate.</Paragraph>
    </Section>
    <Section position="3" start_page="40" end_page="40" type="sub_section">
      <SectionTitle>
3.3 Effect of structural and linguistic
</SectionTitle>
      <Paragraph position="0"> annotation We expected a priori that the corpus with the most extensive structural and linguistic annotation would have the highest usage rate. (In this context, by structural annotation we mean tokenization and sentence segmentation, and by linguistic annotation we mean POS tagging and shallow parsing.) There isn't a clear-cut answer to this.</Paragraph>
      <Paragraph position="1"> The GENIA corpus is the only one with curated structural and POS annotation, and it has the highest usage rate. This is consistent with our initial hypothesis. null On the other hand, the Wisconsin corpus could be considered the most &amp;quot;deeply&amp;quot; linguistically annotated, since it has both POS annotation and-unique among the various corpora--shallow parsing. It nevertheless has a very low usage rate. However, the comparison is not clearcut, since both the POS tagging and the shallow parsing are fully automatic and not manually corrected. (Additionally, the shallow parsing and the tokenization on which it is based are somewhat idiosyncratic.) It is clear that the Yapex corpus has relatively high usage despite the fact that it is, from a linguistic perspective, very lightly annotated (it is marked up for entities only, and nothing else). To our surprise, structural and linguistic annotation do not appear to uniquely determine usage rate.</Paragraph>
    </Section>
    <Section position="4" start_page="40" end_page="40" type="sub_section">
      <SectionTitle>
3.4 Effect of format
</SectionTitle>
      <Paragraph position="0"> Annotation format has a large effect on usage. It bears repeating that these six corpora are distributed in six different formats--even the presumably simple task of populating the Size column in Table 1 required writing six scripts to parse the various data files. The two lowest-usage corpora are annotated in remarkably unique formats. In contrast, the three more widely used corpora are distributed in relatively more common formats. Two of them (GENIA and Yapex) are distributed in XML, and one of them (GENIA) offers a choice for POS tagging information between XML and the whitespace-separated, one-token-followed-by-tags-per-line format that is common to a number of POS taggers and parsers.</Paragraph>
      <Paragraph position="1"> The third (GENETAG) is distributed in the widely used slash-attached format (e.g. sense/NN).</Paragraph>
    </Section>
    <Section position="5" start_page="40" end_page="40" type="sub_section">
      <SectionTitle>
3.5 Effect of semantic annotation
</SectionTitle>
      <Paragraph position="0"> The data in Table 2 and Table 3 are consistent with the hypothesis that semantic annotation predicts usage. The claim would be that corpora that are built specifically for entity identification purposes are more widely used than corpora of other types, presumably due to a combination of the importance of the entity identification task as a prerequisite to a number of other important applications (e.g. information extraction and retrieval) and the fact that it is still an unsolved problem. There may be some truth to this, but we doubt that this is the full story: there are large differences in the usage rates of the three EI corpora, suggesting that semantic annotation is not the only relevant design feature. If this analysis is in fact correct, then certainly we should see a reduction in the use of all three of these corpora once the EI problem is solved, unless their semantic annotations are extended in new directions.</Paragraph>
    </Section>
    <Section position="6" start_page="40" end_page="40" type="sub_section">
      <SectionTitle>
3.6 Effect of semantic domain
</SectionTitle>
      <Paragraph position="0"> Both the advantages and the disadvantages of restricted domains as targets for language processing systems are well known, and they seem to balance out here. The scope of the domain does not affect usage: both the low-use and higher-use groups of corpora contain at least one highly restricted domain (GENIA in the high-use group, and PDG in the low-use group) and one broader domain (GENETAG in the high-use group, and Wisconsin in the lower-use group).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="40" end_page="43" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> The data presented in this paper show clearly that external usage rates vary widely for publicly available biomedical corpora. This variability is not related to the biological relevance of the corpora--the PDG and Wisconsin corpora are clearly of high biological relevance as evinced by the number of systems that have tackled the information extraction tasks that they are meant to support. Additionally, from a biological perspective, the quality of the data in the  PDG corpus is exceptionally high. Rather, our data suggest that basic issues of distribution format and of structural and linguistic annotation seem to be the strongest predictors of how widely used a biomedical corpus will be. This means that as builders of of data sources for BLP, we can benefit from the extensive experience of the corpus linguistics world.</Paragraph>
    <Paragraph position="1"> Based on that experience, and on the data that we have presented in this paper, we offer a number of suggestions for the design of the next generation of biomedical corpora.</Paragraph>
    <Paragraph position="2"> We also suggest that the considerable investments already made in the construction of the lessfrequently-used corpora can be protected by modifying those corpora in accordance with these suggestions. null Leech (1993) and McEnery and Wilson (2001), coming from the perspective of corpus linguistics, identify a number of definitional issues and design maxims for corpus construction. Some of these are quite relevant to the current state of biomedical corpus construction. We frame the remainder of our discussion in terms of these issues and maxims.</Paragraph>
    <Section position="1" start_page="41" end_page="43" type="sub_section">
      <SectionTitle>
4.1 Level of annotation
</SectionTitle>
      <Paragraph position="0"> From a definitional point of view, annotation is one of the distinguishing points of a corpus, as opposed to a text collection. Perhaps the most salient characteristic of the currently publicly available corpora is that from a linguistic or language processing perspective, with the exception of GENIA and GENE-TAG, they are barely annotated at all. For example, although POS tagging has possibly been the sine qua non of the usable corpus since the earliest days of the modern corpus linguistic age, five of the six corpora listed in Table 2 either have no POS tagging or have only automatically generated, uncorrected POS tags. The GENIA corpus, with its carefully curated annotation of sentence segmentation, tokenization, and part-of-speech tagging, should serve as a model for future biomedical corpora in this respect.</Paragraph>
      <Paragraph position="1"> It is remarkable that with just these levels of annotation (in addition to its semantic mark-up), the GENIA corpus has been applied to a wide range of task types other than the one that it was originally designed for. Eight papers from COLING 2004 (Kim et al. 2004) used it for evaluating entity identification tasks. Yang et al. (2002) adapted a subset of the corpus for use in developing and testing a coreference resolution system. Rinaldi et al. (2004) used it to develop and test a question-answering system.</Paragraph>
      <Paragraph position="2"> Locally, it has been used in teaching computational corpus linguistics for the past two years. We do not claim that it has not required extension for some of these tasks--our claim is that it is its annotation on these structural and linguistic levels, in combination with its format, that has made these extensions practical. null  standardization A basic desideratum for a corpus is recoverability: it should be possible to map from the annotation to the raw text. A related principle is that it should be easy for the corpus user to extract all annotation information from the corpus, e.g. for external storage and processing: &amp;quot;in other words, the annotated corpus should allow the maximum flexibility for manipulation by the user&amp;quot; (McEnery and Wilson, p.</Paragraph>
      <Paragraph position="3"> 33). The extent to which these principles are met is a function of the annotation format. The currently available corpora are distributed in a variety of one-off formats. Working with any one of them requires learning a new format, and typically writing code to access it. At a minimum, none of the non-XML corpora meet the recoverability criterion. None10 of these corpora are distributed in a standoff annotation format. Standoff annotation is the strategy of storing annotation and raw text separately (Leech 1993). Table 4 contrasts the two. Non-standoff annotation at least obscures--more frequently, destroys-important aspects of the structure of the text itself, such as which textual items are and are not immediately adjacent. Using standoff annotation, there is no information loss whatsoever. Furthermore, in the standoff annotation strategy, the original input text is immediately available in its raw form. In contrast, in the non-standoff annotation strategy, the original must be retrieved independently or recovered from the annotation (if it is recoverable at all). The stand-off annotation strategy was relatively new at the time that most of the corpora in Table 1 were designed, but by now has become easy to implement, in part 10The semantic annotation of the GENETAG corpus is in a standoff format, but neither the tokenization nor the POS tagging is.</Paragraph>
      <Paragraph position="4">  due to the availability of tools such as the University of Pennsylvania's WordFreak (Morton and LaCivita 2003).</Paragraph>
      <Paragraph position="5"> Crucially, this annotation should be based on character offsets, avoiding a priori assumptions about tokenization. See Smith et al. (2005) for an approach to refactoring a corpus to use character offsets. null  The maxim of documentation suggests that annotation guidelines should be published. Further, basic data on who did the annotations and on their level of agreement should be available. The currently available datasets mostly lack assessments of inter-annotator agreement, utilize a small or unspecified number of annotators, and do not provide published annotation guidelines. (We note the Yang et al. (2002) coreference annotation guidelines, which are excellent, but the corresponding corpus is not publicly available.) This situation can be remedied by editors, who should insist on publication of all of these. The GENETAG corpus is notable for the detailed documentation of its annotation guidelines.</Paragraph>
      <Paragraph position="6"> We suspect that the level of detail of these guidelines contributed greatly to the success of some rule-based approaches to the EI task in the BioCreative competition, which utilized an early version of this corpus.  Corpus linguists generally strive for a well-structured stratified sample of language, seeking to &amp;quot;balance&amp;quot; in their data the representation of text types, different sorts of authors, and so on. Within the semantic domain of molecular biology texts, an important dimension on which to balance is the genre or text type.</Paragraph>
      <Paragraph position="7"> As is evident from Table 1, the extant datasets draw on a very small subset of the types of genres that are relevant to BLP: we have not done a good job yet of observing the principle of balance or representativeness. The range of genres that exist in the research (as opposed to clinical) domain alone includes abstracts, full-text articles, GeneRIFs, definitions, and books. We suggest that all of these should be included in future corpus development efforts.</Paragraph>
      <Paragraph position="8"> Some of these genres have been shown to have distinguishing characteristics that are relevant to BLP. Abstracts and isolated sentences from them are inadequate, and also unsuited to the opportunities that are now available to us for text data mining with the recent announcement of the NIH's new policy on availability of full-text articles (NIH 2005). This policy will result in the public availability of a large and constantly growing archive of current, full-text publications. Abstracts and sentences are inadequate in that experience has shown that significant amounts of data are not found in abstracts at all, but are present only in the full texts of articles, sometimes not even in the body of the text itself, but rather in tables and figure captions (Shatkay and Feldman 2003). They are not suited to the upcoming opportunities in that it is not clear that practicing on abstracts will let us build the necessary skills for dealing with the flood of full-text articles that PubMedCentral is poised to deliver to us. Furthermore, there are other types of data--GeneRIFs and domain-specific dictionary definitions, for instance--that are fruitful sources of biological knowledge, and which may actually be easier to process automatically than abstracts. Space does not permit justifying the importance of all of these genres, but we discuss the rationale for including full text at some length due to the recent NIH announcement and due to the large body of evidence that can currently be brought to bear on the issue. A growing body of recent research makes  two points clear: full-text articles are different from abstracts, and full-text articles must be tapped if we are to build high-recall text data mining systems.</Paragraph>
      <Paragraph position="9"> Corney et al. (2004) looked directly at the effectiveness of information extraction from full-text articles versus abstracts. They found that recall from full-text articles was more than double that from abstracts. Analyzing the relative contributions of the abstracts and the full articles, they found that more than half of the interactions that they were able to extract were found in the full text and were absent in the abstract.</Paragraph>
      <Paragraph position="10"> Tanabe and Wilbur (2002) looked at the performance on full-text articles of an entity identification system that had originally been developed and tested using abstracts. They found different false positive rates in the Methods sections compared to other sections of full-text articles. This suggests that full-text articles, unlike abstracts, will require parsing of document structure. They also noted a range of problems related to the wider range of characters (including, e.g., superscripts and Greek letters) that occurs in full-text articles, as opposed to abstracts.</Paragraph>
      <Paragraph position="11"> Schuemie et al. (2004) examined a set of 3902 full-text articles from Nature Genetics and BioMed Central, along with their abstracts. They found that about twice as many MeSH concepts were mentioned in the full-text articles as in the abstracts. They also found that full texts contained a larger number of unique gene names than did abstracts, with an average of 2.35 unique gene names in the full-text articles, but an average of only 0.61 unique gene names in the abstracts.</Paragraph>
      <Paragraph position="12"> It seems clear that for biomedical text data mining systems to reach anything like their full potential, they will need to be able to handle full-text inputs. However, as Table 1 shows, no publicly available corpus contains full-text articles. This is a deficiency that should be remedied.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML