File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-1306_concl.xml
Size: 2,571 bytes
Last Modified: 2025-10-06 13:55:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1306"> <Title>Corpus design for biomedical natural language processing</Title> <Section position="7" start_page="43" end_page="43" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 5.1 Best practices in biomedical corpus </SectionTitle> <Paragraph position="0"> construction We have discussed the importance of recoverability, publication of guidelines, balance and representativeness, and linguistic annotation. Corpus maintenance is also important. Bada et al. (2004) point out the role that an organized and responsive maintenance plan has played in the success of the Gene Ontology. It seems likely that the continued development and maintenance reflected in the three major releases of GENIA (Ohta et al. 2002, Kim et al.</Paragraph> <Paragraph position="1"> 2003) have contributed to its improved quality and continued use over the years.</Paragraph> </Section> <Section position="2" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 5.2 A testable prediction </SectionTitle> <Paragraph position="0"> We have interpreted the data on the characteristics and usage rates of the various datasets discussed in this paper as suggesting that datasets that are developed in accordance with basic principles of corpus linguistics are more useful, and therefore more used, than datasets that are not.</Paragraph> <Paragraph position="1"> A current project at the University of Pennsylvania and the Children's Hospital of Philadelphia (Kulick et al. 2004) is producing a corpus that follows many of these basic principles. We predict that this corpus will see wide use by groups other than the one that created it.</Paragraph> <Paragraph position="2"> 5.3 The next step: grounded references The logical next step for BLP corpus construction efforts is the production of corpora in which entities and concepts are grounded with respect to external models of the world (Morgan et al. 2004).</Paragraph> <Paragraph position="3"> The BioCreative Task 1B data set construction effort provides a proof-of-concept of the plausibility of building BLP corpora that are grounded with respect to external models of the world, and in particular, biological databases. These will be crucial in taking us beyond the stage of extracting information about text strings, and towards mining knowledge about known, biologically relevant entities.</Paragraph> </Section> </Section> class="xml-element"></Paper>