File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3113_intro.xml
Size: 11,586 bytes
Last Modified: 2025-10-06 14:02:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3113"> <Title>A Design Methodology for a Biomedical Literature Indexing Tool Using the Rhetoric of Science</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 The aim of citation indexing </SectionTitle> <Paragraph position="0"> Indexing tools, such as CiteSeer (Bollacker et al., 1999), play an important role in the scientific endeavour by providing researchers with a means to navigate through the network of scholarly scientific papers using the connections provided by citations. Citations relate articles within a research field by linking together works whose methods and results are in some way mutually relevant.</Paragraph> <Paragraph position="1"> Customarily, authors include citations in their papers to indicate works that are foundational in their field, background for their own work, or representative of complementary or contradictory research. Another researcher may then use the presence of citations to locate articles she needs to know about when entering a new field or to read in order to keep track of progress in a field where she is already well-established. But, with the explosion in the amount of scientific literature, a means to provide more information in order to give more intelligent control to the navigation process is warranted. A user normally wants to navigate more purposefully than &quot;Find all articles citing a source article&quot;. Rather, the user may wish to know whether other experiments have used similar techniques to those used in the source article, or whether other works have reported conflicting experimental results. In order to navigate a citation index in this more-sophisticated manner, the citation index must contain not only the citation-link information, but also must indicate the function of the citation in the citing article.</Paragraph> <Paragraph position="2"> The goal of our research project is the design and implementation of an indexing tool for scholarly biomedical literature which uses the text surrounding the citation to provide information about the binary relation between the two papers connected by a citation. In particular, we are interested in how the scientific method structures the way in which ideas, results, theories, etc. are presented in scientific writing and how the style of presentation indicates the purpose of citations, that is, what the relationship is between the cited and citing papers.</Paragraph> <Paragraph position="3"> Our interest in the connection between scientific literature (our focus), ontologies, and databases is that the content and structure of each of these three repositories of scientific knowledge has its foundations in the method of science. Our purpose here is twofold: to make explicit our design methodology for an indexing tool that uses Association for Computational Linguistics.</Paragraph> <Paragraph position="4"> Linking Biological Literature, Ontologies and Databases, pp. 77-84. HLT-NAACL 2004 Workshop: Biolink 2004, the rhetoric of science as its foundation to see whether the ideas that underly our methodology can cross-fertilize the enquiry into the other two areas, and to discuss the tool itself with the purpose of making known that there exists a working tool which can assist the development of other projects.</Paragraph> <Paragraph position="5"> A citation may be formally defined as a portion of a sentence in a citing document which references another document or a set of other documents collectively. For example, in sentence 1 below, there are two citations: the first citation is Although the 3-D structure. . . progress, with the set of references (Eger et al., 1994; Kelly, 1994); the second citation is it was shown. . . submasses with the single reference (Coughlan et al., 1986).</Paragraph> <Paragraph position="6"> (1) Although the 3-D structure analysis by x-ray crystallography is still in progress (Eger et al., 1994; Kelly, 1994), it was shown by electron microscopy that XO consists of three submasses (Coughlan et al., 1986).</Paragraph> <Paragraph position="7"> A citation index enables efficient retrieval of documents from a large collection--a citation index consists of source items and their corresponding lists of bibliographic descriptions of citing works. The use of citation indexing of scientific articles was invented by Dr. Eugene Garfield in the 1950s as a result of studies on problems of medical information retrieval and indexing of biomedical literature. Dr. Garfield later founded the Institute for Scientific Information (ISI), whose Science Citation Index (Garfield, no date) is now one of the most popular citation indexes. Recently, with the advent of digital libraries, Web-based indexing systems have begun to appear (e.g., ISI's 'Web of Knowledge', CiteSeer (Bollacker et al., 1999)).</Paragraph> <Paragraph position="8"> Authors of scientific papers normally include citations in their papers to indicate works that are connected in an important way to their paper. Thus, a citation connecting the source document and a citing document serves one of many functions. For example, one function is that the citing work gives some form of credit to the work reported in the source article. Another function is to criticize previous work. Other functions include: foundational works in their field, background for their own work, works which are representative of complementary or contradictory research.</Paragraph> <Paragraph position="9"> The aim of citation analysis studies has been to categorize and, ultimately, to classify the function of scientific citations automatically. Many citation classification schemes have been developed, with great variance in the number and nature of categories used. Garfield (1965) was the first to define a classification scheme, while Finney (1979) was the first to suggest that a citation classifier could be automated. Other classification schemes include those by Cole (1975), Duncan, Anderson, and McAleese (1981), Frost (1979), Lipetz (1965), Moravcsik and Murugesan (1975), Peritz (1983), Small (1978), Spiegel-R&quot;osing (1977), and Weinstock (1971). Within this representative group of classification schemes, the number of categories ranges from four to 26. Examples of these categories include a contrastive, supportive, or corrective relationship between citing and cited works.</Paragraph> <Paragraph position="10"> But, the author's purpose for including a citation is not apparent in the citation per se. Determining the nature of the exact relationship between a citing and cited paper often requires some level of understanding of the text in which the citation is embedded.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Citation indexing in biomedical literature analysis </SectionTitle> <Paragraph position="0"> In the biomedical field, we believe that the usefulness of automated citation classification in literature indexing can be found in both the larger context of managing entire databases of scientific articles or for specific information-extraction problems. On the larger scale, database curators need accurate and efficient methods for building new collections by retrieving articles on the same topic from huge general databases. Simple systems (e.g., (Andrade and Valencia, 1998), (Marcotte et al., 2001)) consider only keyword frequencies in measuring article similarity.</Paragraph> <Paragraph position="1"> More-sophisticated systems, such as the Neighbors utility (Wilbur and Coffee, 1994), may be able to locate articles that appear to be related in some way (e.g., finding related Medline abstracts for a set of protein names (Blaschke et al., 1999)), but the lack of specific information about the nature and validity of the relationship between articles may still make the resulting collection a less-than-ideal resource for subsequent analysis. Citation classification to indicate the nature of the relationships between articles in a database would make the task of building collections of related articles both easier and more accurate. And, the existence of additional knowledge about the nature of the linkages between articles would greatly enhance navigation among a space of documents to retrieve meaningful information about the related content.</Paragraph> <Paragraph position="2"> A specific problem in information extraction that may benefit from the use of citation categorization involves mining the literature for protein-protein interactions (e.g., (Blaschke et al., 1999), (Marcotte et al., 2001), (Thomas et al., 2000)). Currently, even the most-sophisticated systems are not yet capable of dealing with all the difficult problems of resolving ambiguities and detecting hidden knowledge. For example, Blaschke et al.'s system (1999) is able to handle fairly complex problems in detecting protein-protein interactions, including constructing the network of protein interactions in cell-cycle control, but important implicit knowledge is not recognized. In the phosphorylates Cdk2. However, the system is not able to detect that Cak is actually a complex formed by Cdk7 and CycH, and that the Cak complex regulates Cdk2.</Paragraph> <Paragraph position="3"> While the earlier literature describes inter-relationships among these proteins, the recognition of the generalization in their structure, i.e., that these proteins are part of a complex, is contained only in more-recent articles: &quot;There is an element of generalization implicit in later publications, embodying previous, more dispersed findings. A clear improvement here would be the generation of associated weights for texts according to their level of generality&quot; (Blaschke et al., 1999). Citation categorization could provide just these kind of 'ancestral' relationships between articles--whether an article is foundational in the field or builds directly on closely related work--and, if automated, could be used in forming collections of articles for study that are labelled with explicit semantic and rhetorical links to one another. Such collections of semantically linked articles might then be used as 'thematic' document clusters (cf. Wilbur (2002)) to elicit much more meaningful information from documents known to be closely related.</Paragraph> <Paragraph position="4"> An added benefit of having citation categories available in text corpora used for studies such as extracting protein-protein interactions is that more, and moremeaningful, information may be obtained. In a potential application for our research, Blaschke et al. (1999) noted that they were able to discover many more protein-protein interactions when including in the corpus those articles found to be related by the Neighbors facility (Wilbur and Coffee, 1994) (285 versus only 28 when relevant protein names alone were used in building the corpus). Lastly, very difficult problems in scientific and biomedical information extraction that involve aspects of deep-linguistic meaning may be resolved through the availability of citation categorization in curated texts: synonym detection, for example, may be enhanced if different names for the same entity occur in articles that can be recognized as being closely related in the scientific research process.</Paragraph> </Section> </Section> class="xml-element"></Paper>