File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3113_metho.xml
Size: 20,024 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3113"> <Title>A Design Methodology for a Biomedical Literature Indexing Tool Using the Rhetoric of Science</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Our Guiding Principles </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Scientific writing and the rhetoric of science </SectionTitle> <Paragraph position="0"> The automated labelling of citations with a specific citation function requires an analysis of the linguistic features in the text surrounding the citation, coupled with a knowledge of the author's pragmatic intent in placing the citation at that point in the text. The author's purpose for including citations in a research article reflects the fact that researchers wish to communicate their results to their scientific community in such a way that their results, or knowledge claims, become accepted as part of the body of scientific knowledge. This persuasive nature of the scientific research article, how it contributes to making and justifying a knowledge claim, is recognized as the defining property of scientific writing by rhetoricians of science, e.g., (Gross, 1996), (Gross et al., 2002), (Hyland, 1998), (Myers, 1991). Style (lexical and syntactic choice), presentation (organization of the text and display of the data), and argumentation structure are noted as the rhetorical means by which authors build a convincing case for their results.</Paragraph> <Paragraph position="1"> Our approach to automated citation classification is based on the detection of fine-grained linguistics cues in scientific articles that help to communicate these rhetorical stances and thereby map to the pragmatic purpose of citations. As part of our overall research methodology, our goal is to map the various types of pragmatic cues in scientific articles to rhetorical meaning. Our previous work has described the importance of discourse cues in enhancing inter-article cohesion signalled by citation usage (Mercer and Di Marco, 2003), (Di Marco and Mercer, 2003). We have also been investigating another class of pragmatic cues, hedging cues, (Mercer, Di Marco, and Kroon, 2004), that are deeply involved in creating the pragmatic effects that contribute to the author's knowledge claim by linking together a mutually supportive network of researchers within a scientific community.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Results of our previous studies </SectionTitle> <Paragraph position="0"> In our preliminary study (Mercer and Di Marco, 2003), we analyzed the frequency of the cue phrases from (Marcu, 1997) in a set of scholarly scientific articles. We reported strong evidence that these cue phrases are used in the citation sentences and the surrounding text with the same frequency as in the article as a whole. In subsequent work (Di Marco and Mercer, 2003), we analyzed the same dataset of articles to begin to catalogue the fine-grained discourse cues that exist in citation contexts. This study confirmed that authors do indeed have a rich set of linguistic and non-linguistic methods to establish discourse cues in citation contexts.</Paragraph> <Paragraph position="1"> Another type of linguistic cue that we are studying is related to hedging effects in scientific writing that are used by an author to modify the affect of a 'knowledge claim'. Hedging in scientific writing has been extensively studied by Hyland (1998), including cataloging the pragmatic functions of the various types of hedging cues.</Paragraph> <Paragraph position="2"> As Hyland (1998) explains, &quot;[Hedging] has subsequently been applied to the linguistic devices used to qualify a speaker's confidence in the truth of a proposition, the kind of caveats like I think, perhaps, might, and maybe which we routinely add to our statements to avoid commitment to categorical assertions. Hedges therefore express tentativeness and possibility in communication, and their appropriate use in scientific discourse is critical (p. 1)&quot;. The following examples illustrate some of the ways in which hedging may be used to deliberately convey an attitude of uncertainty or qualifification. In the first example, the use of the verb suggested hints at the author's hesitancy to declare the absolute certainty of the claim: (2) The functional significance of this modulation is suggested by the reported inhibition of MeSoinduced differentiation in mouse erythroleukemia cells constitutively expressing c-myb.</Paragraph> <Paragraph position="3"> In the second example, the syntactic structure of the sentence, a fronted adverbial clause, emphasizes the effect of qualification through the rhetorical cue Although. The subsequent phrase, a certain degree, is a lexical modifier that also serves to limit the scope of the result: (3) Although many neuroblastoma cell lines show a certain degree of heterogeneity in terms of neurotransmitter expression and differentiative potential, each cell has a prevalent behavior in response to differentiation inducers.</Paragraph> <Paragraph position="4"> In Mercer (2004), we showed that the hedging cues proposed by Hyland occur more frequently in citation contexts than in the text as a whole. With this information we conjecture that hedging cues are an important aspect of the rhetorical relations found in citation contexts and that the pragmatics of hedges may help in determining the purpose of citations.</Paragraph> <Paragraph position="5"> We investigated this hypothesis by doing a frequency analysis of hedging cues in citation contexts in a corpus of 985 biology articles. We obtained statistically significant results (summarized in Table 1 indicating that hedging is used more frequently in citation contexts than the text as a whole. Given the presumption that writers make stylistic and rhetorical choices purposefully, we propose that we have further evidence that connections between fine-grained linguistic cues and rhetorical relations exist in citation contexts.</Paragraph> <Paragraph position="6"> Table 1 shows the proportions of the various types of sentences that contain hedging cues, broken down by hedging-cue category (verb or nonverb cues), according to the different sections in the articles (background, methods, results and discussion, conclusions). For all but one combination, citation sentences are more likely to contain hedging cues than would be expected from the overall frequency of hedge sentences (a0a2a1a4a3a6a5a8a7 ). Citation 'window' sentences (i.e., sentences in the text close to a citation) generally are also significantly (a0a9a1a10a3a5a8a7 ) more likely to contain hedging cues than expected, though for certain combinations (methods, verbs and nonverbs; res+disc, verbs) the difference was not significant.</Paragraph> <Paragraph position="7"> Tables 2, 3, and 4 summarize the occurrence of hedging cues in citation 'contexts' (a citation sentence and the surrounding citation window). Table 5 shows the proportion of hedge sentences that either contain a citation, or fall within a citation window; Table 5 suggests (last 3column column) that the proportion of hedge sentences containing citations or being part of citation windows is at least as great as what would be expected just by the distribution of citation sentences and citation windows.</Paragraph> <Paragraph position="8"> Table 1 indicates (statistically significant) that in most cases the proportion of hedge sentences in the citation contexts is greater than what would be expected by the distribution of hedge sentences. Taken together, these conditional probabilities support the conjecture that hedging cues and citation contexts correlate strongly. Hyland (1998) has catalogued a variety of pragmatic uses of hedging cues, so it is reasonable to speculate that these uses can be mapped to the rhetorical meaning of the text surrounding a citation, and from thence to the function of the citation.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Our Design Methodology </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The Tool </SectionTitle> <Paragraph position="0"> The indexing tool that we are designing enhances a standard citation index by labelling each citation with the function of that citation. That is, given an agreed-upon set of citation functions, our tool will categorize a citation automatically into one of these functional categories.</Paragraph> <Paragraph position="1"> To accomplish this automatic categorization we are using a decision tree: given a set of features, which combinations of features map to which citation function. Our current focus is the biomedical literature, but we are certain that our tool can be used for the experimental sciences.</Paragraph> <Paragraph position="2"> We are not certain whether the tool can be generalized beyond this corpus (Frost, 1979).</Paragraph> <Paragraph position="3"> In the following we describe in more detail the three aspects of our design methodology: the research program, the tool implementation, and its evaluation. Our basic assumption is that citations form links to other articles for much the same purpose and in much the same way as links to other parts of the same article. These intra-textual and inter-textual linkages are made to create a coherent presentation to convince the reader that the content of the article is of value. The presentation is made cohesive by use of linguistic and stylistic devices that have been catalogued by rhetoricians and which we believe may be detected by automated means.</Paragraph> <Paragraph position="4"> The research program will a11 develop a catalogue of linguistic and non-linguistic cues that capture both the linguistic and stylistic techniques as well as the extensive body of knowledge that has accumulated about the rhetoric of science and how science is written about; tions represent as features in a decision tree that produces the intended function of the citation.</Paragraph> <Paragraph position="5"> Our purpose in using a decision tree is three-fold.</Paragraph> <Paragraph position="6"> Firstly, the decision tree gives us ready access to the citation-function decision rules. Secondly, we aim to have a working indexing tool whenever we add more knowledge to the categorization process. This goal appears very feasible given our design choice to use a decision tree: adding more knowledge only refines the decision-making procedure of the previous version. And thirdly, as we gain more experience (currently, we are building the decision tree by hand), we intend to use machine learning techniques to enhance our tool by inducing a decision tree.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 The Research Program </SectionTitle> <Paragraph position="0"> Our basic assumption is that the rhetorical relations that will provide the information that will allow the tool to categorize the citations in a biomedical article are evident to the reader through the use of surface linguistic cues, cues which are linguistically-based but require some knowledge that is not directly derivable from the text, and some cues which are known to the culture of scientific readerswriters because of the practice of science and how this practice influences communication through the writing.</Paragraph> <Paragraph position="1"> We rely on the notion that rhetorical information is realized in linguistic 'cues' in the text, some of which, although not all, are evident in surface features (cf. Hyland (1998) on surface hedging cues in scientific writing).</Paragraph> <Paragraph position="2"> Since we anticipate that many such cues will map to the same rhetorical features that give evidence of the text's argumentative and pragmatic meaning, and that the interaction of these cues will likely influence the text's overall rhetorical effect, the formal rhetorical relation (cf. (Mann and Thompson, 1988)) appears to be the appropriate feature for the basis of the decision tree. So, our long-term goal is to map between the textual cues and rhetorical relations. Having noted that many of the cue words in the prototype are discourse cues, and with two recent important works linking discourse cues and rhetorical relations ((Knott, 1996; Marcu, 1997)), we began our investigation of this mapping with discourse cues. We have some early results that show that discourse cues are used extensively with citations and that some cues appear much more frequently in the citation context than in the full text (Mercer and Di Marco, 2003). Another textual device is the hedging cue, which we are currently investigating (Mercer, Di Marco, and Kroon, 2004).</Paragraph> <Paragraph position="3"> Although our current efforts focus on cue words which are connected to organizational effects (discourse cues), and writer intent (hedging cues), we are also interested in other types of cues that are associated more closely to the purpose and method of science. For example, the scientific method is, more or less, to establish a link to previous work, set up an experiment to test an hypothesis, perform the experiment, make observations, then finally compile and discuss the importance of the results of the experiment. Scientific writing reflects this scientific method and its purpose: one may find evidence even at the coarsest granularity of the IMRaD structure in scientific articles. At a finer granularity, we have many target- null ted words to convey the notions of procedure, observation, reporting, supporting, explaining, refining, contradicting, etc. More specifically, science categorizes into taxonomies or creates polarities. Scientific writing then tends to compare and contrast or refine. Not surprisingly, the morphology of scientific terminology exhibits comparison and contrasting features, for example, exoand endo-. Science needs to measure, so scientific writing contains measurement cues by referring to scales (0100), or using comparatives (larger, brighter, etc.). Experiments are described as a sequence of steps, so this is an implicit method cue.</Paragraph> <Paragraph position="4"> Since the inception of the formal scientific article in the seventeenth century, the process of scientific discovery has been inextricably linked with the actions of writing and publishing the results of research. Rhetoricians of science have gradually moved from a purely descriptive characterization of the science genre to full-fledged field studies detailing the evolution of the scientific article. During the first generation of rhetoricians of science, e.g., (Myers, 1991), (Gross, 1996), (Fahnestock, 1999), the persuasive nature of the scientific article, how it contributes to making and justifying a knowledge claim, was recognized as the defining property of scientific writing.</Paragraph> <Paragraph position="5"> Style (lexical and syntactic choice), presentation (organization of the text and display of the data), and argumentation structure were noted as the rhetorical means by which authors build a convincing case for their results.</Paragraph> <Paragraph position="6"> Recently, second-generation rhetoricians of science (e.g., (Hyland, 1998), (Gross et al., 2002)) have begun to methodically analyze large corpora of scientific texts with the purpose of cataloguing specific stylistic and rhetorical features that are used to create the pragmatic effects that contribute to the author's knowledge claim. One particular type of pragmatic effect, hedging, is especially common in scientific writing and can be realized through a wide variety of linguistic choices.</Paragraph> <Paragraph position="7"> To catalogue these cues and to propose a mapping from these cues to rhetorical relations, we suggest a research program that consists of two phases. One phase is theorybased: we are applying our knowledge from computational linguistics and the rhetoric of science to develop a set of principles that guide the development of rules. Another phase is data-driven. This phase will use machine-learning techniques to induce a decision tree.</Paragraph> <Paragraph position="8"> Our two approaches are guided by a number of factors.</Paragraph> <Paragraph position="9"> Firstly, the initial set of 35 categories ((Garzone, 1996), (Garzone and Mercer, 2000)) were developed by combining and adding to the previous work from the information science community with a preliminary manual study of citations in biochemistry and physics articles. Secondly, our next stages, cataloguing linguistic cues, will require manual work by rhetoricians. Thirdly, and perhaps most importantly, one group of cues is not found in the text, but is rather a set of cultural guidelines that are accepted by the scientific community for which the article is being written. Lastly, we are interested not in the connection between the citation functions and these cues per se, but rather the citation functions and the rhetorical relations that are signalled by the cues.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 The Tool Implementation </SectionTitle> <Paragraph position="0"> Concerning the features on which the decision tree makes its decisions, we have started with a simple, yet fully automatic prototype (Garzone, 1996) which takes journal articles as input and classifies every citation found therein. Its decision tree is very shallow, using only sets of cue-words and polarity switching words (not, however, etc.), some simple knowledge about the IMRaD structure1 of the article together with some simple syntactic structure of the citation-containing sentence. The prototype uses 35 citation categories. In addition to having a design which allows for easy incorporation of more-sophisticated knowledge, it also gives flexibility to the tool: categories can be easily coalesced to give users a tool that can be tailored to a variety of uses.</Paragraph> <Paragraph position="1"> Although we anticipate some small changes to the number of categories due to category refinement, the major modifications to the decision tree will be driven by a more-sophisticated set of features associated with each citation. When investigating a finer granularity of the IMRaD structure, we came to realize that the structure of scientific writing at all levels of granularity was founded on rhetoric, which involves both argumentation structure as well as stylistic choices of words and syntax. This was the motivation for choosing the rhetoric of science as our guiding principle.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Evaluation of the Tool </SectionTitle> <Paragraph position="0"> Finally, as for our prototype system, at each stage of development the tool will be evaluated: a11 A test set of citations will be developed and will be initially manually categorized by humans knowledgeable in the scientific field that the articles represent.</Paragraph> <Paragraph position="1"> a11 Of most essential interest, the classification accuracy of the citation-indexing tool will be evaluated: we propose to use a combination of statistical testing and validation by human experts.</Paragraph> <Paragraph position="2"> a11 In addition, we would like to assess the tool's utility in real-world applications such as database curation for studies in biomedical literature analysis. We have suggested earlier that there may be many uses of this tool, so a significant aspect of the value of our tool will be its ability to enhance other research projects.</Paragraph> </Section> </Section> class="xml-element"></Paper>