File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3103_metho.xml
Size: 10,544 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3103"> <Title>The Language of Bioscience: Facts, Speculations, and Statements in Between</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Manual annotation experiment </SectionTitle> <Paragraph position="0"> In this experiment, four human annotators manually marked sentences as highly speculative, low speculative, or definite.</Paragraph> <Paragraph position="1"> Some of the questions we hoped to answer with this experiment were: can we characterize what a speculative sentence is (as demonstrated by good inter-annotator agreement), can a distinction between high and low speculation be made, how much speculative speech is there, where are speculative sentences located in the abstract, is there variation across topics? The annotators were instructed to follow written annotation guidelines which we provide in appendix of this paper. We wanted to explore how well the annotators agreed on relatively abstract classifications such as &quot;requires extrapolation from actual findings&quot; and thus we refrained from writing instructions such as &quot;if the sentence contains a form of suggest, then mark it as speculative&quot; into the guidelines.</Paragraph> <Paragraph position="2"> We chose three topics to work on and used the following Pubmed queries to gather abstracts: &quot;gene regulation&quot; AND &quot;transcription factor&quot;</Paragraph> <Paragraph position="4"> turmeric OR curcumin OR curcuma The first topic is gene regulation and is about molecular biology research on transcription factors, promoter regions, gene expression, etc. The second topic is Crohn's disease which is a chronic relapsing intestinal inflammation and has a number of genes (CARD15) or chromosomal loci associated with it.</Paragraph> <Paragraph position="5"> The third topic is turmeric (aka curcumin), a spice widely used in Asia and highly regarded for its curative and analgesic properties. These include the treatment of burns, stomach ulcers and ailments, and various skin diseases. There has been a surge of interest in curcumin over the last decade.</Paragraph> <Paragraph position="6"> Each abstract set was prepared for annotation as follows: the order of the abstracts was randomized and the abstracts were broken into sentences using Mxterminator (Reynar and Ratnaparkhi, 1997).</Paragraph> <Paragraph position="7"> The following people performed the annotations: Padmini Srinivasan, who has analyzed crohns and turmeric documents for a separate knowledge discover research task, Xin Ying Qiu, who is completely new to all three topics, Marc Light, who has some experience with gene regulation texts (e.g., (Light et al., 2003)), Vladimir Leontiev, who is a research scientist in an anatomy and cell biology department. It certainly would have been preferable to have four experts on the topics do the annotation but this was not possible.</Paragraph> <Paragraph position="8"> The following manual annotations were performed: null a. 63 gene regulation abstracts (all sentences) by both Leontiev and Light, b. 47 gene regulation additional abstracts (all sen- null tences) by Light, c. 100 crohns abstracts (last 2 sentences) by both Srinivasan and Qiu, d. 400 crohns abstracts additional (last 2 sentences) by Qiu, e. 100 turmeric abstracts (all sentences) by Srinivasan, null f. 400 turmeric additional abstracts (last 2 sentences) by Srinivasan.</Paragraph> <Paragraph position="9"> The 63 double annotated gene regulation abstracts (set a) contained 547 sentences. The additional abstracts (set b) marked by Light1 contained 344 sentences summing to 891 sentences of gene regulation abstracts. Thus, there is an average of almost 9 sentences per gene regulation abstract. The 100 turmeric abstracts (set e) contained 738 sentences. The other sets contain twice as many sentences as abstracts since only the last two sentences where annotated. null The annotation of each sentence was performed in the context of its abstract. This was true even when only the last two sentences where annotated. The annotation guidelines in the appendix were used by all annotators. In addition, at the start of the experiment general issues were discussed but none of the specific examples in the sets a-f.</Paragraph> <Paragraph position="10"> We worked with three categories Low Speculative, High Speculative, and Definite. All sentences were annotated with one of these. The general idea behind the low speculative level was that the authors expressed a statement in such a way that it is clear that it follows almost directly from results but not quite. There is a small leap of faith. A high speculative statement would contain a more dramatic leap from the results mentioned in the abstract.</Paragraph> <Paragraph position="11"> Our inter-annotator agreement results are expressed in the following four tables. The first table contains values for the kappa statistic of agreement (see (Siegel and Castellan, 1988)) for the gene regulation data (set a) and the crohns data (set c). Three values were computed: kappa for three-way agreement (High vs. Low vs. Definite), two-way (Speculative vs. Definite) and two-way (High vs. Low). Due to the lack of any sentences marked High in set c, a kappa value for High vs. low (HvsL) is not possible. Kappa scores between 0.6 and 0.8 are generally considered encouraging but not outstanding. The following two tables are confusion matrices, the first for gene regulation data (set a) and the second for the crohns data (set c).</Paragraph> <Paragraph position="12"> If we consider one of the annotators as defining truth (gold standard), then we can compute precision and recall numbers for the other annotator on finding speculative sentences. If we choose Leontiev and Srinivasan as defining truth, then Light and Qiu receive the scores below.</Paragraph> <Paragraph position="13"> As is evident from the confusion matrices, the amount of data that we redundantly annotated is small and thus the kappa numbers are at best to be taken as trends. However, it does seem that the speculative vs. definite distinction can be made with some reliability. In contrast, the high speculation vs. low speculation distinction cannot.</Paragraph> <Paragraph position="14"> The gene regulation annotations marked by Light (sets a & b using only Light's annotations) can be used to answer questions about the position of speculative fragments in abstracts. Consider the histogram-like table below. The first row refers to speculative sentences and the second to definite. The columns refer to the last sentence of an abstract, the penultimate, elsewhere, and a row sum. The number in brackets is the raw count. Remember that the number of abstracts in sets a & b together is 100. last 2nd last earlier total</Paragraph> <Paragraph position="16"> It is clear that almost all of the speculations come towards the end of the abstract. In fact the final sentence contains a speculation more often than not.</Paragraph> <Paragraph position="17"> In addition, consider the data where all sentences in an abstract were annotated (sets a & b & e, using Light's annotation of a), there were 1456 definitive sentences (89%) and 173 speculative sentence (11%). Finally, if we consider the last two sentences of all the data (sets a-f), we have 1712 definitive sentences (82%) and 381 speculative sentences (18.20%).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Automatic classifier experiment </SectionTitle> <Paragraph position="0"> We decided to explore the ability of an SVM-based text classifier to select speculative sentences from the abstracts. For this the abstracts were first processed using the SMART retrieval system (Salton, 1971) in order to obtain representation vectors (term-based). Alternative representations were tried involving stemming and term weighting (no weights versus TF*IDF weights). Since results obtained were similar we present only results using stemming and no weights.</Paragraph> <Paragraph position="1"> The classifier experiments followed a 10-fold cross-validation design. We used SV Mlight package2 with all settings at default values. We ran experiments in two modes. First, we considered only the last 2 sentences. For this we pooled all hand tagged sentences from the three topic areas (sets a-f). Second, we explored classification on all sentences in the document (sets a,b,e).</Paragraph> <Paragraph position="2"> If we assume a default strategy as a simple baseline, where the majority decision is always made, then we get an accuracy of 82% for the classification problem on the last two sentences data set and 89% for the all sentences data set. Another baseline option is to use a set of strings and look for them as substrings in the sentences. The following 14 strings were identified by Light while annotating the gene regulation abstracts (sets a&b): suggest, potential, likely, may, at least, in part, possibl, potential, further investigation, unlikely, putative, insights, point toward, promise, propose. The automated system then looks for these substrings in a sentence and if found, the sentence is marked as speculative and as definite if not.</Paragraph> <Paragraph position="3"> In the table below the scores for the three methods of annotation are listed as rows. We give accuracy on the categorization task and precision and recall numbers for finding speculative sentences. The format is precision/recall(accuracy), all as percentages. The Majority method, annotating every sentence as WARE/SVM LIGHT/svm light.html.en definite, does not receive precision and recall values. The substring method was run on a subset of the datasets where the gene regulation data (sets a&b) was removed. (It performs extremely well on the gene regulation data due to the fact that it was developed on that data.) Again the results are preliminary since the amount of data is small and the feature set we explored was limited to words. However, it should be noted that both the substring and the SVM systems performs well suggesting that speculation in abstracts is lexically marked but in a somewhat ambiguous fashion. This conclusion is also supported by the fact that neither system used positional features and yet the precision and recall on the all sentence data set is similar to the last two sentences data set.</Paragraph> </Section> class="xml-element"></Paper>