File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1212_intro.xml
Size: 4,823 bytes
Last Modified: 2025-10-06 14:02:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1212"> <Title>Classification from Full Text: A Comparison of Canonical Sections of Scientific Papers</Title> <Section position="3" start_page="0" end_page="66" type="intro"> <SectionTitle> 2 Methods </SectionTitle> <Paragraph position="0"> The current study was concerned with two issues which sections of full text journal articles are most informative with regards to gene product and which Natural Language Processing techniques are most useful in associating those products with particular articles. The scores from the Raychaudhuri et al. study are used as a baseline (see Table 1).</Paragraph> <Section position="1" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 2.1 Data </SectionTitle> <Paragraph position="0"> PubMed gives access to information in MEDLINE - the title and abstract of articles along with manual annotations such as MeSH Headings and Registry Numbers. PubMed Central1, on the other hand, gives access to the full text (in HTML) of (currently 98) journals that are indexed in MEDLINE. Also, many other publishers are now making their journal articles available online for free on their own sites. PubMed Central will also list articles from these publishers.</Paragraph> <Paragraph position="1"> BioMed Central (BMC) is another resource for full text articles. BMC, like PubMed Central, contains full text from many journals as well as having many of its own online journals. Authors can submit articles to these BMC journals and have them reviewed and published in the same month2.</Paragraph> <Paragraph position="2"> The same queries that were used in the Raychaudhuri et al. study were used to query PubMed Central in order to find full text articles relating to the 21 biological process GO codes. The DP field was omitted, in order to access as many full text articles as possible. For some of the 21 GO codes, there were not enough free full text articles available to be deemed representative and so only those codes that had 50+ full text articles associated with them were used in the rest of the study. These can be seen in Table 1.</Paragraph> <Paragraph position="3"> Most journals have a format to which authors must adhere in order for an article to be considered for publication, including rules concerning the naming of sections.</Paragraph> <Paragraph position="4"> With respect to the structure of scientific papers (or, more specifically, papers in biology), many people talk about them having a canonical structure consisting of a Title, Abstract, Introduction, Materials and Methods, Results, and Discussion in either this order or with Materials and Methods at the end. For the experiments reported here, those articles were extracted from the full text of journals that adhere closely to this canonical structure and other sections were ignored. Sections named simply Methods were included with the Materials and</Paragraph> </Section> <Section position="2" start_page="66" end_page="66" type="sub_section"> <SectionTitle> 2.2 Tools </SectionTitle> <Paragraph position="0"> Because the current study concerns whether NLP techniques can help to improve performance of classification, we have postponed experimenting with different machine learning techniques. We will do so after we find which NLP techniques are the most useful. The Rainbow3 Naive Bayes classification tool was used.</Paragraph> <Paragraph position="1"> Raychaudhuri et al. induced a single N-ary classifier, whereas this study induced 21 binary classifiers, i.e. an article was classified as either being related to a particular biological process or unrelated. We applied both Part-of-Speech tagging and stemming. The LT-TTT tagger (Grover et al., 2000) was used to tag the part of speech each word belonged to. This allowed us to experiment with building classifiers based only on single parts of speech as well as ones based on all words.</Paragraph> <Paragraph position="2"> The most widely used stemmer among the NLP community is the Porter stemmer (Porter, 1980). A Perl version of this was used to produce stemmed sets of the articles.</Paragraph> <Paragraph position="3"> We experimented with four strategies to find the best performance in classification: bag of words; bag of nouns; bag of stems; bag of stemmed nouns. There were too few full text articles to both train and test on, so the classifiers were trained on the original titles and abstract articles from Raychaudhuri et al. and then tested on the full text and sections thereof. The negative training instances for each category were those articles that were related to the other categories (approx 2000). Four sets of classifiers were trained: one set each for the bags of words, nouns, stems and stemmed nouns.</Paragraph> </Section> </Section> class="xml-element"></Paper>