File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1301_intro.xml
Size: 4,266 bytes
Last Modified: 2025-10-06 14:03:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1301"> <Title>Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data</Title> <Section position="3" start_page="1" end_page="2" type="intro"> <SectionTitle> 2 Background and Related Work 2.1 Gene Name Normalization and Extrac- </SectionTitle> <Paragraph position="0"> tion The task of normalizing and identifying biological entities, genes in particular, has received considerable attention in the biological text mining community. The recent Task 1B from BioCreAtIvE [5] challenged systems to identify unique gene identifiers associated with paper abstracts from the literature for three organisms: mouse, fly and yeast. Task 1A from the same workshop focused on identifying (i.e. tagging) mentions of genes in biomedical journal abstracts.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 NLP with Noisy and Un-labeled Training Data </SectionTitle> <Paragraph position="0"> Within biomedical text processing, a number of approaches for both identification and normalization of entities have attempted to make use of the many available structured biological resources to &quot;bootstrap&quot; systems by deriving noisy training data for the task at hand. A novel method for using noisy (or &quot;weakly labeled&quot;) training data from biological databases to learn to identify relations in biomedical texts is presented in [6]. Noisy training data was created in [7] to identify gene name mentions in text. Similarly, [8] employed essentially the same approach using the FlyBase database to identify normalized genes within articles.</Paragraph> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.3 Weakly Supervised Learning </SectionTitle> <Paragraph position="0"> Weakly supervised learning remains an active area of research in machine learning. Such methods are very appealing: they offer a way for a learning system provided with only a small amount of labeled training data and a large amount of un-labeled data to perform better than using the labeled data alone.</Paragraph> <Paragraph position="1"> In certain situations (see [2]) the improvement can be substantial.</Paragraph> <Paragraph position="2"> Situations with small amounts of labeled data and large amounts of unlabeled data are very common in real-world applications where labeling large quantities of data is prohibitively expensive. Weakly supervised learning approaches can be broken down into multi-view and single-view methods.</Paragraph> <Paragraph position="3"> Multi-view methods [2] incrementally label unlabeled data as follows. Two classifiers are trained on the training data with different &quot;views&quot; of the data. The different views are realized by splitting the set of features in such a way that the features for one classifier are conditionally independent of features for the other given the class label. Each classifier then selects the most confidently classified instances from the unlabeled data (or some random subset thereof) and adds them to the training set. The process is repeated until all data has been labeled or some other stopping criterion is met. The intuition behind the approach is that since the two classifiers have different views of the data, a new training instance that was classified with high confidence by one classifier (and thus is &quot;redundant&quot; from that classifier's point of view) will serve as an informative, novel, new training instance for the other classifier and viceversa. null Single-view methods avoid the problem of finding an appropriate feature split which is not possible or appropriate in many domains. One common approach here [4] involves learning an ensemble of classifiers using bagging. With bagging, the training data is randomly sampled, with replacement, with a separate classifier trained on each sample.</Paragraph> <Paragraph position="4"> Un-labeled instances are then labeled if all of the separate classifiers agree on the label for that instance. Other approaches are based on the expectation maximization algorithm (EM) [9].</Paragraph> </Section> </Section> class="xml-element"></Paper>