File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-2018_intro.xml
Size: 3,412 bytes
Last Modified: 2025-10-06 14:03:30
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2018"> <Title>MMR-based Active Machine Learning for Bio Named Entity Recognition</Title> <Section position="2" start_page="3" end_page="69" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Named-entity recognition is one of the most elementary and core problems in biomedical text mining. To achieve good recognition performance, we use a supervised machine-learning based approach which is a standard in the named-entity recognition task. The obstacle of supervised machine-learning methods is the lack of the annotated training data which is essential for achieving good performance.</Paragraph> <Paragraph position="1"> Building a training corpus manually is time consuming, labor intensive, and expensive. Creating training corpora for the biomedical domain is particularly expensive as it requires domain specific expert knowledge.</Paragraph> <Paragraph position="2"> One way to solve this problem is through active learning method to select the most informative samples for training. Active selection of the training examples can significantly reduce the necessary number of labeled training examples without degrading the performance.</Paragraph> <Paragraph position="3"> Existing work for active learning explores two approaches: certainty or uncertainty-based methods (Lewis and Gale 1994; Scheffer and Wrobel 2001; Thompson et al. 1999) and committee-based methods (Cohn et al. 1994; Dagan and Engelson 1995; Freund et al. 1997; Liere and Tadepalli 1997). Uncertainty-based systems begin with an initial classifier and the systems assign some uncertainty scores to the un-annotated examples. The k examples with the highest scores will be annotated by human experts and the classifier will be retrained. In the committee-based systems, diverse committees of classifiers were generated. Each committee member will examine the un-annotated examples. The degree of disagreement among the committee members will be evaluated and the examples with the highest disagreement will be selected for manual annotation.</Paragraph> <Paragraph position="4"> Our efforts are different from the previous active learning approaches and are devoted to two aspects: we propose an entropy-based measure to quantify the uncertainty that the current classifier holds. The most uncertain samples are selected for human annotation. However, we also assume that the selected training samples should give the different aspects of learning features to the classification system. So, we try to catch the most representative sentences in each sampling. The divergence measures of the two sentences are for the novelty of the features and their representative levels, and are described by the minimum similarity among the examples. The two measures for uncertainty and diversity will be combined using the MMR (Maximal Marginal Relevance) method (Carbonell and Goldstein 1998) to give the sampling scores in our active learning strategy.</Paragraph> <Paragraph position="5"> We incorporate MMR-based active machine-learning idea into the POSBIOTM/NER (Song et al. 2005) system which is a trainable biomedical named-entity recognition system using the Conditional Random Fields (Lafferty et al. 2001) machine learning technique to automatically identify different sets of biological entities in the text.</Paragraph> </Section> class="xml-element"></Paper>