File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3104_intro.xml
Size: 5,946 bytes
Last Modified: 2025-10-06 14:02:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3104"> <Title>A Study of Text Categorization for Model Organism Databases</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background and Related Work </SectionTitle> <Paragraph position="0"> Using keywords to retrieve relevant articles is to use a list of keywords to retrieve articles containing these keywords. The main component here is to derive a list of keywords for each category. Using supervised machine learning techniques to retrieve relevant articles requires a collection of category-labeled documents. The objective is to learn classifiers from these category-labeled documents. The construction process of a classifier given a list of category-labeled documents contains two components. The first component transfers each document into a feature representation. The second component uses a supervised learning algorithm to learn classification knowledge that forms a classifier.</Paragraph> <Paragraph position="1"> Machine learning for text categorization requires transforming each document into a feature representation (usually a feature vector) where features are usually words or word stems in the document. In our study, in addition to word or word stems in free text, we also explored other features that could be extracted from the material we used for the study.</Paragraph> <Paragraph position="2"> Several supervised learning algorithms have been adapted for text categorization: Naive Bayes learning (Yang and Liu, 1999), neural networks (Wiener, 1995), instance-based learning (Iwayama and Takunaga, 1995), and Support vector machine (Joachims, 1998). Yang and Liu (1999) provided an overview and a comparative study about different learning algorithms. In previous studies of applying supervised machine learning on the problem of word sense disambiguation, we investigated and implemented several supervised learning algorithms including Naive Bayes learning, Decision List learning and Support Vector Machine for word sense disambiguation. There is not much difference between word sense disambiguation task and text categorization task. We can formulate a word sense disambiguation task as a text categorization task by considering senses of a word as categories (Sebastiani, 2002). We can also formulate a text categorization task by considering there is a hidden word (e.g., TC) in the text with multiple senses (i.e., categories). Note that in word sense disambiguation task, one occurrence of a word usually holds a unique sense. While for text categorization task, sometimes one document can be in multiple categories. After verifying that there were less than 1% of documents holding multiple categories (shown in detail in the following section), for simplicity, we applied the implementations of supervised machine learning algorithm (used for word sense disambiguation) directly for text categorization by considering the disambiguation of a hidden word (TC) in the context. The following summarizes the algorithms used in the study. For detail implementations of these algorithms, readers can refer to (Liu, 2004).</Paragraph> <Paragraph position="3"> Naive Bayes learning (NBL) (Duda, 1973) is widely used in machine learning due to its efficiency and its ability to combine evidence from a large number of features. An NBL classifier chooses the category with the highest conditional probability for a given feature vector; while the computation of conditional probabilities is based on the Naive Bayes assumption: the presence of one feature is independent of another when conditioned on the category variable. The training of the Naive Bayes classifier consists of estimating the prior probabilities for different categories as well as the probabilities of each category for each feature. null</Paragraph> <Paragraph position="5"> Yeast Fly Worm Mouse The Decision List method (DLL) (Yarowsky, 1994) is equivalent to simple case statements in most programming languages. In a DLL classifier, a sequence of tests is applied to each feature vector. If a test succeeds, then the sense associated with that test is returned. If the test fails, then the next test in the sequence is applied. This continues until the end of the list, where a default test simply returns the majority sense. Learning a decision list classifier consists of generating and ordering individual tests based on the characteristics of the training data.</Paragraph> <Paragraph position="6"> (a) Support vector machine (SVM) (Vapnik, 1998) is a supervised learning algorithm proposed by Vladimir Vapnik and his co-workers. For a binary classification task with classes {+1, -1}, given a training set with n class-labeled instances, (x1, y1), (x2, y2), ..., (xi, yi), ..., (xn, yn), where xi is a feature vector for the ith instance and yi indicates the class, an SVM classifier learns a linear decision rule, which is represented using a hyperplane. The tag of an unlabelled instance x is determined by which side of the hyperplane x lies. The purpose of training the SVM is to find a hyper-plane that has the maximum margin to separate the two classes.</Paragraph> <Paragraph position="7"> Using a list of keywords to retrieve relevant articles has been used frequently for NLP systems in the biological domain. For example, Iliopoulos et al. (2001) used keywords pertinent to a biological process or a single species to select a set of abstracts for their system. Supervised machine learning has been used by Donaldson et al. (2003) to recognize abstracts describing bio-molecular interactions. The training articles in their study were collected and judged by domain experts. In our study, we compared keywords retrieving with supervised machine learning algorithms. The category-labeled training documents used in our study were automatically obtained from model organism databases and MEDLINE.</Paragraph> </Section> class="xml-element"></Paper>