File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1215_intro.xml
Size: 6,583 bytes
Last Modified: 2025-10-06 14:02:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1215"> <Title>Annotating Multiple Types of Biomedical Entities: A Single Word Classification Approach</Title> <Section position="3" start_page="0" end_page="80" type="intro"> <SectionTitle> 2 Methods </SectionTitle> <Paragraph position="0"> Most of the works in the past on recognizing named entities in the biomedical domain focused on identifying a single type of entities like protein and/or gene names. It is obviously more challenging to annotate multiple types of named entities simultaneously. Intuitively, one can develop a specific recognizer for each type of named entities, run the recognizers one by one to annotate all types of named entities, and merge the results. The problem results from the boundary decision and the annotation conflicts. Instead of constructing five individual recognizers, we regarded the multiple -class annotation as a classification problem, and tried to learn a classifier capable of identifying all the five types of named entities.</Paragraph> <Paragraph position="1"> Before classification, we have to decide the unit of classification. Since it is difficult to correctly mark the boundary of a name to be identified, the simplest way is to consider an individual word as an instance and assign a type to it. After the type assignment, continuous words of the same type will be marked as a complete named entity of that type. The feature extraction process will be described in the following subsections.</Paragraph> <Section position="1" start_page="80" end_page="80" type="sub_section"> <SectionTitle> 2.1 Feature Extraction </SectionTitle> <Paragraph position="0"> The first step in classification is to extract informative and useful features to represent an instance to be classified. In our work, one word is represented by the attributes carried per se, the attributes contributed by two surrounding words, and other contextual information. The details are as follows.</Paragraph> <Paragraph position="1"> The word &quot;attribute&quot; is sometimes used interchangeably with &quot;feature&quot;, but in this article they denote two different concepts. Features are those used to represent a classification instance, and the information enclosed in the features is not necessarily contributed by the word itself.</Paragraph> <Paragraph position="2"> Attributes are defined to be the information that can be derived from the word alone in this paper. The attributes assigned to each word are whether it is part of a gene/protein name, whether it is part of a species name, whether it is part of a tissue name, whether it is a stop word, whether it is a number, whether it is punctuation, and the part of speech of this word. Instead of using a lexicon for gene/protein name annotation, we employed two gene/protein name taggers, Yapex and GAPSCORE, to do this job. As for part of speech tagging, Brill's part of speech tagger was adopted. Contextual information has been shown helpful in annotating gene/protein names, and therefore two strategies for extracting contextual information at different levels are used. One is the usual practice at a word level, and the other is at a pattern level. Since the training data released in the beginning does not define the abstract boundary, we have to assume that sentences are independent of each other, and the contextual information extraction was thus limited to be within a sentence.</Paragraph> <Paragraph position="3"> For contextual information extraction at word level (Hou and Chen, 2003), collocates along with 4 statistics including frequency, the average and standard error of distance between word and entity and t-test score, were extracted. The frequency and t-test score were normalized to [0, 1]. Five lists of collocates were obtained for cell-line, celltype, DNA, RNA, and protein, respectively.</Paragraph> <Paragraph position="4"> As for contextual information extraction at pattern level, we first gathered a list of words constituting a specific type of named entities.</Paragraph> <Paragraph position="5"> Then a hierarchical clustering with cutoff threshold was performed on the words. Edit distance was adopted as the measure of dissimilarity (see Figure 1). Afterwards, common substrings were obtained to form the list of patterns. With a list of patterns at hand, we estimated the pattern distribution, the occurrence frequencies at and around the current position, given the type of word at the current position. Figure 2 showed an example of the estimated distribution. The average KL-Divergence between any two distributions was computed to discriminate the power of each pattern.</Paragraph> <Paragraph position="6"> The formula is as follows:</Paragraph> </Section> <Section position="2" start_page="80" end_page="80" type="sub_section"> <SectionTitle> 2.2 Constructing Training Data </SectionTitle> <Paragraph position="0"> For each word in a sentence, the attributes of the word and the two adjacent words are put into the feature vector. Then, the left five and the right five words are searched for previously extracted collocates. The 15 variables thus added are shown the maximum likelihood estimates of mean and standard deviation for wi given the type. Next, the left three and right three words along with the current word are searched for patterns, adding 6 variables to the feature vector.</Paragraph> <Paragraph position="2"> [?][?] , where type is one of the six types including 'O', iw P is the set of patterns matching wi, Prob p denotes the pmf for pattern p. Finally, the type of the previous word is added to the feature vector, mimicking the concept of a stochastic model.</Paragraph> </Section> <Section position="3" start_page="80" end_page="80" type="sub_section"> <SectionTitle> 2.3 Classification </SectionTitle> <Paragraph position="0"> Support Vector Machines classification with radial basis kernel was adopted in this task, and the package LIBSVM - A Library for Support Vector Machines (Hsu et al., 2003) was used for training and prediction. The penalty coefficient C in optimization and gamma in kernel function were tuned using a script provided in this package.</Paragraph> <Paragraph position="1"> The constructed training data contains 492,551 instances, which is too large for training. Also, the training data is extremely unbalanced (see Table 1) and this is a known problem in SVMs classification. Therefore, we performed stratified sampling to form a smaller and balanced data set for training.</Paragraph> </Section> </Section> class="xml-element"></Paper>