File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1309_metho.xml
Size: 24,002 bytes
Last Modified: 2025-10-06 14:08:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1309"> <Title>Protein Name Tagging for Biomedical Annotation in Text</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> 2 Protein Name Tagging </SectionTitle> <Paragraph position="0"> Our task is to identify non-overlapping strings that represent protein names in text. Figure 1 gives an 1A graphic word is defined to be a string of contiguous alphanumeric characters with spaces on either sides; may include hyphens and apostrophes, but no other punctuation marks. Quoted from p.125 in Manning and Sch&quot;utze (1999).</Paragraph> <Paragraph position="1"> is a cps start and # a cps end. M is a mark and D is a delimiter. The cps starts and cps ends can be determined by marks M and delimiters D. ! is a token found in the dictionary by common prefix search. A bold ! is the optimal path in the trellis.</Paragraph> <Paragraph position="2"> w is a word. m is a morpheme.</Paragraph> <Paragraph position="3"> overview. A plain sentence undergoes morphological analysis and BaseNP recognition. The latter pre-processing is to reflect an intuition that most protein names are found in noun phrases. We extract features from these preprocessing, and represent them as feature vectors. SVM-based chunking is performed using the features to yield a protein name tagged sentence.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Morphological Analysis </SectionTitle> <Paragraph position="0"> Our morphological analysis gives (a) sophisticated tokenization, (b) part-of-speech tagging and (c) annotation of value-added information such as the stemmed form of a word, accession numbers to biomedical resources. Our morphological analyzer for biomedical English, cocab2, is inspired by the work of Yamashita and Matsumoto (2000).</Paragraph> <Paragraph position="1"> We first define terms used in this paper with an illustration in Figure 2.</Paragraph> <Paragraph position="2"> A lexeme is an entry in a dictionary. A common prefix search (cps) is a standard technique for looking up lexemes in morphological analysis of non-segmented languages. A dictionary is often a trie data structure so that all possible lexemes that match with the prefix starting at a given position in the sentence are retrieved efficiently. A common prefix search start position (cps start) is a position in a sentence at which a dictionary lookup can start. A common prefix search end position (cps end) is a position in a sentence by which a matched lexeme must end.</Paragraph> <Paragraph position="3"> A token is a substring in a sentence which matches with a lexeme in the dictionary, and is enclosed by a cps start and a cps end. Note that the matched lexeme is retrieved from the dictionary by common prefix search. A mark is a special symbol or substring that by itself can form a token even when it appears within a graphic word. A delimiter is a special symbol or code that by itself cannot form a token but can work to delimit tokens. Note also that a delimiter cannot appear on the boundaries of a token, but can appear inside a token. Examples of marks and delimiters are shown in Table 1.</Paragraph> <Paragraph position="4"> A word is a substring in a sentence of which seg- null include transcription of Greek alphabets that often appear in MEDLINE abstracts.</Paragraph> </Section> <Section position="2" start_page="0" end_page="1" type="sub_section"> <SectionTitle> Delimiter Mark </SectionTitle> <Paragraph position="0"> space .,:;&quot;'%/[]fg!?%$&-() tab 0123456789 CR/LF alpha beta gamma delta epsilon kappa sigma zeta logical analysis. A morpheme is the smallest unit of a word which is enclosed a cps start and the nearest cps end to the cps start.</Paragraph> <Paragraph position="1"> The task of morphological analysis is to find the best pair hW/;T/i of word segmentation W/ = w/1;:::;w/n and its parts of speech assignment T/ = t/1;:::;t/n, in the sense that the joint probability of the word sequence and the tag sequence P(W;T) is maximized when W = W/ and T = T/. Formally,</Paragraph> <Paragraph position="3"> The approximate solution for this equation is given</Paragraph> <Paragraph position="5"> In order to avoid spurious segmentation, we determine cps starts and cps ends in a sentence. Marks and delimiters in a sentence are used to find cps starts and cps ends in the sentence, shown as &quot; and # respectively in Figure 2.</Paragraph> <Paragraph position="6"> Once cps starts and cps ends are determined, the problem is to solve the equation of morphological analysis. It consists of (a) finding a set of tokens that match lexemes in the dictionary, (b) building a trellis from the tokens, and (c) running a Viterbi-like dynamic programming on the trellis to find the path that best explains the input sentence.</Paragraph> <Paragraph position="7"> In Figure 2, ! indicates tokens. Both &quot;SLP76&quot; and &quot;SLP-76-associated+substrate&quot; (+ denotes a space character) are tokens since they are lexemes in the dictionary, but &quot;SLP-76-&quot; is not a token since it is not a lexeme in the dictionary. It allows a lexeme-based tokenization which can accommodate a token that is shorter than, the same as, or longer than a graphic word.</Paragraph> <Paragraph position="8"> The optimal path in the trellis gives a sequence of words that the input sentence is 'best' tokenized and part-of-speech tagged. This is the word-based output, shown as a sequence of w in Figure 2. In addition, our morphological analyzer produces the morpheme-based output, given the word-based output. This is a sequence of the smallest units in each segmented word, shown as a sequence of m in Figure 2. Our chunking is based on morphemes and takes note of words as features to overcome the under-segmentation problem.</Paragraph> <Paragraph position="9"> GENIA Corpus 3.0p3 is used to calculate a word probability p(wjt), and a tag probability p(tjt0;t00) which is modeled by a simple trigram. To better cope with biomedical English, we enhance the dictionary (i.e. p(wjt)) in a number of ways.</Paragraph> <Paragraph position="10"> First, we collect human protein names (including synonyms) and their accession numbers from protein sequence repositories, SwissProt (SP) (Boeckmann et al., 2003) and Protein Information Resource (PIR) (Wu et al., 2002). We convert each entry description to a lexeme. A part-of-speech of the lexeme is set to a common noun (NN) where the minimum word probability of NN is assigned for p(wjt). An accession number of the entry is also recorded in the miscellaneous information field of the lexeme. Similarly, Gene Ontology (GO) (Consortium., 2000) terms are converted to lexemes where accession number as well as the root category are kept in the miscellaneous information field. Third, we use UMLS Specialist Lexicon (NLM, 2002) to obtain the stemmed form of a lexeme. A final twist is to associate constituent information for each compound lexeme. A lexeme is compound if log of an inverse of p(wjt). &quot;constituents&quot; are obtained from searching single lexemes in the dictionary. &quot;sp&quot; and &quot;pir&quot; are associated with accession numbers. &quot;go&quot; is associated with an accession number and a root category from molecular function, biological process, and cellular component.</Paragraph> <Paragraph position="11"> key value surface string ERK activator kinase 1 it consists of multiple morphemes, and single otherwise. An example of a compound lexeme is shown in Table 2.</Paragraph> <Paragraph position="12"> In the conventional paradigm, a token cannot have a white space character. However, 71.6 % of name description in SP are entries of multiple graphic words. This has been a bottleneck in adapting biomedical resources into language processing. In contrast, our morphological analysis can deal with a lexeme with a white space character, and thus offers a simple way to incorporate biomedical resources in language processing. When a sentence is morphologically analyzed, miscellaneous information field is attached, which can be used for the feature extraction component.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 BaseNP Recognition </SectionTitle> <Paragraph position="0"> BaseNP recognition is applied to obtain approximate boundaries of BaseNPs in a sentence. The CoNLL-1999 shared task dataset is used for training with YamCha, the general purpose SVM-based chunker4. There are four kinds of chunk tags in the CoNLL-1999 dataset, namely IOB1, IOB2, IOE1, and IOE2 (Tjong Kim Sang and Veenstra, 1999).</Paragraph> <Paragraph position="1"> We follow Kudo and Matsumoto (2001) to train four BaseNP recognizers, one for each chunk tag. The word-based output from the morphological analysis is cascaded to each BaseNP recognizer to mark BaseNP boundaries. We collect outputs from the 4http://cl.aist-nara.ac.jp/~taku!ku/software/yamcha/).</Paragraph> <Paragraph position="2"> four recognizers, and interpret the tag as outside of a BaseNP if all recognizers estimate the &quot;O(utside)&quot; tag, otherwise inside of a BaseNP. The intention is to distinguish words that are definitely not a constituent of a BaseNP (outside) from words that may be a constituent of a BaseNP (inside). In this way, we obtain approximate boundaries of BaseNPs in a sentence.</Paragraph> <Paragraph position="3"> Introducing BaseNP recognition as a part of pre-processing is motivated by an intuition that most protein names reside in a noun phrase. Our chunking is based on morphemes. An indication of whether a morpheme lies within or outside a BaseNP boundary seems informative. In addition, the morpheme-based chunking would have narrower local context than the word-based chunking for the same window size. Our intention of approximate BaseNP boundaries is to provide the feature extraction component with the top-down information of morpheme's global scope.</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.3 Feature Extraction </SectionTitle> <Paragraph position="0"> We extract four kinds of features from preprocessing. Morphological analysis gives information for boundary features, lexical features and biomedical features. BaseNP recognition gives information for syntactic features.</Paragraph> <Paragraph position="1"> Our chunking is based on morphemes of which boundaries may or may not coincide with graphic words. The boundary feature is to reflect the following observation. A general English word tends to have the same morpheme-based segmentation and word-based segmentation, i.e. the degree of boundary ambiguity is low. On the other hand, a protein coding word tends to have different morpheme-based segmentation and word-based segmentation, i.e., the degree of boundary ambiguity is high.</Paragraph> <Paragraph position="2"> For each morpheme, we have four binary features lmor, ldel, rmor, and rdel. lmor is 1 if the morpheme is the leftmost morpheme of a word tokenized by the morphological analyzer, and 0 otherwise. ldel is 1 if the morpheme is the leftmost morpheme of a graphic word, and 0 otherwise. Similarly, rmor is 1 if the morpheme is the rightmost morpheme of a word tokenized by the morphological analyzer, and 0 otherwise. rdel is 1 if the morpheme is the right-most morpheme of a graphic word, and 0 otherwise. The lexical features are multi-valued features.</Paragraph> <Paragraph position="3"> In this work, we consider part-of-speech, stemmed form and string features (e.g. lower-cased string, upper-case letter, numerals, prefix and suffix).</Paragraph> <Paragraph position="4"> The biomedical feature is designed to encode biomedical domain resource information. The morphological analyzer tokenizes into words with relevant references to biomedical resources. In addition, if the word is derived from a compound lexeme, constituent morpheme information is also attached. (Recall Table 2 for a compound lexeme example.) The biomedical feature is subdivided into a sequence feature and an ontology feature. The sequence feature refers to a binary feature of accession number reference to SP or PIR. For each word, sp-word is set to 1 if the word has an accession number of SP. For each morpheme, sp-morpheme is set to 1 if the morpheme has an accession number of SP.</Paragraph> <Paragraph position="5"> pir-word and pir-morpheme of PIR are the same as those of SP. The ontology feature refers to a binary feature of accession number reference to GO. We have go-word and go-morpheme for GO. Suppose a sentence contains a compound lexeme in Table 2.</Paragraph> <Paragraph position="6"> For the word &quot;ERK activator kinase 1&quot;, sp-word is set to 1, but pir-word and go-word are set to 0. For the morpheme &quot;ERK&quot;, both sp-morpheme and pir-morpheme are set to 1, but go-morpheme is set to 0.</Paragraph> <Paragraph position="7"> If sp-word or pir-word are set to 1, it means that the word exactly matches with a protein name description in SP or PIR. Unfortunately, it is rare due to variant writing of protein names. However, we can expect a sort of approximate matching, by considering morpheme-based features sp-morpheme or pir-morpheme. Moreover, we add ontology features (go-word, go-morpheme) in order to obtain thesaurus effects.</Paragraph> <Paragraph position="8"> The syntactic feature is to reflect an intuition that most protein names are found in noun phrases.</Paragraph> <Paragraph position="9"> We use two syntactic features, an indicator morpheme feature and a headmorpheme candidate feature. Both features are relevant only for BaseNP constituent morphemes.</Paragraph> <Paragraph position="10"> Fukuda et al. (1998) observe that terms such as &quot;receptor&quot; or &quot;enzyme&quot; that describe the function or characteristic of a protein tend to occur in or nearby a protein name. They use those terms as indicators of presence of a protein name. We also express them as a indicator morpheme feature, but with an additional constraint that indicators are only influential to morphemes found in the same BaseNP.</Paragraph> <Paragraph position="11"> In addition, Arabic and Roman numerals and transcription of Greek alphabets are frequently used to specify an individual protein. We call those specifiers in this paper. Without a deep analysis of compound words, it is hard to determine the morpheme that a specifier depends on, since the specifier could be on the left (&quot;alpha-2 catenin&quot;) or on the right (&quot;interleukin 2&quot;) of the head morpheme. We assume that such specifier morpheme and its head candidate morpheme exist within the same BaseNP boundary and express the observation as the headmorpheme candidate feature for each specifier morpheme.</Paragraph> <Paragraph position="12"> With the absence of a powerful parser, the syntactic features provides only approximation. However, indicator morpheme suggests a protein name existence and headmorpheme candidate intends to discriminate specifiers appear nearby protein-coding morphemes from the rest.</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.4 Chunking as Sequential Classification </SectionTitle> <Paragraph position="0"> Our protein name tagging is formulated as IOB2/IOE2 chunking (Tjong Kim Sang and Veenstra, 1999). Essentially, our method is the same as Kudo and Matsumoto (2001) in viewing the task as a sequence of classifying each chunk label by SVM.</Paragraph> <Paragraph position="1"> The main difference is that our chunking is based on morphemes, and uses features described in Section 2.3 to serve the needs in protein name tagging.</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Experiment using Yapex Corpus </SectionTitle> <Paragraph position="0"> We first conduct experiments with Yapex corpus5, the same corpus used in Olsson et al. (2002) to get a direct comparison with the good-performing rule-based approach6. There are 99 abstracts for training Fukuda et al. (1998) evaluated with Yapex corpus. To date, Fukuda et al. (1998) reports the best result in rule-based approach, evaluated with their closed corpus.</Paragraph> <Paragraph position="1"> ker YamCha. See (Kudo and Matsumoto, 2001) for more information about parameters.</Paragraph> <Paragraph position="2"> parameter description type of kernel polynomial degree of kernel 2 direction of parsing foreward for IOB2, backward for IOE2 context window -2 -1, 0, +1, +2 strict count Correct if the boundaries of system and those of answer matches on Both side.</Paragraph> <Paragraph position="3"> left count Correct if the Left boundary of system and that of answer matches.</Paragraph> <Paragraph position="4"> right count Correct if the Right boundary of system and that of answer matches.</Paragraph> <Paragraph position="5"> sloppy count Correct if any morpheme estimated by system overlaps with any morpheme defined by answer. and 101 abstracts for testing.</Paragraph> <Paragraph position="6"> Each sentence undergoes preprocessing, feature extraction and SVM-based chunking to obtain a protein name tagged sentence. We also use YamCha for this task. Parameters for YamCha are summarized in Table 3. Our evaluation criteria follow that of Olsson et al. (2002). We calculate the standard measures of precision, recall and f-score for each boundary condition of strict, left, right and sloppy described in Table 4.</Paragraph> <Paragraph position="7"> The performance of our method on Yapex corpus is summarized in Tables 5 and 6, along with that of Yapex protein tagger.7. Our method achieves as good result as a hand-crafted rule-based approach, despite the small set of training data (99 abstracts) which works unfavorable to machine learning approaches. The better performance in strict could be attributed to chunking based on morphemes instead of words.</Paragraph> <Paragraph position="8"> Yapex has a good recall rate while our method enjoys a good precision in all boundary conditions. A possible explanation for the low recall is that the training data was small (99 abstracts) for SVM to we will shortly report in the next subsection, we no longer observe a low recall when training with the medium-sized (590 abstracts) and the large-sized (1600 abstracts) data.</Paragraph> <Paragraph position="9"> IOB2 chunking with forward parsing gives better results in left, while IOE2 chunking with backward parsing gives better results in right. The result follows our intuition that IOB2 chunking with a forward parsing intensively learns the left boundary between B(egin) and O(utside), while IOE2 chunking with a backward parsing intensively learns the right boundary between E(nd) and O(utside). Use of a weighted voting of multiple system outputs, as discussed in (Kudo and Matsumoto, 2001), is left for future research.</Paragraph> <Paragraph position="10"> Effects of each feature in IOB2 chunking with forward parsing are summarized in Table 7. Each feature is assessed by subtracting the focused feature from the maximal model in Table 5. Since the test dataset is only 101 abstracts, it is difficult to observe any statistical significance. Based on the offsets, the result suggests that an incorporation of biomedical features (sequence and ontology) is crucial in protein name tagging. The contribution of syntactic features is not as significant as we originally expect. Considering syntactic features we use are approximate features obtained from BaseNP boundaries, the outcome may be inevitable. We plan to investigate boundary condition. The F-score is subtracted from the maximal model in IOB2 chuking with forward parsing (Table 5). The upper rows show effects of a single feature removed. The lower rows show effects of multiple features with the same class removed. further into effective syntactic features such as word dependency from a word dependency parser.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Experiment with GENIA Corpus </SectionTitle> <Paragraph position="0"> In order to experiment our method with a larger dataset, we use GENIA corpus 3.01 released recently. Unlike Yapex corpus, GENIA corpus contains 2000 abstracts and uses a hierarchical tagset. For our experiment, we use two definitions for a protein: one to identify G#protein molecule and the other to identify G#protein X. The former is a narrower sense of protein names, and more close to a protein name in Yapex corpus where the protein name is defined as something that denotes a single biological entity composed of one or more amino acid chain. The latter covers a broader sense of protein, including families and domains. We evaluate our method with the two versions of protein names since the desired granularity of a protein name depends on the application.</Paragraph> <Paragraph position="1"> Two datasets are prepared in this experiment. One is GENIA 1.1 subset and the other is GENIA 3.01 set. The GENIA 1.1 subset contains 670 abstracts from GENIA 3.01 where the same Medline IDs are also found in GENIA corpus 1.1. In addition, we split the GENIA 1.1 subset into the test dataset of 80 abstracts used in Kazama et al. (2002)8 and the training dataset of the remaining 590 abstracts. The stracts (590 abstracts for training and 80 abstracts for testing).</Paragraph> <Paragraph position="2"> stracts (1600 abstracts for training and 400 abstracts for testing).</Paragraph> <Paragraph position="3"> GENIA 3.01 set is an entire set of GENIA corpus 3.01. We randomly split the entire set so that 4/5 of which is used for training the remaining 1/5 is used for testing.</Paragraph> <Paragraph position="4"> Results in Tables 8 and 9 show that the broader class G#protein X is easier to learn than the narrower class G#protein molecule. Results of protein name recognition in Kazama et al. (2002) using GENIA 1.1 are 0.492, 0.664 and 0.565 for precision, recall, f-score respectively. GENIA 1.1 has only one class for protein name (GE-NIA#protein), while GENIA 3.01 has hierarchically organized tags for a protein name class. Assuming that GENIA#protein in GENIA 1.1 corresponds to G#protein X in GENIA 3.01, we could claim that our method gives better results to their SVM approach. The better performance could be attributed to chunking based on morpheme instead of graphic words and better adaptation of biomedical resources. Next, we compare Yapex performance withG#protein moleculetrained with 1600 abstracts (cf. Table 5 and Table 9), though tagging policy and corpus are different. Our method significantly outperforms in strict, better in left and right, slightly lost in sloppy. With a large dataset of training data (1600 abstracts), we obtain 70 points of f-score for G#protein molecule and 75 points of f-score for G#protein X, which are comparable to approaches reported in the literature. An increase of training data from 590 abstracts to 1600 abstracts helps the overall performance improve, given the corpus error is minimized. Our internal experiments with GENIA 3.0 (the version was corrected to GENIA 3.01) reveal that the corpus error is critical in our method. Even corpus errors have been successfully removed, it would not be practical to increase the size of labor-intensive annotated corpus. Use of unlabeled data in conjunction with a small but quality set of labeled data. e.g. Collins and Singer (1999), would have to be explored. null</Paragraph> </Section> </Section> class="xml-element"></Paper>