File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1315_metho.xml

Size: 21,503 bytes

Last Modified: 2025-10-06 14:08:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1315">
  <Title>An Investigation of Various Information Sources for Classifying Biological Names</Title>
  <Section position="3" start_page="1" end_page="2" type="metho">
    <SectionTitle>
2 Sources of Information for Name
Classification
</SectionTitle>
    <Paragraph position="0"> To classify a name we consider both the words within the name (i.e., name internal) as well as the nearby words, the context of occurrences.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.1 Using Name Internal Information
</SectionTitle>
      <Paragraph position="0"> Methods for learning to identify names try to induce patterns of words and special characters that might constitute names. Hence the entire sequence of words in a name is important and necessary for name identification purposes. In contrast, for classification purposes, some parts of the names are more important than the others and some may play no role at all. For example, in the name cyclic AMP response element-binding protein, the last word, protein, is sufficient for its classification. Similarly, Adherence-isolated monocytes, can be classified on the basis of its last word, monocytes.</Paragraph>
      <Paragraph position="1"> The fact that the last word of a name often bears the most information about the class of the name is not surprising. In English, often the type of object referred by a noun phrase is given by the head noun.</Paragraph>
      <Paragraph position="2"> Viewing a name as a noun phrase, the head noun is likely to determine its class. And in English noun phrases, the head noun is often the rightmost word because of the right-branching structure of English noun phrases. Quite often the nouns correspond to concepts (or classes) in an ontology. In such cases, we call these nouns functional terms or f-terms,following the terminology used in some name recognizers proposed for the biomedical domain.</Paragraph>
      <Paragraph position="3">  The notion of f-terms, was first introduced in the design of KEX (Fukuda et al., 1998). In this work, asetofwordssuchasproteins and receptors,were manually selected as f-terms. In this protein name recognition system, as well as in Yapex (Franzen et al., 2002), f-terms are only used for locating names in text. On the other hand, the system reported in (Narayanaswamy et al., 2003), which identifies the names of other classes as well, generalizes them to also classify names as well. Thus, f-terms are identified with types/classes.</Paragraph>
      <Paragraph position="4"> The existing methods that use f-terms rely on a manually selected list of f-terms. However, manual selection methods are usually susceptible to errors of omission. In Section 4.1, we consider a method that tries to automatically select a list of f-terms and the resultant word classes based on the GENIA corpus.</Paragraph>
      <Paragraph position="5"> We then use this generated list to test our intuitions about f-terms.</Paragraph>
      <Paragraph position="6"> We also consider f-terms extended to consist of two consecutive words. We refer to these as bigram f-terms to differentiate them from single word only (unigram) f-terms. The use of bigrams will help us to classify names when the last word is not an f-term, but the last two words together can uniquely classify the name. For example, Allergen -specific T cell clones cannot be classified using the last word alone. However, a name ending with cell clones as the last bigram is likely to be a 'Source'.</Paragraph>
      <Paragraph position="7">  Often the information about the class designated by a noun can be found in its suffix, particularly in a scientific domain. If f-terms can be viewed as words that designate a class of entities then note that suffixes also play the same role. For example, words ending with the suffix -amine are nitrogen compounds and those ending with -cytes are cells.</Paragraph>
      <Paragraph position="8"> Thus using suffixes results in a generalization at the word level. A method of selecting a list of suffixes and associating classes with them is described in Section 4.1.</Paragraph>
      <Paragraph position="9">  Of course, not all names can be classified on the basis of f-terms and suffixes only. Sometimes names are chosen on a more ad hoc manner and do not reflect any underlying meaning. In such cases, matching with names found in a dictionary would be the only name-internal method possible.</Paragraph>
      <Paragraph position="10"> We cannot simply use an &amp;quot;exact matching&amp;quot; algorithm since such a method would only work if the name was already present in our dictionary. As it is not reasonable at this time to have a dictionary that contains all possible names, we can attempt to use approximate matches to find similar names in the dictionary and use them for classification purposes.</Paragraph>
      <Paragraph position="11"> Such a method then can be thought of finding a way to generalize from the names in a dictionary, instead of relying on simple memorization.</Paragraph>
      <Paragraph position="12"> However, assuming a large dictionary is not feasible at this time especially for all the classes. So our alternate is to look at examples from GENIA corpus.</Paragraph>
      <Paragraph position="13"> The candidate examples that we will use for classification would be the ones that most closely match a given name that needs to be classified. Hence, the method we are following here essentially becomes an example-based classification method such as k-nearest neighbor method. One approach to this task is described in Section 4.3.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
2.2 Using Context
</SectionTitle>
      <Paragraph position="0"> We now turn our attention to looking at clues that are outside the name being classified. Using context has been widely used for WSD and has also been applied to name classification (for example, in (Collins and Singer, 1999; Cucerzan and Yarowsky, 1999)).</Paragraph>
      <Paragraph position="1"> This approach has also been adopted for the biomedical domain as illustrated in the work of (Hatzivassiloglou et al., 2001; Narayanaswamy et al., 2003; Casta~no et al., 2002)  .</Paragraph>
      <Paragraph position="2"> In the WSD work involving the use of context, we can find two approaches: one that uses few strong contextual evidences for disambiguation purposes, as exemplified by (Yarowsky, 1995); and the other that uses weaker evidences but considers a combination of a number of them, as exemplified by (Gale et al., 1992). We explore both the methods. In Section 4.4, we discuss our formulation and present a simple way of extracting contextual clues.</Paragraph>
      <Paragraph position="3">  (Casta~no et al., 2002) can be seen as using context in its type coercion rules.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Experimental Setup
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Division of the corpus
</SectionTitle>
      <Paragraph position="0"> We divided the name-annotated GENIA corpus (consisting of 2000 abstracts) into two parts--1500 abstracts were used to derive all the clues: f-terms, suffixes, examples (for matching) and finally contextual features. These derived sources of information were then used to classify the names found in the remaining 500 abstracts. The keys from the annotated corpus were then used to compute the precision and recall figures. We will call these two parts the training and test sections.</Paragraph>
      <Paragraph position="1"> Since we pick the names from the test section and classify them, we are entirely avoiding the name identification task. Of course, this means that we do not account for errors in classification that might result from errors in identifying names. However, we believe that this is appropriate for two reasons. Our investigation focuses on how useful the above mentioned features are for classification and we felt that this might be slanted based on the name identifier we use and its characteristics. Secondly, most of the errors are due to not finding the correct extent of the name, either because additional neighboring words are included or because some words/characters are not included. In our experience, most of these errors happen at the beginning part of the name and, hence, should not unduly affect the classification.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Classes of Names
</SectionTitle>
      <Paragraph position="0"> In our method, we classify names into one of the five classes that we call Protein, Protein Part, Chemical, Source and Others. We don't have any particularly strong reasons for this set of classes although we wish to point out that the first four in this choice corresponds to the classes used by the name recognizer of (Narayanaswamy et al., 2003). It must be noted that the class proteins not only include proteins but also protein families, and genes; all of which are recognized by many protein name recognizers. The GENIA class names were then mapped onto our class names.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Tokenization
</SectionTitle>
      <Paragraph position="0"> After the assignment of classes, all the extracted names were tokenized. Noting that changing a digit by another, a Greek character by another, a Roman numeral by another rarely ever results in obtaining another name of a different class, our name tokenization marks these occurrences accordingly.</Paragraph>
      <Paragraph position="1"> To remove variability in naming, hyphens and extra spaces were removed. Also, as acronyms are not useful for detecting types, their presence is identified (in our case we use a simplistic heuristic that acronyms are words with 2 or more consecutive upper case characters).</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.4 Evaluation Methodology
</SectionTitle>
      <Paragraph position="0"> We used an n-fold cross-validation to verify that the results and conclusions we draw are not slanted by a particular division of the 2000 abstracts. The corpus was divided into sets of 500 abstracts - the composition of each set being random - thus obtaining 4 different partitions. In the first partition, the first three sets were combined to form the Training Set and the last was used as the Test Set. In the second partition, the second, third and fourth sets formed the Training Set and the first was used as the Test Set and so on.</Paragraph>
      <Paragraph position="1"> The overall results that we report in Section 5 were the average of results on the four partitions.</Paragraph>
      <Paragraph position="2"> However, the first partition was used for more detailed investigation.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="5" type="metho">
    <SectionTitle>
4 Classification Method
</SectionTitle>
    <Paragraph position="0"> Given an unclassified name, we first tried to classify it on the basis of the f-terms and the suffixes. If that failed, we applied our string matcher to try to find a match and assign a category to the unknown name. Finally, we used context to assign classes to the names that were still left unclassified.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 F-Term and Suffix Extraction
</SectionTitle>
      <Paragraph position="0"> Since we consider f-terms to be nouns that appear at the end of a name and denote a type of entity, their presence in the name suffices for its classification. Hence, we use the last words of names found in the training set to see if they can uniquely identify the class. To generate a list of f-terms and their respective classes, we count each word or pair of words that is found at the end of any name. A unigram or bigram, w, was selected as an f-term if it appeared at least 5 times and if the conditional probability P(classj w) for any class exceeds a threshold which we set at 0.95.</Paragraph>
      <Paragraph position="1"> In the counting to estimate this conditional probability we ignore the presence of digits, Greek characters and Roman numerals as discussed in the Section 3.3. For example, in latent membrane protein 1 the '1' at the end is ignored and 'protein' will be selected as the unigram for the count.</Paragraph>
      <Paragraph position="2"> The number of f-terms selected for chemicals was the lowest. This is not surprising considering chemical names have few words defining subtypes of chemicals. acetate was an example chosen for this class. Some other examples of extracted f-terms and their associated classes are: cell, tissue, virus (for Source); kinase, plasmid and protein (for Proteins); subunit, site and chain (for Protein Parts) and bindings and defects (for Others). A couple of surprising words were selected. Due to the limitations of our method, we do not check if a last name indeed denotes a class of entities but merely note that the name is strongly associated with a class. Hence, protein names like Ras and Tax were also selected.</Paragraph>
      <Paragraph position="3"> For suffix extraction, we considered suffixes of length three, four and five. Since we argued earlier that the suffixes that we are considering play the same role as f-terms, we only consider the suffixes of the last word. This prevents the classification of cortisol- dependent BA patients (a 'Source') as a 'Chemical' on the basis of the suffix -isol.Also, like in the case of f-terms, digits, Greek characters etc at the end of a name were ignored. However, unlike f-terms, if the last word is an acronym the whole name is dropped, as taking the suffix of an acronym wouldn't result in any generalization. The probability of a class given a suffix is then calculated and only those suffixes which had a probability of greater than the probability threshold were selected.</Paragraph>
      <Paragraph position="4"> When generating the list of suffixes, we have two possibilities. We could choose to consider names which ended with an f-term that was selected or not consider these names under the assumption that f-terms would be sufficient to classify such names. We found that considering the suffixes of the f-terms results in a significant increase in the recall with little or no change in precision. This rather surprising result can be understood if we consider the kinds of names that show up in the class Others.Asuffix such as ation was selected because a number of names ending with selected f-terms like transplantation, transformation,andassociation. This suffix allows us to classify AP-1 translocation on the basis of the suffix despite the fact that translocation was not chosen as an f-term.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Classification based on f-terms and suffixes
</SectionTitle>
      <Paragraph position="0"> Given a set of f-terms and suffixes, along with their associated classes, selected from the training part, names in the test portion were classified by looking at the words that end the names. If a name ended with a selected f-term, then the name was tagged as belonging to the corresponding class. If a match was not found, the suffix of the last word of the name was extracted and a match was attempted with the known list of suffixes. If no match was found, the name was left unclassified.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
4.3 Classifying Names using Similar Examples
</SectionTitle>
      <Paragraph position="0"> We had discussed earlier the use of similar examples to classify a new occurrence of a name. To find similar examples, standard string matching algorithms are often used which produce a similarity score that varies inversely with the number of edit operations needed to match two strings identically. However, we abandoned the use of standard string matching programs as their performance for classification purposes was rather poor. Primarily this was due to the fact that these algorithms do not distinguish between matches at the beginning and at the end of the name strings. As discussed before, for classification purposes the position of words is important and we noticed that matches at the beginning of the strings were hardly ever relevant unlike the case with those at the end. For this reason, we developed our own matching algorithm.</Paragraph>
      <Paragraph position="1"> Given a name in the test corpus, we try to find how similar it is to candidate examples taken from the training portion. For each pair of names, we first try to pair together the individual words that make up the names allowing for some partial matching.</Paragraph>
      <Paragraph position="2"> These partial matches allow for certain kinds of substitutions that we do not believe will affect the classification. These include dropping a plural &amp;quot;s&amp;quot;, substituting one Greek character by another, changing an uppercase character by the same character in lower case, changing an Arabic/Roman single digit by another, changing a Roman numeral by an Arabic one, and dropping digits. Each substitution draws a small penalty (although dropping digits incurs a slightly greater penalty) and only a perfect match receives a score of 1 for matching of individual words. Complete mismatches receive a score of 0.</Paragraph>
      <Paragraph position="3"> We then try to assign a score to the whole pair of names. We begin by assigning position numbers to each pair of words (including matches, mismatches and drops) starting from the rightmost match which is assigned a position of zero. Mismatches to the right of the first match, if any, are assigned negative positions. We then use a weight table that gives more weightage to lower position numbers (i.e., towards the end of the strings rather than the beginning) to assign a weight to each pair of words depending on the position. Then the score of the entire match is given by a weighted sum of the match scores, normalized for length of the string. Assigning a score of 0 for a mismatch is tantamount to saying that a mismatch does not contribute towards the similarity score. A negative score for a mismatch would result in assigning a penalty.</Paragraph>
      <Paragraph position="4"> We only consider those strings as candidate examples if their similarity score is greater than a threshold #0B. To assign a class to a name instance, we look at the k nearest neighbors, as determined by their similarity scores to the name being classified. To assign a class to the name, we weight the voting of each of the k (or fewer) candidates by their similarity score. A class is assigned only if the the ratio of the scores of the top two candidates exceeds a threshold, #0C. The precision should tend to increase with this ratio.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="5" type="sub_section">
      <SectionTitle>
4.4 Classifying Based on Context
</SectionTitle>
      <Paragraph position="0"> To identify the best sources of contextual information for classifying names, we considered two possibilities -- the use of a single strong piece of evidence and the use of a combination of weak evidences. For the former we made use Decision Lists similar to Yarowsky's method for Word Sense Disambiguation (WSD) (Yarowsky, 1995). However, we found that this method had a poor recall.</Paragraph>
      <Paragraph position="1">  As always, the reason for using a threshold is that it allows us to find the appropriate level of compromise between precision and recall. Given that there are different sources of information, there is no need to insist that particular method assign a class tag if we are not comfortable with the level of confidence that we have in such an assignment.</Paragraph>
      <Paragraph position="2">  Due to space limitations, we don't discuss why we might have obtained the poor recall that we got for the decision list Hence, we decided to use a combination of weak evidences and employ the Naive-Bayes assumption of independence between evidences, similar to the method described in (Gale et al., 1992). To do this, the words that occurred within a window and that matched some template pattern were selected as features if their scores  exceeded some threshold (which we name a). Also, unlike Decision Lists, all the features presented in the context of a name instance were involved in its classification and the probability that a name instance has a certain class was calculated by multiplying probabilities associated with all the features. As some of the evidences might be fairly weak, we wanted to classify only those cases where the combination of features strongly indicated a particular class. This is done by comparing the two probabilities associated with the best two classes for an instance. A class was assigned to a particular name instance only when the ratio of the two probabilities was beyond a certain threshold (which will call b). Together with the threshold, a for the feature selection, choice of this threshold could allow trade-off between precision and recall for classification accuracies.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML