File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-2002_intro.xml

Size: 5,058 bytes

Last Modified: 2025-10-06 14:02:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-2002">
  <Title>Identifying Chemical Names in Biomedical Text: An Investigation of the Substring Co-occurrence Based Approaches</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Chemical names recognition is one of the first tasks needed for building an information extraction system in the biomedical domain. Chemicals, especially organic chemicals, are one of the main agents in many processes and relationships such a system would need to find. In this work, we investigate a number of approaches to the problem of chemical names identification. We focus on approaches that use string internal information for classification, those based on the character co-occurrence statistics within the strings that we would like to classify. We would also like not to spend much time and effort to do manual annotation, and hence use readily publicly available data for training all the models. Because of that, we would be satisfied with only moderate results. In the course of this investigation, we have found that N-gram methods work best given these restrictions on the models.</Paragraph>
    <Paragraph position="1"> Work has been done on a related task of named entity recognition (Bikel et al., 1999, Riloff, 1996, Cucerzan, 1999, and others). The aim of the named entity task is usually set to find names of people, organizations, and some other similar entities in text.</Paragraph>
    <Paragraph position="2"> Adding features based on the internal substring patterns has been found useful by Cucerzan et al., 1999. For finding chemicals, internal substring patterns are even more important source of information. Many substrings of chemical names are very characteristic. For example, seeing &amp;quot;methyl&amp;quot; as a substring of a word is a strong indicator of a chemical name. The systematic chemical names are constructed from substrings like that, but even the generic names follow certain conventions, and have many characteristic substrings.</Paragraph>
    <Paragraph position="3"> In this work, character co-occurrence patterns are extracted from available lists of chemicals that have been compiled for other purposes. We built models based on the difference between strings occurring in chemical names and strings that occur in other words.</Paragraph>
    <Paragraph position="4"> The use of only string internal information prevents us from disambiguating different word senses, but we accept this source of errors as a minor one.</Paragraph>
    <Paragraph position="5"> Classification based solely on string internal information makes the chemical names recognition task similar to language identification. In the language identification task, these patterns are used to detect strings from a different language embedded into text.</Paragraph>
    <Paragraph position="6"> Because chemicals are so different, we can view them as a different language, and borrow some of the Language Identification techniques. Danning, 1994 was able to achieve good results using character N-gram models on language identification even on short strings (20 symbols long). This suggests that his approach might be successful in chemical names identification setting.</Paragraph>
    <Paragraph position="7"> N-gram based methods were previously used for chemicals recognition. Wilbur et al., 1999 used all substrings of a fixed length N, but they combined the training counts in a Bayesian framework, ignoring nonindependence of overlapping substring. They claimed good performance for their data, but this approach showed significantly lower performance than alternatives on our data. See the results section for  more details. The difference is that their data is carefully constructed to contain only chemicals and chemicals of all types in the test data, i.e. their training and testing data is in a very close correspondence.</Paragraph>
    <Paragraph position="8"> We on the other hand tried to use readily available chemical lists without putting much manual labor into their construction. Most of our training data comes from a single source - National Cancer Institute website - and hence represents only a very specific domain of chemicals, while testing data is coming from a random sample from MEDLINE. In addition, these lists were designed for use by human, and hence contain many comments and descriptions that are not easily separable for the chemical names themselves. Several attempts on cleaning these out have been made. Most aggressive attempts deleted about half the text from the list. While deleting many useful names, this improved the results significantly.</Paragraph>
    <Paragraph position="9"> While we found that N-grams worked best amoung the approaches we have tried, other approaches are also possible. We did not explore the possibility of using substring as features to a generic classification algorithm, such as, for example, support vector machines (Burges, 1998).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML