File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/n04-2002_concl.xml

Size: 2,811 bytes

Last Modified: 2025-10-06 13:54:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-2002">
  <Title>Identifying Chemical Names in Biomedical Text: An Investigation of the Substring Co-occurrence Based Approaches</Title>
  <Section position="7" start_page="11" end_page="11" type="concl">
    <SectionTitle>
6 Conclusions and Future Work
</SectionTitle>
    <Paragraph position="0"> We have investigated a number of different approaches to chemical identification using string internal information. We used readily available training data, and a small amount of human annotated text that was used primarily for testing. We were able to achieve good performance on general biomedical text taken from MEDLINE abstracts. N-gram models showed the best performance. The specific details of parameter  tuning for these models produced small variations in the results. We have also introduced a method for computing interpolated N-gram model parameters without any tuning on development data. The results produced by this method were slightly better than those of other approaches. We believe this approach performed better because only one parameter - the length of N-grams - needed to be tuned on the development data. This is a big advantage when little development data is available. In general, we discovered many similarities with previous work on language identification, which suggests that other techniques introduced for language identification may carry over well into chemicals identification.</Paragraph>
    <Paragraph position="1"> As a short term goal we would like to determine N-gram interpolation coeficients by usefulness of the corresponding context for discrimination. This would incorporate the same techinque as we used for Naive Bayes system, hopefully combining the advantage of both approaches There are other alternatives for learning a classification rule. Recently using support vector machines (Burges 1998) have been a popular approach.</Paragraph>
    <Paragraph position="2"> More traditionally decision trees (Breiman et al, 1984) have been used for simmilar tasks. It would be interesting to try these aproaches for our task and compare them with Naive Bayes and N-gram approaches discussed here.</Paragraph>
    <Paragraph position="3"> One limitation of the current system is that it does not find the boundaries of chemicals, but only classifies predetermind tokens as being part of a chemical name or not. The system can be improved by removing prior tokenization requirment, and attempting to identify chemical name boundaries based on the learned information.</Paragraph>
    <Paragraph position="4"> In this work we explored just one dimention of possible features usefull for finding chemical names. We intent to incorporate other types of features including context based features with this work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML