File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/c00-1030_concl.xml

Size: 3,209 bytes

Last Modified: 2025-10-06 13:52:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1030">
  <Title>Extracting the Names of Genes and Gene Products with a Hidden Markov Model</Title>
  <Section position="7" start_page="205" end_page="205" type="concl">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> HMMs are proving their worth for various tasks in inibrmation extraction and the results here show that this good performance can be achieved across domains, i.e. in molecular-biology as well as rising news paper reports. The task itself', while being similar to named entity in MUC, is we believe more challenging due to the large nunfl)er of terms which are not proper nouns, such as those in the source sub-classes as well as the large lexieal overlap between classes such as PROTEIN and DNA. A usefifl line of work in the future would be to find empirical methods for comparing difficulties of domains.</Paragraph>
    <Paragraph position="1"> Unlike traditional dictionary-based lnethods, the method we have shown has the advantage of being portable and no hand-made patterns were used. Additiolmlly, since the character tbatures are quite powerful, yet very general, there is little need for intervention to create domain specific features, although other types of features could be added within the interpolation framework. Indeed the only thing that is required is a quite small corpus of text containing entities tagged by a domain expert.</Paragraph>
    <Paragraph position="2"> Currently we have optinfized the ,k constants by hand but clearly a better way would be to do this antomatically. An obvious strategy to use would be to use some iterative learning method such as Expectation Maximization (Dempster et al., 1977).</Paragraph>
    <Paragraph position="3"> The model still has limitations, most obviously when it needs to identity, term boundaries for phrases containing potentially ambiguous local structures such as coordination and pa.rentheses. For such cases we will need to add post-processing rules.</Paragraph>
    <Paragraph position="4"> There are of course many NF, models that are not based on HMMs that have had success in the NE task at the MUC conferences.</Paragraph>
    <Paragraph position="5"> Our main requirement in implementing a model for the domain of molecular-biology has been ease of development, accuracy and portability to other sub-domains since molecular-biology itself is a wide field. HMMs seemed to be the most favourable option at this time. Alternatives that have also had considerable success are decision trees, e.g. (Nobata et al., 1.999) and maximum-entropy. The maximum entropy model shown in (Borthwick et al., 1998) in particular seems a promising approach because of its ability to handle overlapping and large feature sets within n well founded nmthenmtical ti'amework. However this implementation of the method seems to incorporate a number of hand-coded domain specitic lexical Datures and dictionary lists that reduce portability.</Paragraph>
    <Paragraph position="6"> Undoubtedly we could incorporate richer tbatures into our model and based on the evidence of others we would like to add head nouns as one type of feature in the future.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML