File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1217_evalu.xml

Size: 3,472 bytes

Last Modified: 2025-10-06 13:59:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1217">
  <Title>Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web</Title>
  <Section position="4" start_page="89" end_page="90" type="evalu">
    <SectionTitle>
3 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> Our results on the evaluation data and a confusion matrix are shown in Tables 2 and 4. Table 4 suggests areas for further work. Collapsing the B- and I- tags does cost us quite a bit. Otherwise confusions between some named entity and being nothing are most of the errors, although protein/DNA and cellline/cell-type confusions are also noticeable.</Paragraph>
    <Paragraph position="1"> Analysis of performance in biomedical Named Entity Recognition tends to be dominated by the perceived poorness of the results, stemming from the twin beliefs that performance of roughly ninety percent is the state-of-the-art and that performance of 100% (or close to that) is possible and the goal to be aimed for. Both of these beliefs are questionable, as the top MUC 7 performance of 93.39%  (Mikheev et al., 1998) in the domain of newswire text used an easier performance metric where incorrect boundaries were given partial credit, while both the biomedical NER shared tasks to date have used an exact match criterion where one is doubly penalized (both as a FP and as a FN) for incorrect boundaries. However, the difference in metric clearly cannot account entirely for the performance discrepancy between newswire NER and biomedical NER.</Paragraph>
    <Paragraph position="2"> Biomedical NER appears to be a harder task due to the widespread ambiguity of terms out of context, the complexity of medical language, and the apparent need for expert domain knowledge. These are problems that more sophisticated machine learning systems using resources such as ontologies and deep processing might be able to overcome. However, one should also consider the inherent &amp;quot;fuzziness&amp;quot; of the classification task. The few existing studies of inter-annotator agreement for biomedical named entities have measured agreement between 87%(Hirschman, 2003) and 89%(Demetrious and Gaizauskas, 2003). As far as we know there are no inter-annotator agreement results for the GENIA corpus, and it is necessary to have such results before properly evaluating the performance of systems. In particular, the fact that BioNLP sought to distinguish between gene and protein names, when these are known to be systematically ambiguous, and when in fact in the GENIA corpus many entities were doubly classified as &amp;quot;protein molecule or  region&amp;quot; and &amp;quot;DNA molecule or region&amp;quot;, suggests that inter-annotator agreement could be low, and that many entities in fact have more than one classification. null One area where GENIA appears inconsistent is in the labeling of preceding adjectives. The data was selected by querying for the term human, yet the term is labeled inconsistently, as is shown in Table 4. Of the 1790 times the term human occurred before or at the beginning of an entity in the training data, it was not classified as part of the entity 110 times. In the test data, there is only on instance (out of 130) where the term is excluded. Adjectives are excluded approximately 25% of the time in both the training and evaluation data. There are also inconsistencies when two entities are separated by the word and.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML