File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/w00-0904_concl.xml

Size: 2,150 bytes

Last Modified: 2025-10-06 13:52:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0904">
  <Title>Comparison between Tagged Corpora for the Named Entity Task</Title>
  <Section position="9" start_page="97" end_page="97" type="concl">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> The need for soundly-motivated metrics to compare the usefulness of corpora for specific tasks and systems is dearly necessary for the development of robust and portable information extraction systems.</Paragraph>
    <Paragraph position="1"> In this paper we have shown that measures for comparing corpora based just on class-token ratios have difficulty predicting system performance and cannot adequately explain the difficulty of the NE task either generally or for specific systems.</Paragraph>
    <Paragraph position="2"> While we should be cautious in ma~ng sweeping conclusions due to the small size of corpora in our study, our results from gain ratio and cross entropy indicate that counts from the features of both systems will be more useful in the MUC6 corpus than in the biology corpus. We can also see that while the coverage is limited, surface words play a leading role for both systems. Gain ratio statistics for surface words in the two domains were far closer than for any other type of feature, and given that this is also the dominant knowledge type this seems to be one likely reason that the performance of systems is about the same in both domains.</Paragraph>
    <Paragraph position="3"> We have presented the results of applying two supervised learning based models to the named entity task in two widely different domains and explained the performance through class-token ratios, entropy and gain ratio. Measures such as entropy and gain ratio have been found to have the best predictive power, although the features used to calculate gain ratio are not sufficient to describe all the information that is necessary for the named entity task. In future work we intend to extend our study to new and larger NE corpora in various domains and to try to reduce the error factor in our calculations that is a result of corpus size.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML