File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/c02-1130_concl.xml
Size: 3,719 bytes
Last Modified: 2025-10-06 13:53:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1130"> <Title>Fine Grained Classification of Named Entities</Title> <Section position="8" start_page="3" end_page="3" type="concl"> <SectionTitle> 7. Conclusions </SectionTitle> <Paragraph position="0"> The results of these experiments, though preliminary, are very promising. Our research makes clear that positive results are possible with relatively simple statistical techniques.</Paragraph> <Paragraph position="1"> This research has shown that training data construction is critical. The failure of our automatic data generation algorithm to produce a good sample of training data is evident in the large disparity between performances on validation and held out test sets. There are at least two reasons for the algorithm's poor sampling.</Paragraph> <Paragraph position="2"> First, by using only high confidence guesses from the seed trained classifier, the training data may have a disproportionate number of instances that are easy to classify. This is evident in the number of partial names that are present in the held out test set versus the training set. Partial names, such as &quot;Simon&quot; instead of &quot;Paul Simon,&quot; usually occur with weaker evidence for classification than full names. In the training set only 45.1% of the instances are partial names, whereas in the more realistic distribution of the held out set, 58.4% are partial names.</Paragraph> <Paragraph position="3"> The second reason for the poor sampling stems from the use of lists of person names.</Paragraph> <Paragraph position="4"> Because the training set is derived from individuals in these lists, the coverage of individuals included in the training set is inherently limited. For example, in the businessperson category, lists of individuals were taken from such resources as Forbes' annual ranking of the nation's wealthiest people, under the assumption that wealthy people are often in the news. However, the list fails to mention the countless vice presidents and analysts that frequent the pages of the Wall Street Journal. This failure to include such lower level businesspersons means that a large space of the classification domain is not covered by the training set, which in turn leads to poor results on the held out test set.</Paragraph> <Paragraph position="5"> The results of these experiments suggest that better fine-grained classification of named entities will require not only more sophisticated feature selection, but also a better data generation procedure. In future work, we will investigate more sophisticated bootstrapping methods, as (Collins & Singer, 1999) as well as co-training and co-testing (Muslea et al., 2000). In future work we will also examine adapting the hierarchical decision list algorithm from (Yarowsky, 2000) to our task. Treating fine-grained classification of named entities as a word sense disambiguation problem (where categories are treated as different senses of a generic &quot;person name&quot;) allows these methods to be directly applicable. The algorithm is particularly relevant in that it provides an intuitive way to take advantage of the similarities of certain categories (e.g., Athlete and Entertainer).</Paragraph> <Paragraph position="6"> Of more theoretical concern are the problems of miscellaneous classifications that do not fit easily into any category, as well as, instances that may fit into more than one category (e.g., Ronald Reagan can be either a Politician or an Entertainer). We plan to address these issues as well as problems that may arise with extending this system for use with other classes, such as organizations.</Paragraph> </Section> class="xml-element"></Paper>