File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0701_intro.xml
Size: 2,836 bytes
Last Modified: 2025-10-06 14:02:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0701"> <Title>Overlap Features</Title> <Section position="3" start_page="0" end_page="1" type="intro"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> While there has been a great deal of work on coreference resolution within a single document, little work has focused on the challenges associated with resolving the reference of identical person names across multiple documents.</Paragraph> <Paragraph position="1"> Mann and Yarowsky (2003) are amongst the few who have examined this problem. They treat it as a clustering task, in which, a combination of features (such as, a weighted bag of words and biographic information extracted from the text) are given to an agglomerative clustering algorithm, which outputs two clusters representing the two referents of the query name.</Paragraph> <Paragraph position="2"> Mann and Yarowsky (2003) report results on two types of evaluations: using hand-annotated web-pages returned from truly ambiguous searches, they report precision/recall scores of 0.88/0.73; using &quot;psuedonames&quot; they report an accuracy of 86.4%.</Paragraph> <Paragraph position="3"> Borrowing from techniques in word sense disambiguation, they create a test set of 28 &quot;pseudonames&quot; by ran-While Mann and Yarowsky (2003) describe a number of useful features for multi-document per-son name resolution, their technique is limited by only allowing a set number of referent clusters. Further, as discussed below, their use of artificial test data makes it difficult to determine how well it generalize to real world problems.</Paragraph> <Paragraph position="4"> Bagga and Baldwin (1998) also present an examination of multi-document person name resolution. They first perform within-document coreference resolution to form coreference chains for each entity in each document. They then use the text surrounding each reference chain to create summaries about each entity in each document.</Paragraph> <Paragraph position="5"> These summaries are then converted to a bag of words feature vector and are clustered using the standard vector space model often employed in IR. They evaluated their system on 11 entities named John Smith taken from a set of 173 New York Times articles. Using an evaluation metric similar to a weighted sum of precision and recall they get an F-measure of 0.846.</Paragraph> <Paragraph position="6"> Although their technique allows for the discovery of a variable number of referents, its use of simplistic bag of words clustering is an inherently limiting aspect of their methodology. Further, that they only evaluate their system, on a single person name begs the question of how well such a technique would fair on a more real-world challenge.</Paragraph> </Section> class="xml-element"></Paper>