File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-0602_abstr.xml

Size: 884 bytes

Last Modified: 2025-10-06 13:43:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0602">
  <Title>Words and Pictures in the News</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We discuss the properties of a collection of news photos and captions, collected from the Associated Press and Reuters. Captions have a vocabulary dominated by proper names. We have implemented various text clustering algorithms to organize these items by topic, as well as an iconic matcher that identifies articles that share a picture. We have found that the special structure of captions allows us to extract some names of people actually portrayed in the image quite reliably, using a simple syntactic analysis. We have been able to build a directory of face images of individuals from this collection. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML