File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1030_intro.xml

Size: 4,363 bytes

Last Modified: 2025-10-06 14:01:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1030">
  <Title>First Story Detection using a Composite Document Representation.</Title>
  <Section position="3" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> The goal of TDT is to monitor and reorganize a stream of broadcast news stories in such a way as to help a user recognize and explore different news events that have occurred in the data set. First story detection (or online new event detection [1]) is one aspect of the detection problem which constitutes one of the three technical tasks defined by the TDT initiative (the other two being segmentation and tracking). Given a stream of news stories arriving in chronological order, a detection system must group or cluster articles that discuss distinct news events in the data stream. The TDT initiative has further clarified the notion of topic detection by differentiating between classification in a retrospective (Event Clustering) and an online environment (First Story Detection). In FSD the system must identify all stories in the data stream that discuss novel news events. This classification decision is made by considering only those documents that have arrived prior to the current document being evaluated, forcing the system to adhere to the temporal constraints of a real-time news stream.</Paragraph>
    <Paragraph position="1"> In other words the system must make an irrevocable classification decision (i.e. either the document discusses a new event or previously detected event) as soon as the document arrives on the input stream. The goal of event clustering on the other hand is to partition the data stream into clusters of related documents that discuss distinct events. This decision can be made after the system has considered all the stories in the input stream.</Paragraph>
    <Paragraph position="2"> In addition to defining three research problems associated with broadcast news, the TDT initiative also attempted to formally define an event with respect to how it differs from the traditional IR notion of a subject or a topic as defined by the TREC community. An event is defined as 'something that happens at some specific time and place (e.g. an assassination attempt, or a volcanic eruption in Greece)'. A topic on the other hand is a 'seminal event or activity along with all directly related events and activities (e.g. an investigation or a political campaign)' [1]. Initial TDT research into event tracking and detection focused on developing a classification algorithm to address this subtle distinction between an event and a topic. For example successful attempts were made to address the temporal nature of news stories  by exploiting the time between stories when determining their similarity in the detection process [1]. However current research is now focusing on the use of NLP techniques such as language modeling [2, 3], or other forms of feature selection like the identification of events based on the domain dependencies between words [4], or the extraction of certain word classes from stories i.e. noun phrases, noun phrases heads [5]. All these techniques offer a means of determining the most informative features about an event as opposed to classifying documents based on all the words in the document. The aim of our research is also based on this notion of feature selection. In this paper we investigate if the use of lexical chains to classify documents can better encapsulate this notion of an event. In particular we look at the effect on FSD when a composite document representation (using a lexical chain representation and free text representation) is used to represent events in the TDT domain.</Paragraph>
    <Paragraph position="3">  Stories closer together on the input stream are more likely to discuss the same event than stories further apart on this stream. In sections 2 and 3 we describe the first component of our composite document representation derived from lexical chains, with a subsequent description of FSD classification based on our data fusion strategy in Section 4. The remaining sections of this paper give a detailed account of our experimental results, concluding with a discussion of their significance in terms of two general criteria for successful data fusion.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML